# Intro to Python

## Welcome!

### What is Natural Language Processing?

The development and application of computational techniques for the the analysis of human language.

<img src="images/nlp.png" width="800" />


### Is it different to Computational Linguistics?

Computational linguistics is a social science with the goal of understanding language through computational models.

### How can I use it in my Humanities Research?

This is what we will learn during the week.

## NLP as Data Science

![From: https://towardsdatascience.com/introduction-to-statistics-e9d72d818745](images/data_science.png)


## NLP as Field of Research

1. New research is mostly published in conference proceedings 
2. The most important conferences are: ACL, EMNLP, COLING, NAACL, EACL
3. Pre-print are often shared on ArXiV (https://arxiv.org/list/cs.CL/recent)


## What Should I know before we start?

1. All quantitative models of language are wrong - but some are useful!
2. The importance of the human-in-the-loop
3. No-free-lunch theorem
4. Always validate!


## Why Python?

![](images/python.jpg)

1. It is quick to program in

2. Very popular in NLP, and has lots of useful libraries thoroughly documented 

3. It interfaces well with faster languages 

4. Python is free, so you’ll never have a problem getting hold of it, wherever you go.

## Why write programs for research?

Not just labour saving, scripted research can be tested and reproduced!

## Sensible Input - Reasonable Output
Programs are a rigorous way of describing data analysis for other researchers, as well as for computers.

Computational research suffers from people assuming each other’s data manipulation is correct. By sharing codes, which are much more easy for a non-author to understand than spreadsheets.

## Why Binder?

![](images/mybinder.png)

### Running Code is more complicated than Displaying Code!
[GitHub](https://github.com/) is a great service for sharing code, but the contents are static.

It would be great if you could you run a GitHub repository without installing complicated requirements, like directly in your browser.

However, to run code, you need:

1. Hardware on which to run the code
2. Software, including:
    - The code itself
    - The programming language (e.g. Python, R, Julia, and so on)
    - Relevant packages (e.g. pandas, matplotlib, tidyverse, ggplot)

If you are already familiar with creating Python environments, installing libraries and Jupyter Notebooks you can clone [our GitHub repo](https://github.com/Living-with-machines/dhoxss-text2tech) and use the notebooks on your machine.

### The solution!

[Binder](https://mybinder.org/) is a service that provides your code and the hardware and software to execute it.

You can create a **link** to a live, interactive version of your code (like this one)!

## The Jupyter Notebook

![](images/jupyter.png)

The easiest way to get started using Python, and one of the best for research data work, is the Jupyter Notebook.

In the notebook (like this), you can easily mix code with discussion and commentary, and mix code with the results of that code; including graphs and other data visualisations.

Jupyter notebooks consist of discussion cells, referred to as “markdown cells”, and “code cells”, which contain Python. This document has been created using Jupyter notebook, and this very cell is a Markdown Cell. To learn how to write in Markdown, follow [this link](https://daringfireball.net/projects/markdown/)

And now, let's start with our first Code cell!

In [None]:
# you can use hashtag to add comments (these will not get run)

# printing in python

print ("hi!")

In [None]:
# printing multiple things together

print ("Hi","Oxford!")

In [None]:
# things with numbers

print (3+5)

print (3-5)


In [None]:
# defining decimal numbers

10.0

In [None]:
# everytime we do these operations results are displayed but not stored
5*254

In [None]:
# so, we can create variables and store information in them
six = 3*2
print (six)

number = 0
print (number)

In [None]:
# And also sum them together
number = number + 1
print (number)

In [None]:
# what happens if we do 
print (seven)

It is important to spend some time and learn how to read error messages, as we will see loads of them!

In this case it is quite clear, the variable "seven" is not defined

In [None]:
# we can also do operations using variables

print (six*2)

In [None]:
# and even more complicated things 

a_lot = six*six*180

print (a_lot)

In [None]:
# however you need to be careful, as in Python you can re-assign variables

a_lot = 100

print (a_lot)

In [None]:
# you can also assign variables to string

a_lot = "big number"

print (a_lot)

In [None]:
# to check what "type" of object a certain variable is you can do

type(a_lot)

Python has two core numeric types, int for integer, and float for real number.

In [None]:
one = 1
ten = 10
one_float = 1.0
ten_float = 10.0

type(one_float)

Also, cells in Jupyter Notebook are not always evaluated in order.

If I now go back to reading number = number + 1, and run it again, with shift-enter. Number will change from 2 to 3, then from 3 to 4. Try it!

So it’s important to remember that if you move your cursor around in the notebook, it doesn’t always run top to bottom.

### For Loops and If/Else Statements

A `for loop` is a command that allows code to be executed repeatedly. 

In [None]:
# for loops - remember to indent!

counter = 0
print (counter)

for i in range(10):
    counter = counter + 1
print (counter)

✏️ **Exercise:**

Write a programme that sums two numbers (and prints the sum) if the first is larger than the second, otherwise only prints the first

In [None]:
# example of input
number_1 = 5
number_2 = 4

if number_1 > number_2:
    print (number_1+number_2)
else:
    print (number_1)

With this exercise you have learnt another essential element of programming: the `if statement`. In computer science, conditionals statements are commands for handling decisions. Specifically, conditionals perform different actions depending on whether a boolean condition evaluates to true or false. 
![](images/if-else.png)

### Lists (and other objects)

Python’s basic container type is the `list`. Their purpose is to hold other objects.

We can define our own list with square brackets:


In [None]:
some_numbers = [1, 3, 7]
type(some_numbers)

Lists do not have to contain just one type:

In [None]:
various_things = [1, 2, "banana", 3.4, [1, 2]]

We access elements of a list with their index (remember that Python starts counting from 0)

In [None]:
print (various_things[0])

We can also ask a list if it conaints a particular item

In [None]:
"banana" in various_things

Note that you can add and  remove elements from a list

In [None]:
various_things.append("pizza")
print (various_things)

In [None]:
various_things.remove(2)
print (various_things)

An important thing when working with text - Python treats sequences (for instance strings) as lists, so commands that are available for lists also work for sequences, for instance

In [None]:
text = "My name is Fede"
print (text[3:7])

A `set` is a list which cannot contain the same element twice. We make one by calling set() on any sequence, e.g. a list or string.

In [None]:
many_things = ["fede",1,4,562,65,1,44,"fede","book",3455, 5]
print (set(many_things))

A set has no particular order, but is really useful for checking or storing unique values. Additionally, set operations work as in mathematics

In [None]:
x = set("Hello")
y = set("Goodbye")
x & y  # Intersection


In [None]:
x | y  # Union

In [None]:
y - x  # y intersection with complement of x: letters in Goodbye but not in Hello

Python supports a container type called a `dictionary`.

This is also known as an “associative array”, “map” or “hash” in other languages.

The things we can use to look up with are called keys and what we obtain are values.

In [None]:
me = {"name": "Fede", "age": 35, "Jobs": ["Research Data Scientist", "Teacher"]}

In [None]:
me["Jobs"]

A very important thing: that there’s no guaranteed order among the elements in a dictionary! 
(as opposed for instance to list)

Your programs will be faster and more readable if you use the appropriate container type for your data’s meaning. Always use a set for lists which can’t in principle contain the same data twice, always use a dictionary for anything which feels like a mapping from keys to values.

### Opening a file

There are many ways to open a file in Python - for the moment we will just use the one with relies on the built-in function `open`. With open you need to specify a path to the file. See the example below

In [None]:
filepath = "data/input.txt"

file_content = open(filepath,"r").read()
print (file_content)

In [None]:
file_lines = file_content.split("\n")
print (file_lines)

You can also write to a file, again using the `open` function

In [None]:
output_path = "data/output.txt"
file_content = open(output_path,"w").write("hello!")

✏️ **Exercise:**
(from [Advent of Code](https://adventofcode.com/) 2021, day 1)

Given `file_lines`, count the number of times a number increases from the previous one. (There is no measurement before the first measurement.)

In [None]:
counter_variable = 0
# first we define a variable which will capture everyttime the previous number
# initially we define it as a boolean operator
previous_number = False

for number in file_lines:
    if previous_number != False:
        ## you need to convert the strings in integers!!
        if int(number) > int(previous_number):
            counter_variable +=1
            previous_number = number
    else:
        previous_number = number

print (counter_variable)