# Intro to Python

## Welcome!

### What is Natural Language Processing?

The development and application of computational techniques for the the analysis of human language.

<img src="images/nlp.png" width="800" />


### Is it different to Computational Linguistics?

Computational linguistics is a social science with the goal of understanding language through computational models.

### How can I use it in my Humanities Research?

This is what we will learn during the week.

## NLP as Data Science

![From: https://towardsdatascience.com/introduction-to-statistics-e9d72d818745](images/data_science.png)


## NLP as Field of Research

1. New research is mostly published in conference proceedings 
2. The most important conferences are: ACL, EMNLP, COLING, NAACL, EACL
3. Pre-print are often shared on ArXiV (https://arxiv.org/list/cs.CL/recent)


## What Should I know before we start?

1. All quantitative models of language are wrong - but some are useful!
2. The importance of the human-in-the-loop
3. No-free-lunch theorem
4. Always validate!


## Why Python?

![](images/python.jpg)

1. It is quick to program in

2. Very popular in NLP, and has lots of useful libraries thoroughly documented 

3. It interfaces well with faster languages 

4. Python is free, so you’ll never have a problem getting hold of it, wherever you go.

## Why write programs for research?

Not just labour saving, scripted research can be tested and reproduced!

## Sensible Input - Reasonable Output
Programs are a rigorous way of describing data analysis for other researchers, as well as for computers.

Computational research suffers from people assuming each other’s data manipulation is correct. By sharing codes, which are much more easy for a non-author to understand than spreadsheets.

## Why Binder?

![](images/mybinder.png)

### Running Code is more complicated than Displaying Code!
[GitHub](https://github.com/) is a great service for sharing code, but the contents are static.

It would be great if you could you run a GitHub repository without installing complicated requirements, like directly in your browser.

However, to run code, you need:

1. Hardware on which to run the code
2. Software, including:
    - The code itself
    - The programming language (e.g. Python, R, Julia, and so on)
    - Relevant packages (e.g. pandas, matplotlib, tidyverse, ggplot)


### The solution!

[Binder](https://mybinder.org/) is a service that provides your code and the hardware and software to execute it.

You can create a **link** to a live, interactive version of your code (like this one)!

## The Jupyter Notebook

![](images/jupyter.png)

The easiest way to get started using Python, and one of the best for research data work, is the Jupyter Notebook.

In the notebook (like this), you can easily mix code with discussion and commentary, and mix code with the results of that code; including graphs and other data visualisations.

Jupyter notebooks consist of discussion cells, referred to as “markdown cells”, and “code cells”, which contain Python. This document has been created using Jupyter notebook, and this very cell is a Markdown Cell. To learn how to write in Markdown, follow [this link](https://daringfireball.net/projects/markdown/)

And now, let's start with our first Code cell!

In [9]:
# you can use hashtag to add comments (these will not get run)

# printing in python

print ("hi!")

hi!


In [10]:
# printing multiple things together

print ('Hi','Oxford!')

Hi Oxford!


In [11]:
# things with numbers

print (3+5)

print (3-5)


8
-2


In [12]:
# defining decimal numbers

10.0

10.0

In [13]:
# everytime we do these operations results are displayed but not stored
5*254

1270

In [14]:
# so, we can create variables and store information in them
six = 3*2
print (six)

number = 0
print (number)

6
0


In [15]:
# And also sum them together
number = number + 1
print (number)

1


In [16]:
# what happens if we do 
print (seven)

NameError: name 'seven' is not defined

It is important to spend some time and learn how to read error messages, as we will see loads of them!

In this case it is quite clear, the variable "seven" is not defined

In [None]:
# we can also do operations using variables

print (six*2)

12


In [None]:
# and even more complicated things 

a_lot = six*six*180

print (a_lot)

6480


In [None]:
# however you need to be careful, as in Python you can re-assign variables

a_lot = 100

print (a_lot)

100


In [None]:
# you can also assign variables to string

a_lot = "big number"

print (a_lot)

big number


In [None]:
# to check what "type" of object a certain variable is you can do

type(a_lot)

str

Also, cells in Jupyter Notebook are not always evaluated in order.

If I now go back to reading number = number + 1, and run it again, with shift-enter. Number will change from 2 to 3, then from 3 to 4. Try it!

So it’s important to remember that if you move your cursor around in the notebook, it doesn’t always run top to bottom.

In [None]:
# for loops - remember to indent!

counter = 0
print (counter)

for i in range(10):
    counter = counter + 1
print (counter)

0
10


✏️ **Exercise:**

Write a programme that sums two numbers (and prints the sum) if the first is larger than the second, otherwise only prints the first

Python has two core numeric types, int for integer, and float for real number.

In [2]:
one = 1
ten = 10
one_float = 1.0
ten_float = 10.0

type(one_float)

float

### Lists, Tuples, Sets and Dictionaries

Python’s basic container type is the `list`. Their purpose is to hold other objects.

We can define our own list with square brackets:


In [1]:
some_numbers = [1, 3, 7]
type(some_numbers)

list

Lists do not have to contain just one type:

In [4]:
various_things = [1, 2, "banana", 3.4, [1, 2]]

We access elements of a list with their index (remember that Python starts counting from 0)

In [5]:
print (various_things[0])

1


We can also ask a list if it conaints a particular item

In [8]:
'banana' in various_things

True

Note that you can add and  remove elements from a list

In [12]:
various_things.append("pizza")
print (various_things)

[1, 2, 'banana', 3.4, [1, 2], 'pizza', 'pizza']
[1, 'banana', 3.4, [1, 2], 'pizza', 'pizza']


In [None]:
various_things.remove(2)
print (various_things)

An important thing when working with text - Python treats sequences (for instance strings) as lists, so commands that are available for lists also work for sequences, for instance

In [7]:
text = "My name is Fede"
print (text[3:7])

name


A `tuple` instead is an immutable sequence. It is like a list, except **it cannot be changed**. It is defined with round brackets.

In [10]:
my_tuple = ("Hello", "World","!")


In [13]:
my_tuple[0] = "Goodbye"


TypeError: 'tuple' object does not support item assignment

A `set` is a list which cannot contain the same element twice. We make one by calling set() on any sequence, e.g. a list or string.

In [18]:
many_things = ['fede',1,4,562,65,1,44,'fede','book',3455, 5]
print (set(many_things))

{65, 1, 'fede', 4, 5, 44, 562, 'book', 3455}


A set has no particular order, but is really useful for checking or storing unique values. Additionally, set operations work as in mathematics

In [19]:
x = set("Hello")
y = set("Goodbye")
x & y  # Intersection


{'e', 'o'}

In [20]:
x | y  # Union

{'G', 'H', 'b', 'd', 'e', 'l', 'o', 'y'}

In [21]:
y - x  # y intersection with complement of x: letters in Goodbye but not in Hello

{'G', 'b', 'd', 'y'}

Python supports a container type called a `dictionary`.

This is also known as an “associative array”, “map” or “hash” in other languages.

The things we can use to look up with are called keys and what we obtain are values.

In [15]:
me = {"name": "Fede", "age": 35, "Jobs": ["Research Data Scientist", "Teacher"]}

In [16]:
me["Jobs"]

['Research Data Scientist', 'Teacher']

A very important thing: that there’s no guaranteed order among the elements in a dictionary! 
(as opposed for instance to list)

Your programs will be faster and more readable if you use the appropriate container type for your data’s meaning. Always use a set for lists which can’t in principle contain the same data twice, always use a dictionary for anything which feels like a mapping from keys to values.

✏️ **Exercise:**

Write a programme ...