# Week 1: Introduction to Python (Part 2)

This is the second part of the Introduction to Python for Natural Language Engineering mini-course.

These notebooks are designed to give you the working knowledge of Python necessary to complete the lab sessions for Natural Language Engineering. 

From the last notebook you should be familiar with python types, basic operators, identifiers, lists, strings, booleans and conditions.  This notebook will introduce you to defining functions, using comments and docstrings and working with more data structures such as sets, tuples and dictionaries.

As in the last session:-

- Run all of the code cells as you work through the notebook. 
- Try to understand what is happening in each code cell and predict the output before running it.
- Add more cells to try things out.
- Complete all of the exercises.
- Discuss answers and ask questions!


## 1.2.1 Functions
Functions are defined using the keyword `def`, followed by a function name, and some number of arguments in parentheses.  If there are multiple arguments to a function, these are separated by commas.  Don't forget the `:` after the closing parenthesis.

The body of the function starts on the next line, and must be indented.

The function `double()` below has a single argument `number`.  It returns the value of the `number` argument multiplied by 2. 

In [None]:
def double(number):
     return(number * 2)

To use or call a function, you use the function name and supply it with values for its arguments in parentheses.



In [None]:
double(13)

We will often want to store the return value from a function in a variable for use later on in our code.  Remember from last time that we can assign a value to a variable using a single `=`

In [None]:
new_value = double(13)

Note also that the built-in `type()` function will tell us that `double` is a function.

In [None]:
type(double)

We can define functions to do whatever we want.  Here we have a function which takes a String argument and adds a question mark on the add before returning it.

In [None]:
def add_question_mark(any_string):
    return any_string + "?"

In [None]:
add_question_mark("what's your name")

Note that the argument variable (`any_string` in the example above) is a **local variable** to the function.  This means that it cannot be referred to outside of the function.  If you do so, you will get a *NameError* as below.

In [None]:
any_string

What does the unhelpfully named function below do?

In [None]:
def some_function(any_string):
    half_length_of_string = len(any_string)//2 #use floor division as indices must be integers
    return any_string[:half_length_of_string]

In [None]:
some_function('hi how are you doing?')

### **Exercise 1a**
In the empty cell below define a function called `square` that returns an input parameter squared. 

Hint: check the 'basic functions' section (Section 1.1.2 in Part 1) for the Python syntax for exponentials.

### **Exercise 1b**
In the empty cell below define a function `makelist` that takes a sentence string as an input, and returns a list of the words in the sentence.

### 1.2.2 Comments and docstrings

Look at the code in the code cell below.

- The first block of text shows how doc strings are used in Python. By convention, all function definitions should begin with a block of documentation (docstring) of the form given by the first block in triple quotes in the programme below.
- Comments in the code itself are introduced by a `#` either as a separate line or appended to the end of a line. Python will ignore the rest of a line after a `#`.

When you type shift-enter to execute the cell, the function exists in the kernel and can be called by any cell in the notebook. The function definition ends as soon as the indentation ceases (this is triggered by the comment `"Here is the argument:"`).  After creating the function the kernel will continue to execute the contents of the cell, thereby calling the function. 

Notice how the program splits a character string at carriage return ("\n") characters. This works because split is an inbuilt method of the text data type. Therefore all text objects can be split in this way.
- `\n` is the carriage return character. 
- `\t` can be used similarly for reading tab separated data.
- If you leave the argument empty it will treat any string of whitespace as a delimiter to be split. This has the advantage that a double space will be treated as a single delimiter.

### **Exercise 2a**
The cell below defines and calls a function `count_paragraphs()`.  Read the docstring and the comments to understand what it does.
- Notice that execution of the first cell means the variable is now in the kernel and accessible to any cell.
- Use the same box to try printing the variable `input_text`. What happens? Why?

In [None]:

def count_paragraphs(input_text):
    """
    A paragraph is defined as the text before a CR character ie.: "\n".
    Take a character string, split it into paragraphs, count them
    and return the count.
    :param input_text: a character string containing paragraph marks.
    :return: integer, the number of paragraphs.
    """
    
    # The following statement creates a list of strings by breaking
    # up input_text wherever a "\n" character occurs
    
    paragraphs = input_text.split("\n")  
    
    # The len() function counts the number of elements in the list
    
    return len(paragraphs)


# Here is the argument:

sample_text = "This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."
print (sample_text)

# Here is the function call:

print ("Number of paragraphs: ", count_paragraphs(sample_text))


### **Exercise 2a continued**
In the blank cell below examine the contents of the variable `sample_text` with and without the print function.
- Notice that execution of the previous cell means the variable is now in the kernel and accessible to any cell.
- Use the same box to try printing the variable `input_text`. What happens? Why?

## 1.2.3 Sets  
Last time, we looked at the more complex data structure of lists.

Now we are going to take a look at a related data structure called a set.  These are **unordered** collections of **unique** elements.

Note the use of curly  brackets rather than the square brackets used for lists.

In [None]:
unique_numbers = {1, 2, 2, 2, 3}
unique_numbers

In [None]:
type(unique_numbers)

To initialise an empty set, use `set()`

In [None]:
new_set = set()
type(new_set)

Or we can give a list of numbers to the set constructor `set()`.  This will construct a set out of the list.  Note that whilst we should think of a set as unordered, it has to be displayed in some order.  It could be any order but most versions of Python now will store and display a set in a default alphanumeric ordering.

In [None]:
values=[3,1,1,4,5,2,2,1]
set_of_values=set(values)
set_of_values

In [None]:
set(['z','e',3,'e','a'])


To add an element to a set, use the method `add`.

In [None]:
unique_numbers.add(5)

Use `len` to give the number of elements in a set.

In [None]:
len(unique_numbers)

To check the presence of an element in a set use the keyword `in`.

Similar to the use of `in` for lists and strings.

In [None]:
2 in unique_numbers

Similarly to lists, we often want to do something for every element in a set.

The syntax for **iterating over a set** is the same as that used when iterating over a list. 

Remember to use `for`, `in`, `:` and indentation.

In [None]:
for number in unique_numbers:
    print(number * 3)

In [None]:
for number in unique_numbers:
    print (double(number))

### **Exercise 3a**
In the empty cell below create a function called `get_vocabulary` that takes a *list* of words as input, and returns a *set* of the words in the sentence.

Use your function `get_vocabulary` to create the set dickens_vocab, a set of unique words in the opening_line (see above).

In [None]:
opening_line="It was the best of times, it was the worst of times"

## 1.2.4 Dictionaries
A dictionary is a very important data structure which we will be making much use of throughout the NLE course.  By definition, a dictionary is an *unordered set* of **key:value pairs**. 

**Keys** are used to index the dictionary.

The main operations are storing a value with a key, and then extracting a specific value using its key. 

Each key in a given dictionary must be unique. 

A dictionary is initialised with curly braces. This can contain comma-separated key:value pairs. 

Note the use of ':' to map a key to a value.

In [None]:
simpsons_ages = {"Bart":10, "Lisa":8, "Homer" : "thirty something"}
simpsons_ages

In [None]:
type(simpsons_ages)

To access the values associated with keys in a dictionary, we use square brackets.  

In [None]:
simpsons_ages["Homer"]

In [None]:
simpsons_ages['Bart']

Getting the number of elements in a dictionary.

Just like getting the length of a list, we use the keyword `len`.

In [None]:
len(simpsons_ages)

We can checking the presence of a key in a dictionary using the keyword `in`

In [None]:
"Marge" in simpsons_ages

In [None]:
"Bart" in simpsons_ages

If we try to accessing a key that does not exist, we get a **KeyError**.

In [None]:
simpsons_ages["Krusty"]

However, dictionaries also have a `get` method which allows us to look up values for keys which may or may not be in the dictionary.  To use the `get` method, we have to supply a second argument which is a default value to use in the case that the key does not exist.

In [None]:
simpsons_ages.get("Bart","I don't know")

In [None]:
simpsons_ages.get("Krusty","I don't know")

We can add a new key:value entry to the dictionary by combining the syntax for looking up a value with the syntax for assigning a value to a variable.  

For example, if we want to add the fact that Marge is 34, we write:

In [None]:
simpsons_ages["Marge"] = 34
simpsons_ages["Marge"]

We can also update the dictionary in the same way.

In [None]:
simpsons_ages["Homer"]=35
simpsons_ages["Homer"]

### **Exercise 4a**
In the blank cell below add two extra key-value pairs to the dictionary, `simpons_ages`, each consisting of a name and corresponding age.  

There are also some very useful dictionary methods which can be used to iterate over all the keys in the dictionary, all the values in the dictionary or all the key,value pairs in the dictionary.  Make sure you understand all of the examples below.

First, we can use the `keys` method to get a set of all of the keys in a dictionary.  We can then iterate over this like any other set or list.

In [None]:
for person in simpsons_ages.keys():
  print(person,"is in the Simpsons")

Next, we can use the `values` method to get a set of all of the values in a dictionary.  Here we iterate over the list of values adding them to an **accumulator** variable called `total`

In [None]:
total=0
for age in simpsons_ages.values():
  total+=age
print("The total age of known Simpsons characters is",total)

Alternatively, if we just want the sum of all of the values in the dictionary, we could have just used the built-in sum function.  Note that `sum` expects a list or a set of numbers (either floats or ints) and will generate an error if any of the values are strings.

In [None]:
print("The total age of known Simpsons characters is",sum(simpsons_ages.values()))


Next, we can use the `items` method to get a list of key-value pairs. 

In [None]:
#Note that here as in other set and list iterations, 'person' and 'age' are arbitary variable names, and can be replaced with any two names eg 'key' and 'value'
for person,age in simpsons_ages.items():
  if age < 18:
    print(person,"is a child")
  else:
    print(person,"is an adult")

Finally, we can just use a `for` loop to iterate over *keys* in the dictionary, without explictly using the `keys` method.  I wouldn't recommend doing this as older versions of Python behave differently (allowing you to iterate over the `items` using this syntax).  Its best to be explicit about what you want - do you want a list of keys, values or pairs?  However, it is worth knowing that it is the **keys** that you get if you should perform an iteration over a dictionary.

In [None]:
for person in simpsons_ages: 
     print (person)

### **Exercise 4b**
In the blank cell below make a new dictionary called `polygons` where the keys are names of shapes and the values are the corresponding number of sides.

### **Exercise 4c**
In the blank cell below iterate over the keys and values, printing each key and value in a sentence (eg 'a triangle has 3 sides').

### **Exercise 4d**
In the empty code cells below write code that will print, one word per line, each word in `opening_line` together with the number of times that word appears in `opening_line`.

Being able to counting the number of times a word or token has been seen is something which is very important in NLE.  However, it can be quite challenging to work out how to do it efficiently with a dictionary in the first place.  See the cells below for some hints .

**Hint 1**: you need to start with an empty dictionary and then iterate over a list of words.  For each word in the list, you need to look up how many times it has been seen previously and then add 1 to it (remembering to update the dictionary!)

**Hint 2**: use the `get` method to supply a default value of 0 for the number of times a word has been seen so far.

## 1.2.5 Files

Quite often we want to load data from (or save data to) files.
Files have a file path and in the cell below we use the variable `input_file_path` to hold a string that contains a file path.

If you are working on Colab then you will see on your Google Drive account that you have a directory called `Colab Notebooks`.  Navigate to this directory and create a new folder called `datafiles`.  Then drag and drop the `sample_text.txt` file from the Week 1 resources folder into this directory.  

Now mount your Google Drive in colab.  You need to run the code below and follow the instructions in order to authorise the access to your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Now you can access files in your Google Drive account by specifying a filepath in that directory

In [None]:
#Make sure the file path points to a valid file
#on the windows machines in the labs you will want something like this
#input_file_path = "N:/Documents/teaching/NLE2018/w1/Week1Labs/sample_text.txt"

#on a mac or a linux machine you will want something like this
#input_file_path = "/Users/juliewe/Documents/teaching/NLE/NLE2019/w1/Week1Labs/sample_text.txt"

#on colab you will want something like this
input_file_path='/content/drive/My Drive/NLE Notebooks/Week1Labs/sample_text.txt'

We now use the file path variable to *open* the file. We need to do this before reading/writing to it.

The code below also checks the **type** of the opened file path - it should be an `_io0.TextIOWrapper`. If the drive is not mounted or you type thepath incorrectly, you will get a `FileNotFoundError`

In [None]:
input_file = open(input_file_path)
type(input_file)

Use the `read` command to read the entire file contents into a `str` variable called `input_text`.

In [None]:
input_text = input_file.read()
type(input_text)

We can now use this `str` variable in the same way as any other.

In [None]:
input_text

In [None]:
print("The number of characters in the file is",len(input_text))

When you are done with the file, close it.

In [None]:
input_file.close()

After the file has been closed it cannot be read any more.  The following code should generate a `ValueError`

In [None]:
input_text = input_file.read()

It is much neater to use a `with ... as` block to `open` a file (and then you do not need to `close` it since it will close automatically when the block exits)

In [None]:
with open(input_file_path) as input_file:
    some_input_text=input_file.read()

In [None]:
some_input_text

In [None]:
input_file.read()

### **Exercise 5**
In the blank cell below write a function, `print_word_counts` that will take a file path as an argument, open the file, then print, one word per line, each word in the file together with the number of times that word appears in the file.   

Test your function by running it on the `sample_text.txt`.

Hint: Call the function you developed in Exercise 4!

## 1.2.6 Tuples

A tuple is another data structure which at first sight might seem much like a list or a set.  However, note that once created, a tuple has a **fixed** number of elements - we cannot append or add items to a tuple.  The most common types of tuples are **pairs** (two-tuples) and **triples** (three-tuples).  We have already seen pairs when we used the dictionary `.items()` method which returns a list of pairs.

In short, a tuple consists of a number of values separated by commas. These values can be different types. It is initialised with parentheses, containing its objects separated by commas.

In [None]:
person = ("Jon", 14, "jon@thewall.com")
person

In [None]:
type(person)

We can use `len` to count the number of elements in a tuple.

In [None]:
len(person)

We can index into a tuple in the same way that we index into a list.

In [None]:
person[0]

In [None]:
person[-2:]

It can be useful to use tuples as values in dictionaries.

In [None]:
#Note that each key is a string, and each value is a tuple
people = {"Joffrey":(12, "Baratheon", "joff@kingslanding.com"), "Jon":(14, "Snow", "jon@thewall.com")}
people["Joffrey"]

In [None]:
### Jon's age - we access this using the dictionary key, and then indexing within the value:
people["Jon"][0]

In [None]:
### Joffrey's email
people["Joffrey"][2]

In [None]:
#  list everyone's first and last names:
for person, record in people.items():
     print (person, record[1])

### **Exercise 6**
In the blank cell below create a dictionary called `address_book`, with at least 3 key-value entries. Each should consist of a person's name in string format (the key), and a tuple with corresponding pieces of information about them (the value).

Once you've done that, iterate over the address book, printing information about each person into a sentence.

## 1.2.7 The range function

This produces a **generator** of numbers in a specified range.  We will talk more about generators in Part 3 but, for now, you can think of a generator as a list which is generated as required (rather than all being held in memory).  

For small ranges, it doesn't really matter if it is stored as a list or a generator.  But if you want a range of 1000,000 numbers, then it does make a big difference to the memory requirements.

The `range` function takes up to three arguments.  The first argument is the initial number in the range.  The second argument is the first number **NOT** in the range.

In [None]:
indices = range(0,5)

Note that when you output the range, you get a `range` object

In [None]:
indices

In [None]:
type(indices)

We can use the `len` function to find out how big the range is.

In [None]:
len(indices)

We may want to iterate over a range in the same way that we iterate over lists and sets:

In [None]:
total=0
for i in indices:
  total+=i
print(total)

If `range` is given a single argument, it will create a range from zero.

In [None]:
for i in range(10):
  print (i)

If a `range` is given a third argument, it will use this as a **step** value between the numbers generated in the range.

In [None]:
for i in range(0,10,2):
  print(i)

### **Exercise 7a**
In the blank cell below use `range` to print a list of the first 10 odd numbers.

### **Exercise 7b**
In the cell below use `range` to print a list of the first 10 cubes.

## 1.2.8 The zip function

The zip function is used to pair up the corresponding elements between multiple iterables (i.e., lists, sets, tuples or generators). 

It takes multiple iterables as arguments, and returns a list of tuples where the i-th tuple consists of the i-th element from each of the input iterables.

In the example below, we 'zip together' `words` and `indices` into a series of tuples called `word_positions`. For example, the 3rd element of `word_positions` contains the 3rd element of `words` and the 3rd element of `indices`.

In [None]:
words = 'It was the best of times, it was the worst of times'.split()
indices = range(len(words))
word_positions = zip(words, indices)
type(word_positions)

In [None]:
for word, position in word_positions:
    print("'{0}' is in position {1}".format(word,position))


### **Exercise 8**
In the blank cell below write a function, `show_word_positions` that takes a filepath as its argument. The function should read the text from the file, split the text on whitespace, and then print out each word and its position as in the above example.

Test your function out on `sample_text.txt`.

If lists are of different lengths, `zip` will ignore elements in the longer list beyond the length of the shorter list

In [None]:
listA = ["the","cat","sat"]
listB = ["a","dog","lay","down"]

for elem in zip(listB,listA):
    print(elem)

If you want to pad out any 'missing' elements, you might find `zip_longest` useful.  This resides in a library called `itertools` so you need to import it from there (more on libraries later!)

In [None]:
from itertools import zip_longest

for elem in zip_longest(listA,listB):
    print(elem)

### 1.2.9 Enumerate
Python provides a useful built-in function called `enumerate` that can be used instead of the combination of `range` and `zip` seen above.

In [None]:
for a,b in enumerate(['The','Holy','Grail']): 
    print(a,b)

In [None]:
for a,b in enumerate(['The','Holy','Grail'],1): 
    print(a,b)

### **Exercise 9**
In the empty cell below, write a function that calculates the number of letters in each word of an input string, returning a list of tuples `(position, length)`.
