# Week 1: Introduction to Python (Part 2)

This is the second part of the Introduction to Python for Natural Language Engineering mini-course.

These notebooks are designed to give you the working knowledge of Python necessary to complete the lab sessions for Natural Language Engineering. 

From the last notebook you should be familiar with python types, basic operators, identifiers, lists, strings, booleans and conditions.  This notebook will introduce you to defining functions, using comments and docstrings and working with more data structures such as sets, tuples and dictionaries.

As in the last session:-

- Run all of the code cells as you work through the notebook. 
- Try to understand what is happening in each code cell and predict the output before running it.
- Add more cells to try things out.
- Complete all of the exercises.
- Discuss answers and ask questions!


## 1.2.1 Functions
Functions are defined using the keyword `def`, followed by a function name, and some number of arguments in parentheses.  If there are multiple arguments to a function, these are separated by commas.  Don't forget the `:` after the closing parenthesis.

The body of the function starts on the next line, and must be indented.

The function `double()` below has a single argument `number`.  It returns the value of the `number` argument multiplied by 2. 

In [1]:
def double(number):
     return(number * 2)

To use or call a function, you use the function name and supply it with values for its arguments in parentheses.



In [2]:
double(13)

26

We will often want to store the return value from a function in a variable for use later on in our code.  Remember from last time that we can assign a value to a variable using a single `=`

In [3]:
new_value = double(13)

Note also that the built-in `type()` function will tell us that `double` is a function.

In [4]:
type(double)

function

We can define functions to do whatever we want.  Here we have a function which takes a String argument and adds a question mark on the add before returning it.

In [5]:
def add_question_mark(any_string):
    return any_string + "?"

In [6]:
add_question_mark("what's your name")

"what's your name?"

Note that the argument variable (`any_string` in the example above) is a **local variable** to the function.  This means that it cannot be referred to outside of the function.  If you do so, you will get a *NameError* as below.

In [7]:
any_string

NameError: name 'any_string' is not defined

What does the unhelpfully named function below do?

In [8]:
def some_function(any_string):
    some_number = len(any_string)//2 #use floor division as indices must be integers
    return any_string[:some_number]

In [9]:
some_function('hi how are you doing?')

'hi how are'

### **Exercise 1a**
In the empty cell below define a function called `square` that returns an input parameter squared. 

Hint: check the 'basic functions' section (Section 1.1.2 in Part 1) for the Python syntax for exponentials.

In [10]:
def square(number):
    return number**2

square(4)

16

### **Exercise 1b**
In the empty cell below define a function `makelist` that takes a sentence string as an input, and returns a list of the words in the sentence.

In [11]:
def makelist(sentence):
    return sentence.split()

makelist("It was the best of times, it was the worst of times")

['It',
 'was',
 'the',
 'best',
 'of',
 'times,',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

### 1.2.2 Comments and docstrings

Look at the code in the code cell below.

- The first block of text shows how doc strings are used in Python. By convention, all function definitions should begin with a block of documentation (docstring) of the form given by the first block in triple quotes in the code below.
- Comments in the code itself are introduced by a `#` either as a separate line or appended to the end of a line. Python will ignore the rest of a line after a `#`.

When you type shift-enter to execute the cell, the function exists in the kernel and can be called by any cell in the notebook. The function definition ends as soon as the indentation ceases (this is triggered by the comment `"Here is the argument:"`).  After creating the function the kernel will continue to execute the contents of the cell, thereby calling the function. 

Notice how the program splits a character string at carriage return ("\n") characters. This works because split is an inbuilt method of the String data type. Therefore all Strings can be split in this way.
- `\n` is the carriage return character. 
- `\t` can be used similarly for reading tab separated data.
- If you leave the argument empty it will treat any string of whitespace as a delimiter to be split. This has the advantage that a double space will be treated as a single delimiter.

### **Exercise 2a**
The cell below defines and calls a function `count_paragraphs()`.  Read the docstring and the comments to understand what it does.
- Notice that execution of the first cell means the variable is now in the **kernel** and accessible to any cell.

In [12]:

def count_paragraphs(input_text):
    """
    A paragraph is defined as the text before a CR character ie.: "\n".
    Take a character string, split it into paragraphs, count them
    and return the count.
    :param input_text: a character string containing paragraph marks.
    :return: integer, the number of paragraphs.
    """
    
    # The following statement creates a list of strings by breaking
    # up input_text wherever a "\n" character occurs
    
    paragraphs = input_text.split("\n")  
    
    # The len() function counts the number of elements in the list
    
    return len(paragraphs)


# Here is the argument:

sample_text = "This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."
print (sample_text)

# Here is the function call:

print ("Number of paragraphs: ", count_paragraphs(sample_text))


This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!
Sentences like that should not exist. They're too artificial.
A REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?
Have you heard of 'code smell'? Google it if you haven't.
Number of paragraphs:  4


### **Exercise 2a continued**
In the blank cell below examine the contents of the variable `sample_text` with and without the print function.
- Notice that execution of the previous cell means the variable is now in the kernel and accessible to any cell.
- Use the same box to try printing the variable `input_text`. What happens? Why?

In [13]:
sample_text

"This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."

In [14]:
print(sample_text)

This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!
Sentences like that should not exist. They're too artificial.
A REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?
Have you heard of 'code smell'? Google it if you haven't.


In [15]:
input_text

NameError: name 'input_text' is not defined

## 1.2.3 Sets  
Last time, we looked at the more complex data structure of lists.

Now we are going to take a look at a related data structure called a set.  These are **unordered** collections of **unique** elements.

Note the use of curly  brackets rather than the square brackets used for lists.

In [16]:
unique_numbers = {1, 2, 2, 2, 3}
unique_numbers

{1, 2, 3}

In [17]:
type(unique_numbers)

set

To initialise an empty set, use `set()`

In [18]:
new_set = set()
type(new_set)

set

Or we can give a list of numbers to the set constructor `set()`.  This will construct a set out of the list.  Note that whilst we should think of a set as unordered, it has to be displayed in some order.  It could be any order but most versions of Python now will store and display a set in a default alphanumeric ordering.

In [19]:
values=[3,1,1,4,5,2,2,1]
set_of_values=set(values)
set_of_values

{1, 2, 3, 4, 5}

In [20]:
set(['z','e',3,'e','a'])


{3, 'a', 'e', 'z'}

To add an element to a set, use the method `add`.

In [21]:
unique_numbers.add(5)

Use `len` to give the number of elements in a set.

In [22]:
len(unique_numbers)

4

To check the presence of an element in a set use the keyword `in`.

Similar to the use of `in` for lists and strings.

In [23]:
2 in unique_numbers

True

Similarly to lists, we often want to do something for every element in a set.

The syntax for **iterating over a set** is the same as that used when iterating over a list. 

Remember to use `for`, `in`, `:` and indentation.

In [24]:
for number in unique_numbers:
    print(number * 3)

3
6
9
15


In [25]:
for number in unique_numbers:
    print (double(number))

2
4
6
10


### **Exercise 3a**
In the empty cell below create a function called `get_vocabulary` that takes a *list* of words as input, and returns a *set* of the words in the sentence.

Use your function `get_vocabulary` to create the set dickens_vocab, a set of unique words in the opening_line (see above).

In [26]:
opening_line="It was the best of times, it was the worst of times"

In [28]:
def get_vocabulary(word_list):
    return set(word_list)

get_vocabulary(makelist(opening_line))

{'It', 'best', 'it', 'of', 'the', 'times', 'times,', 'was', 'worst'}

## 1.2.4 Dictionaries
A dictionary is a very important data structure which we will be making much use of throughout the NLE course.  By definition, a dictionary is an *unordered set* of **key:value pairs**. 

**Keys** are used to index the dictionary.

The main operations are storing a value with a key, and then extracting a specific value using its key. 

Each key in a given dictionary must be unique. 

A dictionary is initialised with curly braces. This can contain comma-separated key:value pairs. 

Note the use of ':' to map a key to a value.

In [29]:
simpsons_ages = {"Bart":10, "Lisa":8, "Homer" : "thirty something"}
simpsons_ages

{'Bart': 10, 'Lisa': 8, 'Homer': 'thirty something'}

In [30]:
type(simpsons_ages)

dict

To access the values associated with keys in a dictionary, we use square brackets.  

In [31]:
simpsons_ages["Homer"]

'thirty something'

In [32]:
simpsons_ages['Bart']

10

Getting the number of elements in a dictionary.

Just like getting the length of a list, we use the keyword `len`.

In [33]:
len(simpsons_ages)

3

We can checking the presence of a key in a dictionary using the keyword `in`

In [34]:
"Marge" in simpsons_ages

False

In [35]:
"Bart" in simpsons_ages

True

If we try to accessing a key that does not exist, we get a **KeyError**.

In [36]:
simpsons_ages["Krusty"]

KeyError: 'Krusty'

However, dictionaries also have a `get` method which allows us to look up values for keys which may or may not be in the dictionary.  To use the `get` method, we have to supply a second argument which is a default value to use in the case that the key does not exist.

In [37]:
simpsons_ages.get("Bart","I don't know")

10

In [38]:
simpsons_ages.get("Krusty","I don't know")

"I don't know"

We can add a new key:value entry to the dictionary by combining the syntax for looking up a value with the syntax for assigning a value to a variable.  

For example, if we want to add the fact that Marge is 34, we write:

In [39]:
simpsons_ages["Marge"] = 34
simpsons_ages["Marge"]

34

We can also update the dictionary in the same way.

In [40]:
simpsons_ages["Homer"]=35
simpsons_ages["Homer"]

35

### **Exercise 4a**
In the blank cell below add two extra key-value pairs to the dictionary, `simpons_ages`, each consisting of a name and corresponding age.  

In [41]:
simpsons_ages["Krusty"]=45
simpsons_ages["Maggie"]=1

There are also some very useful dictionary methods which can be used to iterate over all the keys in the dictionary, all the values in the dictionary or all the key,value pairs in the dictionary.  Make sure you understand all of the examples below.

First, we can use the `keys` method to get a set of all of the keys in a dictionary.  We can then iterate over this like any other set or list.

In [42]:
for person in simpsons_ages.keys():
  print(person,"is in the Simpsons")

Bart is in the Simpsons
Lisa is in the Simpsons
Homer is in the Simpsons
Marge is in the Simpsons
Krusty is in the Simpsons
Maggie is in the Simpsons


Next, we can use the `values` method to get a set of all of the values in a dictionary.  Here we iterate over the list of values adding them to an **accumulator** variable called `total`

In [43]:
total=0
for age in simpsons_ages.values():
  total+=age
print("The total age of known Simpsons characters is",total)

The total age of known Simpsons characters is 133


Alternatively, if we just want the sum of all of the values in the dictionary, we could have just used the built-in sum function.  Note that `sum` expects a list or a set of numbers (either floats or ints) and will generate an error if any of the values are strings.

In [44]:
print("The total age of known Simpsons characters is",sum(simpsons_ages.values()))


The total age of known Simpsons characters is 133


Next, we can use the `items` method to get a list of key-value pairs. 

In [45]:
#Note that here as in other set and list iterations, 'person' and 'age' are arbitary variable names, and can be replaced with any two names eg 'key' and 'value'
for person,age in simpsons_ages.items():
  if age < 18:
    print(person,"is a child")
  else:
    print(person,"is an adult")

Bart is a child
Lisa is a child
Homer is an adult
Marge is an adult
Krusty is an adult
Maggie is a child


Finally, we can just use a `for` loop to iterate over *keys* in the dictionary, without explictly using the `keys` method.  I wouldn't recommend doing this as older versions of Python behave differently (allowing you to iterate over the `items` using this syntax).  Its best to be explicit about what you want - do you want a list of keys, values or pairs?  However, it is worth knowing that it is the **keys** that you get if you should perform an iteration over a dictionary.

In [46]:
for person in simpsons_ages: 
     print (person)

Bart
Lisa
Homer
Marge
Krusty
Maggie


### **Exercise 4b**
In the blank cell below make a new dictionary called `polygons` where the keys are names of shapes and the values are the corresponding number of sides.

In [47]:
polygons={"triangle":3,"square":4,"pentagon":5}

### **Exercise 4c**
In the blank cell below iterate over the keys and values, printing each key and value in a sentence (eg 'a triangle has 3 sides').

In [49]:
for key,value in polygons.items():
    print("A "+key+" has "+str(value)+" sides")

A triangle has 3 sides
A square has 4 sides
A pentagon has 5 sides


In [50]:
#the use of the format function for strings is slightly neater
for key,value in polygons.items():
    print("A {} has {} sides".format(key,value))
    


A triangle has 3 sides
A square has 4 sides
A pentagon has 5 sides


### **Exercise 4d**
In the empty code cells below write code that will print, one word per line, each word in `opening_line` together with the number of times that word appears in `opening_line`.

Being able to counting the number of times a word or token has been seen is something which is very important in NLE.  However, it can be quite challenging to work out how to do it efficiently with a dictionary in the first place.  See the cells below for some hints .

In [51]:
vocab={}
for word in makelist(opening_line):
    vocab[word]=vocab.get(word,0)+1
    
for key,value in vocab.items():
    print("{}:{}".format(key,value))

It:1
was:2
the:2
best:1
of:2
times,:1
it:1
worst:1
times:1


In [62]:
#this would be useful as a function so
def make_vocab(text):
    vocab={}
    for word in text.split():
        vocab[word]=vocab.get(word,0)+1
        
    return vocab

vocab=make_vocab(opening_line)
for key,value in vocab.items():
    print("{}:{}".format(key,value))

It:1
was:2
the:2
best:1
of:2
times,:1
it:1
worst:1
times:1


**Hint 1**: you need to start with an empty dictionary and then iterate over a list of words.  For each word in the list, you need to look up how many times it has been seen previously and then add 1 to it (remembering to update the dictionary!)

**Hint 2**: use the `get` method to supply a default value of 0 for the number of times a word has been seen so far.

## 1.2.5 Files

Quite often we want to load data from (or save data to) files.
Files have a file path and in the cell below we use the variable `input_file_path` to hold a string that contains a file path.



If you are working on Colab, you will need to mount your Google Drive in colab.  You need to run the code below and follow the instructions in order to authorise the access to your Google Drive

In [None]:
#only run this cell if you are working in colab
#you don't need it and it will cause a ModuleNotFoundError on Anaconda
from google.colab import drive
drive.mount('/content/drive')

Now you can access files in your Google Drive account by specifying a filepath in that directory

In [52]:
#Make sure the file path points to a valid file
#on the windows machines in the labs you will want something like this
#input_file_path = "N:/Documents/teaching/NLE2018/w1/Week1Labs/sample_text.txt"

#on a mac or a linux machine you will want something like this
input_file_path = "/Users/juliewe/Documents/teaching/NLE/NLE2021/w1/Week1LabsSolutions/sample_text.txt"

#on colab you will want something like this
#input_file_path='/content/drive/My Drive/NLE Notebooks/Week1Labs/sample_text.txt'

We now use the file path variable to *open* the file. We need to do this before reading/writing to it.

The code below also checks the **type** of the opened file path - it should be an `_io0.TextIOWrapper`. If the drive is not mounted or you type thepath incorrectly, you will get a `FileNotFoundError`

In [53]:
input_file = open(input_file_path)
type(input_file)

_io.TextIOWrapper

Use the `read` command to read the entire file contents into a `str` variable called `input_text`.

In [54]:
input_text = input_file.read()
type(input_text)

str

We can now use this `str` variable in the same way as any other.

In [55]:
input_text

'This is some sample text. Feel free to replace it with something more interesting!'

In [56]:
print("The number of characters in the file is",len(input_text))

The number of characters in the file is 82


When you are done with the file, close it.

In [57]:
input_file.close()

After the file has been closed it cannot be read any more.  The following code should generate a `ValueError`

In [58]:
input_text = input_file.read()

ValueError: I/O operation on closed file.

It is much neater to use a `with ... as` block to `open` a file (and then you do not need to `close` it since it will close automatically when the block exits)

In [59]:
with open(input_file_path) as input_file:
    some_input_text=input_file.read()

In [60]:
some_input_text

'This is some sample text. Feel free to replace it with something more interesting!'

In [61]:
input_file.read()

ValueError: I/O operation on closed file.

### **Exercise 5**
In the blank cell below write a function, `print_word_counts` that will take a file path as an argument, open the file, then print, one word per line, each word in the file together with the number of times that word appears in the file.   

Test your function by running it on the `sample_text.txt`.

Hint: Call the function you developed in Exercise 4!

In [64]:
def print_word_counts(filename):
    with open(filename) as inputstream:
        
        some_text=inputstream.read()
        vocab=make_vocab(some_text)
        for key,value in vocab.items():
            print("{}:{}".format(key,value))
            
print_word_counts(input_file_path)

This:1
is:1
some:1
sample:1
text.:1
Feel:1
free:1
to:1
replace:1
it:1
with:1
something:1
more:1
interesting!:1


## 1.2.6 Tuples

A tuple is another data structure which at first sight might seem much like a list or a set.  However, note that once created, a tuple has a **fixed** number of elements - we cannot append or add items to a tuple.  The most common types of tuples are **pairs** (two-tuples) and **triples** (three-tuples).  We have already seen pairs when we used the dictionary `.items()` method which returns a list of pairs.

In short, a tuple consists of a number of values separated by commas. These values can be different types. It is initialised with parentheses, containing its objects separated by commas.

In [65]:
person = ("Jon", 14, "jon@thewall.com")
person

('Jon', 14, 'jon@thewall.com')

In [66]:
type(person)

tuple

We can use `len` to count the number of elements in a tuple.

In [67]:
len(person)

3

We can index into a tuple in the same way that we index into a list.

In [68]:
person[0]

'Jon'

In [69]:
person[-2:]

(14, 'jon@thewall.com')

It can be useful to use tuples as values in dictionaries.

In [70]:
#Note that each key is a string, and each value is a tuple
people = {"Joffrey":(12, "Baratheon", "joff@kingslanding.com"), "Jon":(14, "Snow", "jon@thewall.com")}
people["Joffrey"]

(12, 'Baratheon', 'joff@kingslanding.com')

In [71]:
### Jon's age - we access this using the dictionary key, and then indexing within the value:
people["Jon"][0]

14

In [72]:
### Joffrey's email
people["Joffrey"][2]

'joff@kingslanding.com'

In [73]:
#  list everyone's first and last names:
for person, record in people.items():
     print (person, record[1])

Joffrey Baratheon
Jon Snow


### **Exercise 6**
In the blank cell below create a dictionary called `address_book`, with at least 3 key-value entries. Each should consist of a person's name in string format (the key), and a tuple with corresponding pieces of information about them (the value).

Once you've done that, iterate over the address book, printing information about each person into a sentence.

In [74]:
address_book={"ann":(123,"ann@mail.com"),"bob":(234,"bob@mymail.com"),"charlie":(345,"charlie@snailmail.com")}

In [77]:
for entry,value in address_book.items():
    print("{}'s phone number is {} and email address is {}".format(entry,value[0],value[1]))

ann's phone number is 123 and email address is ann@mail.com
bob's phone number is 234 and email address is bob@mymail.com
charlie's phone number is 345 and email address is charlie@snailmail.com
