# Python for Computational Linguists 1.3: Basic NLP with Python


## Introduction

In this module, we will cover
- functions;
- regular expressions;
- spaCy;

## Functions

This section is built on this [tutorial](https://www.digitalocean.com/community/tutorials/how-to-define-functions-in-python-3), and students are strongly recommended to go through it.

> A *function* is a block of instructions that performs an action and, once defined, can be reused. Functions make code more modular, allowing you to use the same code over and over again.

Python has a number of built-in functions that you may be familiar with, including:

- `print()` which will print an object to the terminal;
- `int()` which will convert a string or number data type to an integer data type;
- `len()` which returns the length of an objecj.

In [16]:
print('hello world!')

hello world!


In [17]:
int(3.5)

3

In [18]:
len([1,2,3])

3

Another similar concept is `method`. Check [here](https://stackoverflow.com/questions/155609/whats-the-difference-between-a-method-and-a-function) for their difference.

Now let's learn how to define and call functions.

### Define a function

A function is defined by using the `def` keyword, followed by 
1. a **function name** of your choosing;
2. a set of parentheses which hold any **parameters** the function will take (they can be empty);
3. an ending **colon**;
4. function **content code**.

Let's define a simple function `hello()`that prints **hello world**.

In [19]:
def hello():
    print('hello world')

Note that 
- We do not have any parameter for this function, and we will discuss on it in more details later;
- To start our real function content code, python requires a 4-space indent, as shown above.

Now we have defined our first function. In order to let the program run this function, we still need to call the function, just like what we did to the python built-in functions:

In [20]:
hello()

hello world


Functions can be more complicated. For example, we can use `for` loops, conditional statements, and more within our function block.

For example, the function defined below utilizes a conditional statement to check if the input for the name variable contains a vowel, then uses a for loop to iterate over the letters in the name string.

In [21]:
# Define function names()
def names():
    # Set up name variable with input
    name = str(input('Enter your name: '))
    # Check whether name has a vowel
    if set('aeiou').intersection(name.lower()):
        print('Your name contains a vowel.')
    else:
        print('Your name does not contain a vowel.')

    # Iterate over name
    for letter in name:
        print(letter)

In [22]:
# Call the function
names()

Enter your name: 
Your name does not contain a vowel.


The `names()` function sets up a conditional statement and a for loop, showing how code can be organized within a function definition.

> Defining functions within a program makes our code modular and reusable so that we can call the same functions without rewriting them.

### Function parameters

So far we have looked at functions with empty parentheses, but we can define parameters in function definitions within their parentheses.

> A parameter is a variable in the definition of a function that the function can accept.

Let’s create a function that takes trhee parameters `x`, `y`, `z`, and adds them in different configurations. The sums of these will be printed by the function. Then we’ll call the function and pass numbers into the function.

In [23]:
def add_numbers(x, y, z):
    a = x + y
    b = x + z
    c = y + z
    print(a, b, c)

Now we can pass the `arguments` we want into the function to call it. Check [here](https://www.quora.com/What-is-the-difference-between-argument-and-parameters-in-C) for the difference between parameter and argument.

In [24]:
add_numbers(1, 2, 3)

3 4 5


We passed the number `1` in for the `x` parameter, `2` in for the `y` parameter, and `3` in for the `z` parameter. These values correspond with each parameter in the order they are given.

In [25]:
# try different arguments
add_numbers(4, 6, 8)
add_numbers('a', 'b', 'c') # how does it work?

10 12 14
ab ac bc


> **<h3>💻 Try it yourself!</h3>**

Write a function that, given an input list of integers, prints their average.

<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><code>
  def get_average_list(l):
      print( sum(l) / len(l) )
  </code></p>
</details> 

> **<h3>💻 Try it yourself!</h3>**

Write a function that, given an input string, computes and prints the following statistics:
- Total number of characters.
- Total number of tokens.
- Total number of word types (i.e., distinct tokens) *Hint: there is a datatype that can help you with this!*

In [11]:
def string_statistics(s):
    tot_characters = #TODO
    tot_tokens = #TODO
    tot_types = #TODO
    
    print("The given input string contains", tot_characters, "characters,",\
         tot_tokens, "tokens and", tot_types, "types")

# Test your function here!
test_string = "To be or not to be, not to be or to be!"
string_statistics(test_string) 

SyntaxError: ignored

### Keyword Arguments

We can also use keyword arguments in a function call to identify the arguments by the parameter name.

When using keyword arguments, we can use parameters out of order because the Python interpreter will use the keywords provided to match the values to the parameters:

In [26]:
add_numbers(x = 4, y = 6, z = 8)
add_numbers(y = 6, x = 4, z = 8) # out of order, but same arguments

10 12 14
10 12 14


### Default Argument Values

We can also provide default values for one or more parameters, so the parameters will be set to the default values
if we do not mention them when calling the function.

In [27]:
def add_numbers_default_value(x, y = 6, z = 10):
    a = x + y
    b = x + z
    c = y + z
    print(a, b, c)

In [28]:
add_numbers_default_value(4)
add_numbers_default_value(x = 4)

10 14 16
10 14 16


We can also explicitly pass the arguments to the parameters with the default value:

In [29]:
add_numbers_default_value(4, y = 8)

12 14 18


### Returning a Value

We can pass parameters into a function, and a function can also return a value to us in the end. When a function exits, it can *optionally* pass an expression back to the caller, with the `return` statement. If you use a `return` statement with no arguments, the function will return `None`.

So far, we have used the `print()` statement instead of the return statement in our functions. Let’s create a program that instead of printing will return a variable. The function `square` will take a parameter `x`, and returns the variable `y` representing the square of `x`.

In [30]:
def square(x):
    y = x ** 2
    return y

In [31]:
result = square(3)
print(result)

9


As stated previously, we can use `return` with no arguments, so that the function will return `None`.

In [32]:
# we return no arguments
def square_return_noarg(x):
    y = x ** 2
    return

In [33]:
result = square_return_noarg(3)
print(result)

None


Additionally, without using the `return` statement here, the function cannot return a value so the value defaults to `None` as well.

In [34]:
# we do not use return
def square_noreturn(x):
    y = x ** 2

In [35]:
result = square_noreturn(3)
print(result)

None


In the previous function `add_numbers`, instead of printing the results, we can `return` more than one value wrapped in data structures such as **tuples**, **lists** or **dictionaries**:

In [36]:
def add_numbers_return_dict(x, y, z):
    a = x + y
    b = x + z
    c = y + z
    return {'a': a, 'b': b, 'c': c}

In [37]:
result = add_numbers_return_dict(1, 2, 3)
print(result)

{'a': 3, 'b': 4, 'c': 5}


Whenever the program hits a `return` statement, the function will exit immediately, whether or not they are returning a value.

In [38]:
def loop_five():
    for x in range(0, 25):
        print(x)
        if x == 5:
            # Stop function at x == 5
            return
    print("This line will not execute.")

loop_five()

0
1
2
3
4
5


The function hits `return` before the `for` loop ends, so the line that is outside of the loop will not run. 

> **<h3>💻 Try it yourself!</h3>**

Write a function that:
- Takes as input a shopping list in the form of `items : quantity` pairs;
- For each item to buy, prints a reminder in the form: *Don't forget to buy* + `quantity` + `items`;
- Ends by wishing you a nice shopping time.

In [None]:
def print_shopping_list(shopping_list):
    # write here your function
    pass

# Test your function here!
shopping_list = {"pears": 5, "apples": 7, "bananas": 4}
print_shopping_list(shopping_list)

<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><code>
    def print_shopping_list(shopping_list):
        for item, quantity in shopping_list.items():
            print("Don't forget to buy", quantity, item)
        print("Have a nice shopping time!")
  </code></p>
</details> 

Now, let's make our function a bit nicer.
What would happen, for example, if someone would enter an incorrect input?
Our function would crash!

Let's improve our `print_shopping_list` function in the cell above in this way:
- At the beginning of the function, check that the given input is of type `dict`;
- If the input doesn't pass this test, return a `print` statement that asks the user for a correct input.

Now, test our improved funtion with the test inputs below.

In [None]:
# Test your improved function here!
print_shopping_list({"pears": 5, "apples": 7, "bananas": 4})
print_shopping_list(["pears", "apples", "bananas"])
print_shopping_list({"tomatoes": 9, "cat_litter": 1, "thuna_chunks": 4})

<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><code>
    def print_shopping_list(shopping_list):
        if type(shopping_list) != dict:
            return("Please, give me a valid shopping list!")
        for item, quantity in shopping_list.items():
            print("Don't forget to buy", quantity, item)
        print("Have a nice shopping time!")
  </code></p>
</details> 

> **<h3>💻 Try it yourself!</h3>**

Now, let's write a function that returns the total cost of a shopping list given as inputs two dictionaries: a `shopping_list` dictionary composed of `item : quantity` pairs (as in the excercise above) and an `inventory` dictionary composed of `item : cost` pairs.


Complete the function below with the missing statements.

In [39]:
def calculate_shopping_bill(shopping_list, inventory):
    shopping_cost = 0
    for item, quantity in shopping_list.items():
        # TODO: get the total price by summing up the prices of single items

    return shopping_cost
        
    
# Test your function here!
our_shopping = {"pears": 5, "apples": 7, "bananas": 4}
current_inventory = {"pears": 0.5, "apples": 0.7, "bananas": 1.3}

cost = calculate_shopping_bill(our_shopping, current_inventory)
print("You'll spend", cost, "GBP")

IndentationError: ignored

<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p><code>
    def calculate_shopping_bill(shopping_list, inventory):
        shopping_cost = 0
        for item, quantity in shopping_list.items():
            single_item_cost = inventory[item]
            total_item_cost = single_item_cost * quantity
            shopping_cost += total_item_cost
        return shopping_cost
  </code></p>
</details> 

### ❓ Quiz  

Given the function:
```
def add_numbers(x = 3, y = 4, z = 5):
    x = x + y
    y = x + z
    z = y + z
    print(x, y, z)
    
o, p, q = 5, 6, 7
L = [5,6,7]
```

1. What does the following code print?

```
d = add_numbers(5)
```

A. 7, 8, 9

B. 7, 8, 9

C. 9, 14, 19

D. 9, 14, 19

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>C</p>
</details> 

2. What is the value of `d`?

A. 7

B. 9

C. 10

D. None

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>D</p>
</details> 

3. What is the value of `o p q` after running the following code?

```
add_numbers(o, p, q)
```

A. 7, 12, 17

B. 9, 14, 19

C. 11, 18, 25

D. 5, 6, 7

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>D</p>
</details> 

4. How to pass the element of `L` as the parameters of the function?


A. ```add_numbers(L)```


B. ```add_numbers(L*)```

C. ```add_numbers(*L)```

D. ```add_numbers(**L)```

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>C</p>
</details> 

5. What is the value of `L` after running the following code?

```
add_numbers(L, L, L)
```

A. [5, 6, 7]

B. [5, 6, 7, 5, 6, 7]

C. [5, 6, 7, 5, 6, 7, 5, 6, 7]

D. [5, 6, 7, 5, 6, 7, 5, 6, 7, 5, 6, 7]

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>A</p>
</details> 

## Regular Expressions

This section is heavily built on this [tutorial](https://docs.python.org/3.6/howto/regex.html#regex-howto), and students are strongly recommended to go through it. A more comprehensive documentation of python regular expressions can be found [here](https://docs.python.org/3.6/library/re.html).

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Python uses the **raw string** notation for RE patterns, and backslashes `'\'` are not handled in any special way in a string literal prefixed with `r`. So `r'\n'` is a two-character string containing `'\'` and `'n'`, while `'\n'` is a one-character string containing a newline. 

We’ll use the raw string notation for the rest of the section. We will also write REs in highlight style , i.e. `r'\n'` is equivalent to `\n`, usually without quotes, and strings to be matched 'in single quotes'.

REs can contain both special and ordinary characters. Most ordinary characters, like `A`, `a`, or `0`, are the simplest REs; they simply match themselves. You can concatenate ordinary characters to match more complex sequence. For example, `test` will match the string 'test' exactly.

Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

Here’s a complete list of the metacharacters and their explanations:

- `.`             Matches any character except a newline.
- `^`             Matches the start of the string.
- `$`             Matches the end of the string or just before the newline at the end of the string.
- `*`             Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible.
- `+`             Matches 1 or more (greedy) repetitions of the preceding RE.
- `?`             Matches 0 or 1 (greedy) of the preceding RE.
- `*?, +?, ??`    Non-greedy versions of the previous three special characters.
- `{m,n}`         Matches from m to n repetitions of the preceding RE.
- `{m,n}?`        Non-greedy version of the above.
- `\\`            Either escapes special characters or signals a special sequence.
- `[]`            Indicates a set of characters. A "^" as the first character indicates a complementing set.
- `|`             A|B, creates an RE that will match either A or B.
- `(...)`         Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string.


The special sequences consist of "\\" and a character from the list below:
- `\number`  Matches the contents of the group of the same number.
- `\A`       Matches only at the start of the string.
- `\Z`       Matches only at the end of the string.
- `\b`       Matches the empty string, but only at the start or end of a word.
- `\B`       Matches the empty string, but not at the start or end of a word.
- `\d`       Matches any decimal digit; equivalent to the set `[0-9]`.
- `\D`       Matches any non-digit character; equivalent to `[^\d]`.
- `\s`       Matches any whitespace character; equivalent to `[ \t\n\r\f\v]`.
- `\S`       Matches any non-whitespace character; equivalent to `[^\s]`.
- `\w`       Matches any alphanumeric character; equivalent to `[a-zA-Z0-9_]`.
- `\W`       Matches the complement of `\w`.
- `\\`       Matches a literal backslash.

In [40]:
# import built-in regular expression module in python
import re

In [41]:
text = ("""                2
  When forty winters shall besiege thy brow,"""
  "And dig deep trenches in thy beauty's field, "
  "Thy youth's proud livery so gazed on now, "
  "Will be a tattered weed of small worth held: "
  "Then being asked, where all thy beauty lies, "
  "Where all the treasure of thy lusty days; "
  "To say within thine own deep sunken eyes, "
  "Were an all-eating shame, and thriftless praise. "
  "How much more praise deserved thy beauty's use, "
  "If thou couldst answer 'This fair child of mine "
  "Shall sum my count, and make my old excuse' "
  "Proving his beauty by succession thine. "
  "This were to be new made when thou art old, "
  "And see thy blood warm when thou feel'st it cold.")

print(text)

                2
  When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserved thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.


We will mainly concentrate two essential usage of regular expressions: **searching** and **substituting** text.

### Searching text

Let's first find word '**where**', in regardless of case.

The **re** module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.

In [42]:
p = re.compile(r'\b[Ww]here\b')
p

re.compile(r'\b[Ww]here\b', re.UNICODE)

After we obtained compiled pattern object, it has several functions and attributes. For **searching** text, there are mainly four functions available:
- *match()*: determine if the RE matches at the beginning of the string.
- *search()*: scan through a string, looking for any location where this RE matches.
- *findall()*: find all substrings where the RE matches, and returns them as a list.
- *finditer()*: find all substrings where the RE matches, and returns them as an iterator.

Please consult the **re** documentation for a complete listing.

In [43]:
m = p.match(text)
print(m)

None


In [44]:
m = p.search(text)
print(m)

<_sre.SRE_Match object; span=(212, 217), match='where'>


The functions *match()* and *search()* return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In this example, as the text doesn't start with the word 'where', so *match()* won't find any match, whereas *search()* will look for any location where RE matches.

We can further query the match object for information about the matching string.

In [45]:
m.group()

'where'

In [46]:
m.start(), m.end()

(212, 217)

In [47]:
m.span()

(212, 217)

Now what if we want to find all matches, i.e. all words 'where'? We can use *findall()* and *finditer()*.

The *findall()* returns a list of matching strings.

In [48]:
m = p.findall(text)
print(m)

['where', 'Where']


The *finditer()* returns a sequence of match object instances as an iterator.

In [49]:
iterator = p.finditer(text)
for match in iterator:
    print(match)

<_sre.SRE_Match object; span=(212, 217), match='where'>
<_sre.SRE_Match object; span=(239, 244), match='Where'>


You don’t have to create a pattern object and call its functions; the **re** module also provides the same top-level functions.

In [50]:
print(re.search(r'\b[Ww]here\b', text))

<_sre.SRE_Match object; span=(212, 217), match='where'>


In [51]:
print(re.findall(r'\b[Ww]here\b', text))

['where', 'Where']


Notice that when we compile the RE, there is a **re.UNICODE** item when we print the object. It is a compliation flag which let you modify some aspects of how RE works. The full flag list is available in the module documentation. For example, we can use **re.IGNORECASE** or **re.I** to do case-insensitive matches, so that we do not have to take care of capital letters.

In [52]:
p = re.compile(r'\bwhere\b', re.I)
p

re.compile(r'\bwhere\b', re.IGNORECASE|re.UNICODE)

And we will still have the same results.

In [53]:
m = p.findall(text)
print(m)

['where', 'Where']


> **<h3>💻 Try it yourself!</h3>**

Now try to come up with your regular expressions and search in the text to see if they will work.

You can use the cell below:

#### Groupings

Frequently you need to obtain more information than just whether the RE matched or not. REs are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest.

Groups are marked by the `'('`, `')'` metacharacters. They group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as `*`, `+`, `?`, or `{m,n}`.

In [54]:
p = re.compile('((a(b)c)d)*')
m = p.match('abcdabcd')
print(m.span())

(0, 8)


Groups indicated with `'('`, `')'` also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to `group()`, `start()`, `end()`, and `span()`. 

Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so previous methods all have group 0 as their default argument.

In [55]:
print(m.start(0), m.end(0))
print(m.group(0,1,2,3))
print(m.span(2))

0 8
('abcdabcd', 'abcd', 'abc', 'b')
(4, 7)


The `groups()` method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.

In [56]:
print(m.groups())

('abcd', 'abc', 'b')


### Substituting text

We will use the following functions for substitution:
- *sub()*: find all substrings where the RE matches, and replace them with a different string.
- *subn()*: does the same thing as sub(), but returns the new string and the number of replacements.

Again, you are encouraged to refer to the docs for more functions.

***sub*** *(replacement, string[, count=0])*

Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in *string* by the replacement *replacement*. If the pattern isn’t found, string is returned unchanged.

The optional argument *count* is the maximum number of pattern occurrences to be replaced; *count* must be a non-negative integer. The default value of 0 means to replace all occurrences.

For the same example, let's substitute all `'where'` words to `'-------'`.

In [57]:
p = re.compile(r'\b[Ww]here\b')
print(p.sub('------', text))

                2
  When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, ------ all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserved thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.


In [58]:
print(p.sub('------', text, count = 1))

                2
  When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserved thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.


The `subn()` method does the same work, but returns a 2-tuple containing the new string value and the number of replacements that were performed:

In [59]:
print(p.subn('------', text))

("                2\n  When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, ------ all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserved thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.", 2)


In [60]:
p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
p.sub(r'subsection{\g<1>}','section{First} section{second}')

'subsection{First} subsection{second}'

> **<h3>💻 Try it yourself!</h3>**

Now it's your turn! Please write a regular expression to match sentences that start with the word “This” (case insensitive).

And a regex that matches words which are emoticons (like *:)*). At least 5 emoticons must be matched.

And a regex that can match words which contains at least 3 *o*s in a row (so that *cooooool* would be a match, but *book* would not).

Finally, write a regex that substitutes any word-ending in *a* or *as* into an *o* or *os*.

Test your regex by converting the Spanish sentence *Maria es ecuatoriana y tiene dos hijas rubias y muy altas* (Mary is Ecuadorian and has two blonde and very tall daughters) into its masculine equivalent.

### Building a simple Eliza

Having learnt about Eliza in the lecture, which is meant to emulate a Rogerian psychologist, let us now build a simple version of Eliza with RE.

Firstl, let us build a dictionary `grefs` that is used to map first-person pronouns to second-person pronouns and vice-versa. It is used to “reflect” a statement back against the user:

In [61]:
grefs = {
    "am":          "are",
    "was":         "were",
    "i":           "you",
    "i'd":         "you would",
    "i've":        "you have",
    "i'll":        "you will",
    "my":          "your",
    "are":         "am",
    "you are":     "I am",
    "you've":      "I have",
    "you'll":      "I will",
    "your":        "my",
    "yours":       "mine",
    "you":         "me",
    "me":          "you"
}

Then we need another table `gpats` that is made up of a list of lists, where the first element is a RE that matches the user’s statements and the second element is a list of potential responses. 

Many of the potential responses contain placeholders that can be filled in with fragments to echo the user’s statements.

In [62]:
gpats = [
  [r'I need (.*)',
  [  "Why do you need {0}?",
    "Would it really help you to get {0}?",
    "Are you sure you need {0}?"]],

  [r'I can\'?t (.*)',
  [  "How do you know you can't {0}?",
    "Perhaps you could {0} if you tried.",
    "What would it take for you to {0}?"]],

  [r'I am (.*)',
  [  "Did you come to me because you are {0}?",
    "How long have you been {0}?",
    "How do you feel about being {0}?"]],

  [r'I\'?m (.*)',
  [  "How does being {0} make you feel?",
    "Do you enjoy being {0}?",
    "Why do you tell me you're {0}?",
    "Why do you think you're {0}?"]],

  [r'Are you ([^\?]*)\??',
  [  "Why does it matter whether I am {0}?",
    "Would you prefer it if I were not {0}?",
    "Perhaps you believe I am {0}.",
    "I may be {0} -- what do you think?"]],

  [r'What (.*)',
  [  "Why do you ask?",
    "How would an answer to that help you?",
    "What do you think?"]],

  [r'How (.*)',
  [  "How do you suppose?",
    "Perhaps you can answer your own question.",
    "What is it you're really asking?"]],

  [r'Because (.*)',
  [  "Is that the real reason?",
    "What other reasons come to mind?",
    "Does that reason apply to anything else?",
    "If {0}, what else must be true?"]],

  [r'Hello(.*)',
  [  "Hello... I'm glad you could drop by today.",
    "Hi there... how are you today?",
    "Hello, how are you feeling today?"]],

  [r'quit',
  [  "Thank you for talking with me.",
    "Good-bye.",
    "Thank you, that will be $150.  Have a good day!"]],
  
  [r'(.*)',
  [  "Please tell me more.",
    "Let's change focus a bit... Tell me about your family.",
    "Can you elaborate on that?",
    "Why do you say that {0}?",
    "I see.",
    "Very interesting.",
    "{0}.",
    "I see.  And what does that tell you?",
    "How does that make you feel?",
    "How do you feel when you say that?"]] 
]

Now let us write the function `respond` that reads the user input and generates the response accordingly.

We iterate through the regular expressions in `gpats`, trying to match each one with the user’s input `s`. If we find a match, we choose a response template randomly from the list of possible responses associated with the matching pattern. Then we interpolate the match groups from the RE into the response string, calling the `transalte` function on each match group first.

When we use the list comprehension to generate a list of reflected match groups, we explode the list with the asterisk `\*` character before passing it to the string’s `format` method. Format expects a series of positional arguments corresponding to the number of format placeholders – `{0}`, `{1}`, etc. – in the string.

In [63]:
import re
import random
def respond(s):
    for pattern, responses in gpats:
        # find a match for s
        match = re.match(pattern, s)
        if match:
            # chosen randomly from among the available options
            response = random.choice(responses)
            return response.format(*[translate(g) for g in match.groups()])

First, we make the input lowercase, then we tokenize it by splitting on whitespace characters. We iterate through the list of tokens and, if the token exists in our reflections dictionary, we replace it with the value from the dictionary. So “I” becomes “you”, “your” becomes “my”, etc.

In [64]:
def translate(fragment):
    tokens = fragment.lower().split()
    for i, token in enumerate(tokens):
        if token in grefs:
            tokens[i] = grefs[token]
    return ' '.join(tokens)

Finally, we can wrap everything into an interface where our Eliza will wait for the user input and generate corresponding response.

In [65]:
def start_eliza():
    print('Therapist\n---------')
    print('Talk to the program by typing in plain English, using normal upper-')
    print('and lower-case letters and punctuation.  Enter "quit" when done.')
    print('='*72)
    print('Hello.  How are you feeling today?')

    s = ''
    while s != 'quit':
        try:
            s = input('> ')
            while s[-1] in '!.':
                s = s[:-1]
        except:
            s = 'quit'
            print('Invalid Input, exit ...')
        print(respond(s))

In [66]:
start_eliza()

Therapist
---------
Talk to the program by typing in plain English, using normal upper-
and lower-case letters and punctuation.  Enter "quit" when done.
Hello.  How are you feeling today?
> I am feeling well.
How long have you been feeling well?
> A long time.
I see.
> 
Invalid Input, exit ...
Thank you for talking with me.


Now try to add your own [RE, responses] pairs to the `gpats` to capture more questions from the user. Feel free to refer to [this](https://github.com/jezhiggins/eliza.py/blob/master/eliza.py) for more REs.

### ❓ Quiz  

1. For the previous text, which RE can find all words with 3, 4 and 5 characters?

A. `r'.{3,5}'`

B. `r'\b.{3,5}+\b'`

C. `r'\b\w{3,5}\b'`

D. `r'\w{3,5}'`

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>C</p>
</details> 

2. Which function can you use to find all the words that the RE in 1. matches?

A. `match()`

B. `search()`

C. `finditer()`

D. `findall()`

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>C, D</p>
</details> 

3. How many words the RE in 1 mathes are there?

A. 128

B. 105

C. 96

D. 77

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>D</p>
</details> 

4. How many words are there with 5 letters?

A. 10

B. 20

C. 30

D. 35

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>B</p>
</details> 

5. What's the span of the first word with 4 letters?

A. (16, 20)

B. (18, 22)

C. (20, 24)

D. (22, 26) 

<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>C</p>
</details> 

## SpaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

There are many useful resources online for spaCy, among which the official [tutorial](https://spacy.io/usage/spacy-101) and the [course](https://course.spacy.io/) are extremely useful.

### Package Installation

Python uses a software called *package manager* to install additional libraries. The most common package managers for Python are [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)) and [conda](https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)).

[Here](https://spacy.io/usage) you will find a guide to install spaCy with pip.
However, you can install packages directly from Jupyter by running the following
*shell commands*:
 
```
pip install spacy
python -m spacy download en_core_web_sm
```

Shell commands are usually run in a *terminal*, and are run directly by the
operating system. In Jupyter, you can run a shell command by prepending an 
exclamation mark ("`!`") to the command. Don't worry - we will always prepare
shell commands for you when they are needed!

**IMPORTANT**:
Note that shell commands will work only when running the notebooks
- On Linux and macOS/OS X
- On a cloud provider (e.g. Google Colab, MS Azure, Binder)

These commands *may* not work when running the notebooks locally with Windows.

Let's install spaCy and download the required models:


In [67]:
!pip install spacy
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


The previous command should print the message `✔ Download and installation successful`. If that didn't happen, and/or if the commands below do not work, 
please contact a course instructor for support.

### Using spaCy

spaCy is enables many different features introduced [here](https://spacy.io/usage/spacy-101#features), and we will focus on the simplest and direct features:
- tokenization;
- stemming; (not supported by spaCy)
- lemmatization;
- part-of-speech (POS) tagging;
- sentence boundary detection and sentence segmentation;
- dependency parsing;

In [68]:
# import spacy module
import spacy

# load english model
nlp = spacy.load("en_core_web_sm")
# read the text and process
doc = nlp(text)

In [69]:
# Tokenization
# get all separated tokens in doc
tokens = [token.text for token in doc]
print(tokens)

['                ', '2', '\n  ', 'When', 'forty', 'winters', 'shall', 'besiege', 'thy', 'brow', ',', 'And', 'dig', 'deep', 'trenches', 'in', 'thy', 'beauty', "'s", 'field', ',', 'Thy', 'youth', "'s", 'proud', 'livery', 'so', 'gazed', 'on', 'now', ',', 'Will', 'be', 'a', 'tattered', 'weed', 'of', 'small', 'worth', 'held', ':', 'Then', 'being', 'asked', ',', 'where', 'all', 'thy', 'beauty', 'lies', ',', 'Where', 'all', 'the', 'treasure', 'of', 'thy', 'lusty', 'days', ';', 'To', 'say', 'within', 'thine', 'own', 'deep', 'sunken', 'eyes', ',', 'Were', 'an', 'all', '-', 'eating', 'shame', ',', 'and', 'thriftless', 'praise', '.', 'How', 'much', 'more', 'praise', 'deserved', 'thy', 'beauty', "'s", 'use', ',', 'If', 'thou', 'couldst', 'answer', "'", 'This', 'fair', 'child', 'of', 'mine', 'Shall', 'sum', 'my', 'count', ',', 'and', 'make', 'my', 'old', 'excuse', "'", 'Proving', 'his', 'beauty', 'by', 'succession', 'thine', '.', 'This', 'were', 'to', 'be', 'new', 'made', 'when', 'thou', 'art', 'o

You can see from the list *tokens* that there each tokenized token is in the entry of the list.

In [70]:
# Lemmatization
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['                ', '2', '\n  ', 'when', 'forty', 'winter', 'shall', 'besiege', 'thy', 'brow', ',', 'and', 'dig', 'deep', 'trench', 'in', 'thy', 'beauty', "'s", 'field', ',', 'Thy', 'youth', "'s", 'proud', 'livery', 'so', 'gaze', 'on', 'now', ',', 'Will', 'be', 'a', 'tattered', 'weed', 'of', 'small', 'worth', 'hold', ':', 'then', 'be', 'ask', ',', 'where', 'all', 'thy', 'beauty', 'lie', ',', 'where', 'all', 'the', 'treasure', 'of', 'thy', 'lusty', 'day', ';', 'to', 'say', 'within', 'thine', 'own', 'deep', 'sink', 'eye', ',', 'be', 'an', 'all', '-', 'eat', 'shame', ',', 'and', 'thriftless', 'praise', '.', 'how', 'much', 'more', 'praise', 'deserve', 'thy', 'beauty', "'s", 'use', ',', 'if', 'thou', 'couldst', 'answer', "'", 'this', 'fair', 'child', 'of', '-PRON-', 'Shall', 'sum', '-PRON-', 'count', ',', 'and', 'make', '-PRON-', 'old', 'excuse', "'", 'prove', '-PRON-', 'beauty', 'by', 'succession', 'thine', '.', 'this', 'be', 'to', 'be', 'new', 'make', 'when', 'thou', 'art', 'old', ',', '

Did you notice that almost all pronouns are subsituted with `-PRON-`?

Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal pronouns.

In [71]:
# POS tagging
poss = [token.pos_ for token in doc]
print(poss)

['SPACE', 'NUM', 'SPACE', 'ADV', 'NUM', 'NOUN', 'VERB', 'VERB', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'VERB', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'PART', 'NOUN', 'PUNCT', 'PROPN', 'NOUN', 'PART', 'ADJ', 'NOUN', 'ADV', 'VERB', 'ADP', 'ADV', 'PUNCT', 'VERB', 'AUX', 'DET', 'ADJ', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'VERB', 'PUNCT', 'ADV', 'AUX', 'VERB', 'PUNCT', 'ADV', 'DET', 'DET', 'NOUN', 'VERB', 'PUNCT', 'ADV', 'DET', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADJ', 'NOUN', 'PUNCT', 'PART', 'VERB', 'ADP', 'NOUN', 'VERB', 'ADJ', 'VERB', 'NOUN', 'PUNCT', 'AUX', 'DET', 'ADV', 'PUNCT', 'VERB', 'NOUN', 'PUNCT', 'CCONJ', 'NOUN', 'NOUN', 'PUNCT', 'ADV', 'ADV', 'ADJ', 'NOUN', 'VERB', 'DET', 'NOUN', 'PART', 'NOUN', 'PUNCT', 'SCONJ', 'PROPN', 'PROPN', 'NOUN', 'PUNCT', 'DET', 'ADJ', 'NOUN', 'ADP', 'PRON', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'VERB', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'NOUN', 'PUNCT', 'DET', 'AUX', 'PART', 'AUX', 'ADJ', 'VERB', 'ADV', 'PROPN', 'PROPN', 'ADJ

Compare the outputs of the previous three lists.

Now let's move to sentence level features, i.e. sentence boundary detection and sentence segmentation. Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually more accurate than a rule-based approach, but it also means you’ll need a statistical model and accurate predictions.
If your texts are closer to general-purpose news or web text, this should work well out-of-the-box. 

In [72]:
sentences = list(doc.sents)
sent_texts = [sentence.text for sentence in sentences]
print(sent_texts)
print(len(sent_texts))

['                2\n  ', "When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise.", "How much more praise deserved thy beauty's use, If thou couldst answer '", "This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine.", "This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."]
5


However, for other texts like social media texts, your application may benefit from a custom rule-based implementation. You can either use the built-in Sentencizer or plug an entirely custom rule-based function into your processing pipeline.

Let's build a rule-based sentence segmentor.

In [73]:
from spacy.lang.en import English

nlp_rule = English()  # just the language with no model
sentencizer = nlp_rule.create_pipe("sentencizer")
nlp_rule.add_pipe(sentencizer)
doc_rule = nlp_rule(text)
sent_rule_texts = [sent.text for sent in doc_rule.sents]
print(sent_rule_texts)
print(len(sent_rule_texts))

["                2\n  When forty winters shall besiege thy brow,And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise.", "How much more praise deserved thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse' Proving his beauty by succession thine.", "This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."]
3


You can see clearly the difference in both methods, where the statistical dependency parse sentence segmentor generate 12 sentencs, but the rule-based sentence segmentor only yields 4 sentences.

Finally, let's have a look at the dependencies, i.e. the dependency relations between tokens, which is also used to segment sentences for the first model.

In [74]:
# Dependencies
deps = [token.dep_ for token in doc]
print(deps)

['', 'ROOT', '', 'advmod', 'nummod', 'nsubj', 'aux', 'advcl', 'compound', 'dobj', 'punct', 'cc', 'conj', 'amod', 'dobj', 'prep', 'compound', 'poss', 'case', 'pobj', 'punct', 'compound', 'poss', 'case', 'amod', 'nsubj', 'advmod', 'conj', 'prep', 'pcomp', 'punct', 'aux', 'ROOT', 'det', 'amod', 'attr', 'prep', 'amod', 'pobj', 'acl', 'punct', 'advmod', 'auxpass', 'advcl', 'punct', 'advmod', 'det', 'compound', 'nsubj', 'ccomp', 'punct', 'advmod', 'predet', 'det', 'nsubj', 'prep', 'nmod', 'compound', 'pobj', 'punct', 'aux', 'advcl', 'prep', 'pobj', 'amod', 'advmod', 'amod', 'dobj', 'punct', 'ccomp', 'det', 'advmod', 'punct', 'amod', 'attr', 'punct', 'cc', 'compound', 'conj', 'punct', 'advmod', 'advmod', 'amod', 'nsubj', 'ROOT', 'compound', 'poss', 'case', 'dobj', 'punct', 'mark', 'compound', 'nsubj', 'advcl', 'punct', 'det', 'amod', 'nsubj', 'prep', 'pobj', 'aux', 'ROOT', 'poss', 'dobj', 'punct', 'cc', 'conj', 'poss', 'amod', 'dobj', 'punct', 'ccomp', 'poss', 'dobj', 'prep', 'compound', 'pob

Using spaCy’s built-in [displaCy visualizer](https://spacy.io/usage/visualizers), here’s what our example sentence and its dependencies.

In [75]:
from spacy import displacy
displacy.render(sentences, style = "dep")

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="e60f70f02e5f4d38a4e6366a1e2c23dd-0" class="displacy" width="575" height="224.5" direction="ltr" style="max-width: none; height: 224.5px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="134.5">\n    <tspan class="displacy-word" fill="currentColor" x="50">                </tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">SPACE</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="134.5">\n    <tspan class="displacy-word" fill="currentColor" x="225">2</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">NUM</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="134.5">\n    <tspan class="displacy-word" fill="currentColor" x="400">\n  </tspan>\n    <tspan class="di

## ✍️ Assessment

1. Write a function to calculate the $N_{th}$ Finbonacci number without `for` loop using the folloing formula:
$$
f_n = \frac{a^n - b^n}{a - b} = \frac{a^n - b^n}{\sqrt{5}}
$$
where $a = \frac{1 + \sqrt{5}}{2}$ and $b = \frac{1- \sqrt{5}}{2}$.

In [76]:
import math
SQRT_5 = math.sqrt(5)
def calc_fibonacci(n):
    a = (1 + SQRT_5) / 2
    b = (1 - SQRT_5) / 2
    fn = int((a**n - b**n) / SQRT_5)
    return fn

print(calc_fibonacci(5))
print(calc_fibonacci(10))
print(calc_fibonacci(20))

5
55
6765


2. Tokenize the following sentences with RE:
```
sents = """He asked: "When will you graduate?"\n"I will get my Ph.D. degree(Doctor of Philosophy) in a few years, hehe. And get $123,45675.45." I answered. END"""
```
How many token types are there in total?

In [77]:
import re

PUNCTS = ['"', "'",':', ';', '?', '(',')', '[', ']', '!', '$', '&', '*', '#', '-']
PUNCTS_REGEX =  {',':['[^ ](,)(\s+)'], '.':['[^ ]\.(\s+)([^a-z])']}

def my_tokenizer(sents):
    print('Original...')
    print(sents)

    for punt in PUNCTS:
        sents = sents.replace(punt, (' ' + punt + ' '))

    for punct, regs in PUNCTS_REGEX.items():
        for reg in regs:
            p = re.compile(reg)
            m = p.search(sents)
            while m:
                sents = sents[:m.start() + 1] + ' ' + punct + ' ' + sents[m.start() + 2:]
                m = p.search(sents)
    sents = re.sub(r' +', ' ', sents)        

    print('After Tokenization...')
    print(sents)
    return sents

sents = """He asked: "When will you graduate?"\n"I will get my Ph.D. degree(Doctor of Philosophy) in a few years, hehe. And get $123,45675.45." I answered. END"""
tok_sents = my_tokenizer(sents)
print(set(tok_sents.split()))
print(len(set(tok_sents.split())))

Original...
He asked: "When will you graduate?"
"I will get my Ph.D. degree(Doctor of Philosophy) in a few years, hehe. And get $123,45675.45." I answered. END
After Tokenization...
He asked : " When will you graduate ? " 
 " I will get my Ph.D. degree ( Doctor of Philosophy ) in a few years , hehe . And get $ 123,45675.45 . " I answered . END
{')', '123,45675.45', 'in', 'a', 'hehe', 'degree', ':', 'my', 'answered', 'get', '.', 'When', '$', 'I', 'asked', '?', 'few', 'Philosophy', 'will', 'And', 'He', '(', 'Ph.D.', 'years', 'you', 'of', '"', ',', 'Doctor', 'END', 'graduate'}
31


3. Tokenize the previous sentences with SpaCy. How many token types are there in total?

In [78]:
# import spacy module
import spacy

# load english model
nlp = spacy.load("en_core_web_sm")
# read the text and process
doc = nlp(sents)
print(doc)
tokens = [token.text for token in doc]
print(' '.join(tokens))
print(set(tokens))
print(len(set(tokens)))

He asked: "When will you graduate?"
"I will get my Ph.D. degree(Doctor of Philosophy) in a few years, hehe. And get $123,45675.45." I answered. END
He asked : " When will you graduate ? " 
 " I will get my Ph.D. degree(Doctor of Philosophy ) in a few years , hehe . And get $ 123,45675.45 . " I answered . END
{')', '123,45675.45', 'in', 'degree(Doctor', 'a', 'hehe', ':', 'my', 'answered', 'get', '.', 'When', '\n', '$', 'I', 'asked', '?', 'few', 'Philosophy', 'will', 'And', 'He', 'Ph.D.', 'years', 'you', 'of', '"', ',', 'END', 'graduate'}
30


## External Resources

- [How To Define Functions in Python 3](https://www.digitalocean.com/community/tutorials/how-to-define-functions-in-python-3)
- [Regular Expression HOWTO](https://docs.python.org/3.6/howto/regex.html#regex-howto)
- [Regular Expression Documentation](https://docs.python.org/3.6/library/re.html)
- [Eliza](https://github.com/jezhiggins/eliza.py/blob/master/eliza.py)
- [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101#_title)
- [Advanced NLP with spaCy](https://course.spacy.io/)

## Next Module

[Click here](../module_1.4/module_1.4.ipynb) to move to the next module.