# [1 - Language Processing and Python Exercises](http://www.nltk.org/book/ch01)

* to learn how to run these python notebooks, [refer to our setup tutorial](../setup.ipynb)
* run the cell below before practicing the exercises

In [2]:
import nltk

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 1. 
☼ Try using the Python interpreter as a calculator, and typing expressions like `12 / (4 + 1)`.

## 2.

☼ Given an alphabet of 26 letters, there are 26 to the power 10, or 26 ** 10, ten-letter strings we can form. That works out to 141167095653376. How many hundred-letter strings are possible?

## 3. 

☼ The Python multiplication operation can be applied to lists. What happens when you type `['Monty', 'Python'] * 20`, or `3 * sent1`?

## 4.

☼ Review 1 on computing with language. How many words are there in text2? How many distinct words are there?

<Text: Sense and Sensibility by Jane Austen 1811>

## 5.

☼ Compare the lexical diversity scores for humor and romance fiction in 1.1. Which genre is more lexically diverse?

| Genre            | Tokens | Types | Lexical diversity |
| ---------------- | ------ | ----- | ----------------- |
| humor            | 21695  | 5017  | 0.231             |
| fiction: romance | 70022  | 8452  | 0.121             |

**Humor** appears to be more lexically diverse.

## 6.

☼ Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, and Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?

* [change plot size with matplotlib](https://stackoverflow.com/questions/42848307/nltk-dispersion-plot-figure-size/44126375)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4)) 

text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby"])

* Edward and Willoughby noticably speak at different times

[Answer](https://en.wikipedia.org/wiki/Sense_and_Sensibility):
* Marianne -> Willoughby
* Elinor -> Edward

## 7.

☼ Find the collocations in text5.


* **collocation**: sequence of words that occur together unusually often (more correct occurances)
* [Video explaining it](https://www.youtube.com/watch?v=CqRloBkyqQs&vl=en)
* `.collocations()` function appears to be broken
    * [fix for this problem](https://github.com/nltk/nltk_book/issues/224)

In [None]:
text5

In [None]:
text5.collocation_list()

## 8.

☼ Consider the following Python expression: `len(set(text4))`. State the purpose of this expression. Describe the two steps involved in performing this computation.

1. `set(text4)` creates a list of the unique words in the text
2. `len(set(text4))` gets the length of the list of unique words

In [None]:
text4

In [None]:
len(set(text4))

The purpose of this expression is to compute the total number of unique words in a text. Text4, the *Inagural Address Corpus*, has 9913 unique words.

## 9.

☼ Review Part 2 on lists and strings.

1. Define a string and assign it to a variable, e.g., `my_string = 'My String'` (but put something more interesting in the string). Print the contents of this variable in two ways, first by simply typing the variable name and pressing enter, then by using the print statement.

2. Try adding the string to itself using `my_string + my_string`, or multiplying it by a number, e.g., `my_string * 3`. Notice that the strings are joined together without any spaces. How could you fix this?



In [None]:
# Part 1

my_string = "This is my string"

print(my_string)

In [None]:
# Part 2

my_string + my_string

In [None]:
# Part 2

my_string + " " + my_string

In [None]:
# Part 2 

my_string * 3

In [None]:
# Part 2

(my_string + " ") * 3

## 10.

☼ Define a variable my_sent to be a list of words, using the syntax `my_sent = ["My", "sent"]` (but with your own words, or a favorite saying).

1. Use `' '.join(my_sent)` to convert this into a string.
2. Use split() to split the string back into the list form you had to start with.



In [None]:
my_sent = ["This", "is", "my", "sent"]

In [None]:
# Part 1
my_string = ' '.join(my_sent)

my_string

In [None]:
# Part 2

my_string.split()

## 11.

☼ Define several variables containing lists of words, e.g., phrase1, phrase2, and so on. Join them together in various combinations (using the plus operator) to form whole sentences. What is the relationship between `len(phrase1 + phrase2)` and `len(phrase1) + len(phrase2)`?

In [None]:
phrase1 = "Hi diddly ho!"
phrase2 = "Neighborino!"

In [None]:
phrase1 + " " + phrase2

In [None]:
len(phrase1 + phrase2)

In [None]:
len(phrase1) + len(phrase2)

**Answer:**

* `len(phrase1 + phrase2)` and `len(phrase1) + len(phrase2)` are the same

## 12.

☼ Consider the following two expressions, which have the same value. Which one will typically be more relevant in NLP? Why?

1. `"Monty Python"[6:12]`
2. `["Monty", "Python"][1]`

**Answer:**

Expression 2 is more relevant because a corpus of words is typically represented as a list of words (*tokens*), so accessing a word from a list of words is more commonly used than accessing the substring of a sentence.

## 13.

☼ We have seen how to represent a sentence as a list of words, where each word is a sequence of characters. What does `sent1[2][2]` do? Why? Experiment with other index values.

In [None]:
sent1

In [None]:
sent1[2]

In [None]:
sent1[2][2]

**Answer:**

* `sent1[2][2]` accesses the third letter of the third item in the list

## 14.

☼ The first sentence of text3 is provided to you in the variable sent3. The index of the in sent3 is 1, because sent3[1] gives us 'the'. What are the indexes of the two other occurrences of this word in sent3?

In [None]:
sent3

In [None]:
for i, word in enumerate(sent3):
    if word == "the":
        print(i)

In [None]:
sent3[5]

In [None]:
sent3[8]

**Answer:**

* the other indexes are 5 and 8

## 15.

☼ Review the discussion of conditionals in Part 4. Find all words in the Chat Corpus (text5) starting with the letter b. Show them in alphabetical order.

In [None]:
text5

In [None]:
# List Comprehension Solution

b_words = [word for word in text5 if len(word) > 0 and word[0].lower() == "b"]

sorted(b_words)[:10]

In [None]:
# for loop solution

b_words = []

for word in text5:
    if len(word) > 0 and word[0].lower() == "b":
        b_words.append(word)
        
sorted(b_words)[:10]

* only showing the first 10 words to save space in the notebook

## 16.

☼ Type the expression `list(range(10))` at the interpreter prompt. Now try `list(range(10, 20))`, `list(range(10, 20, 2))`, and `list(range(20, 10, -2))`. We will see a variety of uses for this built-in function in later chapters.

In [None]:
list(range(10))

In [None]:
list(range(10, 20))

In [None]:
list(range(10, 20, 2))

In [None]:
list(range(20, 10, -2))

## 17.

◑ Use `text9.index()` to find the index of the word sunset. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.

In [None]:
text9.index("sunset")

* change the two indecies below to complete the sentence

In [None]:
text9[621:644]

In [None]:
" ".join(text9[621:644])

## 18.

◑ Using list addition, and the set and sorted operations, compute the vocabulary of the sentences sent1 ... sent8.

In [None]:
len(set(sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8))

## 19.

◑ What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?

```python
>>> sorted(set(w.lower() for w in text1))
>>> sorted(w.lower() for w in set(text1))
```

* `set(w.lower() for w in text1)` gets all the unique words in text1 that is case insensitive
* `w.lower() for w in set(text1)` gets all the unique words in text1 that is case sensitive

* therefore the second expression will have more words than the first one because it won't filter out words that have different cases
    * for example, 'The' will be counted as a unique word to 'the' in the second expression

In [None]:
len(sorted(set(w.lower() for w in text1)))

In [None]:
len(sorted(w.lower() for w in set(text1)))

## 20.

◑ What is the difference between the following two tests: w.isupper() and not w.islower()?

* these functions determine if a String is fully upper case or lower case

In [None]:
w = "HELLO"
w1 = "Hello"

In [None]:
w.isupper()

In [None]:
w.islower()

In [None]:
w1.isupper()

In [None]:
w1.islower()

## 21.

◑ Write the slice expression that extracts the last two words of text2.

In [None]:
text2[-2:]

## 22.

◑ Find all the four-letter words in the Chat Corpus (text5). With the help of a frequency distribution (FreqDist), show these words in decreasing order of frequency.

In [None]:
text5_4letter = [word.lower() for word in text5 if len(word) == 4]

text5_4letter[:10]

In [None]:
freqdist5 = FreqDist(text5_4letter)

freqdist5.most_common(25)

## 23.

◑ Review the discussion of looping with conditions in Part 4. Use a combination of for and if statements to loop over the words of the movie script for Monty Python and the Holy Grail (text6) and print all the uppercase words, one per line.

In [None]:
text6

In [None]:
# takes up a lot of screen space

for word in text6:
    if word.isupper():
        print(word)

## 24.

◑ Write expressions for finding all words in text6 that meet the conditions listed below. The result should be in the form of a list of words: ['word1', 'word2', ...].

1. Ending in ise

2. Containing the letter z

3. Containing the sequence of letters pt

4. Having all lowercase letters except for an initial capital (i.e., titlecase)

In [None]:
# Part 1
[word for word in text6 if word.lower().endswith('ise')]

In [None]:
# Part 2
[word for word in text6 if "z" in word.lower()]

In [None]:
# Part 3
[word for word in text6 if "pt" in word.lower()]

In [None]:
# Part 4
[word for word in text6 if word[0].isupper() and word[1:].islower()][0:10]

## 25. 
◑ Define sent to be the list of words `['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']`. Now write code to perform the following tasks:

1. Print all words beginning with sh
2. Print all words longer than four characters

In [None]:
sent = ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']

In [None]:
# Part 1
for word in sent:
    if word[:2] == "sh":
        print(word)

In [None]:
# Part 2
for word in sent:
    if len(word) > 4:
        print(word)

## 26.

◑ What does the following Python code do? `sum(len(w) for w in text1)` Can you use it to work out the average word length of a text?

In [None]:
sum(len(w) for w in text1) 

* `w for w in text1` creates a list of all the words in the text


In [None]:
words_in_text1 = [w for w in text1]

words_in_text1[:10]

* `len(w) for w in text1` gets the length of each word in the text

In [None]:
word_lengths_in_text1 = [len(w) for w in text1]

word_lengths_in_text1[:10]

* to find the average word length, you must divide the sum of the word lengths by the total number of words in the text (`len(text1)`)

In [None]:
sum(len(w) for w in text1) / len(text1)

## 27.

◑ Define a function called `vocab_size(text)` that has a single parameter for the text, and which returns the vocabulary size of the text.

In [6]:
# write your function here
def vocab_size(text):
     

In [7]:
# example (outcome: 19317)
vocab_size(text1)

19317

## 28.

◑ Define a function `percent(word, text)` that calculates how often a given word occurs in a text, and expresses the result as a percentage.

In [8]:
# write your function here


In [9]:
# example (outcome: 5.26)
percent('the', text1)

5.260736372733581

## 29.

◑ We have been using sets to store vocabularies. Try the following Python expression: set(sent3) < set(text1). Experiment with this using different arguments to set(). What does it do? Can you think of a practical application for this?

In [None]:
set(sent3) < set(text1)

In [None]:
set(sent3) > set(text1)

* this compares the lengths of each set