# [1. Computing with Language: Texts and Words](http://www.nltk.org/book/ch01)

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

These are the blocks of code within the chapter. As you read, try running the code in the cells here or experiment with the code by changing it.

Run the cell below before running the other example code:

In [None]:
import nltk

nltk.download('book')

from nltk.book import *

## 1 - Computing with Language: Texts and Words

### 1.1 - Getting Started with Python

* python code can be run in Jupyter Notebook's coding cells like they can be run on the python interpreter
* try running the python code below!

In [None]:
# click this cell and press the 'run' button

1 + 5 * 2 - 3

In [None]:
1 +

**Your Turn:** Enter a few more expressions of your own. You can use asterisk `*` for multiplication and slash `/` for division, and parentheses for bracketing expressions.

* these Jupyter Notebooks use the Python 3 programming language

### 1.2 - Getting Started with NLTK

* see the [Introduction and Setup](../setup.ipynb) notebook for details on installing the NLTK data required for this book

In [None]:
text1

In [None]:
text2

### 1.3 - Searching Text

In [None]:
text1.concordance("monstrous")

**Your Turn:** Try searching for other words. You can also try searches on some of the other texts we have included. For example, search *Sense and Sensibility* for the word affection, using `text2.concordance("affection")`. Search the book of Genesis to find out how long some people lived, using `text3.concordance("lived")`. You could look at `text4`, the Inaugural Address Corpus, to see examples of English going back to 1789, and search for words like *nation, terror, god* to see how these words have been used differently over time. We've also included `text5`, the *NPS Chat Corpus*: search this for unconventional words like *im, ur, lol*. (Note that this corpus is uncensored!)

___

In [None]:
text1.similar("monstrous")

In [None]:
text2.common_contexts(["monstrous", "very"])

**Your Turn:** Pick another pair of words and compare their usage in two different texts, using the similar() and common_contexts() functions.

#### Disperssion Plot Graph

* **Note:** NumPy and Matplotlib are included in both Google Colab and Anocanda

* learning the basics of pyplot is an excellent entry point into data visualization with Python
    * [Intro to pyplot Tutorial](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)
    * learning pyplot will alow you to adjust features of the graphs NLTK generates, such as the titles of each axis, colors, graph size and more

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

In [None]:
text3.generate()

### 1.4 - Counting Vocabulary

In [None]:
len(text3)

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

In [None]:
len(set(text3)) / len(text3)

In [None]:
text3.count("smote")

In [None]:
100 * text4.count('a') / len(text4)

In [None]:
text3.count?

**Your Turn:** How many times does the word lol appear in `text5`? How much is this as a percentage of the total number of words in this text?

#### Functions

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

In [None]:
def percentage(count, total): 
    return 100 * count / total

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

In [None]:
percentage(4, 5)

In [None]:
percentage(text4.count('a'), len(text4))

## 2 - A Closer Look at Python: Texts as Lists of Words

### 2.1 - Lists

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
sent1

In [None]:
len(sent1)

In [None]:
lexical_diversity(sent1)

* **note:** `nltk.book` has been imported at the top of the page

In [None]:
sent2

In [None]:
sent3

In [None]:
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] 

In [None]:
sent4 + sent1

In [None]:
sent1.append("Some")

In [None]:
sent1

### 2.2 - Indexing Lists

In [None]:
text4[173]

In [None]:
text4.index('awaken')

In [None]:
text5[16715:16735]

In [None]:
text6[1600:1625]

In [None]:
sent = ['word1', 'word2', 'word3', 'word4', 'word5',
        'word6', 'word7', 'word8', 'word9', 'word10']

In [None]:
sent[0]

In [None]:
sent[9]

In [None]:
sent[10]

In [None]:
sent[5:8]

In [None]:
sent[5]

In [None]:
sent[6]

In [None]:
sent[7]

In [None]:
sent[:3]

In [None]:
text2[141525:]

In [None]:
sent[0] = 'First'

In [None]:
sent[9] = 'Last'

In [None]:
len(sent)

In [None]:
sent[1:9] = ['Second', 'Third'] 

In [None]:
sent

sent[9]

**Your Turn:** Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier. Check your understanding by trying the exercises on lists at the end of this chapter.

### 2.3 - Variables

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
           'forth', 'from', 'Camelot', '.']

In [None]:
noun_phrase = my_sent[1:4]

In [None]:
noun_phrase

In [None]:
wOrDs = sorted(noun_phrase)

In [None]:
wOrDs

In [None]:
not = 'Camelot' 

In [None]:
vocab = set(text1)

In [None]:
vocab_size = len(vocab)

In [None]:
vocab_size

### 2.4 - Strings

In [None]:
name = 'Monty'

In [None]:
name[0]

In [None]:
name[:4]

In [None]:
name * 2

In [None]:
name + '!'

In [None]:
' '.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

## 3 - Computing with Language: Simple Statistics

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done',
          'more', 'is', 'said', 'than', 'done']

In [None]:
tokens = set(saying)

In [None]:
tokens = sorted(tokens)

In [None]:
tokens[-2:]

### 3.1 - Frequency Diagrams

In [None]:
fdist1 = FreqDist(text1)

In [None]:
print(fdist1)

In [None]:
fdist1.most_common(50)

In [None]:
fdist1['whale']

In [None]:
fdist1.plot(50, cumulative=True)

**Your Turn:** Try the preceding frequency distribution example for yourself, for text2.

### 3.2 Fine-grained Selection of Words

The technique used to create the `long_words` list is called a **list comprehension**. To learn more about how list comprehensions work, a tutorial is linked below. It will be helpful to master list comprehensions for future chapters.

* [Tutorial on List Comprehensions](https://www.programiz.com/python-programming/list-comprehension)
    * **Note:** this tutorial requires understanding of looping and conditionals (*Section 4.2*), so it's reccomended to finish reading this chapter before using this resource

In [None]:
V = set(text1)

In [None]:
long_words = [w for w in V if len(w) > 15]

In [None]:
sorted(long_words)

**Your Turn:** Try out the previous statements in the Python interpreter, and experiment with changing the text and changing the length condition. Does it make a difference to your results if you change the variable names, e.g., using `[word for word in vocab if ...]`?

___

In [None]:
fdist5 = FreqDist(text5)

In [None]:
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

### 3.3 - Collocations and Bigrams

In [None]:
list(bigrams(['more', 'is', 'said', 'than', 'done']))

* **note:** the `.collocations()` method [is known to be buggy](https://stackoverflow.com/questions/59266387/problem-with-nltk-collocations-too-many-values-to-unpack-expected-2), so the `collocation_list()` method will be used as an alternative

In [None]:
text4.collocation_list()

In [None]:
text8.collocation_list()

### 3.4 - Counting Other Things

In [None]:
[len(w) for w in text1]

In [None]:
fdist = FreqDist(len(w) for w in text1)

In [None]:
print(fdist)

In [None]:
fdist

In [None]:
fdist.most_common()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
fdist.freq(3)

## 4 - Back to Python: Making Decisions and Taking Control

### 4.1 - Conditionals

In [None]:
sent7

In [None]:
[w for w in sent7 if len(w) < 4]

In [None]:
[w for w in sent7 if len(w) <= 4]

In [None]:
[w for w in sent7 if len(w) == 4]

In [None]:
[w for w in sent7 if len(w) != 4]

In [None]:
sorted(w for w in set(text1) if w.endswith('ableness'))

In [None]:
sorted(term for term in set(text4) if 'gnt' in term)

In [None]:
sorted(item for item in set(text6) if item.istitle())

In [None]:
sorted(item for item in set(sent7) if item.isdigit())

**Your Turn**: Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.

In [None]:
sorted(w for w in set(text7) if '-' in w and 'index' in w)

In [None]:
sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)

In [None]:
sorted(w for w in set(sent7) if not w.islower())

In [None]:
sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

### 4.2 - Operating on Every Element

In [None]:
[len(w) for w in text1]

In [None]:
[w.upper() for w in text1]

In [None]:
len(text1)

In [None]:
len(set(text1))

In [None]:
len(set(word.lower() for word in text1))

In [None]:
len(set(word.lower() for word in text1 if word.isalpha()))

### 4.3 - Nested Code Blocks

In [None]:
word = 'cat'

if len(word) < 5:
    print('word length is less than 5')
    
if len(word) >= 5:
   print('word length is greater than or equal to 5')

In [None]:
for word in ['Call', 'me', 'Ishmael', '.']:
    print(word)

### 4.4 - Looping with Conditions

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
for xyzzy in sent1:
    if xyzzy.endswith('l'):
        print(xyzzy)

In [None]:
for token in sent1:
    if token.islower():
        print(token, 'is a lowercase word')
    elif token.istitle():
        print(token, 'is a titlecase word')
    else:
        print(token, 'is punctuation')

In [None]:
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)

for word in tricky:
    print(word, end=' ')

## 5 - Automatic Natural Language Understanding

### 5.5 - Spoken Dialog Systems

The `nltk.chat.chatbots()` function unfortunately does not work well with Jupyter Notebooks because there is an issue with text input in it's menu system.

However, you can run each chatbot individually with the `demo()` method. An example of running the *eliza* chatbot is shown below. See the link below for more chatbots you can try out! Simply type "quit" into the text input field when you are finished with your conversation.

* [List of NLTK chatbots](https://www.nltk.org/api/nltk.chat.html)

In [None]:
nltk.chat.eliza.demo()

## Your Turn Solutions

### 1.4

**Your Turn:** How many times does the word lol appear in `text5`? How much is this as a percentage of the total number of words in this text?

In [None]:
text5.count("lol")

In [None]:
text5.count("lol") / len(text5)

### 3.1

**Your Turn:** Try the preceding frequency distribution example for yourself, for text2.

In [None]:
fdist2 = FreqDist(text2)

In [None]:
print(fdist2)

* **note:** for this solution, I used matplotlib library functions to change the size of the graph
    * learn more about matplotlib here: [Intro to pyplot Tutorial](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6)) 

fdist2.plot(50, cumulative=True)

### 3.2

**Your Turn:** Try out the previous statements in the Python interpreter, and experiment with changing the text and changing the length condition. Does it make a difference to your results if you change the variable names, e.g., using `[word for word in vocab if ...]`?

In [None]:
V = set(text1)

In [None]:
# original
long_words = [w for w in V if len(w) > 15]

* `w` can be given any name as long as it's not conflicting with another variable

In [None]:
# this is ok
long_words = [word for word in V if len(word) > 15]

* remember that the variable `vocab` was previously declared as `set(text1)`, so the cell below should still run as intended

In [None]:
# this is ok (if vocab was previously declared)
long_words = [word for word in vocab if len(word) > 15]

### 4.2 

**Your Turn**: Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.



In [None]:
sorted(w for w in set(text7) if '-' in w and 'index' in w)

* `w for w in set(text7)` each unique word in `text7` is looked at
* `if '-' in w and 'index' in w` the word must contain the letter '-' **and** the word 'index'
* `sorted(` sort in alphabetical order

In [None]:
sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)

* `wd for wd in set(text3)` each unique word in `text3` is looked at
* `if wd.istitle()` the word must start with a capital letter **and** the rest of the word must have lower case characters
* `and len(wd) > 10` the word must be greater than 10 characters long
* `sorted(` sort in alphabetical order

This query creates a list of names in `text3` that are longer than 10 characters in alphabetical order.

In [None]:
sorted(w for w in set(sent7) if not w.islower())

* `w for w in set(sent7)` each unique word in `sent3` is looked at
* `if not w.islower()` removes any words that are completely lower case characters
* `sorted(` sorted in alphanumeric order

In [None]:
sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

* `t for t in set(text2)` each unique word in `text2` is looked at
* `if 'cie' in t or 'cei' in t` must contain the string 'cie' **or** 'cei' inside the word
* `sorted(` sorted in alphanumeric order