# Week 2 - Language Processing and Python (NN)

**Reference:** https://www.nltk.org/book/


Address these questions:

1. What can we achieve by combining simple programming techniques with large quantities of text?

2. How can we automatically extract key words and phrases that sum up the style and content of a text?

3. What tools and techniques does the Python programming language provide for such work?

4. What are some of the interesting challenges of natural language processing?

# **1. Computing with Language: Texts and Words**

## **1.2 Getting Started with NLTK**

In [None]:
# !pip install nltk

In [None]:
import nltk

nltk.download()

In [None]:
from nltk.book import *

In [None]:
text1

In [None]:
text2

## **1.3 Searching Text**

A "concordance" shows us every occurrence of a given word, together with some context.

A concordance permits us to see words in context.

In [None]:
# We can see the word accured in contexts

text1.concordance("monstrous")

In [None]:
text2.concordance("Affection")

In [None]:
text9.concordance("Affection")

What other words appear in a similar range of contexts?

In [None]:
# similar function

text1.similar("monstrous")

In [None]:
# Different results for different texts
text2.similar("monstrous")

In [None]:
# common_contexts allows to examine the contxts shared by two or more words
text2.common_contexts(["monstrous", "very"])

In [None]:
text6.common_contexts(['affection', 'simple'])

In [None]:
text7.common_contexts(['affection', 'change'])

"dispersion_plot" to see
```
  1. Automatically detect if the word accures in the text.
  2. Display some words that appear in the same context.
  3. Determine the location of a word in the text.
  4. How many words from the beginning it appears.
```

Each stripe represents an instance of a word, and each row represents the entire text

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12, 6))

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "liberty", "constitution"])
print("\nFigure. Lexical Dispersion Plot for Words in U.S. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time")

"generate()"

Generating a random text.

``` The generate() method is not available in NLTK 3.0 but will be reinstated in a subsequent version. ```

In [None]:
text3.generate()

## **1.4 Counting Vocabulary**

- Length of a text from start to finish, in terms of the words and punctuation symbols that appear - (tokens)

- Token is the technical name for a sequence of characters - such as hairy, his, or ... that we want to treat as a group

- When we count the number of tokens in a text, say, the phrase "**to be or not to be**", we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of **to**, two of **be**, and one each of **or** and **not**. But there are only four distinct vocabulary items in this phrase


In [None]:
len(text3)

- How many distinct words does the book of Genesis contain?

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

- Now, let's calculate a measure of the lexical richness of the text. The next example shows us that the number of distinct words is just 6% of the total number of words, or equivalently that each word is used 16 times on average (16 * 2789 = 44624 ~ 44764)

In [None]:
len(set(text3)) / len(text3)

- Let's consider a single word.
  1. Counting how often a word accures in text.
  2. What percentage of the text is taken up by a specific word.

In [None]:
text3.count("smote")

In [None]:
100 * text4.count('a') / len(text4)

In [None]:
print(text5.count("lol"))
print(100 * text5.count('lol') / len(text5))

 - How to prevent tedious coding?

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

In [None]:
percentage(text4.count('a'), len(text4))

# **2. A Closer Look at Python: Texts as Lists of Words**

## **2.1 Lists**

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
len(sent1)

In [None]:
lexical_diversity(sent1)

In [None]:
sent2

In [None]:
sent3

- How to merge two lists together in python?

In [None]:
['Monty', 'Python']

In [None]:
['and', 'the', 'Holy', 'Grail']

In [None]:
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']

In [None]:
sent4 + sent1

- What if we want to add a simple item to the list?

In [None]:
sent1.append("some")
sent1

## **2.2 Indexing Lists**

- How to pick out the 1st, 173rd, or even 14,278th word in a printed text?

We can access to every word in the text with their **Index**

In [None]:
text4[173]

- What about in reverse? (Giving the word and returning the index)

In [None]:
text4.index('awaken')

### Task 1:

In [None]:
# What if we want to find all indexes of a target word?

def find_indices(text4, word):
    list_ = []
    for index, w in enumerate(text4):
        if w == word:
            list_.append(index)
    return list_

In [None]:
print(find_indices(text4, 'Smote'))

In [None]:
print(find_indices(text4, 'This'))

Indexes are common way to access the words of a text, more generally, the elements of any list.
- How to access the range of elements in python list? (start index to  end index
)

In [None]:
text5[16715:16735]

In [None]:
text6[1600:1625]

In [None]:
sent = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']

In [None]:
sent[0]

In [None]:
sent[9]

- Python list is zero based

In [None]:
sent[10]

- Let's take a look at slicing in our sentence. 5:8 indicates sent elements at indexes 5, 6, 7

In [None]:
sent[5:8]

In [None]:
sent[5]

In [None]:
sent[6]

In [None]:
sent[7]

- By convention, m:n means elements m, …, n-1

  1. As another example if we want to start fom beginning, we can eliminate the start poin

  2. We can omit the second number if we slice to the end

In [None]:
sent[:3]

In [None]:
text2[141525:]

- Let's modify some elements of the list by assigning the one of it's index values.

In [None]:
sent[0] = 'First'

In [None]:
sent[9] = 'Last'

In [None]:
sent

In [None]:
len(sent)

In [None]:
sent[1:9] = ['Second', 'Third']
sent

In [None]:
sent[9]

## **2.3 Variables**

- *variable = expression*

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
... 'forth', 'from', 'Camelot', '.']

In [None]:
noun_phrase = my_sent[1:4]

In [None]:
noun_phrase

In [None]:
wOrDs = sorted(noun_phrase)

In [None]:
wOrDs

- Don't use the reserved words for variables

In [None]:
not = 'Camelot'


Caution!

  - Take care with your choice of names (or identifiers) for Python variables. First, you should start the name with a letter, optionally followed by digits (0 to 9) or letters. Thus, abc23 is fine, but 23abc will cause a syntax error. Names are case-sensitive, which means that myVar and myvar are distinct variables. Variable names cannot contain whitespace, but you can separate words using an underscore, e.g., my_var. Be careful not to insert a hyphen instead of an underscore: my-var is wrong, since Python interprets the "-" as a minus sign.


## **2.4 Strings**

  1. We can assign a string to a variable
  2. Index a string
  3. Slice a string

In [None]:
name = 'Monty'

In [None]:
name[0]

In [None]:
name[:4]

- We can also do multiplication with our string

In [None]:
name * 2

In [None]:
name + '?'

- We can join the words of a list to make a single string, or split a string into a list, as follows:

In [None]:
' '.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

# **3. Computing with Language: Simple Statistics**

  1. What makes a text distinct
  2. Use automatic methods to find characteristic words and expressions of a text

In [None]:
# Simple test

saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']

tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]

## **3.1 Frequency Distributions**

  - How can we automatically identify the words of a text that are most informative about the topic and genre of the text?

Imagine how you might go about finding the 50 most frequent words of a book.

Let's use a FreqDist to find the 50 most frequent words of *Moby Dick*:

In [None]:
fdist1 = FreqDist(text1)
print(fdist1)

In [None]:
fdist1.most_common(50)

- The more informative word for the topic is whale (No. whale ~ 900)
- Other words are English "plumbing"

In [None]:
fdist1['whale']

- Generate a cumulative frequency plot for these words
- These 50 words account for nearly half the book!

In [None]:
plt.figure(figsize=(12, 6))

fdist1.plot(50, cumulative=True)

- What if we want to find the words that accured once? (**hapaxes**)

They are rare words, how can we guess the meaning of these words without seeing the context???!!!

In [None]:
fdist1.hapaxes()

In [None]:
fdist2 = FreqDist(text2)
fdist2

In [None]:
fdist2.most_common(50)

- The more informative word for the topic is Elinor (No. Elinor ~ 680)
- Second more informative word for the topic is Marianne (No. Marianne ~ 560)
- Other words are English "plumbing"

In [None]:
fdist2['Elinor']

In [None]:
fdist2['Marianne']

- Generate a cumulative frequency plot for these words

In [None]:
plt.figure(figsize=(14, 8))

fdist2.plot(50, cumulative=True)

## **3.2 Fine-grained Selection of Words**

- How to find the long words in text?

Let's see words which has length > 15

In [None]:
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

- Find **frequently occuring long words**

It eliminates frequent short words (e.g., the) and infrequent long words (e.g. antiphilosophists).

Notice how we have used two conditions: len(w) > 7 ensures that the words are longer than seven letters, and fdist5[w] > 7 ensures that these words occur more than seven times.

In [None]:
fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

## **3.3 Collocations and Bigrams**

- A "**collocation**" is a sequence of words that occur together unusually often. (eg. *red wine*)

- A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

In [None]:
# Start off by extracting from a text a list of word pairs, also known as bigrams
# This is easily accomplished with the function bigrams().

list(bigrams(['more', 'is', 'said', 'than', 'done']))

Now, collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The collocations() function does this for us.

In [None]:
text4.collocations()

In [None]:
text8.collocations()

## **3.4 Counting Other Things**

- We can look at the distribution of word lengths in a text, by creating a FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text:

We start by deriving a list of the lengths of words in text1, and the FreqDist then counts the number of times each of these occurs.

There are at most only 20 distinct items being counted, the numbers 1 through 20, because there are only 20 different word lengths. (there are words consisting of just one character, two characters, ..., twenty characters, but none with twenty one or more characters)

In [None]:
print([len(w) for w in text1])

In [None]:
# Print maximum length

print(max([len(w) for w in text1]))

In [None]:
fdist = FreqDist(len(w) for w in text1)
print(f"{fdist}\n")
fdist

- How frequent the different lengths of word are (e.g., how many words of length four appear in the text, are there more words of length five than length four, etc). We can do this as follows:

In [None]:
fdist.most_common()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
# freq() function return the frequency of a given sample.  The frequency of a...
# sample is defined as the count of that sample divided by the...
# total number of sample outcomes that have been recorded by...
# this FreqDist.  The count of a sample is defined as the...
# number of times that sample outcome was recorded by this...
# FreqDist.  Frequencies are always real numbers in the range [0, 1].

fdist.freq(3)

# **4. Back to Python: Making Decisions and Taking Control**

## **4.1 Conditionals**

Python Tutorial: https://www.w3schools.com/python/default.asp

Pyrhon Operators: https://www.w3schools.com/python/python_operators.asp

In [None]:
# If you get an error saying sent7 is undefined, first run the following code:
# from nltk.book import * 

print(sent7)

In [None]:
[w for w in sent7 if len(w) < 4]

In [None]:
[w for w in sent7 if len(w) <= 4]

In [None]:
[w for w in sent7 if len(w) == 4]

In [None]:
print([w for w in sent7 if len(w) != 4])

There is a common pattern to all of these examples: 

**[w for w in text if condition ]**

where condition is a Python "test" that yields either true or false. In the cases shown in the previous code example, the condition is always a numerical comparison. However, we can also test various properties of words, using the functions listed bellow:

1. **s.startwith(t)** ---- test if s starts with t 
2. **s.endswith(t)**  ---- test if s ends with t
3. **t in s**         ---- test if t is a substring of s
4. **s.islower()**    ---- test if s contains cased characters and all are lowercase
5. **s.isupper()**    ---- test if s contains cased characters and all are uppercase
6. **s.isalpha()**    ---- test if s is non-empty and all characters in s are alphabetic
7. **s.isalnum()**    ---- test if s is non-empty and all characters in s are alphanumeric
8. **s.isdigit()**    ---- test if s is non-empty and all characters in s are digits
9. **s.istitle()**    ---- test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

In [None]:
sorted(w for w in set(text1) if w.endswith('ableness'))

In [None]:
sorted(term for term in set(text4) if 'gnt' in term)

In [None]:
print(sorted(item for item in set(text6) if item.istitle()))

In [None]:
sorted(item for item in set(sent7) if item.isdigit())

We can also create more complex conditions. If c is a condition, then "not c" is also a condition. If we have two conditions c1 and c2, then we can combine them to form a new condition using conjunction and disjunction: "c1 and c2", "c1 or c2".

### Quick Test: What will be the outcome of these lines??

In [None]:
sorted(w for w in set(text7) if '-' in w and 'index' in w)
sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)
sorted(w for w in set(sent7) if not w.islower())
sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

```






```

## **4.2 Operating on Every Element**

In [None]:
print([len(w) for w in text1])

In [None]:
print([w.upper() for w in text1])

- Let's take a look about some examples

In [None]:
# Considers all words
len(text1)

In [None]:
# Considers the set of all words
len(set(text1))

In [None]:
# considers the set of all converter to lower case words
# E.g. we do not double-counting This and this
len(set(word.lower() for word in text1))

In [None]:
# eliminate numbers and punctuation from the vocabulary count by
# filtering out any non-alphabetic items:
len(set(word.lower() for word in text1 if word.isalpha()))

## **4.3 Nested Code Blocks**

In [None]:
# If you are using Python 2.6 or 2.7, you need to include the following line
# in order for the above print function to be recognized:

# from __future__ import print_function

In [None]:
word = 'cat'
if len(word) < 5:
    print('word length is less than 5')

In [None]:
# The condition is not true then the program will execute
# from the end of the if block

if len(word) >= 5:
    print('word length is greater than or equal to 5')

print("The block of the code is not executed!")

In [None]:
for word in ['Call', 'me', 'Ishmael', '.']:
    print(word)

## **4.4 Looping with Conditions**

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
for temp in sent1:
    if temp.endswith('l'):
        print(temp)

In [None]:
for token in sent1:
    if token.islower():
        print(token, 'is a lowercase word')
    elif token.istitle():
        print(token, 'is a titlecase word')
    else:
        print(token, 'is punctuation')

- Let's combine the idioms we've been exploring. First, we create a list of cie and cei words, then we loop over each item and print it. Notice the extra information given in the print statement: end=' '. This tells Python to print a space (not the default newline) after each word.

In [None]:
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
    print(word, end=' ')