[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Notebooks/Lecture_05_2023_02_09.ipynb)

# Lecture 04: Statistics and Information Theory

In the previous lecture, we examined some general statistical features of words and/or tokens. We observed that there the frequency of terms in document follow a power law distribution. We noticed that the most frequent words are often words that are hardly germane to the ideas of the text. What is more, we often don't think about ideas as associated with one word, but we can create noun phrases or prepositional phrases to communicate our ideas. For example, the phrase "bacon and eggs" might mean, in a given context, the entities to which the words are referencing. In a different context, however, the noun phrase, "bacon and eggs", could mean the event of breakfast or the items one would prefer to eat at the event of a morning meal. Thus, we need a strategy to assess the co-occurrence of n-terms and whether the co-occurence is significant.

## Let's look the LOTR

Q: What do you think is most frequently occuring group of two words or tokens in the Lord of the Rings? Do think the answer provides insight into the theme(s) or topic(s) of the story? What do we expect?

In [2]:
# lets import some packages and configure our notebook
import os
from pathlib import Path
import matplotlib.pyplot as plt

N.B.: Comment out the below and run the colab block if you are working in Colab

In [3]:
# We want the Lord of the Rings text
files = Path('./data/').glob('*.txt')

In [4]:
# If you are running this notebook in Google colab, uncomment this line of code and run
# from google.colab import drive
# drive.mount('/content/gdrive/', force_remount=True)
# files = Path('gdrive/MyDrive/DATA_340_3_NLP/Datasets').glob('*.txt')

In [5]:
# We want the text of the _The Fellowship of the Ring_ so let's extract it from our files
fellowship = ""

for f in files:
    # Parse the file name using the os package
    base_name = os.path.basename(f)
    f_name, _ = os.path.splitext(base_name)
    
    # We are only concerned with the Fellowship
    if not f_name == '01_LOTR_Fellowship':
        continue
    else:
        with open(f, 'r', encoding="utf-8") as FIN:
            fellowship = FIN.read()

## Let's Tokenize, Lemmatize, and Remove Stopwords

In [6]:
# We can use NLTK to tokenize and lemmatize our text
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits)
stop_words = stopwords.words('english') + punct

In [7]:
# Create an empty list to append the tokens and not stopwords
lemmas = []

# Iterate over the text to extract our lemmas
def tokenize_lemmatize_text(text):
    tokens = word_tokenize(text)
    for token in tokens:
        if token in stop_words:
            continue
        else:
            lemmas.append(stemmer.stem(token))
    return lemmas

In [8]:
# Pass our text to the above function so we can then create a bigram dictionary
fellowship_token_lemmas = tokenize_lemmatize_text(fellowship)

In [9]:
fellowship_token_lemmas

['three',
 'ring',
 'elven-k',
 'sky',
 'seven',
 'dwarf-lord',
 'hall',
 'stone',
 'nine',
 'mortal',
 'men',
 'doom',
 'die',
 'one',
 'dark',
 'lord',
 'dark',
 'throne',
 'in',
 'land',
 'mordor',
 'shadow',
 'lie',
 'one',
 'ring',
 'rule',
 'one',
 'ring',
 'find',
 'one',
 'ring',
 'bring',
 'dark',
 'bind',
 'in',
 'land',
 'mordor',
 'shadow',
 'lie',
 'foreword',
 'thi',
 'tale',
 'grew',
 'tell',
 'becam',
 'histori',
 'great',
 'war',
 'ring',
 'includ',
 'mani',
 'glimps',
 'yet',
 'ancient',
 'histori',
 'preced',
 'it',
 'begun',
 'soon',
 '_the',
 'hobbit_',
 'written',
 'public',
 '1937',
 'i',
 'go',
 'sequel',
 'i',
 'wish',
 'first',
 'complet',
 'set',
 'order',
 'mytholog',
 'legend',
 'elder',
 'day',
 'take',
 'shape',
 'year',
 'i',
 'desir',
 'satisfact',
 'i',
 'littl',
 'hope',
 'peopl',
 'would',
 'interest',
 'work',
 'especi',
 'sinc',
 'primarili',
 'linguist',
 'inspir',
 'begun',
 'order',
 'provid',
 'necessari',
 'background',
 "'histori",
 'elvish',

In [10]:
# Let's build a bi-token dictionary
bigram_freqs = {}

# List comprehension to create a list of bigrams
bigrams = [(fellowship_token_lemmas[i], fellowship_token_lemmas[i + 1]) for i in range(len(fellowship_token_lemmas) - 1)]

# The bigrams are repeated so we want to count the frequency of terms
for bigram in bigrams:
    bigram_freqs[bigram] = bigram_freqs.get(bigram, 0) + 1
                      

In [11]:
bigrams_sorted = list(sorted(bigram_freqs.items(), key=lambda kv: -kv[1]))

In [12]:
# Let's create a dataframe of the bigrams using pandas
import pandas as pd

# to create the dataframe we need to use pd.DataFrame and pass it our data and give it some column names
df = pd.DataFrame(bigrams_sorted, columns=['bigram', 'freq'])

# Let's expand the bigrams to their own columns and keep the index so we can retain the frequencies
df[['first_term', 'second_term']] = pd.DataFrame(df['bigram'].tolist(), index=df.index)

# And drop the bigram column since we now have the lemmas in their own columns
df = df.drop(columns=['bigram'])

In [13]:
df.query("first_term == 'frodo'")

Unnamed: 0,freq,first_term,second_term
8,66,frodo,'s
9,65,frodo,i
53,33,frodo,look
68,26,frodo,felt
86,24,frodo,said
...,...,...,...
76393,1,frodo,n't
76411,1,frodo,actual
76439,1,frodo,laid
76466,1,frodo,we


In [14]:
df.query("second_term == 'frodo'")

Unnamed: 0,freq,first_term,second_term
0,220,said,frodo
39,39,ask,frodo
42,36,mr.,frodo
80,25,cri,frodo
117,19,look,frodo
...,...,...,...
75817,1,choos,frodo
75832,1,help,frodo
76129,1,_frodo,frodo
76354,1,water-rat,frodo


## Shannon's Entropy

A lot of what we do with written communication is comparison. We as humans come to understand information and ideas through comparisons. The same is true for Natural Language Processing. We want to compare. An important metric for comparison and discerning similarity between things is Shannon's entropy.

$$H(X) := -\sum_{x\in{X}} p(x) log p(x)$$

We can make this more intuitive by rewriting it to describe surprise:

$$\sum{p(x)} log(\frac{1}{\frac{1}{p(x)}})$$

As a statement of surprise, we can see that probability and surprise are inversely related.

## Resources

* Expected Values, Main Ideas!!! Directed by StatQuest with Josh Starmer, 2021. YouTube, https://www.youtube.com/watch?v=KLs_7b7SKi4.
* Entropy (for Data Science) Clearly Explained!!! Directed by StatQuest with Josh Starmer, 2021. YouTube, https://www.youtube.com/watch?v=YtebGVx-Fxw.
* Jurafsky and Martin, Chapter 3: [N-Gram Language Models](../course_readings/Jurafsky_Martin_chapter_3_39-65.pdf)
