[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Notebooks/Lecture_04_2023_02_07.ipynb)

# Lecture 04: Properties of Language, Statistics, Information Theory

# Properties of Language

## Our shared humanity?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3049087/

| Article | Pubmed Link |
| ---- | ---- |
| ![17_Languages](./images/17_languages.png) | ![17_Languages](./images/17_languages_qr.png) |

> We studied the frequencies of use in each of these languages for the 200 words that make up the Swadesh fundamental vocabulary word list. (p. 1102)

> Our results point to a surprising regularity in the way that human speakers use language. It might be that the way we use language and its structure means that some words inevitably will be used more than others.

* [Swadesh terms](https://en.wikipedia.org/wiki/Swadesh_list)
* [Lexicostatistics](https://en.wikipedia.org/wiki/Lexicostatistics) (comparative linguistics - lexical cognates between languages)


### Similarities of Swadesh term usage in different languages

<center><img src="./images/figure_3.png"  width="800" height="500"></center>

### Let's explore this in more detail

Let's compare the three volumes of _The Lord of the Rings_

In [None]:
# Imports
from pathlib import Path
import os

N.B.: If you are running this notebook in Jupyter Lab, then uncomment the below code accordingly

In [None]:
# If you are running this notebook in Google colab, uncomment this line of code and run
# from google.colab import drive
# drive.mount('/content/gdrive/', force_remount=True)
# files = Path('gdrive/MyDrive/DATA_340_3_NLP/Datasets').glob('*.txt')
# We can use the Path package to create a generator of all patternized items
files = Path('../datasets').glob('*.txt')

In [None]:
# Lets iterate over the generator and create a list of lists with a Short Volume name and its text
corpus = []

# Iterate over the files
for f in files:
    print(f)
    # Let's grab the short name from the file name
    base_name = os.path.basename(f)
    f_name, _ = os.path.splitext(base_name)
    
    # Open the file and read its content
    with open(f, 'r') as file:
        text = file.read()
        
        # Append the short name and the text to the corpus list
        corpus.append([f_name, text])

In [None]:
# Let's look at our corpus
corpus[0][0], corpus[0][1][:100]

### Zipf's Law

[George Kingsley Zipf](https://en.wikipedia.org/wiki/George_Kingsley_Zipf) argued that most words are not used that often. He formally defined his theorem as
$$P_n \sim \frac{1}{n^a}$$

It is a power law distribution. The frequency of any word is inverse in porportion to its rank in the vocabulary.

Let's write a function to compute the frequency of vocabulary items over a volume

In [None]:
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt


def zipf_analysis(text, book):
    # Tokenize the text into words
    words = text.split()
    
    # Count the frequency of each word
    word_freq = Counter(words) # this one line of code does the same as the following for loop
    
    # vanilla python implementation
    # word_freq = {}
    # for word in words:
        # word_freq[word] = word_freq.get(word, 0) + 1
    
    
    # Sort the words by frequency - highest occuring terms are at the top
    sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    
    # Plot the word frequency and rank to check for Zipf's law
    word_rank = np.arange(1, len(sorted_word_freq)+1) # X variable
    word_frequency = [i[1] for i in sorted_word_freq] # Y variable
    
    # Plot log to visualize the power law distribution
    plt.loglog(word_rank, word_frequency, marker='o')
    plt.xlabel('Rank')
    plt.ylabel('Frequency')
    plt.title(f"Zipf's Law for {book}")
    plt.show()
    return sorted_word_freq

#### Fellowship of the Ring

In [None]:
fellowship_name = corpus[0][0]
felloship_text = corpus[0][1]

fellowship_name, felloship_text[:100]

In [None]:
fellowship = zipf_analysis(felloship_text, 'Fellowship of the Ring')

In [None]:
# Here let's use the output to explore the words that occur only once
# We can use pandas to explore our data
import pandas as pd

# Convert the tuples to a dataframe
df = pd.DataFrame(fellowship, columns=['word', 'frequency'])

# Let's query the dataframe for words that occure only once
fellowship_hapax_legomenon = df.query('frequency == 1')
fellowship_hapax_legomenon

#### Two Towers

In [None]:
corpus[1][0], corpus[1][1][:100]

In [None]:
# Let's check the distribution of the two towers
two_towers = zipf_analysis(corpus[1][1], 'The Two Towers')

In [None]:
# Let's use Pandas again to look at some word frequencies
df = pd.DataFrame(two_towers, columns=['word', 'frequency'])

two_towers_hapax_legomenon = df.query('frequency == 1')
two_towers_hapax_legomenon

#### Return of the King

In [None]:
corpus[2][0], corpus[2][1][:100]

In [None]:
# Plot the distribution of terms
return_king = zipf_analysis(corpus[2][1], 'The Return of the King')

In [None]:
# Let's explore again some of the lower frequency terms
df = pd.DataFrame(return_king, columns=['word', 'frequency'])

return_king_hapax_legomenon = df.query('frequency == 1')
return_king_hapax_legomenon.shape

In [None]:
# Nietzsche
corpus[3][0], corpus[3][1][:100]

In [None]:
Nietzsche = zipf_analysis(corpus[3][1], 'Nietzsche')

In [None]:
# Let's explore again some of the lower frequency terms
df = pd.DataFrame(Nietzsche, columns=['word', 'frequency'])

Nietzsche_hapax_legomenon = df.query('frequency == 1')
Nietzsche_hapax_legomenon.head(100)

### Most used word in the USA:

The above demonstration of _The Lord of the Rings_ is generalizable to any English text, and as discussed above to many languages for certain kinds of words.

<img src="./images/most_used_01.png"  width="800" height="400">

But ... <br>
<img src="./images/most_used_02.png" width="800" height="400">

N.B.: Notice the study of Manning and Schutze, _Foundations of Statistical Natural Language Processing_, who demonstrate that a randomly created text follows the power law observation as discussed by Mandelbrot. They conclude their discussion observing that:

> what makes frequency-based approaches to language hard is that almost all words are rare. Zipf's law is a good way to encapsulate this insight. (p. 29)

## Let's Tokenize, Lemmatize, and Remove Stopwords

In [None]:
# We can use NLTK to tokenize and lemmatize our text
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits)
stop_words = stopwords.words('english') + punct

In [None]:
# Create an empty list to append the tokens and not stopwords
lemmas = []

# Iterate over the text to extract our lemmas
def tokenize_lemmatize_text(text):
    tokens = word_tokenize(text)
    for token in tokens:
        if token in stop_words:
            continue
        else:
            lemmas.append(stemmer.stem(token))
    return lemmas

In [None]:
# Pass our text to the above function so we can then create a bigram dictionary
fellowship_token_lemmas = tokenize_lemmatize_text(corpus[0][1])

In [None]:
fellowship_token_lemmas

In [None]:
# Let's build a bi-token dictionary
bigram_freqs = {}

# List comprehension to create a list of bigrams
bigrams = [(fellowship_token_lemmas[i], fellowship_token_lemmas[i + 1]) for i in range(len(fellowship_token_lemmas) - 1)]

# The bigrams are repeated so we want to count the frequency of terms
for bigram in bigrams:
    bigram_freqs[bigram] = bigram_freqs.get(bigram, 0) + 1
                      

In [None]:
bigrams_sorted = list(sorted(bigram_freqs.items(), key=lambda kv: -kv[1]))

In [None]:
# Let's create a dataframe of the bigrams using pandas
import pandas as pd

# to create the dataframe we need to use pd.DataFrame and pass it our data and give it some column names
df = pd.DataFrame(bigrams_sorted, columns=['bigram', 'freq'])

# Let's expand the bigrams to their own columns and keep the index so we can retain the frequencies
df[['first_term', 'second_term']] = pd.DataFrame(df['bigram'].tolist(), index=df.index)

# And drop the bigram column since we now have the lemmas in their own columns
df = df.drop(columns=['bigram'])

In [None]:
df.head()

In [None]:
df.query("first_term == 'frodo'")

In [None]:
x_frodo = df.query("second_term == 'frodo'").copy()

In [None]:
x_frodo

In [None]:
# sum the frequencies to get the total count
x_frodo.freq.sum()