[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Fall_2023/notebooks/02_Properties_of_Language.ipynb)

# Lecture 04: Properties of Language, Statistics, Information Theory

## Objectives

* Understand the basic properties of language by using some EDA techniques
* Using statistical methods to understand language
* Understand the basics of information theory

## Readings

* [Jurafsky & Martin, Ch 17](../../course_readings/Jurafsky_Martin_chapter_17_363-388.pdf)
  * Context-free grammars
  * Treebanks
  * CKY parsing
  * N.B.: The 'Bibliographical and Historical Notes' section is good context
* [Jurafsky & Martin, Ch 18](../../course_readings/Jurafsky_Martin_chapter_18_389-412.pdf)
  * Dependency Parsing
  * Dependency Relations
  * Tradition-based dependency parsing
  * Graph-based dependency parsing

### Using spaCy for Dependency Parsing

* [spaCy Dependency Parsing](https://spacy.io/usage/linguistic-features#dependency-parse)
* [spaCy Dependency Parsing Demo](https://explosion.ai/demos/displacy)

In [None]:
# Dependency parsing

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.dep_:{10}}')

In [None]:
# Dependency parsing
# https://spacy.io/usage/linguistic-features#dependency-parse

In [None]:
# use displacy to visualize the dependency tree
from spacy import displacy

doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

## Our shared humanity?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3049087/

| Article | Pubmed Link |
| ---- | ---- |
| ![17_Languages](../images/figure_1.png) | ![17_Languages](../images/17_languages_qr.png) |

### Swadesh terms

* [Swadesh list](https://en.wikipedia.org/wiki/Swadesh_list)
* [Swadesh list](https://en.wiktionary.org/wiki/Appendix:Swadesh_lists)

What are Swadesh terms?

* A list of words that are thought to be universal across languages


How does this affect our analysis of texts?

* We can use these words to compare languages
* We can use these words to compare texts within a language
* We can gain an understanding of how langauge is used within domains of knowledge using statistical methods

### Similarities of Swadesh term usage in different languages

<center><img src="../images/figure_3.png"  width="800" height="600"></center>

## Using simple statistics to shed light on documents and langauge

In [None]:
import os
from pathlib import Path

In [None]:
# data files
data_folder = Path("../datasets")
# get all files in the data folder
data_files_list = [f for f in os.listdir(data_folder) if os.path.isfile(os.path.join(data_folder, f))]

In [None]:
# print the list of data files
data_files_list

### Zipf's Law

[George Kingsley Zipf](https://en.wikipedia.org/wiki/George_Kingsley_Zipf) argued that most words are not used that often. He formally defined his theorem as
$$P_n \sim \frac{1}{n^a}$$

It is a power law distribution. The frequency of any word is inverse in porportion to its rank in the vocabulary.

### Zipf's Law in action

In [None]:
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt


def zipf_analysis(text, book) -> None:
    # Tokenize the text into words
    words = text.split()
    
    # Count the frequency of each word
    # word_freq = Counter(words) # this one line of code does the same as the following for loop
    
    # vanilla python implementation
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word, 0) + 1
    
    
    # Sort the words by frequency - highest occuring terms are at the top
    sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    
    # Plot the word frequency and rank to check for Zipf's law
    word_rank = np.arange(1, len(sorted_word_freq)+1) # X variable
    word_frequency = [i[1] for i in sorted_word_freq] # Y variable
    
    # Plot log to visualize the power law distribution
    plt.loglog(word_rank, word_frequency, marker='o')
    plt.xlabel('Rank')
    plt.ylabel('Frequency')
    plt.title(f"Zipf's Law for {book}")
    plt.show()

In [None]:
# Lets iterate over the generator and create a list of lists with a Short Volume name and its text
corpus = []

# Iterate over the files
for f in data_files_list:
    # open the file and append the text to the corpus list
    with open(os.path.join(data_folder, f), 'r') as file:
        corpus.append([f, file.read()])

### J.R.R. Tokein's The Lord of The Rings

In [None]:
# let's use our zipf_analysis function to plot the word frequency and rank for each book
for book in corpus:
    zipf_analysis(book[1], book[0])

### Friedrich Nietzsche's Beyond Good and Evil

In [None]:
## Zipf's law in action
for book in corpus:
    if not 'LOTR' in book[0]:
        zipf_analysis(book[1], book[0])

### Most used word in the USA:

The above demonstration of _The Lord of the Rings_ is generalizable to any English text, and as discussed above to many languages for certain kinds of words.

N.B.: Notice the study of Manning and Schutze, _Foundations of Statistical Natural Language Processing_, who demonstrate that a randomly created text follows the power law observation as discussed by Mandelbrot. They conclude their discussion observing that:

<img src="../images/most_used_01.png"  width="800" height="400">

<br />

> what makes frequency-based approaches to language hard is that almost all words are rare. Zipf's law is a good way to encapsulate this insight. (p. 29)

Thus ... <br>
<img src="../images/most_used_02.png" width="800" height="400">



### What about Tokens/Lemmas and Zipf's Law?

In [None]:
## let's tokenize our text according to their lemmata

import spacy

nlp = spacy.load('en_core_web_sm')

# increase the max length of the text that can be processed
nlp.max_length = 2000000

# create a list of lists with a Short Volume name and its text
corpus = []

# Iterate over the files
for f in data_files_list:
    # open the file and append the text to the corpus list
    with open(os.path.join(data_folder, f), 'r') as file:
        corpus.append([f, nlp(file.read())])

In [None]:
# let's plot the data again using the lemmata
for book in corpus:
    zipf_analysis(' '.join([token.lemma_ for token in book[1]]), book[0])

## What about the distribution of words in a document?

### Word Frequency Distribution

What is the distribution of words in a document?

In [None]:
# look at the distribution of the words

# create a list of the lemmata in the corpus
words_freq = []

# Iterate over the files
for f in data_files_list[:1]:
    # open the file and append the text to the corpus list
    with open(os.path.join(data_folder, f), 'r') as file:
        words_freq.extend([token.lemma_ for token in nlp(file.read())])

In [None]:
# Count the frequency of each word
word_freq = Counter(words_freq)

# count the lemma frequency
lemma_freq = Counter([token.lemma_ for token in nlp(' '.join(words_freq))])

# 10 most common lemmas
most_com = lemma_freq.most_common(20)

# 10 least common lemmas
least_com = lemma_freq.most_common()[-20:]

In [None]:
# create dataframe of the most common lemmas and least common lemmas
import pandas as pd

df_most = pd.DataFrame(most_com, columns=['lemma', 'frequency'])
df_least = pd.DataFrame(least_com, columns=['lemma', 'frequency'])

# concatenate the dataframes
df = pd.concat([df_most, df_least])
df.head(50)


In [None]:
df.head()

### Word Length Distribution

What is the distribution of word lengths in a document?

In [None]:
# word length distribution
word_length = [len(word) for word in words_freq]

# plot the word length distribution 
# change the ticks to increment by 5
plt.hist(word_length, bins=65)
plt.xticks(np.arange(0, 65, 5))

### Sentence Length Distribution

What is the distribution of sentence lengths in a document?

In [None]:
# plot the sentence length distribution

# use spacy to break sentences
sentences = [sent for sent in nlp(corpus[0][1]).sents]

In [None]:
# as read from the file
corpus[0][1][:100]

In [None]:
# SpaCy NLP Sentence Detection
sentences[:5]

In [None]:
# plot the sentence length distribution
sentence_length = [len(sent) for sent in sentences]

# plot the sentence length distribution
plt.hist(sentence_length, bins=65)

In [None]:
# Use Numpy to get the mean and standard deviation of the sentence length
import numpy as np

np.mean(sentence_length), np.std(sentence_length), np.percentile(sentence_length, 95)

### Part of Speech Distribution

What is the distribution of parts of speech in a document?

In [None]:
# use spacy to POS tags and plot the distribution of the POS tags

pos = [token.pos_ for token in nlp(corpus[0][1])]

# plot the POS tag distribution
plt.hist(pos, bins=20)
# rotate the x-axis
plt.xticks(rotation=90)

### N-Gram Distribution

What is the distribution of n-grams in a document?

In [None]:
# What is the most common n-grams in the corpus?

# n-grams using python
bi_grams = []

# iterate over the tokens in the corpus
for i in range(len(words_freq)-1):
    bi_grams.append((words_freq[i], words_freq[i+1]))

In [None]:
# count the frequency of each bigram
bi_gram_freq = Counter(bi_grams)

bi_gram_freq.most_common(10)

In [None]:
least_com = bi_gram_freq.most_common()[-10:]
least_com

### Collocation Distribution

What is the distribution of collocations in a document

What is a collocation?

Linguistic collocations refer to sequences of words that co-occur more often than would be expected by chance. Testing the significance of collocations is a common task in computational linguistics and Natural Language Processing (NLP). There are several statistical measures used to test the significance of collocations. Some of the most common ones include:

1. **Mutual Information (MI)**:
   $$ MI(w_1, w_2) = \log_2 \left( \frac{P(w_1, w_2)}{P(w_1) \times P(w_2)} \right) $$
   where \( P(w_1, w_2) \) is the probability of the words \( w_1 \) and \( w_2 \) occurring together, and \( P(w_1) \) and \( P(w_2) \) are the probabilities of the words \( w_1 \) and \( w_2 \) occurring independently.

2. **Chi-squared Test**:
   The chi-squared test compares the observed frequency of a collocation to the expected frequency if the two words were independent. The formula is:
   $$ \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$
   where \( O_{ij} \) is the observed frequency of the words \( w_1 \) and \( w_2 \) occurring together, and \( E_{ij} \) is the expected frequency if the two words were independent.

3. **T-score**:
   $$ T(w_1, w_2) = \frac{f(w_1, w_2) - E(f(w_1, w_2))}{\sqrt{f(w_1, w_2)}} $$
   where \( f(w_1, w_2) \) is the observed frequency of the collocation, and \( E(f(w_1, w_2)) \) is the expected frequency under the assumption of independence.

4. **Log-Likelihood Ratio (LLR)**:
   This is a more complex measure that compares the likelihood of the observed data under two different models: one where the words are independent and one where they are dependent.

There are other measures as well, but these are among the most commonly used in the field of NLP. The choice of measure often depends on the specific task and the nature of the data.

In [None]:
# Collocations with Mutual Information in vanilla python

# import the Counter module
from collections import Counter

# create a list of the lemmata in the corpus
words_freq = []

# Iterate over the files
for f in data_files_list[:1]:
    # open the file and append the text to the corpus list
    with open(os.path.join(data_folder, f), 'r') as file:
        words_freq.extend([token.lemma_ for token in nlp(file.read())])

# Count the frequency of each word
word_freq = Counter(words_freq)

#### Gensim Collocation Detection

In [None]:
# use gensim to get the collocations
from gensim.models.phrases import Phrases, Phraser

# create a list of lists of the lemmata in the corpus
words_freq = []

# Iterate over the files
for f in data_files_list[:1]:
    # open the file and append the text to the corpus list
    with open(os.path.join(data_folder, f), 'r') as file:
        words_freq.append([token.lemma_ for token in nlp(file.read())])
        
# create the phrases model
phrases = Phrases(words_freq, min_count=30, progress_per=10000)

# create the collocations
bigram = Phraser(phrases)

# get the collocations
bigram.phrasegrams

#### Python functions to get the collocations

In [None]:
from math import log2
from math import log
from math import sqrt

def mutual_information(w1_w2_count, w1_count, w2_count, total_count):
    p_w1_w2 = w1_w2_count / total_count
    p_w1 = w1_count / total_count
    p_w2 = w2_count / total_count
    return log2(p_w1_w2 / (p_w1 * p_w2))


def chi_squared(w1_w2_count, w1_count, w2_count, total_count):
    expected_w1_w2 = (w1_count * w2_count) / total_count
    return (w1_w2_count - expected_w1_w2)**2 / expected_w1_w2


def t_score(w1_w2_count, w1_count, w2_count, total_count):
    expected_w1_w2 = (w1_count * w2_count) / total_count
    return (w1_w2_count - expected_w1_w2) / sqrt(w1_w2_count)

def log_likelihood_ratio(w1_w2_count, w1_count, w2_count, total_count):
    e1 = w1_count * w2_count / total_count
    e2 = w1_count * (total_count - w2_count) / total_count
    e3 = (total_count - w1_count) * w2_count / total_count
    e4 = (total_count - w1_count) * (total_count - w2_count) / total_count

    o1 = w1_w2_count
    o2 = w1_count - w1_w2_count
    o3 = w2_count - w1_w2_count
    o4 = total_count - w1_count - w2_count + w1_w2_count

    llr = 2 * (o1 * log(o1 / e1) + o2 * log(o2 / e2) + o3 * log(o3 / e3) + o4 * log(o4 / e4))
    return llr


## Exploring Andrew Huberman Data

Let's explore our data from Andrew Huberman's Podcasts:

In [None]:
url = 'https://raw.githubusercontent.com/JamesMTucker/AskHuberman/main/scraper/data/video_metadata.csv'

import pandas as pd

df = pd.read_csv(url)

df.head()

## Data Clean

In [None]:
# Let's Merge the Title and Description

df['title_description'] = df['video_title'] + ' ' + df['video_description']

df = df[['video_id', 'title_description']]

df.head()

## TF-IDF: Term-Frequency Inverse Document Frequency

We learned from our analysis of Zipf's law that the most frequently occuring terms (bag of words) offer little information about how topics are discussed. We can get a little of an idea of what topics might be discussed by the frequencies of a given term. But we often want and need to do more to analyze a document and its language. So, we created N-grams. We saw that an N-gram methodology provides more information about how language was used in a document. Plotting the various distributions of the linguistic data might lead us to ask important questions. Yet, there are other useful metrics to extract a document's relevant terms: Term-frequency inverse document frequency.

### Informal Definition

General intuition of tf-idf is that words isolated to one particular text provide information as to how it relates to other documents in the corpus.

It is import to note the following definitions:

* `Document` = a news article, journal article, tweet, reddit post, etc.
* `Document vocabulary` = frequency of terms in a document
* `Corpus` = a collection of documents
* `Corpus Vocabulary` = frequency of terms in a corpus of documents

Thus,

`tf-idf` = `term_frequency` * `inverse_document_frequency`

`term_frequency` = count of a words appearence in a document
`inverse_document_frequency` = log(total_number_of_documents / number_of_documents_with_term) + 1

### Formal Definition

$$idf_{t} = log_{10}(\frac{N}{df_t})$$

`term frequency` = $tf_{t,d} = log_{10}(count(t,d) + 1)$ -- The intuition is that a word appearing 100 times does not make that word 100x more likely to be relevant to the meanining of the document. Therefore, we give a heigher weight to words that occur only in a few documents.

`total documents in collection` = $idf_{t} = log_{10}(\frac{N}{df_t})$ -- We let _N_ be the total number of documents in the corpus

In [None]:
### Lemma count per video description
import spacy

nlp = spacy.load('en_core_web_sm')

df['lemma_count'] = [len([token.lemma_ for token in nlp(text)]) for text in df['title_description']]

# plot the distribution of the lemma count
plt.hist(df['lemma_count'], bins=20)

In [None]:
# shortest video description
df[df['lemma_count'] == df['lemma_count'].min()]

In [None]:
# longest video description
df[df['lemma_count'] == df['lemma_count'].max()]

### Tokenize the Video Descriptions


In [None]:
# string punctuation and lemmatize the title_description
import string

punct = string.punctuation

# lemma count per title_description
df['tokens'] = df['title_description'].apply(lambda x: [token.lemma_.lower() for token in nlp(x) if token.lemma_ not in punct])

df.head()

### Tokenize the Text

In [None]:
# explode the lemma_text column

token_df = (df
      .explode('tokens')
      .drop(columns=['title_description', 'lemma_count', 'tokenized'])
)

token_df.head()

### Create a TF count dataframe

In [None]:
# create a word frequency dataframe
term_frequency = (token_df
                  .groupby(by=['video_id', 'tokens'])
                  .agg({'tokens': 'count'})
                  .rename(columns={'tokens': 'term_frequency'})
                  .reset_index()
                  .rename(columns={'tokens': 'term'})
                 )

term_frequency

### Remove stop words

In [None]:
# we aren't so interested in stop words
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

### Create a TF Dataframe

In [None]:
# remove stop words
term_frequency = term_frequency.drop(term_frequency[term_frequency['term'].isin(stop_words)].index)
term_frequency

### Create a Document Frequency Dataframe

In [None]:
# Document Frequency
document_frequency = (term_frequency
                      .groupby(['video_id', 'term'])
                      .size()
                      .unstack()
                      .sum()
                      .reset_index()
                      .rename(columns={0: 'document_frequency'})
                     )

document_frequency

### Merge the TF and DF Dataframes to create TF-IDF

In [None]:
# merge the document freqs into the term dataframe
term_frequency = term_frequency.merge(document_frequency)

In [None]:
documents_in_corpus = term_frequency['video_id'].nunique()
documents_in_corpus

### Calculate the Term Frequency

In [None]:
# inverse document frequency
term_frequency['idf'] = np.log((1 + documents_in_corpus) / (1 + term_frequency['document_frequency'])) + 1

In [None]:
term_frequency

### Calculate the TF-IDF values

In [None]:
term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf']
term_frequency.sort_values(by=['term_frequency'], ascending=False)

### Normalize the values for interpretation

In [None]:
from sklearn import preprocessing
term_frequency['tfidf_norm'] = preprocessing.normalize(term_frequency[['tfidf']], axis=0, norm='l2')

In [None]:
term_frequency

### Explore the TF-IDF values and Terms

In [None]:
top_n_terms = term_frequency.sort_values(by=['video_id', 'tfidf'], ascending=[True, False]).groupby(['video_id']).head(2)

In [None]:
top_n_terms.head(10)

In [None]:
vidIds = top_n_terms['video_id'].tolist()

### Add the links by merging on video_ids

In [None]:
tfidf_df_titles = pd.merge(top_n_terms, df, on='video_id')

In [None]:
tfidf_df_titles.head()