# Text by the Numbers: Word Vectors

**A Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[*Click here to view or register for our current list of workshops*](http://dartgo.org/RRADworkshops)

*This notebook created by*:
+ Version 1.0: Jeremy Mikecz, Research Data Services (Dartmouth Library)
<!--
+ Some of the inspiration for the code and information in this notebook was taken from https://www.w3schools.com/python/python_intro.asp -- This is a great resource if you want to learn more about Python!-->


## I. Word Vectors

**Text vectorization** is the process of converting texts into numbers or, more specifically, into vectors of numbers. 

[explain more]

[what can we learn?]

There are different methods of text vectorization. Three of the most common examples are:
+ **Term Frequency - Inverse Document Frequency (TF-IDF)**:
    + *Term frequency* is the number of times a word appears in one document. *Inverse document frequency** is - more or less - how frequently the word appears across the entire corpus in which this document is found. Thus, within a corpus of newspaper articles, an article on a baseball game will return high TF-IDF scores for words like "hit", "run", "RBI", and "innings" as well as the names of teams and individual players. But, common words found in that same article, like "the", "this", "and", etc. will have low TF-IDF scores. 
+ **Word2Vec**: 
    + Word2Vec is a method to convert a word to a numerical array that essentially situates the word into a multi-dimension language space where similar words are found close to one another. [more] ["embeddings"]
    + applying this vectorization method to a corpus is significantly faster than the other two methods mentions here
    + however, TF-IDF is a far simpler process to understand
+ **Sentence-BERT**:
    + Instead of creating a vector for each word, Sentence-BERT creates a vector for each sentence. This allows the encoding of a word's context: for example, that the *bow* of a ship is something altogether different from a *bow* that you tie or a *bow* and arrow.
    + unlike Word2Vec and like TF-IDF, however, this method is computationally intensive and takes up a lot of memory

## II: Setup

1. Before beginning, we need to import some packages.

In [4]:
import pathlib
from pathlib import Path
import glob 
import pandas as pd

textdir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser() 
pathlist = sorted(textdir.glob('*.txt')) 

# Term Frequency - Inverse Data Frequency (TFIDF)

<img src = "https://miro.medium.com/max/720/1*qQgnyPLDIkUmeZKN2_ZWbQ.webp" style="width:60%">

Image from Yassine Hamdaoui, ["TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python"](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558) *Towards Data Science (Medium)* (Dec. 9, 2019).

***Portions of this notebook are taken from the lesson ["TF-IDF with Scikit-learn"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html) in Melanie Walsh's Introduction to Cultural Analytics & Python book  (indicated by the [MW]).***

## III. TF-IDF with Scikit-Learn [MW]

Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

    Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn


## IV. Breaking Down the TF-IDF Formula [MW]

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 24 (total documents) Lost in the City stories (24 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 24 stories (total documents) (24 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

**inverse_document_frequency** = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

## V. Calculate tf-idf [MW]

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

4. When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization).

**Note: tfidf vectors can become very large even for a modest number of texts. 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
tfidf_vectorizer2 = TfidfVectorizer(input='filename', stop_words='english', max_df = 0.5, max_features=5000)

5. Run TfidfVectorizer on our text_files

In [6]:
tfidf_vector = tfidf_vectorizer.fit_transform(pathlist)
tfidf_vector


<233x25023 sparse matrix of type '<class 'numpy.float64'>'
	with 361183 stored elements in Compressed Sparse Row format>

6. Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [7]:
text_titles = [path.stem for path in pathlist]
#TfidfVectorizer returns a sparse matrix and that's why we have to call .toarray()  before proceeding.
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
#warning: get_feature_names will be depreciated; use get_feature_names_out instead
   ##I made this fix in the code above
print(tfidf_df)

              00       000  0000  0001  001  002  003  004  005  006  ...  \
Adams_1797   0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Adams_1798   0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Adams_1799   0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Adams_1800   0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Adams_1825   0.0  0.271497   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
...          ...       ...   ...   ...  ...  ...  ...  ...  ...  ...  ...   
Wilson_1916  0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Wilson_1917  0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Wilson_1918  0.0  0.000000   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Wilson_1919  0.0  0.023909   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   
Wilson_1920  0.0  0.204717   0.0   0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   

             zimbabwe  zimbabwean  zinc  zion  zollverein  zone  zones  \
A

In [8]:
tfidf_df.head()

Unnamed: 0,00,000,0000,0001,001,002,003,004,005,006,...,zimbabwe,zimbabwean,zinc,zion,zollverein,zone,zones,zoological,zooming,zuloaga
Adams_1797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1825,0.0,0.271497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
tfidf_df.index.name = "textname"
tfidf_df = tfidf_df.reset_index()
tfidf_df.head()

Unnamed: 0,textname,00,000,0000,0001,001,002,003,004,005,...,zimbabwe,zimbabwean,zinc,zion,zollverein,zone,zones,zoological,zooming,zuloaga
0,Adams_1797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adams_1798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Adams_1799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Adams_1800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Adams_1825,0.0,0.271497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
tfidf_long =  pd.melt(tfidf_df, id_vars = "textname", var_name = "word", value_name = "tfidf_score", value_vars = list(tfidf_df.drop(columns = ["textname"]).columns))
tfidf_long.head()

Unnamed: 0,textname,word,tfidf_score
0,Adams_1797,0,0.0
1,Adams_1798,0,0.0
2,Adams_1799,0,0.0
3,Adams_1800,0,0.0
4,Adams_1825,0,0.0


In [11]:
print(tfidf_long.shape)
tfidf_long = tfidf_long[tfidf_long['tfidf_score'] > 0.0]
print(tfidf_long.shape)

(5830359, 3)
(361183, 3)


In [12]:
#get top 15 tfidf scores for each text
N = 15
tfidf_long = tfidf_long.sort_values(by = "tfidf_score", ascending=False)
print(tfidf_long.shape)
tfidf_sub = tfidf_long.groupby('textname').head(N).reset_index(drop=True)

tfidf_sub.head(50)

#textnames = list(tfidf_long['textname'].unique())

#for i, text in enumerate(textnames):
#    onetext_df = tfidf_sub[tfidf_sub['textname'] == text]
#    print(onetext_df.head(10))

(361183, 3)


Unnamed: 0,textname,word,tfidf_score
0,Taft_1910,00,0.648663
1,Johnson_1966,vietnam,0.523858
2,Carter_1980,soviet,0.475029
3,Cleveland_1895,gold,0.464477
4,Arthur_1883,00,0.458837
5,Arthur_1882,00,0.452926
6,Polk_1846,mexico,0.442578
7,Eisenhower_1961,1953,0.414143
8,Hoover_1930,000,0.405701
9,Pierce_1855,states,0.399066


<div class="alert alert-info" role="alert" style="color:blue">
    <h3><b>Exercises</b>:</h3> 
    <p>7. Subset the tfidf_sub dataframe to examine the top tfidf_scores for one particular speech</p>
    <p>7b Advanced. Subset the tfidf_sub dataframe to examine the top tfidf_scores for a president.</p>
</div>

## Automate the TF-IDF vectorization process with a function

8. The function below integrates some of the tasks we did above to further automate the process of creating tf-idf vectors. 

Examine the `tfidf_analysis` function below. What are its inputs? What does it return (its output)? Can you identify what each line of code does?

You may notice, that this code applies an additional processing step to our text: it lemmatizes tokens from the text, reducing words to their base form (plural --> singular for nouns, present tense first-person for verbs)

In [13]:
from nltk.stem import WordNetLemmatizer   ###
from nltk.corpus import stopwords
stop = sorted(stopwords.words('english'))
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')   

# Interface lemma tokenizer from nltk with sklearn
class LemmaTokenizer:                                               ###
    #ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']      ###
    def __init__(self):                                             ###
        self.wnl = WordNetLemmatizer()                              ###
    def __call__(self, doc):                                        ###
        #return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]
        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc) if not t.isdigit()] # if t not in self.ignore_tokens]    ###
    
lemma_tokenizer = LemmaTokenizer()                                 ###
eng_stops = set(stopwords.words('english'))                        ###
lemma_stop = lemma_tokenizer(' '.join(eng_stops))                  ###

def tfidf_analysis(textdir, ng_range = (1,1), lemmas = False):
    '''
    textdir = pathlib Path object to folder containing .txt files to be analyzed
    ng_range = range of ngrams to be analyzed, i.e. (1,2) will analyze words of length 1 (unigrams) and 2 (bigrams) 
    reads in a file folder and returns a long tfidf dataframe for all .txt files found in this folder
    Steps:
    1. 
    '''
    #tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', ngram_range = (ng_range))
    if lemmas:
        tfidf_vectorizer = TfidfVectorizer(input = "filename", stop_words = lemma_stop, tokenizer = lemma_tokenizer, ngram_range = (ng_range), max_df = 0.5, max_features=5000)  #$$$$
    else:
        tfidf_vectorizer = TfidfVectorizer(input = "filename", stop_words = "english", ngram_range = (ng_range), max_df = 0.5, max_features=5000)  #$$$$
        
    pathlist = sorted(textdir.glob('*.txt'))
    tfidf_vector = tfidf_vectorizer.fit_transform(pathlist)
    text_titles = [path.stem for path in pathlist]
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
    print("df shape: ", tfidf_df.shape)
    tfidf_df = tfidf_df.loc[: ,(tfidf_df.max(numeric_only = True) > 0.02)]
    print("df shape: ", tfidf_df.shape)
    #print(tfidf_df.head())
    tfidf_df.index.name = "textname"
    tfidf_df = tfidf_df.reset_index()
    tfidf_long =  pd.melt(tfidf_df, id_vars = "textname", var_name = "word", value_name = "tfidf_score", value_vars = list(tfidf_df.drop(columns = ["textname"]).columns))
    
    tfidf_long = tfidf_long.sort_values(by = 'tfidf_score', ascending = False)
    print("df shape: ", tfidf_long.shape)
    #print(tfidf_long.head(10))
    return(tfidf_long)

9. We can call the function below. You can try it with or without lemmatization and with single words or n-grams.

In [20]:
#take 4x longer to run with lemmatizing tokenizer!
longdf = tfidf_analysis(textdir, ng_range = (1,1), lemmas = True)
longdf.head()




df shape:  (233, 5000)
df shape:  (233, 5000)
df shape:  (1165000, 3)


Unnamed: 0,textname,word,tfidf_score
1125509,Johnson_1966,vietnam,0.667941
987957,Carter_1980,soviet,0.633374
500529,Cleveland_1895,gold,0.593359
683324,Polk_1847,mexico,0.566879
683323,Polk_1846,mexico,0.552646


10. The above dataframe is large: it has 1.1 million rows. We can reduce it by just keeping the top *N* words by tfidf score for each president. 

In [21]:
N = 10
longdf_sub = longdf.groupby('textname').head(N).reset_index(drop=True)
print(longdf_sub.shape)
longdf_sub.head(20)

(2330, 3)


Unnamed: 0,textname,word,tfidf_score
0,Johnson_1966,vietnam,0.667941
1,Carter_1980,soviet,0.633374
2,Cleveland_1895,gold,0.593359
3,Polk_1847,mexico,0.566879
4,Polk_1846,mexico,0.552646
5,Clinton_1996,challenge,0.551497
6,Truman_1953,communist,0.525939
7,Reagan_1982,program,0.520346
8,Obama_2013,job,0.495715
9,Roosevelt_1937,democracy,0.489216


<div class="alert alert-info" role="alert" style="color:blue">
    <h3><b>Exercises</b>:</h3> 
    <p>11. To view the dataframe differently we can sort it by first year and then tfidf_score. Do so below:</p>
</div>

In [23]:
longdf_sub['year'] = longdf_sub['textname'].str[-4:].astype(int)
longdf_sub = longdf_sub.sort_values(by = ["year", "tfidf_score"], ascending = [False, False])
longdf_sub.head(15)

Unnamed: 0,textname,word,tfidf_score,year
71,Biden_2023,folk,0.326948,2023
82,Biden_2023,going,0.315507,2023
144,Biden_2023,job,0.269779,2023
558,Biden_2023,get,0.176336,2023
734,Biden_2023,cancer,0.161842,2023
761,Biden_2023,drug,0.158991,2023
977,Biden_2023,tonight,0.145714,2023
1084,Biden_2023,finish,0.140049,2023
1250,Biden_2023,medicare,0.132981,2023
1463,Biden_2023,covid,0.125162,2023


<div class="alert alert-info" role="alert" style="color:blue">
    <h3><b>Exercises</b></h3>
    <p>12. Subset the dataframe by year, keeping only those speeches given on or after the year 2000.<p>
</div>

In [None]:
longdf_21C = longdf_sub.loc[(longdf_sub['year'] >= 2000), :]
longdf_21C = longdf_21C.sort_values(by = "year", ascending = False)
longdf_21C

## VI. TFIDF vectors with ngrams

13. With the function `tfidf_analysis` we can compile tfidf_scores for ngrams, including two-, three-, and four-word terms by adjusting the minimum and maximum ngram length in the tuple called by the parameter `ng_range`. 

In [25]:
N = 15
min_ng = 2
max_ng = 3
ng_longdf = tfidf_analysis(textdir, ng_range = (min_ng, max_ng), lemmas = True)
ng_longdf = ng_longdf.sort_values(by = "tfidf_score", ascending=False)
ng_longdf_sub = ng_longdf.groupby('textname').head(N).reset_index(drop=True)
print(ng_longdf_sub.shape)
ng_longdf_sub.head(50)



df shape:  (233, 5000)
df shape:  (233, 5000)
df shape:  (1165000, 3)
(3495, 3)


Unnamed: 0,textname,word,tfidf_score
0,Bush_2003,saddam hussein,0.683307
1,Madison_1815,sum million,0.589258
2,Clinton_1994,health care,0.570977
3,Truman_1953,free world,0.570115
4,Madison_1813,prisoner war,0.566379
5,Truman_1951,free nation,0.540958
6,Eisenhower_1960,free world,0.535015
7,Bush_2008,must trust,0.531823
8,Johnson_1966,south vietnam,0.530202
9,Biden_2021,job plan,0.528341


In [26]:
ng_longdf_sub.to_csv(f"sotu_{min_ng}-{max_ng}grams_tfidf_top{N}.csv", encoding = 'utf-8')