# Text by the Numbers: Word Vectors

**A Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[*Click here to view or register for our current list of workshops*](http://dartgo.org/RRADworkshops)

*This notebook created by*:
+ Version 1.0: Jeremy Mikecz, Research Data Services (Dartmouth Library)
+ Version 2.0: ???
<!--
+ Some of the inspiration for the code and information in this notebook was taken from https://www.w3schools.com/python/python_intro.asp -- This is a great resource if you want to learn more about Python!-->


## I. Word Vectors

[types of vectorization]

[what can we learn?]

## II: Setup

1. Before beginning, we need to import some packages.

In [15]:
import pathlib
from pathlib import Path
import glob 
import pandas as pd

textdir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser() 
pathlist = sorted(textdir.glob('*.txt')) 

# Term Frequency - Inverse Data Frequency (TFIDF)

<img src = "https://miro.medium.com/max/720/1*qQgnyPLDIkUmeZKN2_ZWbQ.webp" style="width:60%">

Image from Yassine Hamdaoui, ["TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python"](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558) *Towards Data Science (Medium)* (Dec. 9, 2019).

## III. TF-IDF with Scikit-Learn [MW]

Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

    Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn


## IV. Breaking Down the TF-IDF Formula [MW]

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 24 (total documents) Lost in the City stories (24 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 24 stories (total documents) (24 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

**inverse_document_frequency** = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

## V. Calculate tf-idf [MW]

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

4. When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization).

**Note: tfidf vectors can become very large even for a modest number of texts. 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
tfidf_vectorizer2 = TfidfVectorizer(input='filename', stop_words='english', max_df = 0.5, max_features=5000)

5. Run TfidfVectorizer on our text_files

In [17]:
tfidf_vector = tfidf_vectorizer2.fit_transform(pathlist)
tfidf_vector


<233x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 207870 stored elements in Compressed Sparse Row format>

6. Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [18]:
text_titles = [path.stem for path in pathlist]
#TfidfVectorizer returns a sparse matrix and that's why we have to call .toarray()  before proceeding.
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer2.get_feature_names_out())
#warning: get_feature_names will be depreciated; use get_feature_names_out instead
   ##I made this fix in the code above
print(tfidf_df)

              00   08  100      10th   11  112  11th        12  120  125  ...  \
Adams_1797   0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Adams_1798   0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Adams_1799   0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Adams_1800   0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Adams_1825   0.0  0.0  0.0  0.046041  0.0  0.0   0.0  0.011819  0.0  0.0  ...   
...          ...  ...  ...       ...  ...  ...   ...       ...  ...  ...  ...   
Wilson_1916  0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Wilson_1917  0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Wilson_1918  0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Wilson_1919  0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   
Wilson_1920  0.0  0.0  0.0  0.000000  0.0  0.0   0.0  0.000000  0.0  0.0  ...   

                yield  yiel

In [19]:
tfidf_df.head()

Unnamed: 0,00,08,100,10th,11,112,11th,12,120,125,...,yield,yielded,yielding,york,young,younger,youth,zeal,zone,zones
Adams_1797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045193,0.0,0.0
Adams_1800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adams_1825,0.0,0.0,0.0,0.046041,0.0,0.0,0.0,0.011819,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
tfidf_df.index.name = "textname"
tfidf_df = tfidf_df.reset_index()
tfidf_df.head()

Unnamed: 0,textname,00,08,100,10th,11,112,11th,12,120,...,yield,yielded,yielding,york,young,younger,youth,zeal,zone,zones
0,Adams_1797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adams_1798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Adams_1799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045193,0.0,0.0
3,Adams_1800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Adams_1825,0.0,0.0,0.0,0.046041,0.0,0.0,0.0,0.011819,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
tfidf_long =  pd.melt(tfidf_df, id_vars = "textname", var_name = "word", value_name = "tfidf_score", value_vars = list(tfidf_df.drop(columns = ["textname"]).columns))
tfidf_long.head()

Unnamed: 0,textname,word,tfidf_score
0,Adams_1797,0,0.0
1,Adams_1798,0,0.0
2,Adams_1799,0,0.0
3,Adams_1800,0,0.0
4,Adams_1825,0,0.0


In [22]:
print(tfidf_long.shape)
tfidf_long = tfidf_long[tfidf_long['tfidf_score'] > 0.0]
print(tfidf_long.shape)

(1165000, 3)
(207870, 3)


In [23]:
#get top 15 tfidf scores for each text
N = 15
tfidf_long = tfidf_long.sort_values(by = "tfidf_score", ascending=False)
print(tfidf_long.shape)
tfidf_sub = tfidf_long.groupby('textname').head(N).reset_index(drop=True)

tfidf_sub.head(50)

#textnames = list(tfidf_long['textname'].unique())

#for i, text in enumerate(textnames):
#    onetext_df = tfidf_sub[tfidf_sub['textname'] == text]
#    print(onetext_df.head(10))

(207870, 3)


Unnamed: 0,textname,word,tfidf_score
0,Taft_1910,00,0.790884
1,Johnson_1966,vietnam,0.663404
2,Arthur_1883,00,0.618767
3,Carter_1980,soviet,0.607743
4,Arthur_1882,00,0.595908
5,Cleveland_1895,gold,0.563849
6,Polk_1846,mexico,0.554145
7,Eisenhower_1961,1953,0.53274
8,Polk_1847,mexico,0.532682
9,Harrison_1892,1892,0.514329


# [LEMMAS?]

In [24]:
from nltk.stem import WordNetLemmatizer   ###
from nltk.corpus import stopwords
stop = sorted(stopwords.words('english'))

# Interface lemma tokenizer from nltk with sklearn
class LemmaTokenizer:                                               ###
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']      ###
    def __init__(self):                                             ###
        self.wnl = WordNetLemmatizer()                              ###
    def __call__(self, doc):                                        ###
        #return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]
        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc) if t not in self.ignore_tokens]    ###
    
lemma_tokenizer = LemmaTokenizer()                                 ###
eng_stops = set(stopwords.words('english'))                        ###
lemma_stop = lemma_tokenizer(' '.join(eng_stops))                  ###

def tfidf_analysis(textdir, ng_range = (1,1), lemmas = False):
    '''
    textdir = pathlib Path object to folder containing .txt files to be analyzed
    ng_range = range of ngrams to be analyzed, i.e. (1,2) will analyze words of length 1 (unigrams) and 2 (bigrams) 
    reads in a file folder and returns a long tfidf dataframe for all .txt files found in this folder
    Steps:
    1. 
    '''
    #tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', ngram_range = (ng_range))
    if lemmas:
        tfidf_vectorizer = TfidfVectorizer(input = "filename", stop_words = lemma_stop, tokenizer = lemma_tokenizer, ngram_range = (ng_range), max_df = 0.5, max_features=5000)  #$$$$
    else:
        tfidf_vectorizer = TfidfVectorizer(input = "filename", stop_words = "english", ngram_range = (ng_range), max_df = 0.5, max_features=5000)  #$$$$
        
    pathlist = sorted(textdir.glob('*.txt'))
    tfidf_vector = tfidf_vectorizer.fit_transform(pathlist)
    text_titles = [path.stem for path in pathlist]
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
    print("df shape: ", tfidf_df.shape)
    tfidf_df = tfidf_df.loc[: ,(tfidf_df.max(numeric_only = True) > 0.02)]
    print("df shape: ", tfidf_df.shape)
    #print(tfidf_df.head())
    tfidf_df.index.name = "textname"
    tfidf_df = tfidf_df.reset_index()
    tfidf_long =  pd.melt(tfidf_df, id_vars = "textname", var_name = "word", value_name = "tfidf_score", value_vars = list(tfidf_df.drop(columns = ["textname"]).columns))
    
    tfidf_long = tfidf_long.sort_values(by = 'tfidf_score', ascending = False)
    print("df shape: ", tfidf_long.shape)
    print(tfidf_long.head(10))
    return(tfidf_long)

NameError: name 'tokenizer' is not defined

In [None]:
#take 4x longer to run with lemmatizing tokenizer!
N = 15
longdf = tfidf_analysis(textdir, (1,1), lemmas = True)
longdf = longdf.sort_values(by = "tfidf_score", ascending=False)
longdf_sub = longdf.groupby('textname').head(N).reset_index(drop=True)
print(longdf_sub.shape)
longdf_sub.head(50)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



df shape:  (233, 5000)
df shape:  (233, 5000)
df shape:  (1165000, 3)
                textname       word  tfidf_score
431            Taft_1910         00     0.799647
1127606     Johnson_1966    vietnam     0.668202
995180       Carter_1980     soviet     0.626473
243          Arthur_1883         00     0.624692
242          Arthur_1882         00     0.590526
533848    Cleveland_1895       gold     0.579545
214643      Clinton_1996  challenge     0.548781
706157         Polk_1846     mexico     0.543914
706158         Polk_1847     mexico     0.538077
29426    Eisenhower_1961       1953     0.528814
(3495, 3)


Unnamed: 0,textname,word,tfidf_score
0,Taft_1910,00,0.799647
1,Johnson_1966,vietnam,0.668202
2,Carter_1980,soviet,0.626473
3,Arthur_1883,00,0.624692
4,Arthur_1882,00,0.590526
5,Cleveland_1895,gold,0.579545
6,Clinton_1996,challenge,0.548781
7,Polk_1846,mexico,0.543914
8,Polk_1847,mexico,0.538077
9,Eisenhower_1961,1953,0.528814


## VI. TFIDF vectors with ngrams

In [None]:
N = 15
ng_longdf = tfidf_analysis(textdir, (2,2), lemmas = True)
ng_longdf = ng_longdf.sort_values(by = "tfidf_score", ascending=False)
ng_longdf_sub = ng_longdf.groupby('textname').head(N).reset_index(drop=True)
print(ng_longdf_sub.shape)
ng_longdf_sub.head(50)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



df shape:  (233, 5000)
df shape:  (233, 5000)
df shape:  (1165000, 3)
               textname            word  tfidf_score
901506        Bush_2003  saddam hussein     0.686624
933818   Roosevelt_1936       shall say     0.676655
810740     Madison_1813    prisoner war     0.580559
383727      Truman_1953      free world     0.578242
458592     Clinton_1994     health care     0.568968
327         Hoover_1930         000 000     0.539927
938359  Eisenhower_1961      since 1953     0.538360
382327      Truman_1951     free nation     0.536234
383585  Eisenhower_1960      free world     0.535518
656861        Bush_2008      must trust     0.531432
(3495, 3)


Unnamed: 0,textname,word,tfidf_score
0,Bush_2003,saddam hussein,0.686624
1,Roosevelt_1936,shall say,0.676655
2,Madison_1813,prisoner war,0.580559
3,Truman_1953,free world,0.578242
4,Clinton_1994,health care,0.568968
5,Hoover_1930,000 000,0.539927
6,Eisenhower_1961,since 1953,0.53836
7,Truman_1951,free nation,0.536234
8,Eisenhower_1960,free world,0.535518
9,Bush_2008,must trust,0.531432


In [None]:
ng_longdf_sub.to_csv("sotu_2grams_tfidf_5000maxfeats_top15.csv", encoding = 'utf-8')

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vectorizer3 = TfidfVectorizer(input = "filename", stop_words = lemma_stop, tokenizer = lemma_tokenizer)
tfidf_matrix = tfidf_vectorizer3.fit_transform(pathlist)
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



[[1.         0.43991929 0.403336   ... 0.19580329 0.18803809 0.20870311]
 [0.43991929 1.         0.36556119 ... 0.18971342 0.18305033 0.18292755]
 [0.403336   0.36556119 1.         ... 0.20212655 0.19737592 0.20642286]
 ...
 [0.19580329 0.18971342 0.20212655 ... 1.         0.40592595 0.31038614]
 [0.18803809 0.18305033 0.19737592 ... 0.40592595 1.         0.38413319]
 [0.20870311 0.18292755 0.20642286 ... 0.31038614 0.38413319 1.        ]]


In [None]:
cosine_sim.shape

(233, 233)