# We'll start at 2:10ish PM 

Waiting for others to join the call :)

https://www.youtube.com/watch?v=VHoT4N43jK8

<center><img src='img/cdy.png' style='width:500px; float: left; margin: 0px 30px 15px 0px'></center>

# TF-IDF and Word2Vec
## Class 20 - Data Science Curriculum 

<br>

#### Women Building Change scholarship program 2023 🇧🇮
May 8, 2023





# ⏪ Recap last class

- Vectorial representation of words 
    - One-Hot Encoding
    - Bag of Words
- Similarity of texts using
    - Euclidean distance
    - Cosine distance

# 🚀 Today's agenda

- TF-IDF
- Word2Vec

<center><img src='img/pipeline.png' style='width:1500px; margin: 0px 30px 15px 0px'></center>

# Google Colab: https://colab.research.google.com/

<br>
<center><img src='img/girl_coding.jpg' style='height:350px; float: center; margin: 0px 30px 15px 0px'></center>



# Let's do some text pre-processing

In [2]:
! python3 -m spacy download fr_core_news_md
import nltk
nltk.download('stopwords')

import json
import pandas as pd
import re
from nltk.corpus import stopwords
import spacy

with open('princesses.json', 'r') as f:
    data = json.load(f)

df = pd.DataFrame(data.items(), columns=['princess', 'description']).set_index('princess')

sw = stopwords.words("french")
lemma_obj = spacy.load('fr_core_news_md')

def text_pre_processing(text):
    text = text.lower() # Make it lowercase
    text = re.sub(r"[\W\d]", " ", text) # Remove punctuation (regex)
    text = text.split() # Tokenize
    text = [word for word in text if word not in sw] # Remove stop words
    text = " ".join(text) # Make it a string

    # Lemmatization
    doc = lemma_obj(text)
    text = [token.lemma_ for token in doc]

    text = " ".join(text) # Make it a string
    return text

df['text_pp'] = df['description'].apply(lambda row: text_pre_processing(row))

Collecting fr-core-news-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.3.0/fr_core_news_md-3.3.0-py3-none-any.whl (45.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: fr-core-news-md
Successfully installed fr-core-news-md-3.3.0
You should consider upgrading via the '/Users/vmarquez/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vmarquez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# TF-IDF (Term Frequency, Inverse Document Frequency) 

- It seeks to quantify the importance of a word relative to the other words in the document and in the corpus.
- It is frequently used in information retrieval systems and clustering algorithms.
- The more a word helps to distinguish a document from others, the higher its TF-IDF score will be.

### TF: Term Frequency

- Term frequency: Counting the number of occurrences of a word in a document, divided by the number of words in that document.

$$ tf(t,d) = \dfrac{count(t)}{|d|}$$

where
- $t$ is term
- $d$ is document

*Note: The term frequency is higher for words frequently used in a document.*

### DF: Document Frequency

- Frequency in documents: It is the number of documents that have that word over the total number of documents.

$$ df(t,N) = \dfrac{|\{d_i: t\in d_i, i=1, \dots , N\}|}{N}$$

where
- $t$ is term
- $N$ is Number of documents

*Note: Frequency in documents is higher for words used in many documents*

## So far we have TF and DF... how do we get to TD-IDF?





- A specific word to some document will have very low term frequency

- Since the goal is to distinguish one document from another, we want to highlight words used frequently in one document but penalize them if they are present in all documents. This is called TF-IDF.

- Then, we obtain:

$$tfidf (t,d,N) = tf(df) \times \log\big(\frac{1}{df(t,N)}\big)$$

where
- $t$ is term
- $d$ is document
- $N$ is Number of documents

- When $t$ is in all documents, $idf = log(1) = 0$
 
- This makes sense since a word that is in all documents is very bad at distinguishing between documents

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
tfidf = tfidf_vect.fit_transform(df['text_pp'].values)

tfidf_matrix = pd.DataFrame(tfidf.toarray(), columns=tfidf_vect.get_feature_names())
tfidf_matrix.index = df.index

tfidf_matrix.T.round(3)

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
absent,0.000,0.000,0.081,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
accomplir,0.127,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
admettre,0.000,0.000,0.081,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
adolescent,0.000,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.093
affirme,0.108,0.000,0.000,0.059,0.000,0.0,0.0,0.000,0.000,0.0,0.000
...,...,...,...,...,...,...,...,...,...,...,...
étroite,0.000,0.000,0.000,0.069,0.000,0.0,0.0,0.000,0.000,0.0,0.000
éviter,0.000,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.093
événement,0.000,0.000,0.000,0.000,0.128,0.0,0.0,0.000,0.000,0.0,0.000
être,0.000,0.053,0.000,0.046,0.000,0.0,0.0,0.091,0.104,0.0,0.000


# Similarity measures
How similar are the documents?

# Euclidean distance

<br>
<center><img src='img/dist_euc.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [4]:
from sklearn.metrics.pairwise import euclidean_distances

dist_euc = euclidean_distances(tfidf_matrix.values)
dist_euc = pd.DataFrame(dist_euc, columns = df.index, index = df.index)

dist_euc.style.background_gradient(cmap='Reds')

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Blanche-Neige,0.0,1.373237,1.355777,1.374014,1.368327,1.373438,1.393376,1.37762,1.386181,1.356605,1.345197
Cendrillon,1.373237,0.0,1.375634,1.381346,1.383473,1.382734,1.379908,1.375117,1.387408,1.354227,1.350327
Aurore,1.355777,1.375634,0.0,1.361973,1.372278,1.384766,1.368598,1.365271,1.326518,1.350259,1.378867
Ariel,1.374014,1.381346,1.361973,0.0,1.365308,1.363705,1.386612,1.378191,1.375674,1.361189,1.354413
Belle,1.368327,1.383473,1.372278,1.365308,0.0,1.389688,1.379667,1.390559,1.383216,1.349807,1.391718
Jasmine,1.373438,1.382734,1.384766,1.363705,1.389688,0.0,1.391717,1.381168,1.331025,1.37853,1.395605
Pocahontas,1.393376,1.379908,1.368598,1.386612,1.379667,1.391717,0.0,1.343865,1.349423,1.343287,1.38657
Mulan,1.37762,1.375117,1.365271,1.378191,1.390559,1.381168,1.343865,0.0,1.335878,1.366105,1.39579
Tiana,1.386181,1.387408,1.326518,1.375674,1.383216,1.331025,1.349423,1.335878,0.0,1.36573,1.378886
Raiponce,1.356605,1.354227,1.350259,1.361189,1.349807,1.37853,1.343287,1.366105,1.36573,0.0,1.367093


# Cosine distance

<br>
<center><img src='img/dist_cos.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [6]:
from sklearn.metrics.pairwise import cosine_distances

dist_cos = cosine_distances(tfidf_matrix.values)
dist_cos = pd.DataFrame(dist_cos, columns = df.index, index = df.index)
dist_cos.style.background_gradient(cmap='Reds')

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Blanche-Neige,0.0,0.942889,0.919066,0.943958,0.936159,0.943166,0.970748,0.948919,0.960749,0.920188,0.904777
Cendrillon,0.942889,0.0,0.946185,0.954058,0.956999,0.955976,0.952073,0.945473,0.96245,0.916965,0.911691
Aurore,0.919066,0.946185,0.0,0.927486,0.941573,0.958789,0.936531,0.931983,0.879825,0.9116,0.950637
Ariel,0.943958,0.954058,0.927486,0.0,0.932033,0.929845,0.961347,0.949705,0.946239,0.926418,0.917217
Belle,0.936159,0.956999,0.941573,0.932033,0.0,0.965616,0.95174,0.966827,0.956643,0.910989,0.968439
Jasmine,0.943166,0.955976,0.958789,0.929845,0.965616,0.0,0.968438,0.953813,0.885813,0.950172,0.973857
Pocahontas,0.970748,0.952073,0.936531,0.961347,0.95174,0.968438,0.0,0.902987,0.910472,0.90221,0.961288
Mulan,0.948919,0.945473,0.931983,0.949705,0.966827,0.953813,0.902987,0.0,0.892285,0.933121,0.974115
Tiana,0.960749,0.96245,0.879825,0.946239,0.956643,0.885813,0.910472,0.892285,0.0,0.932609,0.950664
Raiponce,0.920188,0.916965,0.9116,0.926418,0.910989,0.950172,0.90221,0.933121,0.932609,0.0,0.934472


# 🔮 Going from TF-IDF to Word2Vec

- So far, the vector representations of text that we have seen treat linguistic units as atomic units.
- Vectors are sparse
- They have problems with words outside the vocabulary

With distributed representations, such as **word2vec**, we can create dense, low-dimensional representations that capture distributional similarities between words

# How does Word2Vec work?

<br>
<center><img src='img/espacio.jpg' style='height:350px;'>

- Word2Vec derives the meaning of a word from its context. That is, if two different words occur in similar contexts, then they are likely to mean the same thing

- Recently, [Mikolov et al., 2013] showed that neural networks do a good job representing the semantic vector space

- Text representations using neural networks are usually called **embeddings**


# 🛠️ CBOW (Continous bag of words) & SkipGram

<br>
<center><img src='img/espacio.jpg' style='height:350px;'>

- These are variations of Word2Vec

- They are statistical models that seek to predict the probability of word distribution using contextual windows.

- Both models reach a similar conclusion, but take almost inverse paths to get there


# 🛠️ CBOW (Continous bag of words) 
<br>
<center><img src='img/cbow_fox.png' style='height:200px;'>
    <small>What is the probability that "jumped" occurs given that...? $$P(jumped|...)$$</small></center>
    
<br>
    
<center><img src='img/cbow_prep.png' style='height:400px;'>

<center><img src='img/cbow_nn.png' style='height:600px;'>
    

# 🛠️ SkipGram
<br>
<center><img src='img/sg_fox.png' style='height:200px;'>
    
<br>
    
<center><img src='img/sg_prep.png' style='height:400px;'>

<center><img src='img/sg_nn.png' style='height:600px;'>
    

# 🧠 Neural network?

- We initialize each word $w$ in the corpus with a vector $v_w$ with random values

- Then the Word2Vec model refines the vector $v_w$ by predicting $v_w$ using the vectors of the words in context using a neural network

- Word2Vec ensures that these representations are low-dimensional and dense

# 🛠️ Pre-trained embeddings 

- Training your own embedding is a very expensive process (in terms of computation and time)

- Fortunately, there are pre-trained embeddings available

    - 🇬🇧 English: [Standford GloVe](https://nlp.stanford.edu/projects/glove/). Download [here](http://nlp.stanford.edu/data/glove.6B.zip)
    - 🌎 More languages: [FastText](https://fasttext.cc/docs/en/crawl-vectors.html)

In [8]:
import ast
import pandas as pd
import gensim.downloader as api
from gensim.models import Word2Vec

In [9]:
def analogy(model, worda, wordb, wordc):
    '''
    wordA is to wordB as wordC is to ...
    '''
    result = model.most_similar(negative=[worda], 
                                positive=[wordb, wordc])
    return result[0][0]

In [10]:
google_wv = api.load('word2vec-google-news-300')



In [11]:
analogy(google_wv, 'Colombia','Bogota','Burundi')

'Bujumbura'

In [12]:
google_wv.most_similar("game")

[('games', 0.7636998295783997),
 ('play', 0.6501181125640869),
 ('match', 0.6485748887062073),
 ('matchup', 0.6120450496673584),
 ('agame', 0.5863147974014282),
 ('ballgame', 0.5731310248374939),
 ('thegame', 0.5718172192573547),
 ('opener', 0.5680001378059387),
 ('matches', 0.5580832958221436),
 ('tournament', 0.5496207475662231)]

In [14]:
df = pd.read_csv("rap.csv")

In [15]:
df.sample(5)

Unnamed: 0,ALink,SName,SLink,Lyric,language,pp
14946,/mos-def/,The Embassy,/mos-def/the-embassy.html,Mentioned that he worked for the embassy\nPeop...,en,"['mentioned', 'worked', 'embassy', 'people', '..."
16679,/master-p/,Rock It,/master-p/rock-it.html,"(feat. 5th Ward Weebie, Krazy)\n\n[Master P an...",en,"['feat', 'th', 'ward', 'weebie', 'krazy', 'mas..."
4035,/lil-wayne/,That's What They Call Me,/lil-wayne/thats-what-they-call-me.html,"[Lil Wayne - Verse 1]\nMan, I aint got nothing...",en,"['lil', 'wayne', 'verse', 'man', 'aint', 'got'..."
1506,/chris-brown/,Got To War For Ya,/chris-brown/got-to-war-for-ya.html,(trecho)\n\nWhat's the point in me havin' a cr...,en,"['trecho', 'point', 'havin', 'crown', 'got', '..."
12302,/fugees/,Blunted On Reality,/fugees/blunted-on-reality.html,"Intro)\n(*inhales, then coughs*)\nAy, nigga pa...",en,"['intro', 'inhales', 'coughs', 'ay', 'nigga', ..."


In [19]:
df['pp'] = df['pp'].apply(lambda row: ast.literal_eval(row))

In [20]:
rap_model = Word2Vec(df['pp'].values,
                     sg=1, # 1 skip-gram, 0 CBOW
                     seed=1, vector_size=256, min_count=50, window=12)

In [21]:
rap_model.wv.most_similar("game")

[('fame', 0.5601926445960999),
 ('shame', 0.5253305435180664),
 ('rap', 0.49619030952453613),
 ('niggaz', 0.4614643156528473),
 ('aim', 0.45914676785469055),
 ('gain', 0.4551091194152832),
 ('gimmicks', 0.45426827669143677),
 ('games', 0.4541856348514557),
 ('maintain', 0.4448930621147156),
 ('name', 0.44138872623443604)]

In [28]:
google_wv.most_similar("game")

[('games', 0.7636998295783997),
 ('play', 0.6501181125640869),
 ('match', 0.6485748887062073),
 ('matchup', 0.6120450496673584),
 ('agame', 0.5863147974014282),
 ('ballgame', 0.5731310248374939),
 ('thegame', 0.5718172192573547),
 ('opener', 0.5680001378059387),
 ('matches', 0.5580832958221436),
 ('tournament', 0.5496207475662231)]

In [23]:
matrix = pd.DataFrame(google_wv.get_normed_vectors(), index = google_wv.key_to_index)
matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
</s>,0.067320,-0.053447,0.018991,0.091428,0.065955,-0.083695,-0.001819,-0.025018,-0.034342,0.064136,...,-0.092337,0.081876,-0.003625,-0.049125,0.079146,0.069139,0.033887,-0.093247,-0.007335,-0.005146
in,0.052956,0.065460,0.066195,0.047072,0.052221,-0.082009,-0.061415,-0.116210,0.015629,0.099293,...,-0.127242,-0.066931,-0.060679,0.048911,0.046153,-0.035672,-0.044314,-0.035856,0.010895,-0.047072
for,-0.008512,-0.034224,0.032284,0.045868,-0.013143,-0.046221,-0.000948,-0.052219,0.046574,0.062451,...,-0.016318,0.002690,-0.059628,0.058923,0.005733,0.000345,0.013319,0.051513,-0.025227,0.017465
that,-0.012361,-0.022230,0.065540,0.039477,-0.086620,0.024913,-0.011163,-0.070522,0.092369,0.092752,...,-0.008863,-0.012265,-0.026254,-0.016193,-0.015235,0.050209,0.015810,0.005390,0.047909,-0.116515
is,0.003746,-0.038920,0.091332,0.012000,-0.070575,0.105343,0.059936,-0.057342,0.038141,0.011092,...,-0.124024,-0.019330,-0.049817,0.097040,0.014400,0.067980,-0.013168,0.005968,0.087180,0.056823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
RAFFAELE,0.013637,-0.074288,-0.027634,0.043783,0.054908,0.013099,-0.084695,0.052396,-0.077159,0.114841,...,-0.041271,-0.004194,-0.087207,-0.010856,-0.012650,-0.034632,0.024942,-0.018572,0.015611,0.039656
Bim_Skala_Bim,0.016599,0.059948,-0.057047,-0.001974,0.034970,0.065427,-0.078319,-0.110227,-0.044478,0.131499,...,0.114740,-0.054791,-0.048668,-0.030780,0.016437,-0.148259,-0.014020,-0.068006,-0.018855,0.068006
Mezze_Cafe,-0.020033,-0.092573,-0.019784,0.020033,-0.000234,0.041061,-0.051264,-0.012940,0.024636,0.013376,...,-0.047282,0.001097,-0.139358,-0.004635,0.023268,0.043798,-0.047780,-0.016673,-0.013687,0.047531
pulverizes_boulders,0.027951,-0.027534,0.030871,0.001004,-0.031288,-0.072589,0.031497,0.027742,0.054233,0.098037,...,0.017104,-0.060074,-0.143509,0.002633,-0.051522,-0.006232,-0.008396,-0.007561,0.023049,0.016270


In [27]:
matrix = pd.DataFrame(rap_model.wv.get_normed_vectors(), index = rap_model.wv.key_to_index)
matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
like,0.085370,-0.075596,0.097258,-0.142571,0.024031,-0.033467,0.064075,-0.025131,0.025017,0.076124,...,0.035348,-0.052638,0.012715,-0.012325,-0.021964,0.050976,0.002912,-0.024124,-0.138690,0.053977
get,-0.018450,-0.022875,0.033707,-0.100513,0.103531,-0.060459,0.025217,-0.094869,-0.002985,0.050575,...,-0.038272,0.036821,-0.035520,-0.028045,-0.065190,0.020895,-0.110848,-0.073369,-0.161038,0.040252
got,0.062818,-0.004717,0.056002,-0.055788,0.161225,-0.067321,-0.018808,-0.051887,-0.011999,0.092855,...,0.032761,0.107832,0.019446,-0.071148,-0.059053,0.065336,0.030922,-0.023967,-0.198040,0.054639
know,0.092219,-0.039774,0.021838,-0.165178,0.017003,-0.025243,0.018295,-0.079760,-0.010934,0.066495,...,0.055943,-0.097308,0.070908,-0.015377,0.020762,0.110590,0.031665,-0.000793,-0.085172,0.132437
nigga,0.031362,-0.091054,0.076987,0.003783,0.145514,0.019750,0.003984,-0.073198,-0.094429,0.069638,...,-0.015929,-0.130118,0.017100,-0.001576,0.055559,0.015639,-0.092133,-0.019137,-0.133689,0.031998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
laker,-0.091635,-0.059359,0.049432,-0.027052,0.051276,0.040647,-0.022010,0.000144,0.049179,0.082729,...,0.039716,0.066397,0.075029,-0.024649,-0.020025,0.009812,-0.052290,0.013983,-0.082141,-0.044984
aahh,-0.007977,-0.092939,-0.075158,-0.046331,-0.074863,0.020637,0.048279,-0.084653,0.012969,0.039376,...,-0.023783,0.044097,0.002948,-0.097088,-0.004610,0.097879,0.036071,0.053854,0.159144,0.066427
worn,0.032061,-0.041073,-0.028772,0.023050,0.008815,-0.021604,0.134567,-0.114270,0.101783,0.094920,...,-0.004112,-0.004811,0.002921,-0.060393,-0.040535,-0.004875,-0.027006,0.049152,-0.103261,0.055036
spilling,-0.033893,-0.066507,0.078387,0.076874,0.106788,0.038120,0.076079,-0.046286,0.075716,0.010986,...,0.006373,-0.040128,-0.050631,0.059539,0.013564,0.024847,-0.076808,0.047732,-0.044988,-0.056792


# ⏪ Today's recap

- TF-IDF
- Word2Vec

<center><img src='img/bye.gif' style='height:250px;'></center> 

# Next class: Performance metrics
# See you next Friday!