# We'll start at 2:10ish PM 

Waiting for others to join the call :)

https://www.youtube.com/watch?v=VHoT4N43jK8

<center><img src='img/cdy.png' style='width:500px; float: left; margin: 0px 30px 15px 0px'></center>

# TF-IDF and Word2Vec
## Class 20 - Data Science Curriculum 

<br>

#### Women Building Change scholarship program 2023 🇧🇮
May 8, 2023





# ⏪ Recap last class

- Vectorial representation of words 
    - One-Hot Encoding
    - Bag of Words
- Similarity of texts using
    - Euclidean distance
    - Cosine distance

# 🚀 Today's agenda

- TF-IDF
- Word2Vec

<center><img src='img/pipeline.png' style='width:1500px; margin: 0px 30px 15px 0px'></center>

# Google Colab: https://colab.research.google.com/

<br>
<center><img src='img/girl_coding.jpg' style='height:350px; float: center; margin: 0px 30px 15px 0px'></center>



# Let's do some text pre-processing

In [2]:
! python3 -m spacy download fr_core_news_md
import nltk
nltk.download('stopwords')

import json
import pandas as pd
import re
from nltk.corpus import stopwords
import spacy

with open('princesses.json', 'r') as f:
    data = json.load(f)

df = pd.DataFrame(data.items(), columns=['princess', 'description']).set_index('princess')

sw = stopwords.words("french")
lemma_obj = spacy.load('fr_core_news_md')

def text_pre_processing(text):
    text = text.lower() # Make it lowercase
    text = re.sub(r"[\W\d]", " ", text) # Remove punctuation (regex)
    text = text.split() # Tokenize
    text = [word for word in text if word not in sw] # Remove stop words
    text = " ".join(text) # Make it a string

    # Lemmatization
    doc = lemma_obj(text)
    text = [token.lemma_ for token in doc]

    text = " ".join(text) # Make it a string
    return text

df['text_pp'] = df['description'].apply(lambda row: text_pre_processing(row))

Collecting fr-core-news-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.3.0/fr_core_news_md-3.3.0-py3-none-any.whl (45.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: fr-core-news-md
Successfully installed fr-core-news-md-3.3.0
You should consider upgrading via the '/Users/vmarquez/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vmarquez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# TF-IDF (Term Frequency, Inverse Document Frequency) 

- It seeks to quantify the importance of a word relative to the other words in the document and in the corpus.
- It is frequently used in information retrieval systems and clustering algorithms.
- The more a word helps to distinguish a document from others, the higher its TF-IDF score will be.

### TF: Term Frequency

- Term frequency: Counting the number of occurrences of a word in a document, divided by the number of words in that document.

$$ tf(t,d) = \dfrac{count(t)}{|d|}$$

where
- $t$ is term
- $d$ is document

*Note: The term frequency is higher for words frequently used in a document.*

### DF: Document Frequency

- Frequency in documents: It is the number of documents that have that word over the total number of documents.

$$ df(t,N) = \dfrac{|\{d_i: t\in d_i, i=1, \dots , N\}|}{N}$$

where
- $t$ is term
- $N$ is Number of documents

*Note: Frequency in documents is higher for words used in many documents*

## So far we have TF and DF... how do we get to TD-IDF?





- A specific word to some document will have very low term frequency

- Since the goal is to distinguish one document from another, we want to highlight words used frequently in one document but penalize them if they are present in all documents. This is called TF-IDF.

- Then, we obtain:

$$tfidf (t,d,N) = tf(df) \times \log\big(\frac{1}{df(t,N)}\big)$$

where
- $t$ is term
- $d$ is document
- $N$ is Number of documents

- When $t$ is in all documents, $idf = log(1) = 0$
 
- This makes sense since a word that is in all documents is very bad at distinguishing between documents

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
tfidf = tfidf_vect.fit_transform(df['text_pp'].values)

tfidf_matrix = pd.DataFrame(tfidf.toarray(), columns=tfidf_vect.get_feature_names())
tfidf_matrix.index = df.index

tfidf_matrix.T.round(3)

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
absent,0.000,0.000,0.081,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
accomplir,0.127,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
admettre,0.000,0.000,0.081,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.000
adolescent,0.000,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.093
affirme,0.108,0.000,0.000,0.059,0.000,0.0,0.0,0.000,0.000,0.0,0.000
...,...,...,...,...,...,...,...,...,...,...,...
étroite,0.000,0.000,0.000,0.069,0.000,0.0,0.0,0.000,0.000,0.0,0.000
éviter,0.000,0.000,0.000,0.000,0.000,0.0,0.0,0.000,0.000,0.0,0.093
événement,0.000,0.000,0.000,0.000,0.128,0.0,0.0,0.000,0.000,0.0,0.000
être,0.000,0.053,0.000,0.046,0.000,0.0,0.0,0.091,0.104,0.0,0.000


# Similarity measures
How similar are the documents?

# Euclidean distance

<br>
<center><img src='img/dist_euc.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [4]:
from sklearn.metrics.pairwise import euclidean_distances

dist_euc = euclidean_distances(tfidf_matrix.values)
dist_euc = pd.DataFrame(dist_euc, columns = df.index, index = df.index)

dist_euc.style.background_gradient(cmap='Reds')

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Blanche-Neige,0.0,1.373237,1.355777,1.374014,1.368327,1.373438,1.393376,1.37762,1.386181,1.356605,1.345197
Cendrillon,1.373237,0.0,1.375634,1.381346,1.383473,1.382734,1.379908,1.375117,1.387408,1.354227,1.350327
Aurore,1.355777,1.375634,0.0,1.361973,1.372278,1.384766,1.368598,1.365271,1.326518,1.350259,1.378867
Ariel,1.374014,1.381346,1.361973,0.0,1.365308,1.363705,1.386612,1.378191,1.375674,1.361189,1.354413
Belle,1.368327,1.383473,1.372278,1.365308,0.0,1.389688,1.379667,1.390559,1.383216,1.349807,1.391718
Jasmine,1.373438,1.382734,1.384766,1.363705,1.389688,0.0,1.391717,1.381168,1.331025,1.37853,1.395605
Pocahontas,1.393376,1.379908,1.368598,1.386612,1.379667,1.391717,0.0,1.343865,1.349423,1.343287,1.38657
Mulan,1.37762,1.375117,1.365271,1.378191,1.390559,1.381168,1.343865,0.0,1.335878,1.366105,1.39579
Tiana,1.386181,1.387408,1.326518,1.375674,1.383216,1.331025,1.349423,1.335878,0.0,1.36573,1.378886
Raiponce,1.356605,1.354227,1.350259,1.361189,1.349807,1.37853,1.343287,1.366105,1.36573,0.0,1.367093


# Cosine distance

<br>
<center><img src='img/dist_cos.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [6]:
from sklearn.metrics.pairwise import cosine_distances

dist_cos = cosine_distances(tfidf_matrix.values)
dist_cos = pd.DataFrame(dist_cos, columns = df.index, index = df.index)
dist_cos.style.background_gradient(cmap='Reds')

princess,Blanche-Neige,Cendrillon,Aurore,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Raiponce,Mérida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Blanche-Neige,0.0,0.942889,0.919066,0.943958,0.936159,0.943166,0.970748,0.948919,0.960749,0.920188,0.904777
Cendrillon,0.942889,0.0,0.946185,0.954058,0.956999,0.955976,0.952073,0.945473,0.96245,0.916965,0.911691
Aurore,0.919066,0.946185,0.0,0.927486,0.941573,0.958789,0.936531,0.931983,0.879825,0.9116,0.950637
Ariel,0.943958,0.954058,0.927486,0.0,0.932033,0.929845,0.961347,0.949705,0.946239,0.926418,0.917217
Belle,0.936159,0.956999,0.941573,0.932033,0.0,0.965616,0.95174,0.966827,0.956643,0.910989,0.968439
Jasmine,0.943166,0.955976,0.958789,0.929845,0.965616,0.0,0.968438,0.953813,0.885813,0.950172,0.973857
Pocahontas,0.970748,0.952073,0.936531,0.961347,0.95174,0.968438,0.0,0.902987,0.910472,0.90221,0.961288
Mulan,0.948919,0.945473,0.931983,0.949705,0.966827,0.953813,0.902987,0.0,0.892285,0.933121,0.974115
Tiana,0.960749,0.96245,0.879825,0.946239,0.956643,0.885813,0.910472,0.892285,0.0,0.932609,0.950664
Raiponce,0.920188,0.916965,0.9116,0.926418,0.910989,0.950172,0.90221,0.933121,0.932609,0.0,0.934472


# 🔮 Going from TF-IDF to Word2Vec

- So far, the vector representations of text that we have seen treat linguistic units as atomic units.
- Vectors are sparse
- They have problems with words outside the vocabulary

With distributed representations, such as **word2vec**, we can create dense, low-dimensional representations that capture distributional similarities between words

# ⏪ Today's recap

- TF-IDF
- Word2Vec

<center><img src='img/bye.gif' style='height:250px;'></center> 

# Next class: Performance metrics
# See you next Friday!