<center><img src='img/cdy.png' style='width:500px; float: left; margin: 0px 30px 15px 0px'></center>

# TF-IDF
## Class 19 - Data Science Curriculum 




# ⏪ Recap last class

- Text pre-processing
- Word Cloud

# 🚀 Today's agenda

- TF-IDF

<center><img src='img/pipeline.png' style='width:1500px; margin: 0px 30px 15px 0px'></center>

### Representation of data in numerical form

**Example: Images**

<center><img src='img/komp2.jpeg'>
<small>Image credit: VideoNet</small></center>

- An image is represented on a computer in the form of a matrix where each $m[i,j]$ represents pixel $i$,$j$ of the image

- Similarly, a video is a collection of frames, where each frame is an image. Therefore, any video can be represented as a collection of matrices

- (Un)fortunately, representing text numerically is not so simple

## 🤔 The issue at hand...

<br>
<center><img src='img/gigo.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

# Let's do some text pre-processing

In [1]:
import json
import pandas as pd


with open('princesses.json', 'r') as f:
    data = json.load(f)
    
df = pd.DataFrame(data.items(), columns=['princess', 'description']).set_index('princess')

df

Unnamed: 0_level_0,description
princess,Unnamed: 1_level_1
Snow White,"Snow White is a princess of great beauty, whic..."
Cinderella,Cinderella is a young woman subjected to the a...
Aurora,Princess Aurora is the only daughter of King S...
Ariel,"In the 1989 film, Ariel is the youngest of Kin..."
Belle,Belle lives in a small French village with her...
Jasmine,"A modern Disney heroine: independent, intellig..."
Pocahontas,Pocahontas is loosely based on the real-life f...
Mulan,Mulan breaks away from traditional Disney prin...
Tiana,Tiana is the first African-American Disney pri...
Rapunzel,Rapunzel is depicted as a beautiful girl with ...


In [2]:
import re
import pandas as pd 

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vivianamarquez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
sw = stopwords.words('english')

def text_pre_processing(text):
    text = text.lower() # Make it lowercase
    text = re.sub(r"[\W\d]", " ", text) # Remove punctuation
    text = text.split() # Tokenize
    text = [word for word in text if word not in sw] # Remove stop words
    return " ".join(text)

df['text_pp'] = df['description'].apply(lambda row: text_pre_processing(row))

In [5]:
df

Unnamed: 0_level_0,description,text_pp
princess,Unnamed: 1_level_1,Unnamed: 2_level_1
Snow White,"Snow White is a princess of great beauty, whic...",snow white princess great beauty makes stepmot...
Cinderella,Cinderella is a young woman subjected to the a...,cinderella young woman subjected authority ste...
Aurora,Princess Aurora is the only daughter of King S...,princess aurora daughter king stefan queen lea...
Ariel,"In the 1989 film, Ariel is the youngest of Kin...",film ariel youngest king triton seven daughter...
Belle,Belle lives in a small French village with her...,belle lives small french village father mauric...
Jasmine,"A modern Disney heroine: independent, intellig...",modern disney heroine independent intelligent ...
Pocahontas,Pocahontas is loosely based on the real-life f...,pocahontas loosely based real life figure mato...
Mulan,Mulan breaks away from traditional Disney prin...,mulan breaks away traditional disney princess ...
Tiana,Tiana is the first African-American Disney pri...,tiana first african american disney princess s...
Rapunzel,Rapunzel is depicted as a beautiful girl with ...,rapunzel depicted beautiful girl green eyes na...


# 🤔 Cool, we have cleaned and pre-processed our text... but it's still not numeric

<center><img src='img/this.png' style='width:300px; float: left; margin: 0px 30px 15px 0px'></center> <big>Vectorial representation of texts!</big>


(A vector is just a list of numbers)

# Vectorial representation of texts




- There are several methods

- What differentiates one method from the other is how well it captures the linguistic properties of the text it represents and the amount of space it takes up in memory


- Most popular methods:
    - One-Hot Encoding
    - Bag of Words 
    - TF-IDF 
    - Word embeddings (word2vec) 
        - CBOW (Continuous Bag of Words)
        - SkipGram

# One-Hot Encoding 

Mapping each word in the vocabulary of the text corpus to a unique identification

### Advantages
- Intuitive and easy to understand
- Implementation is straightforward

### Disadvantages
- Generates a sparse matrix
- The vector of each phrase does not have a constant size
- No notion of similarity between words
- Out-of-vocabulary problem

# Bag of Words (BoW)

- Represent the text as a bag of words (ignoring order and context).
- If two pieces of text have almost the same words, then they belong to the same bag.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit_transform(df['text_pp'].values)

print("Vocabulary: ", count_vect.vocabulary_)

df['bow'] = [row for row in bow_rep.toarray()]

df.head()

Vocabulary:  {'snow': 402, 'white': 475, 'princess': 349, 'great': 196, 'beauty': 40, 'makes': 278, 'stepmother': 409, 'queen': 352, 'jealous': 235, 'daily': 100, 'asks': 25, 'magic': 275, 'mirror': 294, 'fairest': 147, 'land': 247, 'always': 14, 'hoping': 215, 'say': 384, 'one': 318, 'day': 104, 'however': 217, 'tells': 428, 'furious': 178, 'decides': 106, 'young': 486, 'girl': 184, 'killed': 239, 'huntsman': 221, 'assigns': 26, 'task': 425, 'cannot': 65, 'bring': 60, 'lets': 257, 'escape': 138, 'lost': 268, 'forest': 173, 'exhausted': 142, 'ends': 133, 'house': 216, 'inhabited': 227, 'seven': 396, 'dwarfs': 126, 'cinderella': 78, 'woman': 483, 'subjected': 417, 'authority': 29, 'lady': 246, 'tremaine': 446, 'two': 454, 'stepsisters': 410, 'anastasia': 16, 'drizella': 123, 'although': 13, 'mistreated': 296, 'humiliated': 220, 'forced': 171, 'role': 372, 'servant': 394, 'maintains': 277, 'hope': 214, 'dreams': 121, 'believes': 44, 'wishes': 481, 'happiness': 202, 'come': 82, 'true': 45

Unnamed: 0_level_0,description,text_pp,bow
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Snow White,"Snow White is a princess of great beauty, whic...",snow white princess great beauty makes stepmot...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..."
Cinderella,Cinderella is a young woman subjected to the a...,cinderella young woman subjected authority ste...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ..."
Aurora,Princess Aurora is the only daughter of King S...,princess aurora daughter king stefan queen lea...,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ..."
Ariel,"In the 1989 film, Ariel is the youngest of Kin...",film ariel youngest king triton seven daughter...,"[0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ..."
Belle,Belle lives in a small French village with her...,belle lives small french village father mauric...,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


### Advantages
- Intuitive and easy to understand
- The implementation is straightforward
- The vector of each phrase has a constant size

### Disadvantages
- Generates a sparse matrix (solution?)
- No notion of similarity between words
- Out-of-vocabulary problem
- Information order is lost

# TF-IDF (Term Frequency, Inverse Document Frequency) 

- It seeks to quantify the importance of a word relative to the other words in the document and in the corpus.
- It is frequently used in information retrieval systems and clustering algorithms.
- The more a word helps to distinguish a document from others, the higher its TF-IDF score will be.

### TF: Term Frequency

- Term frequency: Counting the number of occurrences of a word in a document, divided by the number of words in that document.

$$ tf(t,d) = \dfrac{count(t)}{|d|}$$

where
- $t$ is term
- $d$ is document

*Note: The term frequency is higher for words frequently used in a document.*

### DF: Document Frequency

- Frequency in documents: It is the number of documents that have that word over the total number of documents.

$$ df(t,N) = \dfrac{|\{d_i: t\in d_i, i=1, \dots , N\}|}{N}$$

where
- $t$ is term
- $N$ is Number of documents

*Note: Frequency in documents is higher for words used in many documents*

## So far we have TF and DF... how do we get to TD-IDF?





- A specific word to some document will have very low term frequency

- Since the goal is to distinguish one document from another, we want to highlight words used frequently in one document but penalize them if they are present in all documents. This is called TF-IDF.

- Then, we obtain:

$$tfidf (t,d,N) = tf(df) \times \log\big(\frac{1}{df(t,N)}\big)$$

where
- $t$ is term
- $d$ is document
- $N$ is Number of documents

- When $t$ is in all documents, $idf = log(1) = 0$
 
- This makes sense since a word that is in all documents is very bad at distinguishing between documents

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
tfidf = tfidf_vect.fit_transform(df['text_pp'].values)

tfidf_matrix = pd.DataFrame(tfidf.toarray(), columns=tfidf_vect.get_feature_names_out())

tfidf_matrix.index = df.index

tfidf_matrix.T.round(3)

princess,Snow White,Cinderella,Aurora,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Rapunzel,Merida
accidentally,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.0,0.118,0.0
activated,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.0,0.118,0.0
admired,0.000,0.000,0.000,0.000,0.141,0.0,0.000,0.000,0.0,0.000,0.0
admit,0.000,0.000,0.098,0.000,0.000,0.0,0.000,0.000,0.0,0.000,0.0
adventure,0.000,0.000,0.000,0.000,0.000,0.0,0.146,0.000,0.0,0.000,0.0
...,...,...,...,...,...,...,...,...,...,...,...
women,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.176,0.0,0.000,0.0
world,0.000,0.000,0.000,0.073,0.000,0.0,0.000,0.000,0.0,0.101,0.0
young,0.084,0.064,0.066,0.000,0.000,0.0,0.098,0.000,0.0,0.000,0.0
youngest,0.000,0.000,0.000,0.085,0.000,0.0,0.000,0.000,0.0,0.000,0.0


# 🔮 In the next class

#### Going from TF-IDF to Word2Vec

- So far, the vector representations of text that we have seen treat linguistic units as atomic units.
- Vectors are sparse
- They have problems with words outside the vocabulary

With distributed representations, such as **word2vec**, we can create dense, low-dimensional representations that capture distributional similarities between words

# Similarity measures
How similar are the documents?

# Euclidean distance

<br>
<center><img src='img/dist_euc.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [9]:
from sklearn.metrics.pairwise import euclidean_distances

dist_euc = euclidean_distances(tfidf_matrix.values)
dist_euc = pd.DataFrame(dist_euc, columns = df.index, index = df.index)

dist_euc.style.background_gradient(cmap='Blues')

princess,Snow White,Cinderella,Aurora,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Rapunzel,Merida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Snow White,0.0,1.349204,1.375952,1.408711,1.400076,1.408716,1.40839,1.408495,1.398964,1.391161,1.393649
Cinderella,1.349204,0.0,1.398689,1.408548,1.40729,1.411802,1.390949,1.407539,1.372281,1.385929,1.352289
Aurora,1.375952,1.398689,0.0,1.392005,1.370182,1.383344,1.370994,1.384726,1.352218,1.377012,1.385299
Ariel,1.408711,1.408548,1.392005,0.0,1.39855,1.375824,1.41047,1.406457,1.397453,1.368856,1.379941
Belle,1.400076,1.40729,1.370182,1.39855,0.0,1.408001,1.414214,1.414214,1.409359,1.414214,1.4092
Jasmine,1.408716,1.411802,1.383344,1.375824,1.408001,0.0,1.383141,1.370872,1.364911,1.37569,1.404715
Pocahontas,1.40839,1.390949,1.370994,1.41047,1.414214,1.383141,0.0,1.380638,1.374663,1.397104,1.382334
Mulan,1.408495,1.407539,1.384726,1.406457,1.414214,1.370872,1.380638,0.0,1.361212,1.403438,1.414214
Tiana,1.398964,1.372281,1.352218,1.397453,1.409359,1.364911,1.374663,1.361212,0.0,1.400576,1.406793
Rapunzel,1.391161,1.385929,1.377012,1.368856,1.414214,1.37569,1.397104,1.403438,1.400576,0.0,1.353261


import seaborn as sns
%matplotlib inline

sns.heatmap(dist_euc, annot=True)

# Cosine distance

<br>
<center><img src='img/dist_cos.png' style='height:350px;'>
<small>Image credit: R-Bloggers</small></center>

In [10]:
from sklearn.metrics.pairwise import cosine_distances

dist_cos = cosine_distances(tfidf_matrix.values)
dist_cos = pd.DataFrame(dist_cos, columns = df.index, index = df.index)
dist_cos.style.background_gradient(cmap='Blues')

princess,Snow White,Cinderella,Aurora,Ariel,Belle,Jasmine,Pocahontas,Mulan,Tiana,Rapunzel,Merida
princess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Snow White,0.0,0.910175,0.946622,0.992233,0.980106,0.992241,0.991781,0.991929,0.97855,0.967665,0.971128
Cinderella,0.910175,0.0,0.978166,0.992003,0.990233,0.996592,0.96737,0.990583,0.941578,0.9604,0.914342
Aurora,0.946622,0.978166,0.0,0.968839,0.938699,0.95682,0.939812,0.958733,0.914247,0.948081,0.959527
Ariel,0.992233,0.992003,0.968839,0.0,0.977972,0.946445,0.994713,0.98906,0.976438,0.936883,0.952119
Belle,0.980106,0.990233,0.938699,0.977972,0.0,0.991234,1.0,1.0,0.993147,1.0,0.992922
Jasmine,0.992241,0.996592,0.95682,0.946445,0.991234,0.0,0.956539,0.939645,0.931491,0.946262,0.986612
Pocahontas,0.991781,0.96737,0.939812,0.994713,1.0,0.956539,0.0,0.953081,0.944849,0.975949,0.955423
Mulan,0.991929,0.990583,0.958733,0.98906,1.0,0.939645,0.953081,0.0,0.926449,0.984819,1.0
Tiana,0.97855,0.941578,0.914247,0.976438,0.993147,0.931491,0.944849,0.926449,0.0,0.980807,0.989534
Rapunzel,0.967665,0.9604,0.948081,0.936883,1.0,0.946262,0.975949,0.984819,0.980807,0.0,0.915658


# ⏪ Today's recap

- Vectorial representation of texts
    - Bag of Words
    - TF-IDF
- Distance metrics
    - Euclidean distance
    - Cosine distance

# Next class: Word2Vec