## Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF builds on the Bag of Words (BoW) approach to help us better understand and analyze text. Both TF-IDF and BoW represent text data as numerical values, but they differ in how they handle word importance.

The Bag of Words (BoW) method just counts how often each word appears in a document, treating all words equally. This can be a problem because common words like "and" or "the" might dominate the results, even though they don't carry much meaning.

TF-IDF gets around this by not only counting word frequency (like BoW) but also considering how common or rare a word is across all documents. It makes common words less important and increases the significance of unique or rare words. This makes TF-IDF a more refined method, helping to highlight the words that truly matter in understanding the content of a document.

In a nutshell, while BoW gives you a basic count of words, TF-IDF goes a step further by weighting these counts based on word significance, providing a more accurate reflection of the document's key topics.

### Example: Calculating TF-IDF

#### Step 1: Start with 3 Phrases
Let's use the following phrases as our documents:

1. "The car is fast"
2. "The car is red"
3. "The fast car is blue"

#### Step 2: Create the Term Frequency (TF) Matrix
First, list all unique words across the phrases: `the`, `car`, `is`, `fast`, `red`, `blue`.

**TF Formula:**  
TF = (Number of times the word appears in a phrase) / (Total number of words in the phrase)

| Term   | Phrase 1 ("The car is fast") | Phrase 2 ("The car is red") | Phrase 3 ("The fast car is blue") |
|--------|------------------------------|-----------------------------|-----------------------------------|
| the    | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| car    | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| is     | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| fast   | 1/4 = 0.25                   | 0/4 = 0                     | 1/5 = 0.20                        |
| red    | 0/4 = 0                      | 1/4 = 0.25                  | 0/5 = 0                           |
| blue   | 0/4 = 0                      | 0/4 = 0                     | 1/5 = 0.20                        |

#### Step 3: Create the Inverse Document Frequency (IDF) Matrix
**IDF Formula:**  
IDF = log(Total number of phrases / Number of phrases containing the word)

For 3 phrases:

| Term   | Document Frequency (DF) | IDF Calculation                      | IDF  |
|--------|-------------------------|--------------------------------------|------|
| the    | 3                        | log(3 / 3) = log(1)                  | 0    |
| car    | 3                        | log(3 / 3) = log(1)                  | 0    |
| is     | 3                        | log(3 / 3) = log(1)                  | 0    |
| fast   | 2                        | log(3 / 2)                          | 0.18 |
| red    | 1                        | log(3 / 1)                          | 0.48 |
| blue   | 1                        | log(3 / 1)                          | 0.48 |

#### Step 4: Calculate the TF-IDF Matrix
**TF-IDF Formula:**  
TF-IDF = TF * IDF

| Term   | TF-IDF for Phrase 1        | TF-IDF for Phrase 2       | TF-IDF for Phrase 3       |
|--------|----------------------------|---------------------------|---------------------------|
| the    | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| car    | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| is     | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| fast   | 0.25 * 0.18 = 0.045        | 0 * 0.18 = 0              | 0.20 * 0.18 = 0.036       |
| red    | 0 * 0.48 = 0               | 0.25 * 0.48 = 0.12        | 0 * 0.48 = 0              |
| blue   | 0 * 0.48 = 0               | 0 * 0.48 = 0              | 0.20 * 0.48 = 0.096       |

#### Final TF-IDF Scores
- **Phrase 1** ("The car is fast"): Important word - `fast` (0.045)
- **Phrase 2** ("The car is red"): Important word - `red` (0.12)
- **Phrase 3** ("The fast car is blue"): Important words - `fast` (0.036), `blue` (0.096)

### Summary
- Common words like "the", "car", and "is" have lower TF-IDF scores (0), while unique words like "red" and "blue" have higher scores, showing their importance in their respective phrases.


### Implementation in Python

Let's begin by importing the libraries we need.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Let's begin by creating a sample dataset. Just like we did in BoW.

In [3]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

Let's visualize it.

In [4]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


Here is the code for applying TF-IDF:

In [14]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the data to compute TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Content_cleaned'])

# Create a DataFrame with the TF-IDF scores
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Insert the original reviews as the first column in the DataFrame
tfidf_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Print the DataFrame with TF-IDF scores
tfidf_df.head()

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,0.266468,0.329142,0.0,0.266468,0.0,0.417474,0.0,0.329142,0.266468,0.0,0.0,0.0,0.417474,0.417474,0.217855,0.0,0.0,0.0
1,love the app love the content but it crashes,0.232224,0.286843,0.286843,0.232224,0.0,0.0,0.0,0.0,0.0,0.232224,0.727649,0.0,0.0,0.0,0.379717,0.0,0.0,0.0
2,the app crashes too much it is frustrating,0.288291,0.0,0.0,0.288291,0.0,0.0,0.451664,0.0,0.288291,0.288291,0.0,0.451664,0.0,0.0,0.235697,0.0,0.451664,0.0
3,the content is great it is easy to use it is g...,0.0,0.0,0.230725,0.0,0.292645,0.0,0.0,0.46145,0.560375,0.373583,0.0,0.0,0.0,0.0,0.152714,0.292645,0.0,0.292645


### Here's a **fun tip**: Take a look at the differences between this matrix and the one created using Bag of Words!

Let's try now applying what we have learned to the Netflix dataset.

In [5]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occurred after text preprocessing
df.fillna('', inplace=True)

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the data
tfidf = vectorizer.fit_transform(df['content_cleaned'])

# Print the size of the vocabulary and the shape of the matrix
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Shape of the sparse matrix: {tfidf.shape}")


Vocabulary size: 39783
Shape of the sparse matrix: (113292, 39783)


The size of the matrix will be the same as BoW.

In [7]:
print(df['content_cleaned'][44])
print(tfidf[44])

i love to watch the movie s you have on here and the tv shows i love them growing_heart
  (0, 38860)	0.16308812084494653
  (0, 2769)	0.10912960815218174
  (0, 34247)	0.20741480726571174
  (0, 34829)	0.11001395378020835
  (0, 16108)	0.15733005144487602
  (0, 31090)	0.18829505337366567
  (0, 24131)	0.15370315587563638
  (0, 22367)	0.2341258023408215
  (0, 20654)	0.4173832358745547
  (0, 34287)	0.28537500358727697
  (0, 37517)	0.17283144911951462
  (0, 16371)	0.332457626059765
  (0, 35537)	0.23448183036773226
  (0, 15596)	0.5627552998770302


A clean way to show it:

In [9]:
# Convert the sparse matrix row to a dense array for the 44th document
dense_array = tfidf[44].toarray()

# Create a DataFrame to map the indices to the words and their TF-IDF scores
words = vectorizer.get_feature_names_out()
tfidf_scores = dense_array.flatten()
tfidf_df = pd.DataFrame({
    'Word': [words[i] for i in range(len(tfidf_scores)) if tfidf_scores[i] > 0],
    'TF-IDF Score': [tfidf_scores[i] for i in range(len(tfidf_scores)) if tfidf_scores[i] > 0]
})

# Display the original text and the TF-IDF scores
print("Original text:")
print(df['content_cleaned'][44])
print("\nTF-IDF Matrix Representation:")
print(tfidf_df)

Original text:
i love to watch the movie s you have on here and the tv shows i love them growing_heart

TF-IDF Matrix Representation:
             Word  TF-IDF Score
0             and      0.109130
1   growing_heart      0.562755
2            have      0.157330
3            here      0.332458
4            love      0.417383
5           movie      0.234126
6              on      0.153703
7           shows      0.188295
8             the      0.207415
9            them      0.285375
10             to      0.110014
11             tv      0.234482
12          watch      0.172831
13            you      0.163088


## Pros and Cons of TF-IDF

### Pros

- **Measuring relevance**: TF-IDF is great for figuring out which words are most relevant in a document. This can be really important for search engines and other text analysis apps.
   
- **Filtering Out Noise**: By reducing the impact of common words across documents, TF-IDF can filter out the usual "noise" or common words, allowing more relevant and unique content to stand out.
   
- **Simplicity and Efficiency**: TF-IDF is simple to understand and implement. It's also efficient, even with large datasets, because it requires minimal computational resources.

### Cons

- **Lack of Context Understanding**: TF-IDF doesn't take context into account, which can be a drawback for tasks that require understanding the meaning of the text.
   
- **Not Suitable for Short Texts**: In documents with very few words (like tweets or SMS messages), the TF-IDF scores might not be very informative since the frequency of words is generally low.
   
- **High-Dimensional Output**: The vectors generated by TF-IDF are usually high-dimensional (one dimension per unique word in the corpus). This can result in sparse matrices, which are more challenging to manage and process for some machine learning models.

- **Out-Of-Vocabulary Issue**: Again,the algorithm does not work with new sequences that contain words not included in the vocabulary used for fitting.