## Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF builds on the Bag of Words (BoW) approach to help us better understand and analyze text. Both TF-IDF and BoW represent text data as numerical values. But they differ in how they handle word importance.

The Bag of Words (BoW) method counts how often each word appears in a document, treating all words the same. This can be a problem, because common words like "and" or "the" might dominate the results. Even if they don't have much meaning.

TF-IDF gets around this by not only counting word frequency (like BoW). But also considering how common or rare a word is across all documents. This makes common words less important and increases the importance of rare words. This makes TF-IDF a more sophisticated method. It helps to highlight the words that are important for understanding the content of a document.

In short, BoW gives you a basic word count. But TF-IDF goes a step further by weighting these counts based on word importance. Thus, TF-IDF provides a more accurate reflection of the document's key messages.

### Example: Calculating TF-IDF

#### Step 1: Start with 3 Phrases
Let's use the following phrases as our documents (that eventually will be our reviews):

1. "The car is fast"
2. "The car is red"
3. "The fast car is blue"

#### Step 2: Create the Term Frequency (TF) Matrix
First, list all unique words across the phrases: `the`, `car`, `is`, `fast`, `red`, `blue`.

**TF Formula:**  
TF = (Number of times the word appears in a phrase) / (Total number of words in the phrase)

| Term   | Phrase 1 ("The car is fast") | Phrase 2 ("The car is red") | Phrase 3 ("The fast car is blue") |
|--------|------------------------------|-----------------------------|-----------------------------------|
| the    | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| car    | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| is     | 1/4 = 0.25                   | 1/4 = 0.25                  | 1/5 = 0.20                        |
| fast   | 1/4 = 0.25                   | 0/4 = 0                     | 1/5 = 0.20                        |
| red    | 0/4 = 0                      | 1/4 = 0.25                  | 0/5 = 0                           |
| blue   | 0/4 = 0                      | 0/4 = 0                     | 1/5 = 0.20                        |

#### Step 3: Create the Inverse Document Frequency (IDF) Matrix
**IDF Formula:**  
IDF = log(Total number of phrases / Number of phrases containing the word)

For 3 phrases:

| Term   | Document Frequency (DF) | IDF Calculation                      | IDF  |
|--------|-------------------------|--------------------------------------|------|
| the    | 3                        | log(3 / 3) = log(1)                  | 0    |
| car    | 3                        | log(3 / 3) = log(1)                  | 0    |
| is     | 3                        | log(3 / 3) = log(1)                  | 0    |
| fast   | 2                        | log(3 / 2)                          | 0.18 |
| red    | 1                        | log(3 / 1)                          | 0.48 |
| blue   | 1                        | log(3 / 1)                          | 0.48 |

#### Step 4: Calculate the TF-IDF Matrix
**TF-IDF Formula:**  
TF-IDF = TF * IDF

| Term   | TF-IDF for Phrase 1        | TF-IDF for Phrase 2       | TF-IDF for Phrase 3       |
|--------|----------------------------|---------------------------|---------------------------|
| the    | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| car    | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| is     | 0.25 * 0 = 0               | 0.25 * 0 = 0              | 0.20 * 0 = 0              |
| fast   | 0.25 * 0.18 = 0.045        | 0 * 0.18 = 0              | 0.20 * 0.18 = 0.036       |
| red    | 0 * 0.48 = 0               | 0.25 * 0.48 = 0.12        | 0 * 0.48 = 0              |
| blue   | 0 * 0.48 = 0               | 0 * 0.48 = 0              | 0.20 * 0.48 = 0.096       |

#### Final TF-IDF Scores
- **Phrase 1** ("The car is fast"): Important word - `fast` (0.045)
- **Phrase 2** ("The car is red"): Important word - `red` (0.12)
- **Phrase 3** ("The fast car is blue"): Important words - `fast` (0.036), `blue` (0.096)

### Summary

It's amazing to see how TF-IDF highlights the most important words. All the sentences contain `the car is`, and these words got a TF-IDF score of 0. But, the parts not "in common" got higher scores, such as: `fast`, `red` and `blue`.


### Implementation in Python

Let's begin by importing the libraries we need.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Let's begin by creating a sample dataset. Just like we did in BoW.

In [3]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

Let's visualize it.

In [4]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


Here is the code for applying TF-IDF:

In [5]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the data to compute TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Content_cleaned'])

# Create a DataFrame with the TF-IDF scores
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Insert the original reviews as the first column in the DataFrame
tfidf_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Print the DataFrame with TF-IDF scores
tfidf_df.head()

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,0.266468,0.329142,0.0,0.266468,0.0,0.417474,0.0,0.329142,0.266468,0.0,0.0,0.0,0.417474,0.417474,0.217855,0.0,0.0,0.0
1,love the app love the content but it crashes,0.232224,0.286843,0.286843,0.232224,0.0,0.0,0.0,0.0,0.0,0.232224,0.727649,0.0,0.0,0.0,0.379717,0.0,0.0,0.0
2,the app crashes too much it is frustrating,0.288291,0.0,0.0,0.288291,0.0,0.0,0.451664,0.0,0.288291,0.288291,0.0,0.451664,0.0,0.0,0.235697,0.0,0.451664,0.0
3,the content is great it is easy to use it is g...,0.0,0.0,0.230725,0.0,0.292645,0.0,0.0,0.46145,0.560375,0.373583,0.0,0.0,0.0,0.0,0.152714,0.292645,0.0,0.292645


### Here's a **fun tip**: Take a look at the differences between this matrix and the one created using Bag of Words!

Let's try now applying what we have learned to the Netflix dataset.

In [6]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occurred after text preprocessing
df.fillna('', inplace=True)

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the data
tfidf = vectorizer.fit_transform(df['content_cleaned'])

# Print the size of the vocabulary and the shape of the matrix
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Shape of the sparse matrix: {tfidf.shape}")


Vocabulary size: 39783
Shape of the sparse matrix: (113292, 39783)


The size of the matrix will be the same as BoW.

**Visualizing a single row**

Let's visualize a single review, just like we did in BoW.

In [8]:
row_index = 44 # Specific row (review) we are inspecting

# Display the original review text
print("Original Text:\n")
print(df['content_cleaned'][row_index])

# Convert the sparse matrix row to a dense array for the selected document
dense_array = tfidf[row_index].toarray()

# Get the feature names (vocabulary) from the vectorizer
words = vectorizer.get_feature_names_out()

# Prepare data for the DataFrame
data = {
    'Row Index': [],
    'Word': [],
    'TF-IDF Score': []
}

# Populate the data with words and their corresponding non-zero TF-IDF scores
for i in range(len(dense_array[0])):
    score = dense_array[0][i]
    if score > 0:  # Only include words with a non-zero TF-IDF score
        data['Row Index'].append(row_index)
        data['Word'].append(words[i])  # Get the actual word from the vocabulary
        data['TF-IDF Score'].append(score)

# Create a DataFrame to map the row index, word, and their corresponding TF-IDF score
tfidf_df = pd.DataFrame(data)

# Display the DataFrame
print("\n\nTF-IDF Matrix Representation as a Table:\n")
print(tfidf_df)

Original Text:

i love to watch the movie s you have on here and the tv shows i love them growing_heart


TF-IDF Matrix Representation as a Table:

    Row Index           Word  TF-IDF Score
0          44            and      0.109130
1          44  growing_heart      0.562755
2          44           have      0.157330
3          44           here      0.332458
4          44           love      0.417383
5          44          movie      0.234126
6          44             on      0.153703
7          44          shows      0.188295
8          44            the      0.207415
9          44           them      0.285375
10         44             to      0.110014
11         44             tv      0.234482
12         44          watch      0.172831
13         44            you      0.163088


**Note**: All the single characters tokens are ignored by the default tokenizer (TfidfVectorizer). If you want single character tokens to be in the vocabulary, then you have to use a custom tokenizer.

It's cool to see that the 2 tokens with the highest TF-IDF score are: `growing_heart` and `love`. Which is enough to understand that it's a positive review.

## Pros and Cons of TF-IDF

### Pros

- **Measuring relevance**: TF-IDF is great for figuring out which words are most relevant in a document. This can be really important for search engines and other text analysis apps.
   
- **Filtering Out Noise**: By reducing the impact of common words across documents, TF-IDF can filter out the usual "noise" or common words, allowing more relevant and unique content to stand out.
   
- **Simplicity and Efficiency**: TF-IDF is simple to understand and implement. It's also efficient, even with large datasets, because it requires minimal computational resources.

### Cons

- **Lack of Context Understanding**: TF-IDF doesn't take context into account, which can be a drawback for tasks that require understanding the meaning of the text.
   
- **Not Suitable for Short Texts**: In documents with very few words (like tweets or SMS messages), the TF-IDF scores might not be very informative since the frequency of words is generally low.
   
- **High-Dimensional Output**: The vectors generated by TF-IDF are usually high-dimensional (one dimension per unique word in the corpus). This can result in sparse matrices, which are more challenging to manage and process for some machine learning models.

- **Out-Of-Vocabulary Issue**: Again,the algorithm does not work with new sequences that contain words not included in the vocabulary used for fitting.