## Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a technique used in text processing that improves upon simpler methods like Bag of Words. These simpler methods just count words, giving each occurrence the same importance, which can skew the data. For example, common words like 'is' might end up with a high count, such as 100, while more meaningful words like 'Tesla' might only appear 10 times. This imbalance could lead the model to inaccurately emphasize less informative words when trained on large datasets.

TF-IDF solves this by adjusting the weight of each word based on its rarity across all documents. It reduces the influence of frequent but less meaningful words and increases the significance of words that carry more information about the content of a document. This results in a more balanced and effective model, without having to remove common words manually from the dataset. Essentially, TF-IDF ensures that the words that define the topic and context of a document are highlighted, making it ideal for tasks involving text relevance and categorization.

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used to indicate how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word across the entire corpus. This helps to adjust for the fact that some words appear more frequently in general.

### Components of TF-IDF

1. **Term Frequency (TF)**: This measures how frequently a term occurs in a document. It is calculated as follows:
   - **Formula**: TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

2. **Inverse Document Frequency (IDF)**: This measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, like "is", "of", and "that", may appear many times but have little importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones. The IDF of a term is calculated as follows:
   - **Formula**: IDF(t, D) = log(Total number of documents / Number of documents containing term t)
   - To avoid division by zero, the denominator is usually adjusted to 1 + Number of documents containing term t.

### TF-IDF Calculation

The TF-IDF value is simply the multiplication of TF and IDF:
   - **Formula**: TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

This value is higher for terms that are more unique to the document, providing a measure of how important the term is in the document relative to the corpus as a whole.

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Let's begin by creating a sample dataset. Just like we did in BoW.

In [10]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

In [11]:
df.head()

Unnamed: 0,Content_cleaned
0,the app is great new features but crashes often
1,love the app love the content but it crashes
2,the app crashes too much it is frustrating
3,the content is great it is easy to use it is g...


Here is the code for applying TF-IDF:

In [14]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the data to compute TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Content_cleaned'])

# Create a DataFrame with the TF-IDF scores
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Insert the original reviews as the first column in the DataFrame
tfidf_df.insert(0, 'Original_Review', df['Content_cleaned'])

# Print the DataFrame with TF-IDF scores
tfidf_df.head()

Unnamed: 0,Original_Review,app,but,content,crashes,easy,features,frustrating,great,is,it,love,much,new,often,the,to,too,use
0,the app is great new features but crashes often,0.266468,0.329142,0.0,0.266468,0.0,0.417474,0.0,0.329142,0.266468,0.0,0.0,0.0,0.417474,0.417474,0.217855,0.0,0.0,0.0
1,love the app love the content but it crashes,0.232224,0.286843,0.286843,0.232224,0.0,0.0,0.0,0.0,0.0,0.232224,0.727649,0.0,0.0,0.0,0.379717,0.0,0.0,0.0
2,the app crashes too much it is frustrating,0.288291,0.0,0.0,0.288291,0.0,0.0,0.451664,0.0,0.288291,0.288291,0.0,0.451664,0.0,0.0,0.235697,0.0,0.451664,0.0
3,the content is great it is easy to use it is g...,0.0,0.0,0.230725,0.0,0.292645,0.0,0.0,0.46145,0.560375,0.373583,0.0,0.0,0.0,0.0,0.152714,0.292645,0.0,0.292645


### **Fun tip**: Go and check the differences between this matrix and the one created by using Bag of Words!

## Pros and Cons of TF-IDF

### Pros

1. **Relevance Measurement**: TF-IDF is very effective in measuring word relevance in documents. It helps to identify the most significant words in a document, which can be crucial for search engines and other text analysis applications.
   
2. **Filtering Out Noise**: By decreasing the weight of commonly used words across documents, TF-IDF can filter out usual 'noise' or common words, allowing more relevant and unique content to stand out.
   
3. **Simplicity and Efficiency**: TF-IDF is simple to understand and implement. It requires minimal computational resources, making it efficient even with large datasets.
   
4. **Versatility**: It can be used as a foundational tool for many NLP tasks including document classification, clustering, and information retrieval systems.

### Cons

1. **Lack of Context Understanding**: TF-IDF focuses solely on the frequency of words and ignores the context or order of words. This can be a limitation for tasks that require understanding the semantic meaning of the text.
   
2. **Not Suitable for Short Texts**: In documents with very few words (like tweets or SMS messages), the TF-IDF scores might not be very informative as the frequency of words is generally low.
   
3. **High-Dimensional Output**: The vectors generated by TF-IDF are typically high-dimensional (one dimension per unique word in the corpus). This can lead to sparse matrices, which are harder to manage and process for some machine learning models.
   
4. **IDF Part Sensitivity**: The IDF component can be significantly affected if the document corpus is not representative of the general language use, potentially skewing results.
