# $$ TF-IDF $$

_____________

### TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** is a popular method for text vectorization that evaluates how important a word is to a document in a collection of documents. It combines two components: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**.

#### 1. **Term Frequency (TF)**

The **Term Frequency** of a word in a document is simply a measure of how frequently that word appears in the document. The basic formula for **TF** is:

$$
TF = \frac{\text{Number of times word appears in a document}}{\text{Total number of words in the document}}
$$

- **Why it matters**: A word that appears frequently in a document might be important for understanding that document. However, we need to weigh this frequency against the word's presence in other documents to ensure it’s truly significant.

#### 2. **Inverse Document Frequency (IDF)**

The **Inverse Document Frequency** is used to measure how important a word is in the entire corpus. It helps reduce the weight of words that appear in many documents, as they are likely common words with less meaning.

The formula for **IDF** is:

$$
IDF = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)
$$

- **Why it matters**: Words that appear in many documents (e.g., "the", "is", "and") are considered less meaningful because they are not distinguishing features of any particular document. **IDF** helps emphasize rare and unique words that carry more information.

#### 3. **TF-IDF Calculation**

The final **TF-IDF** score is calculated by multiplying **TF** and **IDF**:

$$
\text{TF-IDF} = \text{TF} \times \text{IDF}
$$

This score reflects both how often a word appears in a specific document and how rare or unique it is across the entire corpus. Words that are frequent in a document but rare in others will have a high **TF-IDF** score, meaning they are important to that document.

#### Why Use TF-IDF?

- **Relevance**: TF-IDF helps identify words that are specific and important to a document, reducing the influence of common words that don't add much value in distinguishing between documents.
  
- **Improved Accuracy**: By considering both **TF** and **IDF**, the method provides a more accurate representation of the significance of words in a document, which can improve the performance of machine learning models.

In conclusion, **TF-IDF** is a crucial technique in text analysis and NLP, allowing for more meaningful vector representations by balancing word frequency and document importance.


_______________________

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [3]:
tfidfvec = TfidfVectorizer()

In [4]:
tfidfvec_fit = tfidfvec.fit_transform(data)

In [5]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns = tfidfvec.get_feature_names_out())

In [6]:
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 