## Early Days: One-Hot Encoding

Initially, words were represented using **one-hot encoding**. Each word in a vocabulary is represented as a unique vector, where one dimension corresponds to the word and all other dimensions are zero. This approach, however, has limitations:

1. **High Dimensionality**: Vocabulary sizes can be large, leading to very high-dimensional vectors.
2. **Sparsity**: Most of the vector elements are zero, making it inefficient for computation.
3. **Lack of Semantics**: One-hot vectors do not capture any semantic relationships between words. For instance, "king" and "queen" would be equidistant from "apple" and from each other.

### 1. One-Hot Encoding

### 2. TF-IDF Encoding

**TF-IDF** (Term Frequency-Inverse Document Frequency) is a popular statistical method used in **information retrieval** and **text mining**. It's used to evaluate the importance of a word in a document relative to a collection of documents or corpus. It's commonly used in tasks like **text classification**, **document clustering**, **search engines**, and **recommendation systems**.

### What does TF-IDF represent?

1. **TF (Term Frequency)**: Measures how frequently a word occurs in a document.
   
   \[
   \text{TF}(t,d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total number of terms in document } d}
   \]
   - Here, \( t \) is a specific term (word) and \( d \) is a document.
   - If the word appears more times in the document, the TF will be higher. This indicates that the word is **important within that document**.

2. **IDF (Inverse Document Frequency)**: Measures how important the word is in the entire corpus of documents.
   
   \[
   \text{IDF}(t,D) = \log \left( \frac{\text{Total number of documents } D}{\text{Number of documents containing term } t} \right)
   \]
   - If the word appears in many documents, it is less important (IDF will be smaller), because it is common and not distinguishing.
   - If the word appears in only a few documents, it is more important (IDF will be larger), as it is more unique to the documents in which it appears.

3. **TF-IDF**: The final metric that combines both TF and IDF to rank words by their importance in a document within the context of the entire corpus.
   
   \[
   \text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D)
   \]

   This means that:
   - Words that appear frequently in a document but not across many other documents will have a high TF-IDF score.
   - Words that appear frequently across many documents will have a low TF-IDF score, as they don't provide as much distinguishing power.

### Why is TF-IDF useful?
- **Identify important terms**: TF-IDF helps in identifying the most relevant words in a document or a collection of documents.
- **Reduce noise**: Common words (e.g., "the", "is", "and") that don’t carry meaningful information are given low TF-IDF scores and can be filtered out, leaving more useful keywords.
- **Feature extraction**: TF-IDF is often used as a feature extraction method for machine learning tasks like classification, clustering, and recommendation systems.

### Example Calculation of TF-IDF:

Let’s say we have the following three documents in our corpus:

- **Document 1**: "I love machine learning"
- **Document 2**: "Machine learning is fun"
- **Document 3**: "I love coding in Python"

#### Step 1: Calculate **TF** (Term Frequency)
For the word "machine" in Document 1:
- Document 1 has 3 words: ["I", "love", "machine", "learning"]
- "Machine" appears 1 time.
  
\[
\text{TF}(\text{machine}, D_1) = \frac{1}{4} = 0.25
\]

For the word "learning" in Document 2:
- Document 2 has 4 words: ["Machine", "learning", "is", "fun"]
- "Learning" appears 1 time.

\[
\text{TF}(\text{learning}, D_2) = \frac{1}{4} = 0.25
\]

#### Step 2: Calculate **IDF** (Inverse Document Frequency)

First, count how many documents contain the word "machine". The word "machine" appears in Document 1 and Document 2. So, it appears in **2 out of 3 documents**.

\[
\text{IDF}(\text{machine}, D) = \log \left( \frac{3}{2} \right) = 0.176
\]

Next, calculate the IDF for the word "love". The word "love" appears only in Document 1 and Document 3. So, it appears in **2 out of 3 documents**.

\[
\text{IDF}(\text{love}, D) = \log \left( \frac{3}{2} \right) = 0.176
\]

#### Step 3: Calculate **TF-IDF** for each word

Now that we have TF and IDF, we can compute the TF-IDF score for each word. For the word "machine" in Document 1:

\[
\text{TF-IDF}(\text{machine}, D_1) = 0.25 \times 0.176 = 0.044
\]

For the word "learning" in Document 2:

\[
\text{TF-IDF}(\text{learning}, D_2) = 0.25 \times 0.176 = 0.044
\]

### Use Cases of TF-IDF

1. **Text Classification**:
   - **Text classification models** (like spam detection or sentiment analysis) often use TF-IDF as a feature extraction method. By converting a collection of documents (or articles, posts, etc.) into vectors of TF-IDF values, you can use these vectors as inputs to machine learning models like Naive Bayes, SVM, etc.

2. **Search Engines**:
   - TF-IDF is used by search engines to rank the relevance of documents for a particular search query. When a user queries for a specific term, the TF-IDF score of the terms in each document helps to rank which documents are most relevant.

3. **Clustering**:
   - TF-IDF can be used to represent documents in a vector space for **unsupervised learning** tasks like **clustering**. For instance, K-means clustering can be applied to group similar documents based on their TF-IDF vector representations.

4. **Recommender Systems**:
   - In content-based recommender systems, TF-IDF can be used to compare articles, books, or movies based on their textual content.

5. **Information Retrieval**:
   - TF-IDF helps rank documents by how relevant they are to a given search query, allowing for better retrieval of information.

### Implementation in Python (using `scikit-learn`):

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding in Python"
]

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents into TF-IDF values
tfidf_matrix = vectorizer.fit_transform(documents)

# Show TF-IDF matrix (term-document matrix)
print(tfidf_matrix.toarray())

# Show the feature names (terms)
print(vectorizer.get_feature_names_out())
```

### Output:

```
[[0.57735027 0.57735027 0.         0.57735027 0.57735027 0.        ]
 [0.57735027 0.57735027 0.57735027 0.57735027 0.         0.57735027]
 [0.57735027 0.57735027 0.57735027 0.         0.57735027 0.57735027]]
```

Here, each row represents a document, and each column represents a word in the vocabulary. The values are the **TF-IDF** scores for each term in each document.

### Conclusion:
**TF-IDF** is a powerful tool for **feature extraction** in text-based applications. By measuring both the term frequency and the rarity of terms across a document corpus, TF-IDF provides a balanced approach to identifying important keywords in texts. It’s widely used in search engines, classification tasks, and content-based recommendation systems.