# 🧠 Traditional Text Representations in NLP
In this lesson, we explore how raw text can be represented numerically using traditional methods:
- Bag of Words (BoW)
- Term Frequency-Inverse Document Frequency (TF-IDF)

These representations are foundational for many NLP tasks like classification, clustering, and information retrieval.




# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import os

# --------------------------------------
# 1. Define the Corpus (Example Sentences)
# --------------------------------------
corpus = [
    "I like the future of AI",
    "I'm learning the AI"
]

# --------------------------------------
# 2. Initialize Vectorizers
# --------------------------------------

# CountVectorizer for Bag of Words representation
bow_vectorizer = CountVectorizer()

# TfidfVectorizer for TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()

# --------------------------------------
# 3. Fit and Transform the Corpus
# --------------------------------------

# Create BoW matrix
bow_matrix = bow_vectorizer.fit_transform(corpus)

# Create TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# --------------------------------------
# 4. Get Feature Names (Vocabulary Words)
# --------------------------------------

bow_features = bow_vectorizer.get_feature_names_out()
tfidf_features = tfidf_vectorizer.get_feature_names_out()

# --------------------------------------
# 5. Create and Save BoW Heatmap
# --------------------------------------

# Set up the plot
plt.figure(figsize=(10, 6))
sns.heatmap(
    bow_matrix.toarray(),
    annot=True,
    cmap="Blues",
    xticklabels=bow_features,
    yticklabels=[f"Sentence {i+1}" for i in range(len(corpus))]
)
plt.title("Bag of Words (BoW) Matrix")
plt.xlabel("Words")
plt.ylabel("Sentences")

# Save the figure before showing
os.makedirs("figures", exist_ok=True)
plt.savefig("figures/bow_matrix.png")
plt.show()
plt.close()

# --------------------------------------
# 6. Create and Save TF-IDF Heatmap
# --------------------------------------

plt.figure(figsize=(10, 6))
sns.heatmap(
    tfidf_matrix.toarray(),
    annot=True,
    cmap="YlGnBu",
    xticklabels=tfidf_features,
    yticklabels=[f"Sentence {i+1}" for i in range(len(corpus))]
)
plt.title("TF-IDF Matrix")
plt.xlabel("Words")
plt.ylabel("Sentences")

# Save the figure before showing
plt.savefig("figures/tfidf_matrix.png")
plt.show()
plt.close()


### 📘 Text Representation: Bag of Words (BoW) and TF-IDF

---

#### 🔹 Bag of Words (BoW)

- **Concept**: BoW is a simple and commonly used method for text representation in NLP.
- **How it works**:
  - Each document (sentence) is represented as a vector.
  - The vector contains the **count of each word** in the document.
  - It does **not consider grammar or word order**, only frequency.
- **Example**:
  - Corpus: `["NLP is amazing", "I learn NLP"]`
  - Vocabulary: `["NLP", "is", "amazing", "I", "learn"]`
  - Each sentence becomes a vector of word counts:
    ```
    [1, 1, 1, 0, 0]   → "NLP is amazing"
    [1, 0, 0, 1, 1]   → "I learn NLP"
    ```

> ✅ BoW is useful for simple models but has limitations like ignoring word order and context.

---

#### 🔹 TF-IDF (Term Frequency–Inverse Document Frequency)

- **Concept**: TF-IDF improves BoW by weighing words based on how important they are in the document and corpus.
- **How it works**:
  - **TF (Term Frequency)**: Measures how often a term appears in a document.
  - **IDF (Inverse Document Frequency)**: Measures how unique a term is across all documents. Common words get lower weights.
- **Formula**:
	
TF=Number of times the word appears in the document/(Total number of words in the document)

	IDF (Inverse Document Frequency):
	Represents the importance of the word across a collection of documents (Corpus).
	If a word appears in most of the documents, it means it is a common word and doesn't carry much distinguishing value.

IDF=log⁡((Totla number of documents)/(Number of documents containing the word))

The log is used here because it gives more weight to words that 

