# Text Representation

Before feeding text into machine learning models, we must convert it into numerical format. This process is called text representation.
### Text Representation Techniques
1️⃣ Bag of Words (BoW) - (Simple, Count-Based)

2️⃣ TF-IDF (Term Frequency-Inverse Document Frequency) - (Weight-Based)

3️⃣ N-grams (Unigram, Bigram, Trigram) - (Context-Based)

4️⃣ Word Embeddings (Word2Vec, GloVe, FastText) - (Deep Learning-Based)

# Bag of Words
BoW is a simple method that converts text into a word frequency matrix. It ignores word order but keeps track of word occurrence.

 How does it work?

 Tokenize text into words.

 Count the frequency of each word.

 Create a matrix where rows = sentences/documents and columns = words.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
text = [
    "I love NLP",
    "NLP is awesome",
    "Deep learning is useful for NLP"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(text)

# Convert to array
print(X.toarray())

# Show the feature names (words)
print(vectorizer.get_feature_names_out())


[[0 0 0 0 0 1 1 0]
 [1 0 0 1 0 0 1 0]
 [0 1 1 1 1 0 1 1]]
['awesome' 'deep' 'for' 'is' 'learning' 'love' 'nlp' 'useful']


# TF-IDF

TF-IDF is an improvement over Bag of Words (BoW). Instead of just counting words, it assigns weights to words based on their importance in a document.

 Common words (like "is", "the") get lower importance.

 Rare but important words (like "NLP", "Transformer") get higher importance.

 Reduces the dominance of frequently occurring but less meaningful words.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
text = [
    "I love NLP",
    "NLP is amazing",
    "Deep learning is useful for NLP"
]

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(text)

# Convert to array
print(X.toarray())

# Show the feature names (words)
print(vectorizer.get_feature_names_out())


[[0.         0.         0.         0.         0.         0.861037
  0.50854232 0.        ]
 [0.72033345 0.         0.         0.54783215 0.         0.
  0.42544054 0.        ]
 [0.         0.45050407 0.45050407 0.34261996 0.45050407 0.
  0.26607496 0.45050407]]
['amazing' 'deep' 'for' 'is' 'learning' 'love' 'nlp' 'useful']


# N-grams
N-grams are continuous sequences of N words from a given text. They help capture context and word relationships in NLP.

### Types of N-grams:
1️⃣ Unigram → Single words (e.g., "I", "love", "NLP")

2️⃣ Bigram → Two-word sequences (e.g., "I love", "love NLP")

3️⃣ Trigram → Three-word sequences (e.g., "I love NLP")

4️⃣ n-gram → Any N-word sequence

📌 Why Use N-grams?

✔️ Captures context beyond individual words

✔️ Improves text classification and language modeling

✔️ Helps in autocorrect, autocomplete, and speech recognition



In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
corpus = [
    "I love NLP",
    "NLP is amazing",
    "Deep learning is useful for NLP"
]

# Unigram
vectorizer_unigram = CountVectorizer(ngram_range=(1,1))
X_unigram = vectorizer_unigram.fit_transform(corpus)
print("Unigram Feature Names:", vectorizer_unigram.get_feature_names_out())

# Bigram
vectorizer_bigram = CountVectorizer(ngram_range=(2,2))
X_bigram = vectorizer_bigram.fit_transform(corpus)
print("Bigram Feature Names:", vectorizer_bigram.get_feature_names_out())

# Trigram
vectorizer_trigram = CountVectorizer(ngram_range=(3,3))
X_trigram = vectorizer_trigram.fit_transform(corpus)
print("Trigram Feature Names:", vectorizer_trigram.get_feature_names_out())


Unigram Feature Names: ['amazing' 'deep' 'for' 'is' 'learning' 'love' 'nlp' 'useful']
Bigram Feature Names: ['deep learning' 'for nlp' 'is amazing' 'is useful' 'learning is'
 'love nlp' 'nlp is' 'useful for']
Trigram Feature Names: ['deep learning is' 'is useful for' 'learning is useful' 'nlp is amazing'
 'useful for nlp']


# Word Embeddings
Word embeddings represent words as dense numerical vectors in a high-dimensional space, capturing semantic relationships between words. Unlike one-hot encoding, embeddings preserve meaning and context.

### Why Use Word Embeddings
 Capture semantic similarity (e.g., "king" and "queen" are close)

 Improve text classification and sentiment analysis

 Work well for large vocabularies

### Types of Word Embeddings
1️⃣ Word2Vec – Predicts words based on context (CBOW, Skip-gram)

2️⃣ GloVe – Uses co-occurrence matrix for learning word relationships

3️⃣ FastText – Extends Word2Vec by considering subwords (useful for misspellings and rare words)