Classical text representation techniques are:

1. One-Hot Encoding
2. Bag of Words (BoW)
3. Term Frequency-Inverse Document Frequency (TF-IDF)
4. N-Gram Language Modeling with NLTK
5. Latent Semantic Analysis (LSA)
6. Latent Dirichlet Allocation (LDA)

These were the foundation of early NLP (pre-2015).
They’re still useful for small or interpretable tasks, but not used much in large-scale or deep learning models now. ✅ TF-IDF and LDA are still relevant,
but ❌ BoW, One-Hot, LSA are mostly outdated for modern neural networks.

# The “Word Embedding Era” (2013–2018)
These are static embeddings → each word has one fixed vector regardless of context.
(e.g., “bank” same for “river bank” and “money bank”)
1. Word2Vec(google, 2013): Less used directly, but influential
2. GloVe(Stanford,2014):  Less Used Now
3. FastText(Facebook 2013): Sometimes used now

# The “Contextual Embedding Era” (2018–Now)
1. ELMo(2018): Historical interest
2. BERT(google,2018): Very popular now
3. RoBERTa, ALBERT, DistilBERT: Variants of Bert. Used in today.
4. GPT, GPT-2/3/4/5: Popular now.
5. Sentence-BERT (SBERT): Sentence Level Encoding. Popular now.
6. T5, BART: Sequence to Sequence Transformers. Used for summerization

These models dominate modern NLP because they:

1. Capture semantic + contextual + syntactic meaning

2. Work across all tasks (classification, QA, summarization, generation)

3. Can be fine-tuned for specific downstream tasks

# One Hot Encoding
One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine learning models.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {
     'Employee id': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

df  = pd.DataFrame(data)
print(df)

df_encoded = pd.get_dummies(df,columns=['Gender','Remarks']).astype(int) #get_dummies() creates boolean (bool) columns if you didn’t explicitly request numeric type.
print("One Hot Encoded using Pandas: ")
print(df_encoded)

encoder = OneHotEncoder(sparse_output=False)
encoded_output = encoder.fit_transform(df[['Gender', 'Remarks']])
encoded_df = pd.DataFrame(encoded_output, columns=encoder.get_feature_names_out(['Gender', 'Remarks']))

df_final = pd.concat([df[['Employee id']],encoded_df],axis=1)

print(df_final)




   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice
One Hot Encoded using Pandas: 
   Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0           10         0         1             1              0             0
1           20         1         0             0              0             1
2           15         1         0             1              0             0
3           25         0         1             0              1             0
4           30         1         0             0              0             1
   Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0           10       0.0       1.0           1.0            0.0           0.0
1           20       1.0       0.0           0.0            0.0           1.0
2           15       1.0       0.0           1.0            0.0           0.0
3        

#Bag of Words
 Converts text into numerical vectors while ignoring word order and grammar, focusing only on word frequencies.

 #Core Concept:

1. Treat text as an unordered collection of words (like a "bag")

2. Count how many times each word appears

3. Represent each document as a frequency vector

#Simple Example

Let’s say you have these 3 sentences:

"I love NLP"

"I love machine learning"

"NLP is fun"

#Vocabulary

` ['I', 'love', 'NLP', 'machine', 'learning', 'is', 'fun']`

| Sentence                | I | love | NLP | machine | learning | is | fun |
| ----------------------- | - | ---- | --- | ------- | -------- | -- | --- |
| I love NLP              | 1 | 1    | 1   | 0       | 0        | 0  | 0   |
| I love machine learning | 1 | 1    | 0   | 1       | 1        | 0  | 0   |
| NLP is fun              | 0 | 0    | 1   | 0       | 0        | 1  | 1   |



In [None]:
#Manual Implementation
import pandas as pd

corpus = ["I love NLP", "I love machine learning", "NLP is fun"]

tokenized = [sentence.lower().split() for sentence  in corpus]

vocab = sorted(set(word for sent in tokenized for word in sent))

print(vocab)
BoW = []

for token in tokenized:
  BoW_vector = [token.count(word) for word in vocab ]
  BoW.append(BoW_vector)

df = pd.DataFrame(BoW, columns=vocab)
print(df)

['fun', 'i', 'is', 'learning', 'love', 'machine', 'nlp']
   fun  i  is  learning  love  machine  nlp
0    0  1   0         0     1        0    1
1    0  1   0         1     1        1    0
2    1  0   1         0     0        0    1


In [None]:
#Using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

documents = [
        "great movie amazing acting",
        "terrible film boring story",
        "wonderful cinematography excellent direction",
        "poor acting bad script",
        "fantastic performance loved it",
        "awful movie waste of time",
        "brilliant story amazing characters",
        "horrible acting poor direction"
    ]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['acting' 'amazing' 'awful' 'bad' 'boring' 'brilliant' 'characters'
 'cinematography' 'direction' 'excellent' 'fantastic' 'film' 'great'
 'horrible' 'it' 'loved' 'movie' 'of' 'performance' 'poor' 'script'
 'story' 'terrible' 'time' 'waste' 'wonderful']
[[1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
 [0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0]
 [0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0]]


In [None]:
#dir(vectorizer)

vectorizer.vocabulary_

{'great': 12,
 'movie': 16,
 'amazing': 1,
 'acting': 0,
 'terrible': 22,
 'film': 11,
 'boring': 4,
 'story': 21,
 'wonderful': 25,
 'cinematography': 7,
 'excellent': 9,
 'direction': 8,
 'poor': 19,
 'bad': 3,
 'script': 20,
 'fantastic': 10,
 'performance': 18,
 'loved': 15,
 'it': 14,
 'awful': 2,
 'waste': 24,
 'of': 17,
 'time': 23,
 'brilliant': 5,
 'characters': 6,
 'horrible': 13}

# Bag of n-Grams

In BoW each is seperated but in Bag of n-Grams where words are seperated as piar such as Bi-Gram, tri-Gram and so on.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,2))# 1 words and also 2 words pair

X = vectorizer.fit_transform(["great movie amazing acting","I love NLP"])
print(vectorizer.get_feature_names_out())

['acting' 'amazing' 'amazing acting' 'great' 'great movie' 'love'
 'love nlp' 'movie' 'movie amazing' 'nlp']


#TF-IDF(Term-Frequency Inverse Document Frequency)

It’s a statistical method to weigh the importance of words in a document relative to a collection (corpus).

It improves upon BoW, where all words are treated equally, by giving more importance to rare but meaningful words and less to very common ones (like “the”, “is”, “and”)
	​

{61F159E4-456D-46BC-8A19-0B2A7BF2BC6E}.png


# TF-IDF(t,d) = TF(t,d) * IDF(d)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
        "I love NLP",
        "I love machine learning",
        "NLP is fun"
    ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['fun' 'is' 'learning' 'love' 'machine' 'nlp']
[[0.         0.         0.         0.70710678 0.         0.70710678]
 [0.         0.         0.62276601 0.4736296  0.62276601 0.        ]
 [0.62276601 0.62276601 0.         0.         0.         0.4736296 ]]


In [2]:
print('\nidf values:')
for ele1, ele2 in zip(vectorizer.get_feature_names_out(), vectorizer.idf_):
    print(ele1, ':', ele2)


idf values:
fun : 1.6931471805599454
is : 1.6931471805599454
learning : 1.6931471805599454
love : 1.2876820724517808
machine : 1.6931471805599454
nlp : 1.2876820724517808


In [3]:
print('\nWord indexes:')
print(vectorizer.vocabulary_)
print('\ntf-idf value:')
print(X)



Word indexes:
{'love': 3, 'nlp': 5, 'machine': 4, 'learning': 2, 'is': 1, 'fun': 0}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (3, 6)>
  Coords	Values
  (0, 3)	0.7071067811865476
  (0, 5)	0.7071067811865476
  (1, 3)	0.4736296010332684
  (1, 4)	0.6227660078332259
  (1, 2)	0.6227660078332259
  (2, 5)	0.4736296010332684
  (2, 1)	0.6227660078332259
  (2, 0)	0.6227660078332259


In [None]:
dir(vectorizer)