
Embedding vectors and one-hot encoding are both techniques to represent categorical data numerically,
 but embeddings are learned, dense, and compact, capturing semantic relationships in a lower-dimensional 
 space, while one-hot encoding creates sparse, high-dimensional vectors where each category is an orthogonal 
 dimension, making embeddings better for large datasets and semantic tasks, and one-hot encoding suitable for 
 small, fixed categories without inherent order.  


One-Hot Encoding
* How it works: Creates a binary vector for each category, where only one element (the "hot" one) is 1,
 and the rest are 0. 
* Pros: Simple, deterministic, and effective for nominal (no natural order) categorical variables. 

* Cons: Creates very high-dimensional and sparse vectors, which can lead to the "curse of dimensionality" and are inefficient for large datasets with many categories. Lacks semantic meaning and relationship between categories. 

Embeddings
* How it works: Maps data into dense, lower-dimensional vectors where each dimension captures latent features or semantic meaning. 
* Pros: More computationally efficient, requires less memory, and can capture complex relationships and patterns, leading to better model performance and generalization. 
* Cons: Requires significant training data and computational resources. The quality depends on the underlying training algorithm (e.g., Word2Vec, GloVe). 
When to Use Which 

* Use One-Hot Encoding When: You have a small, fixed number of categories and don't need to capture semantic relationships between them. 
* Use Embeddings When: You have a large number of categories, or you need to leverage the semantic meaning and context of the data, as in natural language processing (NLP). 

Key Differences
* Dimensionality: Embeddings are lower-dimensional and dense; one-hot encoding creates high-dimensional and sparse vectors. 
* Information Content: Embeddings encode semantic relationships; one-hot encoding provides no semantic information. 
* Learning: Embeddings are learned through machine learning models; one-hot encoding is a fixed, deterministic process. 

Word2Vec -- semantic relationship and meaning information 
* Developed by Google (Mikolov et al., 2013).
* It converts words into dense vectors (called embeddings), where similar words are close to each other in vector space.
* Example: king - man + woman ≈ queen.
 How it works:
1. Two architectures:
    * CBOW (Continuous Bag of Words): predicts a word from its surrounding words.
    * Skip-Gram: predicts surrounding words given a word.
2. The model trains on large text data and learns vector representations.
3. After training, each word is represented as a fixed-size vector (e.g., 100-dimensional).
 Limitation: It only learns whole word embeddings, so unseen words (OOV – out of vocabulary) are not represented.

i <love> eating   - CBOW  40 % 
 I  love  eating icecream alone   -- skip - gram 
 
FastText
* Developed by Facebook AI Research (2016).
* It is an extension of Word2Vec.
* Key difference: Instead of representing each word as a whole, FastText breaks words into subword units (character n-grams).
Examples
Word = “playing”
* Subwords: “play”, “layi”, “ayin”, “ing”
* Embedding is built from these pieces.  50 % 
Advantages of FastText:  -  
* Handles rare words better.
* Can create embeddings for out-of-vocabulary (OOV) words (e.g., misspellings, new words).
* Especially useful for morphologically rich languages (like Hindi, Tamil, Turkish, Finnish, etc.).


Word2Vec = word-level embeddings.
FastText = word + subword embeddings (smarter for rare/OOV words).


Feature	   CBOW	                            Skip-Gram
Predicts	Word from context	          Context from word
Speed	     Faster	                       Slower
Accuracy	Good for frequent words	        Better for rare words
Best for	Small datasets	                   Large datasets


2017 - transfomer ( word2vec)

In [1]:
def one_hot_encode(text):
    words = text.split()
    vocabulary = sorted(set(words))
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    one_hot_encoded = []
    for word in words:
        one_hot_vector = [0] * len(vocabulary)
        one_hot_vector[word_to_index[word]] = 1
        one_hot_encoded.append(one_hot_vector)
    return one_hot_encoded, word_to_index, vocabulary

example_text = "cat in the hat dog on the mat bird in the tree"

one_hot_encoded, word_to_index, vocabulary = one_hot_encode(example_text)

print("Vocabulary:", vocabulary)                 # should work
print("Word to Index Mapping:", word_to_index)
print("One-Hot Encoded Matrix:")
for word, encoding in zip(example_text.split(), one_hot_encoded):
    print(f"{word}: {encoding}")

Vocabulary: ['bird', 'cat', 'dog', 'hat', 'in', 'mat', 'on', 'the', 'tree']
Word to Index Mapping: {'bird': 0, 'cat': 1, 'dog': 2, 'hat': 3, 'in': 4, 'mat': 5, 'on': 6, 'the': 7, 'tree': 8}
One-Hot Encoded Matrix:
cat: [0, 1, 0, 0, 0, 0, 0, 0, 0]
in: [0, 0, 0, 0, 1, 0, 0, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
hat: [0, 0, 0, 1, 0, 0, 0, 0, 0]
dog: [0, 0, 1, 0, 0, 0, 0, 0, 0]
on: [0, 0, 0, 0, 0, 0, 1, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
mat: [0, 0, 0, 0, 0, 1, 0, 0, 0]
bird: [1, 0, 0, 0, 0, 0, 0, 0, 0]
in: [0, 0, 0, 0, 1, 0, 0, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
tree: [0, 0, 0, 0, 0, 0, 0, 0, 1]


In [2]:
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)


<gensim.models.fasttext.FastText object at 0x1290fa2d0>


In [5]:
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)  

<gensim.models.fasttext.FastText object at 0x12927d350>


In [6]:
wv = model.wv
print(wv)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)

<gensim.models.fasttext.FastTextKeyedVectors object at 0x128fb0a90>
True


In [7]:
print('nights' in wv.key_to_index)

False


In [8]:
print(wv['night'])

array([-0.20284255,  0.17937975, -0.26213023, -0.08410547,  0.06646235,
        0.37616438,  0.30806497,  0.5046025 ,  0.2462801 , -0.23985392,
        0.02934063, -0.1570554 , -0.23325667,  0.5250359 , -0.39944953,
       -0.56109583,  0.18253204, -0.24287868, -0.42940032, -0.53925073,
       -0.4684133 , -0.05898273, -0.46661866, -0.13629974, -0.19980335,
       -0.317181  , -0.68123364, -0.1178949 , -0.3241525 ,  0.2694415 ,
       -0.3264985 ,  0.30853063,  0.8473678 , -0.26889664,  0.18328932,
        0.38816193,  0.3921644 , -0.09665528, -0.37228012, -0.33637613,
        0.48111784, -0.43240216,  0.03541676, -0.41230687, -0.53825843,
       -0.3141474 , -0.07590213,  0.12475323,  0.36745605, -0.00932131,
        0.36208126, -0.44410658,  0.30114147, -0.4140023 , -0.1931092 ,
       -0.18441269, -0.16532879, -0.13075094,  0.04240717, -0.35819513,
       -0.3473046 , -0.44219017, -0.19478136,  0.35050353, -0.11053087,
        0.692508  ,  0.05158235,  0.05928085,  0.42303494,  0.25

In [10]:
print(wv['nights'])

array([-0.17602764,  0.15592496, -0.22667974, -0.07251345,  0.05618022,
        0.32383716,  0.26760095,  0.43797266,  0.21326041, -0.20889054,
        0.02708904, -0.13416103, -0.2026273 ,  0.45149082, -0.34644526,
       -0.485329  ,  0.1569213 , -0.2092339 , -0.36943987, -0.4663614 ,
       -0.40154582, -0.05202486, -0.4028292 , -0.11910775, -0.17127909,
       -0.27256083, -0.5870845 , -0.0996221 , -0.279828  ,  0.23432305,
       -0.280182  ,  0.2661049 ,  0.73039424, -0.23192888,  0.15849352,
        0.33480278,  0.3402905 , -0.08351878, -0.321937  , -0.29128414,
        0.414982  , -0.37282392,  0.03009282, -0.35581183, -0.4656405 ,
       -0.27006483, -0.06262122,  0.1083453 ,  0.31862086, -0.00697254,
        0.31416807, -0.3836532 ,  0.26058412, -0.35724917, -0.16625573,
       -0.1579507 , -0.14461027, -0.11104371,  0.03791947, -0.3064769 ,
       -0.29890484, -0.38219512, -0.16784729,  0.30216286, -0.09485804,
        0.59894264,  0.04466497,  0.04856251,  0.36508232,  0.22

In [11]:
print(wv.similarity("night", "nights"))  # vector similarity match 

0.9999919


In [12]:
print(wv.most_similar("nights"))  # vector Search 

[('night', 0.9999918937683105),
 ('rights', 0.9999876022338867),
 ('flights', 0.9999868273735046),
 ('overnight', 0.9999866485595703),
 ('fighter', 0.9999847412109375),
 ('fighters', 0.9999846816062927),
 ('night.', 0.999984622001648),
 ('fight', 0.9999843239784241),
 ('overnight.', 0.9999842643737793),
 ('night,', 0.9999841451644897)]


In [None]:
700 PB  -- model -- 