#### 01. Static Embeddings - Word2Vec

##### Skip-gram

- Goal: Predict the context words given a target (center) word.

- How it works: For each word in a sentence, the model tries to guess the words that appear in its surrounding window.

- Example: Sentence: "He sat by the river bank"
If the center word is "bank" and window=2, the context words are ["the", "river"].

- Training pairs:
    - Input: "bank" 
    - Outputs: "the", "river"

- Best for: Learning good representations for rare words.

##### CBOW (Continuous Bag of Words)

- Goal: Predict the target (center) word given its context words.

- How it works: For each word in a sentence, the model tries to guess the word in the middle from the words around it.

- Example: Sentence: "He sat by the river bank"
If the context is ["the", "river"], the model tries to predict "bank".

- Training pairs:
    - Input: ["the", "river"]
    - Output: "bank"

- Best for: Faster training, works well with frequent words.

In [2]:

# Import Word2Vec from gensim
# Gensim is an open-source Python library for unsupervised topic modeling 
# and natural language processing, with a focus on efficient, scalable algorithms 
# for learning vector representations of text.

from gensim.models import Word2Vec

# Basic tokenizer: splits sentences into lowercase words
def simple_tokenizer(text):
    # Lowercase and split on spaces (very basic, for demo only)
    return text.lower().replace('.', '').split()

# Example sentences with the word 'bank' in different contexts
raw_sentences = [
    "He sat by the river bank.",
    "She deposited money in the bank.",
    "The bank was closed on Sunday."
]

# Tokenize sentences
tokenized_sentences = [simple_tokenizer(sent) for sent in raw_sentences]
# [He, Sat, by, the]

# Train Word2Vec model
# vector_size: dimension of the embedding vectors (higher = more expressive, but needs more data)
# min_count: ignore words with total frequency lower than this (set to 1 to include all words in this tiny corpus)
# window: maximum distance between the current and predicted word within a sentence (context window size)
w2v_model = Word2Vec(tokenized_sentences, vector_size=50, min_count=1, window=3)
# This is Skip-gram (sg=1) by default and sg=0 for CBOW

# Get embedding for 'bank'
# The vector for 'bank' will be the same regardless of which sentence/context it appears in
print("Word2Vec vector for 'bank':\n", w2v_model.wv['bank'])
print("Word2Vec : Bank's Word Vector Shape", w2v_model.wv['bank'].shape)

words = list(w2v_model.wv.index_to_key)  # List all words in the vocabulary
similar = w2v_model.wv.most_similar('bank')  # Find words most similar to 'bank'

print("Similarity between 'bank' and 'river':", w2v_model.wv.similarity('bank', 'river'))
print("Similarity between 'bank' and 'money':", w2v_model.wv.similarity('bank', 'money'))

# NOTE:

# - The model uses a sliding window (set by the window parameter) to look at neighboring words 
# - and learns that words appearing in similar contexts should have similar vectors.

# - bank' has the same embedding in both "river bank" and "money bank" contexts.
# - the word "bank" appears in both "river bank" and "money bank" contexts. 
# - The model tries to capture both, but since static embeddings can only assign one vector 
# - per word, "bank" gets a single vector that is an average of all its contexts.

# - This is a limitation of static embeddings:
# - they cannot distinguish between different meanings (senses) of a word.

Word2Vec vector for 'bank':
 [-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]
Word2Vec : Bank's Word Vector Shape (50,)
Similarity between 'bank' and 'river': -0.012591083
Similarity between 'bank' and 'money': 0.13204393


#### 02. Static Embeddings - FastText

In [None]:
from gensim.models import FastText

# Train FastText model on your tokenized sentences
ft_model = FastText(tokenized_sentences, vector_size=10, min_count=1, window=3)

# Get embedding for a known word
# Each word is broken into character n-grams (e.g., "bank" → <ba, ban, ank, nk>, etc.).
# The word vector is the sum (or average) of its n-gram vectors.
# Training is similar to Word2Vec (CBOW or Skip-gram), but on subwords.
print("FastText vector for 'bank':\n", ft_model.wv['bank'])

# FastText can handle OOV (out-of-vocabulary) words using subword information
print("FastText vector for OOV word 'banking':\n", ft_model.wv['banking'])

# Compare with a nonsense word (still gets a vector!)
print("FastText vector for OOV word 'bankzzz':\n", ft_model.wv['bankzzz'])

# Demo: Similarity between 'bank' and 'banking'
print("Similarity between 'bank' and 'banking':", ft_model.wv.similarity('bank', 'banking'))

# NOTE:
# - FastText is especially useful when you expect to encounter new words,
# - rare words, or work with morphologically rich languages.

FastText vector for 'bank':
 [-0.02681367 -0.00070708  0.01805622  0.00670038 -0.00706677 -0.01905808
  0.01269603  0.00208769  0.00945567 -0.02328101]
FastText vector for OOV word 'banking':
 [ 0.00135347 -0.00299125  0.00331595  0.01845975  0.0230606   0.00761079
  0.00130226  0.00841182  0.00700071 -0.01047717]
FastText vector for OOV word 'bankzzz':
 [-0.02014219  0.00406886 -0.00836108  0.01079835  0.01020753  0.00953312
  0.01484188 -0.01233316  0.0123117  -0.02921248]
Similarity between 'bank' and 'banking': 0.112472326
