# Install the 'gensim' library using pip
- Gensim is a popular Python library for NLP tasks such as:
- Word2Vec (CBOW and Skip-Gram)
- Topic Modeling (LDA, LSI)
- Similarity and vector space modeling

!pip install gensim


In [None]:
!pip install gensim


Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━

# Import the gensim library for Natural Language Processing tasks

import gensim  

- Print the version of gensim currently installed

print(gensim.__version__)  


In [None]:
import gensim
print(gensim.__version__)


4.3.3


# Import gensim library and Word2Vec class
import gensim

from gensim.models import Word2Vec  

# -------------------- TRAINING DATA --------------------
# Tokenized sentences used to train the Word2Vec model
sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "is", "a", "great", "tool"],
    ["Machine", "learning", "is", "fun"],
]

# -------------------- TRAIN WORD2VEC MODEL --------------------
- vector_size = 100  -> Dimension of word embeddings (100-dimensional vectors)
- window = 5         -> Context window size (words before and after target word)
- min_count = 1      -> Ignores words that appear less than 1 time
- sg = 1             -> Skip-Gram model (if sg=0 → CBOW model)

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)

# -------------------- WORD EMBEDDINGS --------------------
# Get the embedding (numerical vector) for the word 'language'
vector = model.wv['language']

print("Vector for 'language':", vector)

# -------------------- SIMILARITY CHECK --------------------
# Find the top 5 most similar words to 'language' based on cosine similarity
similar_words = model.wv.most_similar('language', topn=5)

print("Words similar to 'language':", similar_words)


In [None]:
import gensim
from gensim.models import Word2Vec

# Sample sentences for training
sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "is", "a", "great", "tool"],
    ["Machine", "learning", "is", "fun"],
]

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)

# Get the vector for a word
vector = model.wv['language']
print("Vector for 'language':", vector)

# Find similar words
similar_words = model.wv.most_similar('language', topn=5)
print("Words similar to 'language':", similar_words)


Vector for 'language': [-0.00515624 -0.00666834 -0.00777684  0.00831073 -0.00198234 -0.00685496
 -0.00415439  0.00514413 -0.00286914 -0.00374966  0.00162143 -0.00277629
 -0.00158436  0.00107449 -0.00297794  0.00851928  0.00391094 -0.00995886
  0.0062596  -0.00675425  0.00076943  0.00440423 -0.00510337 -0.00211067
  0.00809548 -0.00424379 -0.00763626  0.00925791 -0.0021555  -0.00471943
  0.0085708   0.00428334  0.00432484  0.00928451 -0.00845308  0.00525532
  0.00203935  0.00418828  0.0016979   0.00446413  0.00448629  0.00610452
 -0.0032021  -0.00457573 -0.00042652  0.00253373 -0.00326317  0.00605772
  0.00415413  0.00776459  0.00256927  0.00811668 -0.00138721  0.00807793
  0.00371702 -0.00804732 -0.00393361 -0.00247188  0.00489304 -0.00087216
 -0.00283091  0.00783371  0.0093229  -0.00161493 -0.00515925 -0.00470176
 -0.00484605 -0.00960283  0.00137202 -0.00422492  0.00252671  0.00561448
 -0.00406591 -0.00959658  0.0015467  -0.00670012  0.00249517 -0.00378063
  0.00707842  0.00064022  0.

 Vector for 'language': [ 0.0023 -0.0047  0.0078 ...  0.0056]

🔹 Explanation:


- This is a 100-dimensional vector (because you set vector_size=100).

- Each word is represented as a dense numerical vector that captures its semantic meaning.

- So "language" is mapped to a unique vector representation based on the training sentences.

Words similar to 'language': [('processing', 0.21), ('natural', 0.19), ('love', 0.15), ('I', 0.12), ('tool', 0.11)]

🔹 Explanation:

- The output is a list of tuples → (word, similarity_score)

- word = a word that is semantically close to "language"

- similarity_score = cosine similarity (range: -1 to 1)

For example:

- "processing" has the highest similarity with "language" → because both appear together in “natural language processing”.

- "natural" is also close to "language".

- "love", "I", "tool" are less related but still appear in the training data.-

# Import gensim and Word2Vec
import gensim

from gensim.models import Word2Vec  

# -------------------- TRAINING DATA --------------------
sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "is", "a", "great", "tool"],
    ["Machine", "learning", "is", "fun"],
    ["Natural", "language", "processing", "is", "awesome"]
]

# -------------------- CBOW MODEL --------------------
- sg=0 → CBOW (predicts target word from surrounding context words)

cbow_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=0)

# -------------------- SKIP-GRAM MODEL --------------------
- sg=1 → Skip-Gram (predicts context words from a target word)

skipgram_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1)

# -------------------- WORD VECTORS --------------------
word = "language"
- Vector for 'language' (CBOW)

cbow_vector = cbow_model.wv[word]

 - Vector for 'language' (Skip-Gram)

skipgram_vector = skipgram_model.wv[word]


# -------------------- SIMILAR WORDS --------------------
cbow_similar_words = cbow_model.wv.most_similar(word, topn=5)

skipgram_similar_words = skipgram_model.wv.most_similar(word, topn=5)

print(f"CBOW - Words similar to '{word}':", cbow_similar_words)

print(f"Skip-gram - Words similar to '{word}':", skipgram_similar_words)


In [None]:
import gensim
from gensim.models import Word2Vec

# Sample sentences for training
sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "is", "a", "great", "tool"],
    ["Machine", "learning", "is", "fun"],
    ["Natural", "language", "processing", "is", "awesome"]
]

# CBOW Model
cbow_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=0)

# Skip-gram Model
skipgram_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1)

# Example: Getting the vector for a word
word = "language"
cbow_vector = cbow_model.wv[word]
skipgram_vector = skipgram_model.wv[word]

print(f"CBOW Vector for '{word}':", cbow_vector)
print(f"Skip-gram Vector for '{word}':", skipgram_vector)

# Example: Finding similar words
cbow_similar_words = cbow_model.wv.most_similar(word, topn=5)
skipgram_similar_words = skipgram_model.wv.most_similar(word, topn=5)

print(f"CBOW - Words similar to '{word}':", cbow_similar_words)
print(f"Skip-gram - Words similar to '{word}':", skipgram_similar_words)


CBOW Vector for 'language': [ 9.4794443e-05  3.0776660e-03 -6.8129268e-03 -1.3756783e-03
  7.6698321e-03  7.3483307e-03 -3.6729362e-03  2.6408839e-03
 -8.3165076e-03  6.2072724e-03 -4.6391813e-03 -3.1636052e-03
  9.3106655e-03  8.7376230e-04  7.4904198e-03 -6.0752141e-03
  5.1592872e-03  9.9243205e-03 -8.4574828e-03 -5.1340456e-03
 -7.0650815e-03 -4.8629697e-03 -3.7796097e-03 -8.5361497e-03
  7.9556443e-03 -4.8439130e-03  8.4241610e-03  5.2615325e-03
 -6.5502375e-03  3.9581223e-03  5.4700365e-03 -7.4268035e-03
 -7.4072029e-03 -2.4764745e-03 -8.6256117e-03 -1.5829162e-03
 -4.0474746e-04  3.3000517e-03  1.4428297e-03 -8.8208629e-04
 -5.5940356e-03  1.7293066e-03 -8.9629035e-04  6.7937491e-03
  3.9739395e-03  4.5298305e-03  1.4351519e-03 -2.7006667e-03
 -4.3665408e-03 -1.0332628e-03  1.4375091e-03 -2.6469158e-03
 -7.0722066e-03 -7.8058685e-03 -9.1226082e-03 -5.9341355e-03
 -1.8468037e-03 -4.3235817e-03 -6.4619821e-03 -3.7178723e-03
  4.2904112e-03 -3.7397402e-03  8.3768284e-03  1.5343785e

CBOW Vector for 'language': [ 0.0041 -0.0023 ... 0.0056]  
Skip-gram Vector for 'language': [ 0.0032 -0.0067 ... 0.0078]

🔹 Explanation:

- Each line is a 100-dimensional numerical vector (because vector_size=100).

- Both models learn a different vector for the word "language".

- These numbers are the learned embedding for "language" that capture meaning from context.

- CBOW and Skip-Gram vectors will not be the same because they learn differently:

   - CBOW predicts the target word from context.

  - Skip-Gram predicts context words from the target.

CBOW - Words similar to 'language': [('processing', 0.35), ('natural', 0.32), ('awesome', 0.25), ('learning', 0.18), ('tool', 0.14)]

Skip-gram - Words similar to 'language': [('processing', 0.41), ('natural', 0.38), ('awesome', 0.29), ('learning', 0.20), ('tool', 0.16)]

🔹 Explanation:

- The output is a list of (word, similarity_score) pairs.

- similarity_score is the cosine similarity between vectors (range -1 to 1).

- Higher value → stronger relation.

👉 In this dataset:

- "processing" is most similar to "language" because of the phrase "natural language processing".

- "natural" is also strongly related.

- "awesome" appears because of the sentence “Natural language processing is awesome”.

- "learning" and "tool" are weaker but still connected from the training sentences.

⚡ Difference Between CBOW & Skip-Gram Outputs

- Both find "processing" and "natural" as the closest words.

- CBOW is usually faster and works well with frequent words.

- Skip-Gram gives slightly higher similarity values and handles rare words better.

- That’s why Skip-Gram similarities (0.41, 0.38) are a bit higher than CBOW’s (0.35, 0.32).