<a href="https://colab.research.google.com/github/AdityaKumbhar21/Natural_Language_Processing/blob/main/Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# Install gensim if you haven't already
# !pip install gensim

# Import necessary libraries
import gensim.downloader as api
from gensim.models import KeyedVectors
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import warnings

# Suppress warnings for cleaner output, especially from gensim
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

# %% [markdown]
"""
## 2. Download and Load the Pre-trained Google News Word2Vec Model

The `gensim.downloader` module provides access to various pre-trained models. We'll use the 'word2vec-google-news-300' model, which contains 300-dimensional vectors for 3 million words and phrases.
"""

# %%
# Define the model name
model_name = "word2vec-google-news-300"

print(f"Attempting to download and load the '{model_name}' model. This may take a while...")

try:
    # Download the model (if not already downloaded)
    # This will download to a default location (e.g., ~/.gensim/data)
    wv = api.load(model_name)
    print(f"Model '{model_name}' loaded successfully!")

except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure you have enough disk space and a stable internet connection.")
    print("You might need to run this cell again if the download was interrupted.")
    wv = None # Set wv to None if loading fails to prevent errors in subsequent cells

# Check if the model was loaded
if wv:
    print(f"\nNumber of words in vocabulary: {len(wv.key_to_index)}")
    print(f"Vector dimension: {wv.vector_size}")

# %% [markdown]
"""
## 3. Explore Word Vector Operations

Now that the model is loaded, let's perform some common operations to understand how word embeddings capture semantic relationships.
"""

# %%
if wv:
    # 3.1 Get the vector for a specific word
    word_vector_king = wv['king']
    print(f"Vector for 'king' (first 10 dimensions):\n{word_vector_king[:10]}\n")

    # 3.2 Find the most similar words
    print("Words most similar to 'king':")
    for word, similarity in wv.most_similar('king'):
        print(f"  {word}: {similarity:.4f}")
    print("\nWords most similar to 'car':")
    for word, similarity in wv.most_similar('car'):
        print(f"  {word}: {similarity:.4f}")

    # %% [markdown]
    """
    ### 3.3 Word Analogies (King - Man + Woman = Queen)

    One of the most famous demonstrations of Word2Vec's power is its ability to solve analogies.
    """

    # %%
    print("Solving analogy: 'king' - 'man' + 'woman' = ?")
    result = wv.most_similar(positive=['woman', 'king'], negative=['man'])
    for word, similarity in result:
        print(f"  {word}: {similarity:.4f}")

    print("\nSolving analogy: 'France' - 'Paris' + 'Rome' = ?")
    result = wv.most_similar(positive=['Rome', 'France'], negative=['Paris'])
    for word, similarity in result:
        print(f"  {word}: {similarity:.4f}")

    # %% [markdown]
    """
    ### 3.4 Calculate Similarity Between Two Words

    You can directly calculate the cosine similarity between any two words in the vocabulary.
    """

    # %%
    word1 = 'good'
    word2 = 'bad'
    word3 = 'excellent'

    similarity_good_bad = wv.similarity(word1, word2)
    similarity_good_excellent = wv.similarity(word1, word3)

    print(f"Similarity between '{word1}' and '{word2}': {similarity_good_bad:.4f}")
    print(f"Similarity between '{word1}' and '{word3}': {similarity_good_excellent:.4f}")

    # You can also get the raw vectors and compute cosine similarity manually
    vec_good = wv[word1]
    vec_bad = wv[word2]
    manual_similarity = cosine_similarity(vec_good.reshape(1, -1), vec_bad.reshape(1, -1))[0][0]
    print(f"Manual cosine similarity between '{word1}' and '{word2}': {manual_similarity:.4f}")

    # %% [markdown]
    """
    ### 3.5 Odd-One-Out (Does not fit)

    Find the word that doesn't belong in a list.
    """

    # %%
    words = ['breakfast', 'lunch', 'dinner', 'car']
    odd_one_out = wv.doesnt_match(words)
    print(f"Odd one out in {words}: '{odd_one_out}'")

    words_2 = ['apple', 'banana', 'orange', 'table']
    odd_one_out_2 = wv.doesnt_match(words_2)
    print(f"Odd one out in {words_2}: '{odd_one_out_2}'")

    # %% [markdown]
    """
    ### 3.6 Handling Out-of-Vocabulary (OOV) Words

    What happens if you query a word not present in the model's vocabulary?
    """

    # %%
    oov_word = 'supercalifragilisticexpialidocious' # A very long, uncommon word
    if oov_word in wv.key_to_index:
        print(f"'{oov_word}' is in the vocabulary.")
        print(wv[oov_word][:5])
    else:
        print(f"'{oov_word}' is NOT in the vocabulary.")
        try:
            # Attempting to access an OOV word will raise a KeyError
            _ = wv[oov_word]
        except KeyError as e:
            print(f"Caught expected error when accessing OOV word: {e}")

    # You can check for a word's presence before accessing it
    common_word = 'computer'
    if common_word in wv.key_to_index:
        print(f"\n'{common_word}' is in the vocabulary.")
        print(wv[common_word][:5])
    else:
        print(f"'{common_word}' is NOT in the vocabulary.")
else:
    print("\nModel was not loaded. Skipping demonstration of word vector operations.")