## Importing Required Libraries

In [None]:
import numpy as np
import gensim
import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
from IPython.display import display, Markdown
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

## Load the Text8 Dataset
The **Text8 dataset** is a large-scale dataset extracted from Wikipedia, often used for training word embeddings.  
It consists of approximately 17 million words in a single text file, formatted as a sequence of words without punctuation.

In this step, you will:
1. **Download the Text8 dataset** using Gensim's downloader.
2. **Convert it to a list named text8_corpus** to use it for training.

In [None]:
# Your Code Here

In [None]:
len(text8_corpus)

## Train a Word2Vec Model in CBOW Mode

In this step, you will initialize a Word2Vec model and train it using a specific architecture.  
The model will learn word representations based on a **context-predicting approach**, where surrounding words help predict the target word.  

Consider the impact of different parameters such as:
- The size of the word representations(e.g., 300 dimensions).
- The number of neighboring words considered(e.g., 5 neighbors).
- The minimum occurrences required for a word to be included in the vocabulary(e.g., 5 occurrences).
- The number of CPU cores used for training(e.g., 4 cores).

Run the code below to train the model for **one epoch**.


In [None]:
# Your Code Here

## Example Words
We select a few words of interest.

In [None]:
words = ["king", "queen", "man", "woman", "car", "bus"]

## Compare Word Similarities

After training the Word2Vec model, we can analyze how well it captures relationships between words.  
In this step, we:
- Find the **top 5 most similar words** for each based on cosine similarity.
- Display the results in a structured format.

Words with high similarity scores are expected to have similar meanings or occur in similar contexts.  
If a word is **not found in the vocabulary**, it means it didn’t meet the minimum occurrence threshold during training.


In [None]:
for word in words:
    if ## code
        similar_words = ## code
        display(Markdown(f"**{word}:**"))
        for w, score in similar_words:
            display(Markdown(f"- {w}: {score:.4f}"))
    else:
        display(Markdown(f"{word} not in vocabulary."))

## Visualizing Word Embeddings

Word embeddings are high-dimensional vectors, making them difficult to interpret directly.  
To visualize them, we use **Principal Component Analysis (PCA)** to reduce their dimensionality from 300 to 2.

### Steps:
1. Extract word vectors for selected words.
2. Apply **PCA** to reduce dimensionality from 300 to 2.
3. Plot the words in a 2D space.

Words that appear in similar contexts should be **closer together** in the plot.


In [None]:
vectors = ## CODE
pca = ## CODE
result = ## CODE
plt.figure(figsize=(8,6))
plt.scatter(## CODE)
for i, word in enumerate(words):
    ## CODE
plt.title("Fine-Tuned Word Embeddings Visualization using CBOW")
plt.show()

## Training Word2Vec with Skip-gram

Now, we will train another Word2Vec model using a different architecture called **Skip-gram**.  
Unlike the previous CBOW approach, Skip-gram learns to predict **context words** given a target word, making it more effective for learning representations of rare words.

Repeat the previous steps:
1. Initialize a new Word2Vec model with a different training mode.
2. Train it using the **Text8 dataset** for one epoch.


In [None]:
# Your Code Here

## Compare Word Similarities (Skip-gram)
Now, we repeat the similarity comparison, but this time using the **Skip-gram** model.

- Skip-gram focuses on predicting surrounding words given a target word.
- It is better suited for smaller datasets and learns high-quality embeddings, especially for infrequent words.
- Here, we retrieve and display the most similar words for a given set of words.


In [None]:
for word in words:
    if ## code
        similar_words = ## code
        display(Markdown(f"**{word}:**"))
        for w, score in similar_words:
            display(Markdown(f"- {w}: {score:.4f}"))
    else:
        display(Markdown(f"{word} not in vocabulary."))

## Visualizing Word Embeddings (Skip-gram)
Now, we visualize the word embeddings obtained from the **Skip-gram** model.

In [None]:
vectors = ## CODE
pca = ## CODE
result = ## CODE
plt.figure(figsize=(8,6))
plt.scatter(## CODE)
for i, word in enumerate(words):
    ## CODE
plt.title("Fine-Tuned Word Embeddings Visualization using CBOW")
plt.show()