<a href="https://colab.research.google.com/github/F1ameX/2025-ODS-NLP/blob/main/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Word2Vec model training

**Task:** Write a program that trains Word2Vec model. Do not use print() instructions in your code, otherwise test procedure will not succeed; the message "Wrong Answer" indicates answer format is incorrect (print() in the code, missing words in the dictionary, etc.). The message "Embeddings are not good enough" means you're on the right track and you should focus on the model improvement. In this version of the assignment the checks on the embeddings are easier.

In [2]:
import re
import string
import numpy as np


def clean(inp: str) -> str:
    '''
    Cleans the input string by removing punctuation, converting to lowercase,
    and replacing multiple spaces with a single space.
    '''
    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    inp = re.sub(r'\s+', ' ', inp.lower())
    return inp

## Lite version of train function

In [None]:
def train(data: str) -> dict:
    '''
    Accepts cleaned data as input and generates word embeddings
    using random vectors for each unique word.
    '''
    w2v_dict = {}

    data = clean(data)
    words = clean(data).split()

    vocab = list(set(words))
    vocab_size = len(vocab)

    w2v_dict = {word: np.random.uniform(-0.5, 0.5, vocab_size) for word in vocab}
    return w2v_dict

## Hard version of train function

In [3]:
def train(data: str) -> dict:
    '''
    Trains a simple word embedding model using a basic
    word-context averaging approach.
    '''
    window = 2
    epochs = 5

    words = clean(data).split()
    vocab = list(set(words))
    vector_size = len(vocab)

    w2v_dict = {word: np.random.uniform(-0.5, 0.5, vector_size) for word in vocab}

    for _ in range(epochs):
        for i, word in enumerate(words):
            left = max(0, i - window)
            right = min(len(words), i + window + 1)
            neighbors = words[left : i] + words[i + 1 : right]

            if not neighbors:
                continue

            neighbor_vectors = np.array([w2v_dict[neighbor] for neighbor in neighbors])
            avg_vector = np.mean(neighbor_vectors, axis = 0)

            w2v_dict[word] = 0.9 * w2v_dict[word] + 0.1 * avg_vector

    return w2v_dict