<a href="https://colab.research.google.com/github/ItWasAllYellow/NLP_2025/blob/main/notebooks/1_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Implementation from Scratch

This notebook demonstrates how to implement the Word2Vec algorithm from scratch using PyTorch. We'll use the first Harry Potter book as our corpus to train word embeddings.


## 1. Setting Up the Environment

First, we need to import the necessary libraries:
- `torch` and `torch.nn` for tensor operations and neural network functionality
- `string` for string manipulations (removing punctuation)


In [1]:
import torch
import torch.nn as nn
import string


## 2. Getting the Text Data

We'll download the first Harry Potter book to use as our corpus.

In [2]:
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt"


--2025-03-13 05:28:42--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439742 (429K) [text/plain]
Saving to: ‘J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt.1’


2025-03-13 05:28:42 (8.85 MB/s) - ‘J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt.1’ saved [439742/439742]



## 3. Text Preprocessing

Before we can use the text data, we need to preprocess it:
- Remove punctuation
- Convert text to lowercase
- Split text into tokens (words)

This function will help us clean and tokenize the text.

In [3]:
def remove_punctuation(x):
  return x.translate(''.maketrans('', '', string.punctuation))

def make_tokenized_corpus(corpus):
  out= [ [y.lower() for y in remove_punctuation(sentence).split(' ') if y] for sentence in corpus]
  return [x for x in out if x!=[]]


## 4. Loading and Formatting the Text

Now we'll load the text file, replace some special characters, and split the text into sentences.


In [4]:
with open("J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt", 'r') as f:
  strings = f.readlines()
sample_text = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')


Let's tokenize the text using our preprocessing function `make_tokenized_corpus`:

In [5]:
# Corpus is a list of list of strings (words)
corpus = make_tokenized_corpus(sample_text)

## 5. Creating Context Word Pairs

A key concept in Word2Vec is learning from context. We need to create pairs of words that appear near each other in the text. We'll use a sliding window approach to create these pairs.

For example, with the window size of 2, for the word "to" in the sentence "they were the last people youd expect to be involved...", we would create pairs with:
- ("to", "expect")
- ("to", "be")
- ("to", "involved")
- ("to", "in")

These pairs will be our training data.

In [6]:
from tqdm import tqdm

sample_sentence = ['they', 'were', 'the', 'last', 'people', 'youd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', 'didnt', 'hold', 'with', 'such', 'nonsense']

word_pairs = []
window_size = 2

for sample_sentence in tqdm(corpus):
    for curr_idx, center_word in enumerate(sample_sentence):
        #print(curr_idx, center_word)

        window_begin = max(curr_idx - window_size, 0)
        window_end = min(curr_idx + window_size + 1, len(sample_sentence))

        #for context_word in sample_sentence[window_begin : window_end]:
            #if center_word == context_word: continue -> context word와 주변 단어가 같은 경우도 존재하므로 문제가 됨

            #word_pairs.append((center_word, context_word))

        for j in range(window_begin, window_end):
            if curr_idx == j: continue
            word_pairs.append((center_word, sample_sentence[j]))

print(f"\nLength of word_pairs is {len(word_pairs)}")
print(f"First 5 example of word_pairs is {word_pairs[:5]}")


100%|██████████| 4682/4682 [00:00<00:00, 8004.90it/s]


Length of word_pairs is 282372
First 5 example of word_pairs is [('harry', 'potter'), ('harry', 'and'), ('potter', 'harry'), ('potter', 'and'), ('potter', 'the')]





## 6. Building the Vocabulary

To work with word vectors, we need to create a vocabulary that maps each unique word to an index. We'll also filter out rare words that appear less than a certain number of times in the corpus.

### 6.1 Collecting All Words

First, let's collect all words in our corpus:


In [7]:
# we have to make vocabulary
# pretty print!

sentence = corpus[0]
entire_words = []

for sentence in corpus:
    for word in sentence:
        entire_words.append(word)

len(entire_words)

77597


### 6.2 Finding Unique Words

Now, let's find the unique words in our corpus:

In [8]:
# we have to get the "unique" item among total words

unique_words = set(entire_words)
len(unique_words)

6038

### 6.3 Converting to a List and Sorting

We'll convert the set of unique words to a sorted list:

In [9]:
# vocab_set[0] # set is not subscriptable because it has no order

unique_words = sorted(list(unique_words))
unique_words[0]

'\the'

### 6.4 Filtering by Frequency

Now, let's filter out rare words that occur less than a specified number of times:
- We can use the `Counter` class from the `collections` module to count the frequency of each word in the corpus.
- Caution on `alist.sort()` will return `None`.

In [10]:
# how can we filter the vocab by its frequency?
filtered_vocab = None
# you can use word counter as dictionary
# In python dictionary, dict.keys() gives keys, and dict.values() give values,
# dict.items() give (key, value)

from collections import Counter

word_counter = Counter(entire_words)

word_counter.most_common(10)
word_counter['harry']

threshold = 5
filtered_vocab = []
for k, v in word_counter.items():
    if v > threshold:
        filtered_vocab.append(k)

filtered_vocab.sort()
filtered_vocab[:10]

['a',
 'able',
 'abou',
 'about',
 'above',
 'across',
 'added',
 'afford',
 'afraid',
 'after']

## 7. Filtering Word Pairs

Now that we have our filtered vocabulary, we need to filter our word pairs to only include words that are in our vocabulary:

In [16]:
# Filter the word_pairs using the vocab
# word_pairs, filtered_vocab
# word_pairs is a list of [word_a, word_b]

'''
filtered_pairs = []

for pair in tqdm(word_pairs):
    a, b = pair
    if (a in filtered_vocab) and (b in filtered_vocab):
        filtered_pairs.append(pair)
'''

filtered_pairs = []
vocab_set = set(filtered_vocab)

for pair in tqdm(word_pairs):
    a, b = pair
    if (a in vocab_set) and (b in vocab_set):
        filtered_pairs.append(pair)

100%|██████████| 282372/282372 [00:00<00:00, 542551.20it/s]


In [18]:
# implement same algorithm with list comprehension

filtered_word_pairs = [pair for pair in word_pairs if ((pair[0] in vocab_set) and (pair[1] in vocab_set))]
len(filtered_word_pairs)

226846

In [19]:
len(filtered_pairs), len(word_pairs)

(226846, 282372)

## 8. Converting Words to Indices

For efficiency, we'll convert our words to indices according to their position in our vocabulary:

In [22]:
# convert word into index of vocab
filtered_vocab.index('harry')

527

This is inefficient because `list.index()` has to scan the list every time. Let's use a dictionary for faster lookups:

In [23]:
# we can make it faster
# use dictionary to find the index of string

# dictionary나 set이 탐색에 훨씬 유리함 (빠름)

word2idx = dict()
for idx, word in enumerate(filtered_vocab):
    word2idx[word] = idx

word2idx['harry']

527

Now, let's convert our word pairs to index pairs more efficiently:

In [None]:
index_pairs = []

In [None]:
# Why we don't need idx2tok?

## 9. Creating Initial Word Vectors

Now we'll create random vectors for each word in our vocabulary. These vectors will be adjusted during training:
- We can use `torch.randn` to create random vectors that follow normal distribution.

In [None]:
# we have to make random vectors for each word in the vocab
# we also have to decide the dimension of the vector

dim = None
vocab_size = None

word_vectors = None

In [None]:
# what is the vector for harry?


## 10. Understanding Word Relationships with Dot Products

The core of Word2Vec is using dot products to measure relationships between words. Let's explore this concept:

In [None]:
torch.set_printoptions(sci_mode=False) # Do this to avoid scientific notation


## Dot Product
- Assume we have two vectors $a$ and $b$.
  - $a = [a_1, a_2, a_3, a_4, ..., a_n]$
  - $b = [b_1, b_2, b_3, b_4, ..., b_n]$
- $a \cdot b$ = $\sum _{i=1}^n a_ib_i$  = $a_1b_1 + a_2b_2 + a_3b_3 + a_4b_4 + ... + a_nb_n$

Let's calculate the dot product between "harry" and "potter":


In [None]:
# calculate P(potter|harry)

dot_product_value_between_potter_harry = None
dot_product_value_between_potter_harry

In [None]:
# we can get the dot product value for every other words in the vocab
# to get  P(word | harry)
word_dot_dict = {}

word_dot_dict

Now, let's convert these dot products to probabilities using the softmax function:
- We have to convert our prediction into probability distribution to get P(word|harry) so that sum of [P(a|harry), ..., P(potter|harry), ... P(ron|harry), ... ] = 1
- current dot product value is any real number, sometimes called as logit
  - logit from logistic regression. Some values that are not yet converted to 0-1 or value before sigmoid function
  - every probability should be in range (0, 1) (greater than 0, smaller than 1)
  - this can be handled by taking exponential of dot product values, divided by total sum
  - This function is called **Softmax**

- Why we use exponential?
  - Because we want to make every probability in positive range while preserving the order


In [None]:
from math import exp

word_prob_dict = None

In [None]:
# Get P(potter|harry)

## 13. Efficient Matrix Operations
![img](https://mkang32.github.io/images/python/khan_academy_matrix_product.png)

Instead of calculating dot products one by one, we can use matrix multiplication for efficiency:


In [None]:
# get dot product result for every word in the vocabulary

# first, make vector_of_harry into matrix format

# do matrix multiplication


Let's verify that our matrix multiplication gives the same result as individual dot products:

Now let's implement the complete softmax calculation using matrix operations:


In [None]:
# convert dot product result into exponential

In [None]:
# get the sum of exponential


In [None]:
# divide exponential value with sum

## 14. Creating a Probability Function

Let's create a function to calculate probabilities efficiently:

In [None]:
def get_probs(query_vectors, entire_vectors):
  return None

# get_probs(mat_of_harry, word_vectors)

## 15. Preparing for Training

Before training our Word2Vec model, we need to split our dataset into training and testing sets:

In [None]:
# Now we can train the word2vec

# Let's think about training pairs
index_pairs # this is our dataset. It's list of list of two integer
# two integer means a pair of neighboring words

# Training set and Test set
# To validate that our model can solve 'unseen' problems
# So we have to split the dataset before training.

# To randomly split the dataset, we will first shuffle the dataset

# random.shuffle(index_pairs) # this will shuffle the list items

In [None]:
len(train_set), len(test_set)

## 16. Training the Word2Vec Model

Now we'll train our Word2Vec model using batched gradient descent:

In [None]:
# making batch from train_set
# Batch is a set of training samples, that are calculated together
# And also we update the model after one single batch

## 17. Evaluating the Training

Let's visualize the training loss to see if our model is learning:

In [None]:
import matplotlib.pyplot as plt
plt.plot(loss_record)

## 18. Testing the Model

Now we'll test our model on the test set:

## 19. Exploring Learned Word Relationships

Let's explore what our model has learned by finding the words most closely related to "harry":

In [None]:
# P(potter|harry)?
