# CS6120 NLP Fall 2023 Assignment 3

## Implementing Skipgram and CBOW Algorithms

### Background: 
Word embeddings are dense vector representations of words in a continuous vector space. Skipgram and CBOW are two primary algorithms introduced by Mikolov et al. in their 2013 papers that form the basis of the popular word2vec model. While both are used for generating word embeddings, they use different architectures and techniques.

- Skipgram: Given a word, this model predicts the surrounding context words.
- CBOW (Continuous Bag-of-Words): Given context words, this model predicts the target word.

### Tasks:
1. Data Collection:
Download the text8 dataset, which is a cleaned version of the first 100MB of the English Wikipedia dump. It is available on several NLP data repositories.



In [2]:
import requests

# Download the dataset
url = "http://mattmahoney.net/dc/text8.zip"
response = requests.get(url, allow_redirects = True)

with open('text8.zip', 'wb') as f:
    f.write(response.content)

In [3]:
# Loading and preparing the text8 dataset
with open('text8', 'r') as file:
    data = file.read()
    
print(data[:1000])
print(type(data))
print('The length of the {} is {}' .format('`data`', len(data)))

 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic instituti

2. Pre-processing:
- Tokenize the dataset.
- Remove stopwords and non-alphabetic tokens.
- Build a vocabulary of the most frequent words (e.g., top 10,000 or 20,000 words).

In [51]:
# import package
import numpy as np
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
import pandas as pd
import re
import numpy as np
from nltk.tokenize import  word_tokenize
import nltk
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# add more packages
from typing import Dict
from typing import Set
from typing import List
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, accuracy_score
from collections import defaultdict
from nltk.corpus import stopwords
import string
import numpy as np
from nltk.stem import PorterStemmer
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from collections import Counter
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import random
from math import sqrt
from itertools import chain
from torch.utils.data import Dataset, DataLoader

In [5]:
# Clean data
def clean_corpus(line: list[str]) -> list[str]:
    '''
    preprocess and clean a given line.

    - line: The text line to be cleaned.
    ---
    - list: A list of preprocessed tokens from the line.
    '''
    tokens = word_tokenize(line)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  
    return tokens


In [38]:
# Preprocessing the data
lines: list[list[str]] = []

# Predefined list of stop words
stop_words: set = set(stopwords.words('english'))

# Tokenize the text documents and update the lists word_list and lines
lines = clean_corpus(data)
print(len(lines))
lines[:10]

10888361


['anarchism',
 'originated',
 'term',
 'abuse',
 'first',
 'used',
 'early',
 'working',
 'class',
 'radicals']

In [41]:
# Build a vocabulary of the most frequent words (e.g., top 10,000 or 20,000 words).
word_counts = Counter(lines)
top_freq_words = [word for word, count in word_counts.most_common(10000)]
top_freq_words[:10]


['one',
 'zero',
 'nine',
 'two',
 'eight',
 'five',
 'three',
 'four',
 'six',
 'seven',
 'also',
 'first',
 'many',
 'new',
 'used',
 'american',
 'time',
 'see',
 'may',
 'world']

### Subsampling data

In [68]:
# Subsampling

# Flatten the nested list
flattened_lines = lines

t = 1e-5 # # Hyperparameters

word_counts = Counter(flattened_lines)
total_count = len(flattened_lines)
frequencies = {word: count/total_count for word, count in word_counts.items()}


def subsample_prob(word):
    prob = (sqrt(frequencies[word] / t) + 1) * (t / frequencies[word])
    return prob

subsamped_lines = [word for word in flattened_lines if random.random() < (1 - subsample_prob(word))]

In [66]:
print(len(subsamped_lines))

65250614


In [69]:
# Check subsample
subsamped_lines[:10]

['term',
 'abuse',
 'first',
 'used',
 'early',
 'working',
 'class',
 'including',
 'revolution',
 'french']

3. Implement CBOW:
- Create the architecture for CBOW with an embedding layer and a linear layer.
- Generate training samples. For each word in the dataset, use n surrounding words as context.
- Train the model using a suitable optimizer and loss function.
- Extract word embeddings for the vocabulary.

CBOW is Continuous Bag Of Words, another vision of Word2Vec
- Reference: https://www.youtube.com/watch?v=ghu_5o42QGQ

In [70]:
# Create the architecture for CBOW with an embedding layer and a linear layer.
class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_size)

    def forward(self, context_words):
        embedded_words = self.embeddings(context_words)
        # We average the embeddings across the context words (dim=0 since context words are in the first dimension)
        avg_embedded = embedded_words.mean(dim = 0)
        logits = self.linear(avg_embedded)
        return logits

In [71]:
# Generate training samples. For each word in the dataset, use n surrounding words as context.
def generate_context_pairs(corpus, window_size, vocab_set):
    data = []

    for i, word in enumerate(corpus):
        if word not in vocab_set:
            continue

        # Initialize an empty context list for the current word
        context = []

        # Define the start and end indices for the context words
        start_index = max(0, i - window_size)
        end_index = min(len(corpus), i + window_size + 1)

        # Loop over the surrounding words within the window
        for j in range(start_index, end_index):
            # Exclude the current word itself
            if j != i and corpus[j] in vocab_set:
                context.append(corpus[j])

        target = word
        data.append((context, target))

    return data



In [72]:
# check function of genreate
generate_context_pairs(subsamped_lines, 2, top_freq_words)

[(['abuse', 'first'], 'term'),
 (['term', 'first', 'used'], 'abuse'),
 (['term', 'abuse', 'used', 'early'], 'first'),
 (['abuse', 'first', 'early', 'working'], 'used'),
 (['first', 'used', 'working', 'class'], 'early'),
 (['used', 'early', 'class', 'including'], 'working'),
 (['early', 'working', 'including', 'revolution'], 'class'),
 (['working', 'class', 'revolution', 'french'], 'including'),
 (['class', 'including', 'french', 'whilst'], 'revolution'),
 (['including', 'revolution', 'whilst', 'term'], 'french'),
 (['revolution', 'french', 'term', 'still'], 'whilst'),
 (['french', 'whilst', 'still', 'used'], 'term'),
 (['whilst', 'term', 'used', 'way'], 'still'),
 (['term', 'still', 'way', 'describe'], 'used'),
 (['still', 'used', 'describe', 'act'], 'way'),
 (['used', 'way', 'act', 'used'], 'describe'),
 (['way', 'describe', 'used', 'means'], 'act'),
 (['describe', 'act', 'means', 'organization'], 'used'),
 (['act', 'used', 'organization', 'taken'], 'means'),
 (['used', 'means', 'take

In [58]:
# check GPU is work
# Train the model using a suitable optimizer and loss function.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.is_available())

cuda
True


In [60]:
# Hyperparameters
embedding_dim = 100
learning_rate = 0.01
epochs = 10
window_size = 2

# Preparing data
vocab = set(top_freq_words)
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for word, i in word2idx.items()}

training_data = generate_context_pairs(subsamped_lines, window_size, vocab)

# Model, Loss, Optimizer
model = CBOW(len(vocab), embedding_dim).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr = learning_rate)

# List to store losses
epoch_losses = []

# Training loop
for epoch in range(epochs):
    total_loss = 0
    for context, target in tqdm(training_data, desc=f"Epoch {epoch+1}"):
        context_tensor = torch.tensor([word2idx[word] 
                                       for word in context], 
                                      dtype = torch.long).to(device)
        
        target_tensor = torch.tensor(word2idx[target], 
                                     dtype = torch.long).to(device)
        
        optimizer.zero_grad()
        outputs = model(context_tensor)
        loss = loss_fn(outputs, target_tensor)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    epoch_losses.append(total_loss)
    print(f"Epoch {epoch+1}, Loss: {total_loss}")


Epoch 1:   0%|          | 0/590174 [00:00<?, ?it/s]

Epoch 1:   9%|▉         | 54049/590174 [01:14<12:16, 728.06it/s]


KeyboardInterrupt: 

4. Implement Skipgram:
- Create the architecture for Skipgram, which is essentially the inverse of CBOW.
- Generate training samples. For each word in the dataset, create pairs with n surrounding words.
- Train the model using a suitable optimizer and loss function.
- Extract word embeddings for the vocabulary.



5. Evaluation:
- Implement a simple cosine similarity function to measure similarity between word pairs.
- Test the similarity of a few pairs of words (e.g., king & queen, man & woman, Paris & France).
- Visualize embeddings of some selected words using t-SNE or PCA.



6. Report:
- Provide a brief introduction to word embeddings, Skipgram, and CBOW.
- Discuss the architecture of the models.
- Describe the dataset and pre-processing steps.
- Present results from the evaluation step.
- Discuss challenges faced during implementation and potential improvements.
- Conclude with insights and potential applications of the implemented models.