# HW05: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [None]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [None]:
dataset = api.load("text8")



In [None]:
##TODO train a word2vec model on this dataset which appear at least 10 times in the corpus

# train the model
from gensim.models import Word2Vec
model = Word2Vec(dataset,  
               workers = 8, # Number of threads to run in parallel
               vector_size=300,  # Word vector dimensionality     
               min_count =  10, # Minimum word count  
               window = 5, # Context window size      
               sample = 1e-3, # Downsample setting for frequent words
               )

In [26]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [27]:
# Save model
model_save_name = 'w2v-vectors.pkl'
path = F"/content/gdrive/My Drive/Colab Notebooks/Homework/{model_save_name}" 
model.save(path)

In [None]:
# Load model
from gensim.models import Word2Vec
model_save_name = 'w2v-vectors.pkl'
path = F"/content/gdrive/My Drive/Colab Notebooks/Homework/{model_save_name}" 
model = Word2Vec.load(path)

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [None]:
# model.wv
word_vectors  = model.wv

##TODO find the closest words to king
result = word_vectors.most_similar('king')
result

[('kings', 0.6537180542945862),
 ('prince', 0.6409084796905518),
 ('queen', 0.6376361846923828),
 ('throne', 0.6341038942337036),
 ('sultan', 0.6066715717315674),
 ('aragon', 0.6039237380027771),
 ('darius', 0.5958287715911865),
 ('emperor', 0.594192624092102),
 ('duke', 0.5894923210144043),
 ('vii', 0.5867447853088379)]

King is to man as woman is to X

In [None]:
##TODO find the closest word for the vector "woman" + "king" - "man"
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
result[0]

('queen', 0.6237307190895081)

**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [None]:
!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

--2023-03-30 09:57:03--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz’


2023-03-30 09:57:03 (550 MB/s) - ‘ws353simrel.tar.gz’ saved [5460/5460]

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [None]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
# if a word is not present in our model, we assign similarity 0 for the respective text pair

similarity = []

for pair in X:
  # extract pair
  first_word, second_word = pair

  # check if word is present in our model
  if ((first_word in model.wv.index_to_key) and (second_word in model.wv.index_to_key)):
    # compute similarity
    sim = model.wv.similarity(first_word, second_word)
    similarity.append(sim)
  else:
    # assign 0 if one of the words is not in vocabulary
    similarity.append(0)

In [None]:
for i in range(0,10):
  print("pair:", X[i], " ---  similarity from WordSim353:", y[i], "  --- Similarity from word2vec:", similarity[i])

pair: ('tiger', 'cat')  ---  similarity from WordSim353: 7.35   --- Similarity from word2vec: 0.5433863
pair: ('tiger', 'tiger')  ---  similarity from WordSim353: 10.0   --- Similarity from word2vec: 1.0
pair: ('plane', 'car')  ---  similarity from WordSim353: 5.77   --- Similarity from word2vec: 0.45190403
pair: ('train', 'car')  ---  similarity from WordSim353: 6.31   --- Similarity from word2vec: 0.53525
pair: ('television', 'radio')  ---  similarity from WordSim353: 6.77   --- Similarity from word2vec: 0.66451186
pair: ('media', 'radio')  ---  similarity from WordSim353: 7.42   --- Similarity from word2vec: 0.3964206
pair: ('bread', 'butter')  ---  similarity from WordSim353: 6.19   --- Similarity from word2vec: 0.68906647
pair: ('cucumber', 'potato')  ---  similarity from WordSim353: 5.92   --- Similarity from word2vec: 0.6796025
pair: ('doctor', 'nurse')  ---  similarity from WordSim353: 7.0   --- Similarity from word2vec: 0.49071032
pair: ('professor', 'doctor')  ---  similarity

In [None]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations
res = spearmanr(similarity, y)
print('Correlation: ', res.statistic)
print('p-value: ',res.pvalue)

Correlation:  0.6481808256998792
p-value:  1.4102608500152953e-25


In [None]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
def get_similarity(word1, word2):
    token1 = en(word1)
    token2 = en(word2)
    return token1.similarity(token2)

similarity2 = []
for pair in X:
  first_word, second_word = pair
  sim = get_similarity(first_word, second_word)
  similarity2.append(sim)

  return token1.similarity(token2)
  token1 = en(word1)
  return token1.similarity(token2)


Correlation:  0.09174883124982042
p-value:  0.19295227692420674


  token2 = en(word2)
  return token1.similarity(token2)
  token2 = en(word2)


In [None]:
##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment
res = spearmanr(similarity2, y)
print('Correlation: ', res.statistic)
print('p-value: ', res.pvalue)

Correlation:  0.09174883124982042
p-value:  0.19295227692420674


**PyTorch Embeddings**

In [None]:
#Import the AG news dataset (same as hw01)
#Download them from here 
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
path = F"/content/gdrive/My Drive/Colab Notebooks/Homework/train.csv"
df = pd.read_csv(path)

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

Unnamed: 0,label,title,lead,text
50286,sci/tech,"Denmark to Claim North Pole, Hopes to Strike O...",Reuters - Denmark aims to claim the North Pole...,"Denmark to Claim North Pole, Hopes to Strike O..."
73587,business,M A Industry Weighs Fees Against Size (Reuters),Reuters - Investment banks are in the midst of...,M A Industry Weighs Fees Against Size (Reuters...
85601,world,Pakistan Wins U.S. Praise Over Afghan Vote,ISLAMABAD (Reuters) - A senior U.S. official ...,Pakistan Wins U.S. Praise Over Afghan Vote IS...
28833,sci/tech,Microsoft exec takes aim at open source,"New platform chief, Ashim Pal, says software g...",Microsoft exec takes aim at open source New pl...
103893,business,Shoppers offer retailers cheer,Retailers generally had a good showing as the ...,Shoppers offer retailers cheer Retailers gener...


In [39]:
##TODO tokenize the text, only keep 200 most frequent words 
from sklearn.feature_extraction.text import CountVectorizer

##pre-process text 
def tokenize(x):
    return [w.lemma_.lower() for w in en(x) if not w.is_stop and not w.is_punct and not w.is_digit]
df["tokens"] = df["text"].apply(lambda x: tokenize(x))
df["preprocessed"] = df['tokens'].apply(lambda x: ' '.join(x))

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.9,  
                        max_features=200,
                        stop_words='english',
                        ngram_range=(1,1))
X = vectorizer.fit_transform(df['preprocessed'])
vocab = vectorizer.get_feature_names_out()

In [40]:
#TODO create a one_hot representation for each word and truncate/pad the sequences such that they are all of the same length (here we use 100)

# create one_hot representation for each word
!pip install Keras-Preprocessing
from keras.preprocessing.text import one_hot
from keras_preprocessing.sequence import pad_sequences

X_one_hot = [one_hot(row, n = 200) for row in df["preprocessed"]]
print(X_one_hot[0][:50])

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Keras-Preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 KB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Keras-Preprocessing
Successfully installed Keras-Preprocessing-1.1.2
[3, 152, 40, 27, 91, 1, 198, 154, 154, 3, 11, 152, 40, 27, 192, 57, 198, 132, 172, 117, 138, 141, 75, 40, 101, 60, 178, 124]


In [49]:
# next, we pad (or truncate) such that all the inputs have same length
max_seq_length = 100
X_one_hot_padded = pad_sequences(X_one_hot, padding='post', maxlen=max_seq_length, truncating='post')
X_one_hot_padded.shape

(10000, 100)

In [None]:
##TODO create your torch embedding like we did in notebook 5! (hint: predicting labels: world, sport, business, and sci/tech)

In [50]:
# make dummy variables for the labels
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
J = encoder.fit_transform(df['label'].astype(str))
num_label = max(J)+1

In [55]:
# set up DNN
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

class EmbeddingNet(nn.Module):
  def __init__(self, num_label):
    super(EmbeddingNet, self).__init__()
    self.embedding = nn.Embedding(num_label, 1) # assign 2 features to each label
    self.flatten = nn.Flatten()
    self.fc1 = nn.Linear(2, 2)
    self.fc2 = nn.Linear(2, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, x):
    x = self.embedding(x)
    x = self.flatten(x)
    x = self.fc1(x)
    x = self.fc2(x)
    x = self.sigmoid(x)
    return x

# I wasn't able to figure out what to do from here onwards and stopped after 1.5h.