# HW06: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [44]:
import gensim.downloader as api
import pandas as pd

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [45]:
dataset = api.load("text8")

In [46]:
from gensim.models import Word2Vec

##TODO train a word2vec model on this dataset, only consider words which appear at least 10 times in the corpus

model = Word2Vec(dataset,  # list of tokenized sentences
                 min_count=10,  # Minimum word count
                 )

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [47]:
from gensim.models import KeyedVectors

word_vectors = model.wv
word_vectors['arafat']
word_vectors.save('vectors.kv')
word_vectors = KeyedVectors.load('vectors.kv')

In [48]:
##TODO find the closest words to king
print(word_vectors.most_similar('king'))

[('prince', 0.7635866403579712), ('queen', 0.7095021605491638), ('kings', 0.7082085609436035), ('emperor', 0.7068253755569458), ('regent', 0.6904042363166809), ('vii', 0.6830312013626099), ('throne', 0.6775007247924805), ('sultan', 0.6710481643676758), ('aragon', 0.6708518266677856), ('viii', 0.6621804237365723)]


King is to man as woman is to X

In [49]:
##TODO find the closest word for the vector "woman" + "king" - "man"
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

queen: 0.6997


**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.



In [50]:
# !wget http://alfonseca.org/pubs/ws353simrel.tar.gz
# !tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"


def df_maker(f):
    for line in f:
        line = line.strip().split("\t")
        yield {'A': line[0], 'B': line[1], 'y': line[-1]}


def load_data(path):
    with open(path) as f:
        df = pd.DataFrame(data=df_maker(f))
    return df


df = load_data(path)
df.head()

Unnamed: 0,A,B,y
0,tiger,cat,7.35
1,tiger,tiger,10.0
2,plane,car,5.77
3,train,car,6.31
4,television,radio,6.77


In [51]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
##TODO if  aword is not present in our model, we assign similarity 0 for the respective text pair
df['similarity'] = df.apply(lambda x: word_vectors.similarity(x.A, x.B) if (x.A in word_vectors) and (x.B in word_vectors) else 0, axis=1)
df.head()


Unnamed: 0,A,B,y,similarity
0,tiger,cat,7.35,0.620412
1,tiger,tiger,10.0,1.0
2,plane,car,5.77,0.435886
3,train,car,6.31,0.535424
4,television,radio,6.77,0.736855


In [52]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations
spearmanr(df.y, df.similarity).correlation

0.6083728930917283

In [53]:
import spacy

en = spacy.load('en_core_web_sm')
en('apple').similarity(en('orange'))


##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment

  """


0.5420636159533704

In [54]:
df['spacy_sim'] = df.apply(lambda x: en(x.A).similarity(en(x.B)), axis=1)
df.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,y,similarity,spacy_sim
0,tiger,cat,7.35,0.620412,0.624192
1,tiger,tiger,10.0,1.0,1.0
2,plane,car,5.77,0.435886,0.608974
3,train,car,6.31,0.535424,0.508821
4,television,radio,6.77,0.736855,0.600423


In [55]:
spearmanr(df.y, df.spacy_sim).correlation

0.08738489440789854