# HW08: Word Embeddings

Remember that these homework work as a completion grade. **In this homework, we present two tasks and you can choose which one you want to solve. You only have to solve <span style="color:red">one task</span> in this homework.**
Task 1 is more guided and we evaluate document embeddings on a standard benchmark. Task 2 is very open-end and might be a starting point for your course project.

**Task 1**
In this task, we evaluate different document embeddings on the English version of the [STS Benchmark](https://arxiv.org/pdf/1708.00055.pdf). The task is to determine how semantically similar two texts are and is a popular dataset to evaluate document embeddings, i.e. we want embeddings of two semantically similar documents to be similar as well. We provide a wordcounts baseline for this task and ask you to compute and evaluate embeddings for a selected sample of document embedding techniques.

To evaluate, we follow [(Reimers and Gurevych, 2019)](https://arxiv.org/pdf/1908.10084.pdf) and compute the Spearman’s rankcorrelation between the cosine-similarity of thesentence embeddings and the gold labels.

In [1]:
# obtain the data
!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.zip

!unzip sts2017.eval.v1.1.zip 
!unzip sts2017.gs.zip 

URL transformed to HTTPS due to an HSTS policy
--2021-06-08 08:28:28--  https://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
Resolving alt.qcri.org (alt.qcri.org)... 80.76.166.234
Connecting to alt.qcri.org (alt.qcri.org)|80.76.166.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 87902 (86K) [application/zip]
Saving to: ‘sts2017.eval.v1.1.zip’


2021-06-08 08:28:29 (253 KB/s) - ‘sts2017.eval.v1.1.zip’ saved [87902/87902]

URL transformed to HTTPS due to an HSTS policy
--2021-06-08 08:28:30--  https://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.zip
Resolving alt.qcri.org (alt.qcri.org)... 80.76.166.234
Connecting to alt.qcri.org (alt.qcri.org)|80.76.166.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3138 (3.1K) [application/zip]
Saving to: ‘sts2017.gs.zip’


2021-06-08 08:28:30 (150 MB/s) - ‘sts2017.gs.zip’ saved [3138/3138]

Archive:  sts2017.eval.v1.1.zip
  inflating: STS

In [13]:
# load the data

def load_STS_data():
    with open("STS2017.gs/STS.gs.track5.en-en.txt") as f:
        labels = [float(line.strip()) for line in f]
    
    text_a, text_b = [], []
    with open("STS2017.eval.v1.1/STS.input.track5.en-en.txt") as f:
        for line in f:
            line = line.strip().split("\t")
            text_a.append(line[0])
            text_b.append(line[1])
    return text_a, text_b, labels

text_a, text_b, labels = load_STS_data()
text_a[0], text_b[0], labels[0]

('A person is on a baseball team.',
 'A person is playing basketball on a team.',
 2.4)

In [14]:
# some utils
from scipy.stats import spearmanr
def evaluate(predictions, labels):
    print (spearmanr(predictions, labels)[0])

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a,b):
    return dot(a, b)/(norm(a)*norm(b))


In [15]:
# Wordcounts baseline
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec.fit(text_a + text_b)

# encode documents
text_a_encoded = np.array(vec.transform(text_a).todense())
text_b_encoded = np.array(vec.transform(text_b).todense())

# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

0.6998056665685976


In [16]:
##TODO train Doc2Vec on the texts in the dataset, compute cosine similarity of the resulting embeddings and evaluate

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk import word_tokenize
docs = []
for i in text_a + text_b:
    docs.append(word_tokenize(i))

doc_iterator = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
d2v = Doc2Vec(doc_iterator,
                min_count=2, # minimum word count
                window=5,    # window size
                vector_size=25, # size of document vector
                sample=1e-4, 
                negative=5, 
                workers=4, # threads
                #dbow_words = 1 # uncomment to get word vectors too
                max_vocab_size=1000) # max vocab size



In [17]:
text_a_encoded = d2v.docvecs.vectors_docs[:len(text_a)]
text_b_encoded = d2v.docvecs.vectors_docs[len(text_a):]

# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

0.05212032836011226


In [18]:
##TODO do the same with embeddings provided by spaCy

import spacy
nlp = spacy.load('en_core_web_sm')

text_a_encoded = [nlp(text).vector for text in text_a]
text_b_encoded = [nlp(text).vector for text in text_b]

In [19]:
# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

0.5210910398082091
0.5210910398082091


In [20]:
##TODO do the same with universal sentence embeddings

import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

import tensorflow_hub as hub
import numpy as np

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
    return model(input)

embeddings_a = embed(text_a)
embeddings_b = embed(text_b)

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    text_a_encoded = session.run(embeddings_a)
    text_b_encoded = session.run(embeddings_b)


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded
module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [21]:
# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

0.8493103413219787
0.8493103413219787


In [22]:
for a,b in zip(text_a_encoded, text_b_encoded):
    print(a.shape)
    print(b.shape)
    dsdgfas

(512,)
(512,)


NameError: name 'dsdgfas' is not defined

In [23]:
print(text_a_encoded.shape)

(250, 512)


In [11]:
##TODO do the same with SBERT embeddings
from sentence_transformers import SentenceTransformer
model = "bert-base-nli-mean-tokens"
embedder = SentenceTransformer(model)
text_a_encoded = embedder.encode(text_a)
text_b_encoded = embedder.encode(text_b)


In [12]:
# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)


0.8008164100246977


**Task 2**
Use your favorite document embeddings method to compute embeddings for a dataset you are interested in. Think of a method and provide some data visualization statistics (one method would be the path we have chosen in the notebook, i.e. cluster the embeddings with k-means and visualize low-dimensional representations of the document embeddings obtained by PCA). 

This task is very open and there is no right or wrong; If you want to use document embeddings in your course project, this is a chance to play around with those.

