# DL4NLP SS17 Home Exercise 04
----------------------------------
**Due until Tuesday, 16.05. at 13:00**

## Task 1 Mandatory Paper (2P)

What is the main difference between CBOW and Skip-Gram in comparison to previous models? What is the benefit of this difference?

#### Solution
The main difference is the absence of a non-linear hidden layer. It improves the learning speed and allows the training on much more data.

## Task 2 Embeddings (8P)

### Task 2.0 Setup (1P)
For this home exercise, you will need:
* [the gensim python library](https://radimrehurek.com/gensim)
* [the scipy python library](https://www.scipy.org/scipylib/index.html)
* [the binary pretrained 300-dimensional word2vec embeddings from Google](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing), extracted
* [the WS-353 dataset](http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/), a classic word similarity dataset

### Task 2.1 Word Similarity (3P)
In this task, you will evaluate how well the pretrained word2vec embeddings perform in the word similarity task on the WS-353 dataset. Gensim provides such a functionality with the `evaluate_word_pairs` method, but we will take the manual route in this task.

a) In the `combined.tab` file from the WS-353 dataset, each row contains two words and a mean similarity score assigned by humans. The three values are separated by tab characters (`\t`). Write a python method which reads the dataset into an appropriate format.

b) Load the pretrained binary word2vec embeddings with gensim, then compute the pairwise word similarity between each word pair using gensim's `similarity` method.

Hint: https://radimrehurek.com/gensim/models/keyedvectors.html

c) [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) is a typical choice for measuring the ranking of two variables. Compute the coefficient between the values assigned by humans and your results from b) using scipy. Explain in two sentences what your resulting coefficient means.

In [None]:
from gensim.models.keyedvectors import KeyedVectors
from scipy.stats import spearmanr

# a)
def read_ws353(path):
    with open(path, 'r') as file:
        
        # skip the first line
        file = iter(file)
        next(file)  

        word_pairs = []
        gold_labels = []
        for line in file:
            w1, w2, human = line.split("\t")
            word_pairs.append((w1, w2))
            gold_labels.append(float(human))
    return word_pairs, gold_labels
word_pairs, gold_labels = read_ws353("/var/data/wordsim353/combined.tab")

# b)
embeddings_path = "/var/data/word2vec/GoogleNews-vectors-negative300.bin"
print("Started reading embeddings...")
word_vectors = KeyedVectors.load_word2vec_format(embeddings_path, binary=True)
print("Embeddings read.")
prediction =[word_vectors.similarity(w1, w2) for w1, w2 in word_pairs]

# c)
spearmanr(gold_labels, prediction)

# console output:
# SpearmanrResult(correlation=0.70001664862721935, pvalue=2.8686666605142199e-53)

# explanation:
#   The coefficient is positive, which generally shows that w2v embedding similarity produces
#   higher values for words of higher similarity. Since 0.7 is relatively close to 1.0 (the
#   maximum possible coefficient), the performance can be considered quite good.

### Task 2.2 Gamified Word Intrusion (4P)
"Word intrusion" is a task used by [Faruqui et al. 2015](https://arxiv.org/pdf/1506.02004.pdf) for intrinsic evaluation of word embeddings. The idea is: "In one instance of the experiment, a human judge is presented with five words in random order and asked to select the intruder." The intruder is a word unrelated to the four other words.

In this task you will not evaluate embeddings via word intrusion, but you will rather use embeddings to write a "game" based on word embeddings, where a human has to identify the intruder in a set of five words.

Write a python script which repeats the following steps:
1. Find four words which are similar to each other according to the pretrained word2vec embeddings, and one intruder word.
2. Print the five words in random order, then query a human (i.e. yourself) to spot the intruder. The python method `input` might be of use here.
3. Print the accuracy currently reached by the human.

In [None]:
import random

vocab_list = list(word_vectors.vocab)
correct = 0
tries = 0
while True:
    # pick two random words; one "normal" one and the intruder, then find three similar "normal" words
    seed_positive, seed_negative = random.sample(vocab_list, 2)
    more_positive = word_vectors.most_similar_cosmul(positive=[seed_positive], negative=[seed_negative], topn=3)
          
    # create a list of words and insert the intruder at a random position
    choices = [tup[0] for tup in more_positive]
    choices.append(seed_positive)
    intruder_pos = random.randint(0, len(choices))
    choices.insert(intruder_pos, seed_negative)
    
    # query the user
    print("{}\nWhere's the intruder?".format(" ".join(choices)))
    if int(input())==intruder_pos:
        correct += 1
    tries += 1
    print("Current accuracy: {}\n".format(correct/tries))