# Similarity tests on TOEFL Similarity Questionnaire

Resource: https://nlp.stanford.edu/projects/glove/

Other tests on TOEFL Similarity Test: https://aclweb.org/aclwiki/TOEFL_Synonym_Questions_(State_of_the_art) (if the link doesn't work, make sure you have the closing paranthesis in your URL!)

The task for today is to pass the TOEFL similarity test with the aid of computational support. The tools we need:

1. word vectors (taken from the GloVe resource)
2. TOEFL similarity Questionnaire + answers for comparison
3. cosine similarity for predicting the correct answer


### Cosine Similarity

We want to define the cosine similarity at an early stage of our code, since it is one of the main steps we need to take. Formula reminder:

## $cos\text{-}sim = \frac{v\cdot w}{||v||\cdot ||w||}$

The necessary parts for coding the formula you will find below.

In [None]:
import numpy as np
from numpy.linalg import norm

# Instanciate the vectors
v = np.array([1,2,3])
u = np.array([1,2,3])

# Numpy contains a function for the length of a vector
#norm(vector)

# Insert the cosine_similarity here, which takes two vectors v and u and computes the similarity between them.


## Working with files

Before you start working with a file, you first need to declare a path to your file. The "with open()" method allow you to access the file and perform tasks on it. Please open your own dummy file and print out the lines.

In [None]:
path_to_file = "C:/path/to/file.txt"

with open(path_to_file, 'r', encoding='utf-8') as f:
    pass # Insert your code here

The code below defines two functions: cosine_sim() and compare_similarity(). The purpose of this piece of code is to compare the vectors of two words and give out their similarity.

In [None]:
import numpy as np
from numpy.linalg import norm

embeddings_path = "C:/path/to/file/glove.6B.300d.txt"

def cosine_sim(vec_1, vec_2):
    return (vec_1/norm(vec_1)).dot((vec_2/norm(vec_2)))

def compare_similarity(word_1, word_2):
    """This function takes two words, looks them up in a file of word vectors
    and compares the cosine similarity of these vectors. It returns a dict
    with the two word vectors and the cosine similarity between them.
    """
    compare_words = {} # dict in which the word vectors are stored
    with open(embeddings_path, 'r', encoding='utf-8') as emb_file: # opens the file with word_vectors
        for word_vector in emb_file: # reads the file line by line: every line is a word vector in string format
            word, vector = word_vector.split(' ', 1) # splits the vector string into two parts: word and vector
            if word in [word_1.lower(), word_2.lower()]: # checks whether the word is one of the two words
                compare_words[word] = (np.fromstring(vector, dtype='float', sep=' ')) # adds vectors to compare_words
                if len(compare_words) == 2: # if compare_words contains two words the loop is broken
                    break

    if len(compare_words) != 2:
        sim_of_words = 0.0
    else:    
        w_1, w_2 = compare_words.values()
        sim_of_words = cosine_sim(w_1, w_2)
        
    return sim_of_words

Before we can perform the cosine similarity on our data, we need to access the required data and preprocess it. The code below reads both toefl-files, the questions and the corresponding answers. Please fill out the missing parts.

In [None]:
from collections import OrderedDict

toefl_qst_path = "C:/path/to/file/toefl_qst.txt"
toefl_ans_path = "C:/path/to/file/toefl_ans.txt"

def read_toefl():
    questions_list = []
    choices_dict = {}
    
    with open(toefl_qst_path, 'r', encoding='utf-8') as toefl_questions:
        questions = toefl_questions.read()
        questions = questions.split('\n\n')
 
        for q in questions:
            question = q.split()
            target = question[1]
            choices = question[2:]
            letters = [c.rstrip('.') for c in choices[0::2]]
            candidates = choices[1::2]
            
            for i,l in enumerate(letters):
                choices_dict[l] = candidates[i]
            
            questions_list.append([target, choices_dict])
            choices_dict = {}
                
    with open(toefl_ans_path, 'r', encoding='utf-8') as toefl_answers:
        answers = []
        for line in toefl_answers:
            if line not in ['\n']:
                answer = line.split()[3]
                answers.append(answer)
    return questions_list, answers

Let's put everything together!

The read_toefl() function returns two lists: questions_list and answers. You will need only the questions_list at this point. The task is to access the target words and to calculate the similarity with their corresponding candidates. The result might look like:

enormously  -  appropriately :  0.537885569727

enormously  -  uniquely :  0.617749822949

enormously  -  tremendously :  0.906292639466

enormously  -  decidedly :  0.555241334939

etc. ...

In [None]:
questions_list, answers = read_toefl()

for target, candidate in questions_list:
    i = 0
    while i < len(candidate):
        for token in candidate.values():
            print(target,' - ',token,': ', compare_similarity(target,token))
            i += 1


The last step to take is to calculate the accuracy of our predictions. For this task, we require two lists: the list with the predicted results and the list with the actual results. 

The code below calculates the most promising candidate for the target word and returns its alphanumerical index as a string.

In [None]:
def predict_synonym(question):
    #question = ['enormously', {'a':'appropriately','b':'uniquely','c':'tremendously','d':'decidedly'}]
    target = question[0]
    candidates = question[1]
    
    similarities = []
    
    for choice in candidates:
        score = compare_similarity(target, candidates[choice])
        similarities.append((choice, score))

    prediction = max(similarities, key=lambda x:x[1])[0]
    
    return prediction

Your task now is to write a code which gives these two lists as outputs. 

In [None]:
questions_list, gold = read_toefl()

for question in questions_list:
    print(predict_synonym(question))
    break

Finally, compute the accuracy score between our predictions and the correct answers.

In [None]:
from sklearn.metrics import accuracy_score

questions, gold = read_toefl()
predictions = []

for question in questions:
    #print(question)
    prediction = predict_synonym(question)
    #print(prediction)
    predictions.append(prediction)

print("Accuracy for TOEFL test: ", accuracy_score(predictions, gold))

Accuracy for TOEFL test and 200-dim vector: 0.85

Accuracy for TOEFL test and 300-dim vector: 0.8875