# In this notebook I expirement with distributed semantic representations.

- Dense distributed representations represent words as embeddings in a continuous vector space where semantically similar words are mapped to nearby points.
- in this notebook i test word2vec, Glove50d and Glive100d embeddings on 4 different analogy tests to measure to what extent are embedding models able to capture semantic relationships(and surplrisingly more!) between words.
- The goal of the analogy tests is to predict the 4th entry in a row given the analogy expressed by the first 3 words.

## Imports

In [60]:
from gensim.models import KeyedVectors
import re
from tqdm import tqdm

### Test on plural-verbs analogy 

In [64]:
base_dir='/Users/Abdelrahman/scrape/fayd/test-category/' 
filename_test_1= 'plural-verbs.txt'

In [105]:
# preprocess the test data file
def preprocess(base_dir, filename):
    with open(base_dir + filename) as f:
        lines = f.readlines()
        
    # ramoving new line characters
    for index, line in enumerate(lines):
        lines[index]= re.sub("\n", "", line)
    
    for index, line in enumerate(lines):
        lines[index]= re.sub("\t", "", line)
    
    # Split a line of words into an array of words
    words=[]
    for line in lines:
        words.append(line.split())
        lines[index]= line.lower()
    
    return words


In [106]:
words_test_1= preprocess(base_dir, filename_test_1)

In [107]:
# show preprocessed data
words_test_1[0:5]

[['decrease', 'decreases', 'describe', 'describes'],
 ['decrease', 'decreases', 'eat', 'eats'],
 ['decrease', 'decreases', 'enhance', 'enhances'],
 ['decrease', 'decreases', 'estimate', 'estimates'],
 ['decrease', 'decreases', 'find', 'finds']]

In [72]:
# show number of examles in words list
len_words_test_1= len(words)
len_words_test_1

866

### Test on plural-nouns analogy 

In [22]:
filename_test_2= 'gram8-plural.txt'

In [37]:
words_test_2= preprocess(base_dir, filename_test_2)

In [38]:
# show preprocessed data
words_test_2[0:5]

[['banana', 'bananas', 'bird', 'birds'],
 ['banana', 'bananas', 'bottle', 'bottles'],
 ['banana', 'bananas', 'building', 'buildings'],
 ['banana', 'bananas', 'car', 'cars'],
 ['banana', 'bananas', 'cat', 'cats']]

In [54]:
# show number of examles in words_test_2 list
len_words_test_2= len(words_test_2)
len_words_test_2

1332

### Test on capital_common_countries analogy 

In [39]:
filename_test_3= 'capital_common_countries.txt'

In [40]:
words_test_3= preprocess(base_dir, filename_test_3)

In [41]:
# show preprocessed data
words_test_3[0:5]

[['Athens', 'Greece', 'Baghdad', 'Iraq'],
 ['Athens', 'Greece', 'Bangkok', 'Thailand'],
 ['Athens', 'Greece', 'Beijing', 'China'],
 ['Athens', 'Greece', 'Berlin', 'Germany'],
 ['Athens', 'Greece', 'Bern', 'Switzerland']]

In [55]:
# show number of examles in words_test_3 list
len_words_test_3= len(words_test_3)
len_words_test_3

506

### Test on currency analogy 

In [29]:
filename_test_4= 'currency.txt'

In [44]:
words_test_4= preprocess(base_dir, filename_test_4)

In [45]:
# show preprocessed data
words_test_4[0:5]

[['Algeria', 'dinar', 'Angola', 'kwanza'],
 ['Algeria', 'dinar', 'Argentina', 'peso'],
 ['Algeria', 'dinar', 'Armenia', 'dram'],
 ['Algeria', 'dinar', 'Brazil', 'real'],
 ['Algeria', 'dinar', 'Bulgaria', 'lev']]

In [56]:
# show number of examles in words_test_4 list
len_words_test_4= len(words_test_4)
len_words_test_4

866

# Word2vec Testing

In [47]:
# Loading the model
word2vec_model = KeyedVectors.load_word2vec_format('/Users/Abdelrahman/scrape/fayd/embeddings/GoogleNews-vectors-negative300.bin', binary=True)

In [50]:
# test example 
word, conf= word2vec_model.most_similar(positive=['decreases','eat'],negative=['decrease'], topn=1)[0]
word

'eats'

In [92]:
def calculate_accuracy(words_list, len_words_list, model_instance):
    ''' accuracy is calculated as correctly answered / number of examples. '''
    correct_answ=0
    for line in tqdm(words_list):
        word, conf= model_instance.most_similar(positive=[line[2],line[1]],negative=[line[0]], topn=1)[0]

        if word == line[3]:
            correct_answ= correct_answ + 1 
        
    acc = correct_answ / len_words_list
    return acc

### Word2vec accuracy on plural verbs analogy

In [83]:
acc= calculate_accuracy(words_list= words_test_1, len_words_list=len_words_test_1, model_instance= word2vec_model)
acc

100%|██████████| 870/870 [03:17<00:00,  4.40it/s]


0.6824480369515011

### Word2vec accuracy on plural-nouns analogy 

In [85]:
acc= calculate_accuracy(words_list= words_test_2, len_words_list=len_words_test_2, model_instance= word2vec_model)
acc

100%|██████████| 1332/1332 [05:03<00:00,  4.39it/s]


0.8986486486486487

### Word2vec accuracy on capital_common_countries analogy 

In [88]:
acc= calculate_accuracy(words_list= words_test_3, len_words_list=len_words_test_3, model_instance= word2vec_model)
acc

100%|██████████| 506/506 [01:55<00:00,  4.37it/s]


0.8320158102766798

### Word2vec accuracy on currency analogy  

In [89]:
acc= calculate_accuracy(words_list= words_test_4, len_words_list=len_words_test_4, model_instance= word2vec_model)
acc

100%|██████████| 866/866 [03:19<00:00,  4.34it/s]


0.3510392609699769

## word2vec results:

| Model | plural verbs| plural nouns | capital common countries | currency |
| --- | --- | --- | --- | --- |
| word2vec | 0.68 | 0.89 | 0.83 | 0.35 |