# In this notebook I expirement with distributed semantic representations.

- Dense distributed representations represent words as embeddings in a continuous vector space where semantically similar words are mapped to nearby points.
- in this notebook i test word2vec, Glove50d and Glive100d embeddings on 4 different analogy tests to measure to what extent are embedding models able to capture semantic relationships(and surplrisingly more!) between words.
- The goal of the analogy tests is to predict the 4th entry in a row given the analogy expressed by the first 3 words.

## Imports

In [1]:
from gensim.models import KeyedVectors
import re
from tqdm import tqdm

### Test on plural-verbs analogy 

In [15]:
base_dir='/Users/Abdelrahman/scrape/fayd/test-category/' 
filename_test_1= 'plural-verbs.txt'

In [33]:
# preprocess the test data file
def preprocess(base_dir, filename):
    with open(base_dir + filename) as f:
        lines = f.readlines()
        
    # ramoving new line characters
    for index, line in enumerate(lines):
        lines[index]= re.sub("\n", "", line)
    
    for index, line in enumerate(lines):
        lines[index]= re.sub("\t", "", line)
    
    # lower case all words
    for index, line in enumerate(lines):
        lines[index]= line.lower()
        
    # Split a line of words into an array of words
    words=[]
    for line in lines:
        words.append(line.split())
        
    return words


In [34]:
words_test_1= preprocess(base_dir, filename_test_1)

In [35]:
# show preprocessed data
words_test_1[0:5]

[['decrease', 'decreases', 'describe', 'describes'],
 ['decrease', 'decreases', 'eat', 'eats'],
 ['decrease', 'decreases', 'enhance', 'enhances'],
 ['decrease', 'decreases', 'estimate', 'estimates'],
 ['decrease', 'decreases', 'find', 'finds']]

In [36]:
# show number of examles in words list
len_words_test_1= len(words_test_1)
len_words_test_1

870

### Test on plural-nouns analogy 

In [37]:
filename_test_2= 'gram8-plural.txt'

In [38]:
words_test_2= preprocess(base_dir, filename_test_2)

In [39]:
# show preprocessed data
words_test_2[0:5]

[['banana', 'bananas', 'bird', 'birds'],
 ['banana', 'bananas', 'bottle', 'bottles'],
 ['banana', 'bananas', 'building', 'buildings'],
 ['banana', 'bananas', 'car', 'cars'],
 ['banana', 'bananas', 'cat', 'cats']]

In [40]:
# show number of examles in words_test_2 list
len_words_test_2= len(words_test_2)
len_words_test_2

1332

### Test on capital_common_countries analogy 

In [41]:
filename_test_3= 'capital_common_countries.txt'

In [42]:
words_test_3= preprocess(base_dir, filename_test_3)

In [43]:
# show preprocessed data
words_test_3[0:5]

[['athens', 'greece', 'baghdad', 'iraq'],
 ['athens', 'greece', 'bangkok', 'thailand'],
 ['athens', 'greece', 'beijing', 'china'],
 ['athens', 'greece', 'berlin', 'germany'],
 ['athens', 'greece', 'bern', 'switzerland']]

In [44]:
# show number of examles in words_test_3 list
len_words_test_3= len(words_test_3)
len_words_test_3

506

### Test on currency analogy 

In [45]:
filename_test_4= 'currency.txt'

In [46]:
words_test_4= preprocess(base_dir, filename_test_4)

In [47]:
# show preprocessed data
words_test_4[0:5]

[['algeria', 'dinar', 'angola', 'kwanza'],
 ['algeria', 'dinar', 'argentina', 'peso'],
 ['algeria', 'dinar', 'armenia', 'dram'],
 ['algeria', 'dinar', 'brazil', 'real'],
 ['algeria', 'dinar', 'bulgaria', 'lev']]

In [48]:
# show number of examles in words_test_4 list
len_words_test_4= len(words_test_4)
len_words_test_4

866

# Glove50d Testing

In [49]:
# Loading the model
glove50d_model = KeyedVectors.load_word2vec_format('/Users/Abdelrahman/scrape/fayd/glove50d', binary=False)

In [51]:
# test example 
word, conf= glove50d_model.most_similar(positive=['decreases','eat'],negative=['decrease'], topn=1)[0]
word

'eaten'

In [53]:
def calculate_accuracy(words_list, len_words_list, model_instance):
    ''' accuracy is calculated as correctly answered / number of examples. '''
    correct_answ=0
    for line in tqdm(words_list):
        word, conf= model_instance.most_similar(positive=[line[2],line[1]],negative=[line[0]], topn=1)[0]

        if word == line[3]:
            correct_answ= correct_answ + 1 
        
    acc = correct_answ / len_words_list
    return acc

### Glove50d accuracy on plural verbs analogy

In [54]:
acc= calculate_accuracy(words_list= words_test_1, len_words_list=len_words_test_1, model_instance=glove50d_model)
acc

100%|██████████| 870/870 [00:07<00:00, 116.19it/s]


0.34367816091954023

### Glove50d accuracy on plural-nouns analogy 

In [55]:
acc= calculate_accuracy(words_list= words_test_2, len_words_list=len_words_test_2, model_instance=glove50d_model)
acc

100%|██████████| 1332/1332 [00:11<00:00, 111.71it/s]


0.5990990990990991

### Glove50d accuracy on capital_common_countries analogy 

In [56]:
acc= calculate_accuracy(words_list= words_test_3, len_words_list=len_words_test_3, model_instance=glove50d_model)
acc

100%|██████████| 506/506 [00:04<00:00, 124.14it/s]


0.7924901185770751

### Glove50d accuracy on currency analogy  

In [58]:
acc= calculate_accuracy(words_list= words_test_4, len_words_list=len_words_test_4, model_instance=glove50d_model)
acc

100%|██████████| 866/866 [00:07<00:00, 113.45it/s]


0.08314087759815242

## Glove50d results:

| Model | plural verbs| plural nouns | capital common countries | currency |
| --- | --- | --- | --- | --- |
| Glove50d | 0.35 | 0.60 | 0.80 | 0.08 |