# Assignment 2

### Initial Setup:

Links to word vector downloads:<br>
Glove: http://nlp.stanford.edu/data/glove.6B.zip <br>
File used: glove.6B.50d.txt<br>

LexVec: https://www.dropbox.com/s/kguufyc2xcdi8yk/lexvec.enwiki%2Bnewscrawl.300d.W.pos.vectors.gz?dl=1 <br>

aclmb folder, glove.6B.50d.txt and lexvec.enwiki+newscrawl.300d.W.pos.vectors must be placed in the same directory as the notebook.<br>

## Text Classification

In [1]:
#import relevant libraries
import numpy as np
import pandas as pd

In [2]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

### Problem 1:

We need functions to assess and evaluate the performance of our models. We will implement those first.
Create one function to calculate precision, one function to calculate recall and
one function to calculate f-measure.

#### Confusion Matrix

First, we must calculate a confusion matrix. A confusion matrix is used to describe the performance of a classification model. From the confusion matrix, we can calculate the values for True Positive, False Positive, False Negative and True Negative.<br>
Next, we use the values calculated above to calculate precision, recall and F1 score (f-measure)<br>

In [3]:
#Function to calculate a confusion matrix
def confusion_matrix(y_pred, y_test):
    c_mat = np.zeros((2, 2))
    for p, t in zip(y_pred, y_test):
        c_mat[p][t] += 1
    return c_mat

#### Precision

Precision is calculated as:<br>
Precision = True Positive / (True Positive + False Positive)<br>

In [4]:
#Function to calculate precision
def calc_precision(y_pred, y_test):
    cm = confusion_matrix(y_pred, y_test)
    
    #Precision = True Positive / (True Positive + False Positive)
    return (cm[1, 1] / (cm[1, 1] + cm[1, 0]))

#### Recall

Recall is calculated as:<br>
Recall = True Positive / (True Positive + False Negative)

In [5]:
# Function to calculate recall
def calc_recall(y_pred, y_test):
    cm = confusion_matrix(y_pred, y_test)
    
    #Recall = True Positive / (True Positive + False Negative)
    return (cm[1, 1] / (cm[1, 1] +  cm[0, 1]))

#### F1 Score/F-measure

F1 score is a combination of precision and recall. F1 score is calculated as:<br>
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

In [6]:
#Function to calculate F1 Score
def calc_fmeasure(y_pred, y_test):
    cm = confusion_matrix(y_pred, y_test)
    
    #Precision = True Positive / (True Positive + False Positive)
    precision = (cm[1, 1] / (cm[1, 1] + cm[1, 0]))
    
    #Recall = True Positive / (True Positive + False Negative)
    recall = (cm[1, 1] / (cm[1, 1] + cm[0, 1]))
    
    #F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
    return (2 * precision * recall) / (precision + recall)

Additional function to calculate accuracy. Used in Problem 7.

In [7]:
#Function to calculate accuracy
def calc_accuracy(y_pred, y_test):
    cm = confusion_matrix(y_pred, y_test)
    accuracy = (cm[0, 0]+cm[1, 1])/(cm[0, 0]+cm[1, 1]+cm[0, 1]+cm[1, 0])
    return accuracy

### Problem 2

Majority Class Baseline. We will create majority class
baseline to evaluate our initial model performance – which is the simplest
baseline. The label for the test data should be the majority class found in
training data.
You should report P, R and F-score (using the functions you wrote to solve
Problem 1) for both training and test data to obtain full marks on this problem.

To create a majority class baseline, we need to figure out which class is the majority class, i.e. what class does a majority of the data belong to. <br>

In [8]:
#Import the load_files method from sklearn.datasets to import the data
from sklearn.datasets import load_files

We use the load_files function from sklearn.datasets to load the text files. This is done because load_files allows us the label files by the folders they are stored in. This is ideal, as in our case, the positive and negative examples are stored in separate subfolders. Negative reviews are encoded with the label 0 and positive reviews are labelled with 1.

Note: the following two commands occasionally take a long time to execute. Rerunning the commands a couple of times usually helps.

In [11]:
#Import the training data
directory = r'aclImdb/train/'
train = load_files(directory, categories = ['neg', 'pos'])

In [12]:
#Import the testing data
directory = r'aclImdb/test/'
test = load_files(directory, categories = ['neg', 'pos'])

We count the number of reviews belonging to each class.

In [13]:
#Printing out the number of reviews of each class in the training data
print("Number of Positive Reviews in Training Data: ", (train.target==1).sum())
print("Number of Negative Reviews in Training Data: ", (train.target==0).sum())

Number of Positive Reviews in Training Data:  12500
Number of Negative Reviews in Training Data:  12500


Here we can see that the number of reviews per class in the training data is the same (12500). Therefore, we will compute our metrics for both cases.<br>

#### Majority Class: Negative

We consider 0 (Negative) to be our majority class label.

In [14]:
#Assigning a predicted value of 0 to all reviews
y_pred = [0]*25000

#### Training Data

In [15]:
#Calculating metrics
print("Metrics for Majority Class Baseline (Training Data):")
print("Precision: ", calc_precision(y_pred, train.target)*100)
print("Recall: ", calc_recall(y_pred, train.target)*100)
print("F1 Score: ", calc_fmeasure(y_pred, train.target)*100)
# print("Accuracy: ", calc_accuracy(y_pred, train.target)*100)

Metrics for Majority Class Baseline (Training Data):
Precision:  nan
Recall:  0.0
F1 Score:  nan


#### Test Data

In [16]:
#Calculating metrics
print("Metrics for Majority Class Baseline (Test Data):")
print("Precision: ", calc_precision(y_pred, test.target)*100)
print("Recall: ", calc_recall(y_pred, test.target)*100)
print("F1 Score: ", calc_fmeasure(y_pred, test.target)*100)
# print("Accuracy: ", calc_accuracy(y_pred, test.target)*100)

Metrics for Majority Class Baseline (Test Data):
Precision:  nan
Recall:  0.0
F1 Score:  nan


#### Majority Class: Positive

Now, we consider 1 (Positive) to be our majority class label.

In [17]:
#Assigning a predicted value of 1 to all reviews
y_pred = [1]*25000

#### Training Data

In [18]:
#Calculating metrics
print("Metrics for Majority Class Baseline (Training Data):")
print("Precision: ", calc_precision(y_pred, train.target)*100)
print("Recall: ", calc_recall(y_pred, train.target)*100)
print("F1 Score: ", calc_fmeasure(y_pred, train.target)*100)
# print("Accuracy: ", calc_accuracy(y_pred, train.target)*100)

Metrics for Majority Class Baseline (Training Data):
Precision:  50.0
Recall:  100.0
F1 Score:  66.66666666666666


#### Test Data

In [19]:
#Calculating metrics
print("Metrics for Majority Class Baseline:")
print("Precision: ", calc_precision(y_pred, test.target)*100)
print("Recall: ", calc_recall(y_pred, test.target)*100)
print("F1 Score: ", calc_fmeasure(y_pred, test.target)*100)
# print("Accuracy: ", calc_accuracy(y_pred, test.target)*100)

Metrics for Majority Class Baseline:
Precision:  50.0
Recall:  100.0
F1 Score:  66.66666666666666


### Problem 3

Review Length Baseline. We will create baseline to
evaluate our model performance – which takes into account length of the
review.
For this baseline, you should try setting various thresholds of review length to
classify them as positive or negative. For example, all reviews > 50 words in
length can be classified as positive. You should experiment with at least 3
different thresholds and document your reasons why you chose these
thresholds.
For each threshold, you should report P, R and F-score (using the functions you
wrote to solve Problem 1) for both training and test data to obtain full marks on
this problem.

We need to calculate the length of each review in our dataset.<br>

In [20]:
train['length'] = [len(review) for review in train.data]

Next, we see what the maximum and minimum values for the length are.

In [21]:
print("Maximum Length: ", max(train.length))
print("Minimum Length: ", min(train.length))

Maximum Length:  13704
Minimum Length:  52


For our experiments, we create a threshold based on various percentile values of review length. Reviews above this threshold are given a positive sentiment and the reviews below the threshold are given a negative sentiment. The threshold is positioned at the 25th, 50th and 75th percentile values of review length.<br>


#### Training Data:

In [22]:
train_data = pd.DataFrame.from_dict({key: train[key] for key in train.keys()
                               & {'data', 'filenames', 'target', 'length'}} )

In [23]:
#Setting the threshold to the 25th Percentile
train_data['y_pred'] = train_data.length.apply(lambda x: 1 if x > np.percentile(train.length, 25) else 0)

In [24]:
#Calculating metrics
print("Metrics for Review Length Baseline (25th Percentile):")
print("Precision: ", calc_precision(train_data['y_pred'], train.target)*100)
print("Recall: ", calc_recall(train_data['y_pred'], train.target)*100)
print("F1 Score: ", calc_fmeasure(train_data['y_pred'], train.target)*100)

Metrics for Review Length Baseline (25th Percentile):
Precision:  49.34877762357212
Recall:  73.96000000000001
F1 Score:  59.19830953448165


In [25]:
#Setting the threshold to the 50th Percentile
train_data['y_pred'] = train_data.length.apply(lambda x: 1 if x > np.percentile(train.length, 50) else 0)

In [26]:
#Calculating metrics
print("Metrics for Review Length Baseline (50th Percentile):")
print("Precision: ", calc_precision(train_data['y_pred'], train.target)*100)
print("Recall: ", calc_recall(train_data['y_pred'], train.target)*100)
print("F1 Score: ", calc_fmeasure(train_data['y_pred'], train.target)*100)

Metrics for Review Length Baseline (50th Percentile):
Precision:  50.19617263191609
Recall:  50.151999999999994
F1 Score:  50.17407659370122


In [27]:
#Setting the threshold to the 75th percentile
train_data['y_pred'] = train_data.length.apply(lambda x: 1 if x > np.percentile(train.length, 75) else 0)

In [28]:
#Calculating metrics
print("Metrics for Review Length Baseline (75th Percentile):")
print("Precision: ", calc_precision(train_data['y_pred'], train.target)*100)
print("Recall: ", calc_recall(train_data['y_pred'], train.target)*100)
print("F1 Score: ", calc_fmeasure(train_data['y_pred'], train.target)*100)

Metrics for Review Length Baseline (75th Percentile):
Precision:  51.896303408545364
Recall:  25.944
F1 Score:  34.59384500506694


#### Test Data:

In [29]:
test['length'] = [len(review) for review in test.data]

In [30]:
test_data = pd.DataFrame.from_dict({key: test[key] for key in test.keys()
                               & {'data', 'filenames', 'target', 'length'}} )

In [31]:
#Setting the threshold to the 25th Percentile
test_data['y_pred'] = test_data.length.apply(lambda x: 1 if x > np.percentile(test.length, 25) else 0)

In [32]:
#Calculating metrics
print("Metrics for Review Length Baseline (25th Percentile):")
print("Precision: ", calc_precision(test_data['y_pred'], test.target)*100)
print("Recall: ", calc_recall(test_data['y_pred'], test.target)*100)
print("F1 Score: ", calc_fmeasure(test_data['y_pred'], test.target)*100)

Metrics for Review Length Baseline (25th Percentile):
Precision:  49.28457020822211
Recall:  73.848
F1 Score:  59.116234390009616


In [33]:
#Setting the threshold to the 50th Percentile
test_data['y_pred'] = test_data.length.apply(lambda x: 1 if x > np.percentile(test.length, 50) else 0)

In [34]:
#Calculating metrics
print("Metrics for Review Length Baseline (50th Percentile):")
print("Precision: ", calc_precision(test_data['y_pred'], test.target)*100)
print("Recall: ", calc_recall(test_data['y_pred'], test.target)*100)
print("F1 Score: ", calc_fmeasure(test_data['y_pred'], test.target)*100)

Metrics for Review Length Baseline (50th Percentile):
Precision:  49.47933354694008
Recall:  49.416
F1 Score:  49.447646493756004


In [35]:
#Setting the threshold to the 75th Percentile
test_data['y_pred'] = test_data.length.apply(lambda x: 1 if x > np.percentile(test.length, 75) else 0)

In [36]:
#Calculating metrics
print("Metrics for Review Length Baseline (75th Percentile):")
print("Precision: ", calc_precision(test_data['y_pred'], test.target)*100)
print("Recall: ", calc_recall(test_data['y_pred'], test.target)*100)
print("F1 Score: ", calc_fmeasure(test_data['y_pred'], test.target)*100)

Metrics for Review Length Baseline (75th Percentile):
Precision:  50.24023062139654
Recall:  25.096
F1 Score:  33.472044387537345


From the above experiments, we can see that shorter reviews are more likely to be positive.

### Problem 4

Implementing the Naïve Bayes classifier. You can use the
built-in Naive Bayes model from sklearn to train a classifier. Here is the
documentation for how to implement GaussianNB (discussed in class) using
sklearn.
You will train your classifier on the words contained in the positive and negative
reviews, based on the algorithm discussed in class.
You should report P, R and F-score (using the functions you wrote to solve
Problem 1) for both training and test data to obtain full marks on this problem.

#### Setup

In [37]:
#importing relevant libraries
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import GaussianNB

To input text data into a machine learning model, the text data must be converted to a numerical representation. In this case, we use CountVectorizer from sklearn.feature_extraction to create a vector of token count vectors for each document.

In [38]:
#Creating bag of words of the documents using the fit_transform method of the CountVectorizer class
word_vec = CountVectorizer(min_df=2, tokenizer=word_tokenize)         
word_counts = word_vec.fit_transform(train.data)

To avoid discrepancies with regard to higher word counts in longer documents, we create a Term Frequency - Inverse Document Frequency vector to normalize word frequencies.

In [39]:
#Creating TF-IDF vectors of the training data
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(word_counts)

In [40]:
#Creaing bag of words vectors and TF-IDF vectors of the testing data
test_counts = word_vec.transform(test.data)
test_tfidf = tfidf_transformer.transform(test_counts)

Now that our data is represented appropriately, we can move on to training and testing our model

In [41]:
#Creating a Gaussian Naive Bayes classifier object
GNB = GaussianNB()

#The unique classes are computed because the first iteration of partial_fit requires it
unique_classes = np.unique(train.target)

Due to memory constraints, the fit method of the GaussianNB class cannot be used. Therefore, we use the partial_fit function from the GaussianNB class to train our model incrementally.

In [42]:
GNB.partial_fit(train_tfidf[0:10000].toarray(), train.target[0:10000], classes=unique_classes)

GaussianNB(priors=None, var_smoothing=1e-09)

In [43]:
GNB.partial_fit(train_tfidf[10000:20000].toarray(), train.target[10000:20000])

GaussianNB(priors=None, var_smoothing=1e-09)

In [44]:
GNB.partial_fit(train_tfidf[20000:].toarray(), train.target[20000:])

GaussianNB(priors=None, var_smoothing=1e-09)

#### Training Data:

We use the model to predict the labels for the training data.

In [45]:
#Predicting the target variables for the training data
y_pred = []
for i in range(0, 25000):
    y_pred.append(int(GNB.predict(train_tfidf[i].toarray())))

In [46]:
#Storing the training target variables in a list variable y
y = train.target.tolist()

In [47]:
#Calculating metrics
print("Metrics for training data using Gaussian Naive Bayes classification")
print("Precision: ", calc_precision(y_pred, y)*100)
print("Recall: ", calc_recall(y_pred, y)*100)
print("F1 Score: ", calc_fmeasure(y_pred, y)*100)
#print("Accuracy: ", calc_accuracy(y_pred, y)*100)

Metrics for training data using Gaussian Naive Bayes classification
Precision:  99.33763136977832
Recall:  89.984
F1 Score:  94.42975275993787


In [48]:
del y_pred

#### Test Data:

Finally, we use the model to predict the labels of the test data

In [49]:
#Predicting the target variables for the test data
y_pred = []
for i in range(0, 25000):
    y_pred.append(int(GNB.predict(test_tfidf[i].toarray())))

In [50]:
#Storing the test target variables in a list variable y
y = test.target.tolist()

In [51]:
#Calculating metrics
print("Metrics for testing data using Gaussian Naive Bayes classification")
print("Precision: ", calc_precision(y_pred, y)*100)
print("Recall: ", calc_recall(y_pred, y)*100)
print("F1 Score: ", calc_fmeasure(y_pred, y)*100)
#print("Accuracy: ", calc_accuracy(y_pred, y)*100)

Metrics for testing data using Gaussian Naive Bayes classification
Precision:  62.775149146266195
Recall:  48.824
F1 Score:  54.92754927549276


In [52]:
#Deleting the model and predicted values to free up memory
del GNB, y_pred

### Problem 5

Implementing your own classifier. You can implement
your own classifier – by using another classifier from sklearn package – SVM or
Logistic Regression (for example) or you can experiment with different features
and add them to your Naïve Bayes model. As before, you should report P, R and
F-score (using the functions you wrote to solve Problem 1) for both training and
test data to obtain full marks on this problem.

#### Setup

The classifier implemented here is a linear Support Vector Machine. The main difference between Naive Bayes and SVM is that Naive Bayes considers each feature to be independant while SVM considers any implicit relationships between the features.<br>
We use the SGDClassifier class to implement this instead of the SVC class from sklearn.svm because the former supports minibatch learning using the partial_fit method.

In [53]:
from sklearn import linear_model
svm_clf = linear_model.SGDClassifier()

Incrementally training the classifier in batches using the training data.

In [54]:
svm_clf.partial_fit(train_tfidf[0:10000].toarray(), train.target[0:10000], classes=unique_classes)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [55]:
svm_clf.partial_fit(train_tfidf[10000:20000].toarray(), train.target[10000:20000])

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [56]:
svm_clf.partial_fit(train_tfidf[20000:].toarray(), train.target[20000:])

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

#### Training Data

Using the trained model to predict the training labels

In [57]:
#Predicting the target sentiment values for the training data
y_pred = []
for i in range(0, 25000):
    y_pred.append(int(svm_clf.predict(train_tfidf[i].toarray())))

In [58]:
#Storing the training target variables in a list variable y
y = train.target.tolist()

In [59]:
#Calculating metrics
print("Metrics for training data using Linear SVM classification")
print("Precision: ", calc_precision(y_pred, y)*100)
print("Recall: ", calc_recall(y_pred, y)*100)
print("F1 Score: ", calc_fmeasure(y_pred, y)*100)
#print("Accuracy: ", calc_accuracy(y_pred, y)*100)

Metrics for training data using Linear SVM classification
Precision:  69.3671027619367
Recall:  99.256
F1 Score:  81.66260777989864


In [60]:
del y_pred

#### Test Data

Using the model to predict the labels for the testing data

In [61]:
#Predicting the target variables for the test data
y_pred = []
for i in range(0, 25000):
    y_pred.append(int(svm_clf.predict(test_tfidf[i].toarray())))

In [62]:
#Storing the test target variables in a list variable y
y = test.target.tolist()

In [63]:
#Calculating metrics
print("Metrics for testing data using Linear SVM classification")
print("Precision: ", calc_precision(y_pred, y)*100)
print("Recall: ", calc_recall(y_pred, y)*100)
print("F1 Score: ", calc_fmeasure(y_pred, y)*100)
#print("Accuracy: ", calc_accuracy(y_pred, y)*100)

Metrics for testing data using Linear SVM classification
Precision:  67.11956521739131
Recall:  98.8
F1 Score:  79.93527508090615


In [64]:
#Deleting the model and predicted values to free up memory
del svm_clf, y_pred

### Problem 6

What do you observe about the performance of the
different models?

The first model, based on Majority Class Baseline, worked on the assumption that the labels for the test data should be the label that is most common in the training data. However, the classes in our training data are perfectly balanced, i.e. they are equal in number. Therefore we calculate the metrics using both labels. We see that the the model fares poorly, even giving us NaN values due to the the lack to true positive values in some cases.<br>

Our next model classifies reviews based on review length. A threshold is moved between the maximum and minimum lengths of the training data set. The metrics are best when the threshold is set at 25th percentile of the review lengths. This implies that shorter reviews are more likely to be positive. However, the precision, recall and fmeasure scores are not high which implies that the model is not a very good one.

The first machine learning approach implemented here is a Gaussian Naive Bayes classifier. We convert the data into a bag of words vector and then further into TFIDF vectors. These vectors are then used to predict the target value. The model performs substantially better that the preceeding models on the training data. However it performed poorly on the test data, indicating that the model suffers from overfitting.

Our final model uses a linear Support Vector Machine to classify our datasets. This is the best model by far, obtaining similar values for precision, recall and F1 score for both training and testing data, indicating that the model has generalized well. Since SVM considers the relationships between the features in a model, we can safely assume that treating word vectors as mutually independant is an incorrect approach.

## Vector Semantics

### Problem 7

Run the analogy test. Pick any two sets of pretrained
embeddings, implement the analogy prediction method described in Equation
2, and compare their accuracies on the eight analogy tasks listed above. Make
sure to mention the details of your selection in writing.

For this part of the assignment, we use the gensim library create word vector models using pretrained embeddings<br>
https://radimrehurek.com/gensim/index.html<br>
<br>
Additionally, gensim's most_similar method helps us find the most similar vector to a target vector using cosine similarity.<br>
https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html

In [65]:
#Reading the test data from the text file
f = open(r'word-test.v1.txt', encoding='utf-8')
text = f.read()
f.close()

#Preprocessing the data
text = text.split(":")[1:]
text = [doc.split('\n') for doc in text]

In [66]:
#Splitting the test into analogy tests. Keeping only those tests that are required
cap_world = [doc[1:] for doc in text if 'capital-world' in doc[0]][0]
cur = [doc[1:] for doc in text if 'currency' in doc[0]][0]
city_state = [doc[1:] for doc in text if 'city-in-state' in doc[0]][0]
fam = [doc[1:] for doc in text if 'family' in doc[0]][0]
adj_adv = [doc[1:] for doc in text if 'gram1-adjective-to-adverb' in doc[0]][0]
opp = [doc[1:] for doc in text if 'gram2-opposite' in doc[0]][0]
comp = [doc[1:] for doc in text if 'gram3-comparative' in doc[0]][0]
nat_adj = [doc[1:] for doc in text if 'gram6-nationality-adjective' in doc[0]][0]

### Glove
Source: https://nlp.stanford.edu/projects/glove/<br>
Paper: https://nlp.stanford.edu/pubs/glove.pdf<br>
Download: http://nlp.stanford.edu/data/glove.6B.zip<br>

Text File Used: glove.6B.50d.txt<br>
The file contains 6 Billion tokens with 50 dimensions sourced using Wikipedia 2014 and Gigaword 5 datasets.<br>
The file is placed in the same directory as the notebook.<br>

#### Setup

In [67]:
#Importing relevant libraries
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import os
import numpy as np

In [68]:
#Function to preprocess text in the analogy tests
def preprocess(data):
    
    #Lowercasing all the words and removing '\t' symbols
    data = [doc.lower().replace('\t', '') for doc in data if doc != '']
    return data

In [69]:
#Function to calculate the result for each analogy test
def calculate_result(t, model):
    results = []
    for analogy in t:
        
        #Splitting the analogy test into the four component word
        a, b, c, d = analogy.split()
        
        try:
            pred = model.most_similar(positive=[b, c], negative=[a], topn = 1)[0][0]
            
        #In case any of the words in the analogy don't exist in the word vector vocabulary,
        #assume that the prediction is a random string, in this case: "abc"
        except KeyError:
            pred = 'abc'
            
        #The function returns 1 if the model guesses the word correctly, else 0
        if pred == d:
            results.append(1)        
        else:
            results.append(0)
    return results

In [70]:
#Using gensim to convert the glove vector text file into word2vec format
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

#Source: https://radimrehurek.com/gensim/scripts/glove2word2vec.html
glove_file = datapath(os.path.abspath('glove.6B.50d.txt'))
tmp_file = get_tmpfile(os.path.abspath("test_word2vec.txt"))
converted_file = glove2word2vec(glove_file, tmp_file) 

In [71]:
#Loading the Glove embeddings in word2vec format
glove_model = KeyedVectors.load_word2vec_format(os.path.abspath("test_word2vec.txt"))

#### Results:

In [72]:
#Average execution time is 1 minute for each task
print("Glove model:")

#Capital World Task
cap_world = preprocess(cap_world)
y_act = [1]*len(cap_world)
y_pred = calculate_result(cap_world, glove_model)
print("The accuracy of the model on the 'Capital-World' task: ", calc_accuracy(y_pred, y_act)*100)

#Currency Task
cur = preprocess(cur)
y_act = [1]*len(cur)
y_pred = calculate_result(cur, glove_model)
print("The accuracy of the model on the 'Currency' task: ", calc_accuracy(y_pred, y_act)*100)

#City State Task
city_state = preprocess(city_state)
y_act = [1]*len(city_state)
y_pred = calculate_result(city_state, glove_model)
print("The accuracy of the model on the 'City-State' task: ", calc_accuracy(y_pred, y_act)*100)

#Family Task
fam = preprocess(fam)
y_act = [1]*len(fam)
y_pred = calculate_result(fam, glove_model)
print("The accuracy of the model on the 'Family' task: ", calc_accuracy(y_pred, y_act)*100)

#Adjective adverb task
adj_adv = preprocess(adj_adv)
y_act = [1]*len(adj_adv)
y_pred = calculate_result(adj_adv, glove_model)
print("The accuracy of the model on the 'Adjective-Adverb' task: ", calc_accuracy(y_pred, y_act)*100)

#Opposites task
opp = preprocess(opp)
y_act = [1]*len(opp)
y_pred = calculate_result(opp, glove_model)
print("The accuracy of the model on the 'Opposites' task: ", calc_accuracy(y_pred, y_act)*100)

#Comparitives task
comp = preprocess(comp)
y_act = [1]*len(comp)
y_pred = calculate_result(comp, glove_model)
print("The accuracy of the model on the 'Comparitives' task: ", calc_accuracy(y_pred, y_act)*100)

#Nationality-Adjective task
nat_adj = preprocess(nat_adj)
y_act = [1]*len(nat_adj)
y_pred = calculate_result(nat_adj, glove_model)
print("The accuracy of the model on the 'Nationality-Adjective' task: ", calc_accuracy(y_pred, y_act)*100)

Glove model:
The accuracy of the model on the 'Capital-World' task:  68.47922192749779
The accuracy of the model on the 'Currency' task:  8.314087759815243
The accuracy of the model on the 'City-State' task:  15.322253749493312
The accuracy of the model on the 'Family' task:  68.97233201581028
The accuracy of the model on the 'Adjective-Adverb' task:  15.221774193548388
The accuracy of the model on the 'Opposites' task:  9.482758620689655
The accuracy of the model on the 'Comparitives' task:  51.80180180180181
The accuracy of the model on the 'Nationality-Adjective' task:  85.99124452782989


The above model fares very poorly in the "Currency", "City-in-State", "Adjective-Adverb" and "Opposite" tasks. This may be due to lack of presence in enough contexts in the dataset that the Glove embeddings were sourced from.

### LexVec
Source: https://github.com/alexandres/lexvec<br>
Download Word Embeddings: https://www.dropbox.com/s/kguufyc2xcdi8yk/lexvec.enwiki%2Bnewscrawl.300d.W.pos.vectors.gz?dl=1<br>
<br>
File Used: lexvec.enwiki+newscrawl.300d.W.pos.vectors<br>
The file contains 7 Billion Tokens with 300 dimensions sourced using the Wikipedia 2015 data set and NewsCrawl<br>
The file is placed in the same directory as that of the notebook.<br>

#### Setup

In [73]:
#Loading the LexVec embeddings in word2vec format
lexvec = KeyedVectors.load_word2vec_format('lexvec.enwiki+newscrawl.300d.W.pos.vectors', binary = False)

#### Results:

In [74]:
#Average execution time is 2 minute for each task
print("LexVec model:")

#Capital World Task
cap_world = preprocess(cap_world)
y_act = [1]*len(cap_world)
y_pred = calculate_result(cap_world, lexvec)
print("The accuracy of the model on the 'Capital-World' task: ", calc_accuracy(y_pred, y_act)*100)

#Currency Task
cur = preprocess(cur)
y_act = [1]*len(cur)
y_pred = calculate_result(cur, lexvec)
print("The accuracy of the model on the 'Currency' task: ", calc_accuracy(y_pred, y_act)*100)

#City State Task
city_state = preprocess(city_state)
y_act = [1]*len(city_state)
y_pred = calculate_result(city_state, lexvec)
print("The accuracy of the model on the 'City-State' task: ", calc_accuracy(y_pred, y_act)*100)

#Family Task
fam = preprocess(fam)
y_act = [1]*len(fam)
y_pred = calculate_result(fam, lexvec)
print("The accuracy of the model on the 'Family' task: ", calc_accuracy(y_pred, y_act)*100)

#Adjective adverb task
adj_adv = preprocess(adj_adv)
y_act = [1]*len(adj_adv)
y_pred = calculate_result(adj_adv, lexvec)
print("The accuracy of the model on the 'Adjective-Adverb' task: ", calc_accuracy(y_pred, y_act)*100)

#Opposites task
opp = preprocess(opp)
y_act = [1]*len(opp)
y_pred = calculate_result(opp, lexvec)
print("The accuracy of the model on the 'Opposites' task: ", calc_accuracy(y_pred, y_act)*100)

#Comparitives task
comp = preprocess(comp)
y_act = [1]*len(comp)
y_pred = calculate_result(comp, lexvec)
print("The accuracy of the model on the 'Comparitives' task: ", calc_accuracy(y_pred, y_act)*100)

#Nationality-Adjective task
nat_adj = preprocess(nat_adj)
y_act = [1]*len(nat_adj)
y_pred = calculate_result(nat_adj, lexvec)
print("The accuracy of the model on the 'Nationality-Adjective' task: ", calc_accuracy(y_pred, y_act)*100)

LexVec model:
The accuracy of the model on the 'Capital-World' task:  94.36339522546419
The accuracy of the model on the 'Currency' task:  22.0554272517321
The accuracy of the model on the 'City-State' task:  72.63883259019052
The accuracy of the model on the 'Family' task:  87.74703557312253
The accuracy of the model on the 'Adjective-Adverb' task:  24.899193548387096
The accuracy of the model on the 'Opposites' task:  36.57635467980296
The accuracy of the model on the 'Comparitives' task:  87.31231231231232
The accuracy of the model on the 'Nationality-Adjective' task:  91.80737961225766


The above model has considerably better accuracy than the previous Glove model. This may be due to the higher number of tokens and dimensions in the LexVec Model. The Model also takes a larger amount of time to perform each task. However, the above model still performs poorly on the "Currency", "City-in-State", "Adjective-Adverb" and "Opposite" tasks. 

In [75]:
#Experiment with Word2Vec

# w2v_model = KeyedVectors.load_word2vec_format(os.path.abspath("GoogleNews-vectors-negative300.bin"), binary=True0)
# y_pred = calculate_result(cap_world, w2v_model)
# calc_accuracy(y_pred, y_act)

#Accuracy: 0.021441202475685234

In [76]:
#Experiment with FastText

# from gensim.models.wrappers import FastText
# fasttext_model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
# y_pred = calculate_result(cap_world, fasttext_model)
# calc_accuracy(y_pred, y_act)

#Accuracy: 0.16069849690539345

### Problem 8

One known problem with word embeddings is that
antonyms (words with meanings considered to be opposites) often have similar
embeddings. You can verify this by searching for the top 10 most similar words
to a few verbs like increase or enter that have clear antonyms (e.g., decrease
and exit, respectively) using the cosine similarity. Discuss why embeddings
might have this tendency.

Finding the top 10 most similar words using both LexVec and Glove

In [77]:
lexvec.most_similar('increase')

[('decrease', 0.85441654920578),
 ('increased', 0.8349589109420776),
 ('increases', 0.7846347093582153),
 ('increasing', 0.7320786714553833),
 ('decreased', 0.6769896745681763),
 ('reduce', 0.6768884658813477),
 ('rise', 0.6671467423439026),
 ('reduced', 0.6613093614578247),
 ('decline', 0.6586759090423584),
 ('reduction', 0.6572977304458618)]

In [78]:
lexvec.most_similar('exit')

[('exits', 0.7411397695541382),
 ('westbound', 0.6021254062652588),
 ('southbound', 0.6006765365600586),
 ('eastbound', 0.5971554517745972),
 ('northbound', 0.5956331491470337),
 ('exiting', 0.5949065685272217),
 ('entrance', 0.5674291849136353),
 ('interchange', 0.5511788725852966),
 ('ramp', 0.5424832105636597),
 ('cloverleaf', 0.5038855075836182)]

In [79]:
glove_model.most_similar('exit')

[('exits', 0.7869449257850647),
 ('narrow', 0.6713175773620605),
 ('route', 0.6655716896057129),
 ('track', 0.6644213795661926),
 ('reaching', 0.6516510248184204),
 ('reach', 0.6463636755943298),
 ('final', 0.6396181583404541),
 ('passage', 0.6381533145904541),
 ('next', 0.6355715394020081),
 ('northbound', 0.6345898509025574)]

In [80]:
glove_model.most_similar('up')

[('down', 0.9523451924324036),
 ('out', 0.9315087795257568),
 ('while', 0.9285621047019958),
 ('back', 0.9047043919563293),
 ('off', 0.8998678922653198),
 ('put', 0.8958144187927246),
 ('just', 0.8927414417266846),
 ('away', 0.8920525908470154),
 ('over', 0.8814347386360168),
 ('to', 0.8726335763931274)]

In [81]:
glove_model.most_similar('refuse')

[('refusing', 0.883618950843811),
 ('insist', 0.8524091839790344),
 ('accept', 0.8361102938652039),
 ('intend', 0.822323203086853),
 ('obliged', 0.8216915130615234),
 ('cannot', 0.8049349188804626),
 ('must', 0.8007821440696716),
 ('ask', 0.7897180318832397),
 ('willing', 0.7879184484481812),
 ('agree', 0.7838754653930664)]

From the above examples, we can clearly see that antonyms tend to have similar word vectors. This is because verbs that are antonyms of each other oftern occur in the same contexts and often perform the same role in sentences. For example, "Going up the stairs" and "Going down the stairs", "up" and "down" have similar functions and occur in the same context (traversing the stairs). Word embeddings such as Word2Vec, Glove and LexVec are trained on contextual similarity and so do not actually contain any information about their polarity with respect to each other.

### Problem 9

Design two new types of analogy tests that are not part
of Mikolov’s analogy dataset. You will create your own test questions (3
questions for each type, so in total 6 new questions). Report how well the two
sets of embeddings perform on your test questions. You’re encouraged to be
adversarial so that the embeddings might get an accuracy of zero! Discuss any
interesting observations you have made in the process.

Field-Profession <br>
music : musician :: science : scientist<br>
medicine : doctor :: art : artist<br>
comedy : comedian :: war : warrior<br>

In [82]:
#Creating a list of Field - Professional analogy questions
professions = ['music musician science scientist',
'medicine doctor art artist',
'comedy comedian war warrior']

In [83]:
#Using the Glove model
professions = preprocess(professions)
y_act = [1]*len(professions)
y_pred = calculate_result(professions, glove_model)
print("Glove Model")
print("The accuracy of the model on the 'Field-Profession' task: ", calc_accuracy(y_pred, y_act)*100)

Glove Model
The accuracy of the model on the 'Field-Profession' task:  66.66666666666666


In [84]:
#Using the LexVec Model
professions = preprocess(professions)
y_act = [1]*len(professions)
y_pred = calculate_result(professions, lexvec)
print("LexVec Model")
print("The accuracy of the model on the 'Field-Profession' task: ", calc_accuracy(y_pred, y_act)*100)

LexVec Model
The accuracy of the model on the 'Field-Profession' task:  33.33333333333333


Male Animal - Female Animal<br>
peacock : peahen :: stallion : mare<br>
billy : nanny :: lion : lioness<br>
bull : cow :: boar : sow<br>

In [85]:
#Creating a list of Male - Female analogy question
male_female = ['peacock peahen stallion mare',
              'billy nanny lion lioness',
             'bull cow boar sow',]

In [86]:
#Using the Glove model
male_female = preprocess(male_female)
y_act = [1]*len(male_female)
y_pred = calculate_result(male_female, glove_model)
print("Glove Model")
print("The accuracy of the model on the 'Male Animal - Female Animal' task: ", calc_accuracy(y_pred, y_act)*100)

Glove Model
The accuracy of the model on the 'Male Animal - Female Animal' task:  0.0


In [87]:
#Using the LexVec model
male_female = preprocess(male_female)
y_act = [1]*len(male_female)
y_pred = calculate_result(male_female, lexvec)
print("LexVec Model")
print("The accuracy of the model on the 'Male Animal - Female Animal' task: ", calc_accuracy(y_pred, y_act)*100)

LexVec Model
The accuracy of the model on the 'Male Animal - Female Animal' task:  0.0


The above results show us that the models that we created are not very good at capturing the actual meanings of words. The vectors rely entirely on the context of the word and how and where the word occur in the training corpus for the word embeddings. Our models are able to make some correct predictions in the first analogy "Field-Profession" due to how often the words occur in everyday text. The models are trained on everyday text and thus produce results representative of it.<br>
The models fail completely on the second analogy task. The words in the analogy task are quite rare and might not occur in the same context. Therefore, the models have an accuracy of 0.<br>

In [88]:
del glove_model
del lexvec