# **Solving the Definition Extraction Problem**


### **Approach 5: Using Summation Word2Vec model and Classifiers.**

**Word2Vec** is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far.

![alt text](https://www.smartcat.io/media/1395/3d_transparent.png?width=500&height=198.90795631825273)

With the Word2Vec model, we can calculate the vectors for each word in a document. But what if we want to calculate a vector for the entire document?. We could use Word2Vec for this task by inferring a vector for each word in the document using Word2Vec model then summing all these words vectors to create one vector that represent the whole document.

In [2]:
# Run this cell only if you are working on google colab
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# Download GoogleNews word embeddings file
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2020-01-17 19:58:14--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.186.237
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.186.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-01-17 19:58:31 (92.9 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
!gunzip 'GoogleNews-vectors-negative300.bin.gz'

In [0]:
import numpy as np
from data_loader import DeftCorpusLoader
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### **Build Word2Vec Model Using GoogleNews Word Embeddings File**

GoogleNew word embeddings file is a file contains vector representation for 3 millions word from google news. Eash word vector is 300 dimensions. We will load this file into genism Word2Vec model to vectorize document words.

In [6]:
 model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
def get_documents_vectors(parsed_documents):
  """
  Used to get the vector representation of a parsed document (preprocessed, cleaned & must be tokenized)

  Args:
    parsed_documents: List of tokenized docuemnts.

  Returns:
    vectors: List of vector representation of each document in parsed documents.
  """
  vectors = []
  for parsed_document in parsed_documents:

    # Initialze a temp vector with size 300 for document vector
    temp_vector = np.array([0] * 300)
    for token in parsed_document:
      if(token in model.vocab.keys()):

        # Add the vector of the token to the temp vector
        temp_vector = np.add(temp_vector, model.get_vector(token))
    vectors.append(temp_vector)
  return vectors

### **Load DeftEval Trainning & Dev Data**

Note: as the code is executed on google colab, the path of the data is rooted from the drive. So, the path of the data need to be change if the code will  be executed on the local machine.

In [0]:
deft_loader = DeftCorpusLoader("drive/My Drive/DeftEval/deft_corpus/data")
trainframe, devframe = deft_loader.load_classification_data()

Preprocess training and dev data (remove stop words, stemming & tokenizing)

In [0]:
deft_loader.preprocess_data(devframe)
deft_loader.clean_data(devframe)

deft_loader.preprocess_data(trainframe)
deft_loader.clean_data(trainframe)

Get the vector representation of each document in the training and dev data.

In [0]:
train_vectors = get_documents_vectors(trainframe['Parsed'])
dev_vectors = get_documents_vectors(devframe['Parsed'])

### **Apply Classifiers Algorithms**

For each classifier test, **F1-score** and **Accuracy** are calculated.

**1. Naive Bayes Algorithm**

In [11]:
gnb = GaussianNB()
test_predict = gnb.fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.69      0.79      0.74       510
           1       0.48      0.35      0.41       275

    accuracy                           0.64       785
   macro avg       0.59      0.57      0.57       785
weighted avg       0.62      0.64      0.62       785



**2. Decision Tree**

In [12]:
decision_tree = tree.DecisionTreeClassifier(class_weight="balanced")
test_predict = decision_tree.fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.74      0.73      0.74       510
           1       0.52      0.53      0.52       275

    accuracy                           0.66       785
   macro avg       0.63      0.63      0.63       785
weighted avg       0.66      0.66      0.66       785



**3. Logistic Regression**

In [13]:
test_predict = LogisticRegression(class_weight="balanced", random_state=0).fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.77      0.74      0.76       510
           1       0.56      0.60      0.58       275

    accuracy                           0.69       785
   macro avg       0.66      0.67      0.67       785
weighted avg       0.70      0.69      0.69       785



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
