# **Solving the Definition Extraction Problem**


### **Approach 3: Using Doc2Vec model and Classifiers.**

**Doc2Vec** is a Model that represents each Document as a Vector. The goal of Doc2Vec is to create a numeric representation of a document, regardless of its length. So, the input of texts per document can be various while the output is fixed-length vectors.
Design of Doc2Vec is based on Word2Vec. But unlike words, documents do not come in logical structures such as words, so the another method has to be found. There are two implementations:

1.   Paragraph Vector - Distributed Memory (PV-DM)
2.   Paragraph Vector - Distributed Bag of Words (PV-DBOW)

**PV-DM** is analogous to Word2Vec continous bag of word CBOW. But instead of using just words to predict the next word, add another feature vector, which is document-unique. So, when training the word vectors W, the document vector D is trained as well, and in the end of training, it holds a numeric representation of the document.

![alt text](https://quantdare.com/wp-content/uploads/2019/08/06.png)


**PV-DBOW** is analogous to Word2Vec skip gram. Instead of predicting next word, it use a document vector to classify entire words in the document.

![alt text](https://quantdare.com/wp-content/uploads/2019/08/07.png)


Not: it's recommend to use a combination of both algorithms to infer the vector representation of a document. 



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [6]:
!unzip 'drive/My Drive/wikipedia-movie-plots.zip'

Archive:  drive/My Drive/wikipedia-movie-plots.zip
  inflating: wiki_movie_plots_deduped.csv  


In [4]:
import os
import nltk
import pandas as pd 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from data_loader import DeftCorpusLoader
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Load Doc2Vec Model Trainning Data**

In [7]:
# Load amazon review reports of movies.
with open('wiki_movie_plots_deduped.csv') as data:
  corpus_list = pd.read_csv(data, sep=",", header = None)
corpus_list = corpus_list[7].tolist()[1:]
print("Corpus legnth: ", len(corpus_list))

Corpus legnth:  34886


In [0]:
stop_words = set(stopwords.words('english'))
porter = PorterStemmer()
qoutes_list = ["``", "\"\"", "''"]
train_corpus = []
for i, sentence in enumerate(corpus_list):
  
  # Lower all the letters in the sentence
  tokens = word_tokenize(sentence.lower())
  processed_tokens = []
  for j, token in enumerate(tokens):
    if not token.isdigit():
      if token not in stop_words and len(token) > 1 and token not in qoutes_list:

        # Convert each sentence from amazon reviews to list of words that doesn't include
        # stop words or any special letters or digits
        processed_tokens.append(porter.stem(token))
  train_corpus.append(TaggedDocument(words=processed_tokens, tags=[str(i)]))

In [27]:
train_corpus[:5]

[TaggedDocument(words=['bartend', 'work', 'saloon', 'serv', 'drink', 'custom', 'fill', 'stereotyp', 'irish', 'man', "'s", 'bucket', 'beer', 'carri', 'nation', 'follow', 'burst', 'insid', 'assault', 'irish', 'man', 'pull', 'hat', 'eye', 'dump', 'beer', 'head', 'group', 'begin', 'wreck', 'bar', 'smash', 'fixtur', 'mirror', 'break', 'cash', 'regist', 'bartend', 'spray', 'seltzer', 'water', 'nation', "'s", 'face', 'group', 'policemen', 'appear', 'order', 'everybodi', 'leav'], tags=['0']),
 TaggedDocument(words=['moon', 'paint', 'smile', 'face', 'hang', 'park', 'night', 'young', 'coupl', 'walk', 'past', 'fenc', 'learn', 'rail', 'look', 'moon', 'smile', 'embrac', 'moon', "'s", 'smile', 'get', 'bigger', 'sit', 'bench', 'tree', 'moon', "'s", 'view', 'block', 'caus', 'frown', 'last', 'scene', 'man', 'fan', 'woman', 'hat', 'moon', 'left', 'sky', 'perch', 'shoulder', 'see', 'everyth', 'better'], tags=['1']),
 TaggedDocument(words=['film', 'minut', 'long', 'compos', 'two', 'shot', 'first', 'girl',

### **Train Doc2Vec Model Based on Amazon Reviews.**
First we will define the attributes of Doc2Vec model:


*   **Vector Size:** Dimensionality of the documents feature vector.
*   **Min Count:** Ignores all words with total frequency lower than this.
*   **Epochs:** Number of iterations (epochs) over the corpus.
*   **Workers:** Use these many worker threads to train the model (faster training with multicore machines).

Second build the **Vocabulary** based on the training corpus (processed amazon reviews). Finally train the model on the training corpus.

Note: the default used algorithm is PV-DM.

In [0]:
model = Doc2Vec(vector_size=50, min_count=2, epochs=40, workers=8)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

### **Load DeftEval Trainning & Dev Data**

Note: as the code is executed on google colab, the path of the data is rooted from the drive. So, the path of the data need to be change if the code will  be executed on the local machine.

In [0]:
deft_loader = DeftCorpusLoader("drive/My Drive/DeftEval/deft_corpus/data")
trainframe, devframe = deft_loader.load_classification_data()

In [0]:
deft_loader.preprocess_data(devframe)
deft_loader.clean_data(devframe)
dev_vectors = []

# Create test data vectors from Doc2Vec model
for parsed_list in devframe["Parsed"]:
  dev_vectors.append(model.infer_vector(parsed_list))

In [0]:
deft_loader.preprocess_data(trainframe)
deft_loader.clean_data(trainframe)
train_vectors=[]

# Create training data vectors from Doc2Vec model
for parsed_list in trainframe["Parsed"]:
  train_vectors.append(model.infer_vector(parsed_list))

### **Apply Classifiers Algorithms**

For each classifier test, **F1-score** and **Accuracy** are calculated.

**1. Naive Bayes Algorithm**

In [32]:
gnb = GaussianNB()
test_predict = gnb.fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.68      0.84      0.75       511
           1       0.51      0.30      0.38       283

    accuracy                           0.65       794
   macro avg       0.60      0.57      0.57       794
weighted avg       0.62      0.65      0.62       794



**2. Decision Tree Algorithm**

In [34]:
decision_tree = tree.DecisionTreeClassifier()
test_predict = decision_tree.fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.67      0.67      0.67       511
           1       0.41      0.41      0.41       283

    accuracy                           0.58       794
   macro avg       0.54      0.54      0.54       794
weighted avg       0.58      0.58      0.58       794



**3. Logistic Regression Algorithm**

In [36]:
test_predict = LogisticRegression(random_state=0).fit(train_vectors, trainframe['HasDef']).predict(dev_vectors)
print(metrics.classification_report(list(devframe["HasDef"]), test_predict))

              precision    recall  f1-score   support

           0       0.66      0.97      0.78       511
           1       0.61      0.10      0.17       283

    accuracy                           0.66       794
   macro avg       0.64      0.53      0.47       794
weighted avg       0.64      0.66      0.56       794

