# Sentiment analysis using word and document embeddings. IMDB dataset
Ссылка на Google colab:

https://drive.google.com/file/d/1ECvHDLo5j27xOEg38RkjfEY6tKym4tUM/view?usp=sharing

## 1. Theoretical part

### Word and Document mbeddings

### Bag-of-words Model
Early state-of-the-art document representations were based on the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words model</a>, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents  
(1) `John likes to watch movies. Mary likes movies too.`  
(2) `John also likes to watch football games.`  
are used to construct a length 10 list of words  
`["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]`  
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list  
(1) `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]`  
(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]`  
Bag-of-words models are surprisingly effective but still lose information about word order. Bag of <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

### `Word2Vec`
`Word2Vec` is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW), both implemented by the gensim `Word2Vec` class.


#### `Word2Vec` - Skip-gram Model
The skip-gram <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">word2vec</a> model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoding of words goes through a 'projection layer' to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

#### `Word2Vec` - Continuous-bag-of-words Model
Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings. 

But, Word2Vec doesn't yet get us fixed-size vectors for longer texts.


### Paragraph Vector, aka gensim `Doc2Vec`
The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the <i>Paragraph Vector</i>, which usually outperforms such simple-averaging.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's `Doc2Vec` class implements this algorithm. 

#### Paragraph Vector - Distributed Memory (PV-DM)
This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)
This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)


## 2. Practical part

### 2.1 Use word embeddings for text classification - Logistic Regression

Implement using a text classifier based on logistic regression where pre-trained word embeddings are used. You need to simply average word embeddings of a sentence (perform average pooling of word vectors) and they apply the logistic regression to the output representation.

The process for using word embeddings as the initial embedding matrix involves first loading the
embeddings from the disk, then selecting the correct subset of embeddings for the words that are
actually present in the data, and finally setting the Embedding layer’s weight matrix as the loaded
subset

If you use ``torch`` use the ``torch.nn.Embedding`` to load pre-trained word embeddings. Use the [GloVe](http://nlp.stanford.edu/data/wordvecs/glove.6B.zip) embeddings. Otherwise you can use ``gensim`` and ``sklearn`` or similar packages.

In [4]:
import numpy as np
import statistics
from collections import defaultdict
import codecs
import re
import math
from math import pi,log, exp
import matplotlib.pyplot as plt

In [5]:
def generate_ngrams(texts, n):
    n_grams = []
    for text in texts:
        text = text.lower()        
        text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)        
        tokens = [token for token in text.split(" ") if token != ""]
        ngrams = zip(*[tokens[i:] for i in range(n)])
        n_grams.append([" ".join(ngram) for ngram in ngrams])
    return n_grams

def preprocess(texts):
    def replacer(check):
        if check.group(1) is not None:
            return '{} '.format(check.group(1))
        else:
            return ' {}'.format(check.group(2))

    comp = re.compile(r'^(\W+)|(\W+)$')
    
    return [text.split() for text in [" ".join([comp.sub(replacer, word) for word in i.split()]).lower() for i in texts]]

def tokenize(texts):
    return [text.split() for text in texts]

def read_file(textname):
    file = open(textname, 'r')
    if file.mode == 'r':
        ex_data = file.readlines()
        
    return [re.sub("\n|\r''", "", ex_data[i]) for i in range(len(ex_data))]


In [6]:
train_texts = read_file('train.texts')
test_texts = read_file('test.texts')
train_labels = read_file('train.labels')
train_labels_enc = [0 if i=='pos' else 1 for i in train_labels]
print(np.asarray(train_texts).shape, np.asarray(train_labels).shape)

(15000,) (15000,)


In [7]:
train_texts[0]

'If the myth regarding broken mirrors would be accurate, everybody involved in this production would now face approximately 170 years of bad luck, because there are a lot of mirrors falling to little pieces here. If only the script was as shattering as the glass, then "The Broken" would have been a brilliant film. Now it\'s sadly just an overlong, derivative and dull movie with only just a handful of remarkable ideas and memorable sequences. Sean Ellis made a very stylish and elegantly photographed movie, but the story is lackluster and the total absence of logic and explanation is really frustrating. I got into a discussion with a friend regarding the basic concept and "meaning" of the film. He thinks Ellis found inspiration in an old legend claiming that spotting your doppelganger is a foreboding of how you\'re going to die. Interesting theory, but I\'m not familiar with this legend and couldn\'t find anything on the Internet about this, neither. Personally, I just think "The Broken"

In [8]:
#Create sparse matrix
from scipy.sparse import csr_matrix
texts = preprocess(train_texts)
index1 = [0]
index2 = []
bow = []
vocab = {}
for text in texts:
    for word in text:
        index = vocab.setdefault(word, len(vocab))
        index2.append(index)
        bow.append(1)
    index1.append(len(index2))

sparse_matrix = csr_matrix((bow, index2, index1), dtype=int).toarray()
print(sparse_matrix.shape)
print(sparse_matrix)

(15000, 101334)
[[ 2 26  1 ...  0  0  0]
 [ 0  4  0 ...  0  0  0]
 [ 1  5  0 ...  0  0  0]
 ...
 [ 2  5  0 ...  0  0  0]
 [ 0 16  0 ...  0  0  0]
 [ 4 36  0 ...  1  1  1]]


In this matrix $i^{th}$ row contains bag-of-words vector, $j^{th}$ component is the absolute frequency of $j^{th}$ token from the vocabulary in the $i^{th}$ review

## Let's try model on our sparse matrix

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
model2 = LogisticRegression(max_iter = 300, penalty = 'l2')

In [11]:
%time model2.fit(sparse_matrix, np.asarray(train_labels_enc))



CPU times: user 17.9 s, sys: 19 s, total: 36.9 s
Wall time: 39.3 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
preds = model2.predict(sparse_matrix)
(preds == train_labels_enc).mean()

0.9996666666666667

## Try bigrams

In [13]:
bigrams = generate_ngrams(train_texts,2)
bigrams[0][:10]

['if the',
 'the myth',
 'myth regarding',
 'regarding broken',
 'broken mirrors',
 'mirrors would',
 'would be',
 'be accurate',
 'accurate everybody',
 'everybody involved']

In [14]:
#Create sparse matrix
from scipy.sparse import csr_matrix
texts = generate_ngrams(train_texts,2)
index1 = [0]
index2 = []
bow = []
vocab = {}
for text in texts:
    for word in text:
        index = vocab.setdefault(word, len(vocab))
        index2.append(index)
        bow.append(1)
    index1.append(len(index2))

sparse_matrix_ngr = csr_matrix((bow, index2, index1), dtype=int).toarray()
print(sparse_matrix_ngr.shape)
print(sparse_matrix_ngr)

(15000, 963095)
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 1 1]]


In [466]:
model = LogisticRegression(max_iter=100)
%time model.fit(sparse_matrix_ngr, np.asarray(train_labels_enc))
preds = model.predict(sparse_matrix_ngr)
#mean accuracy
(preds == train_labels_enc).mean(dtype=np.float64)

NameError: name 'sparse_matrix_ngr' is not defined

NameError: name 'sparse_matrix_ngr' is not defined

# Gensim

In [478]:
import numpy as np

In [480]:
#tokenize and lowercase texts
def tokenize(texts):
    return [text.lower().split() for text in texts]

tokenized_train_data = tokenize(train_texts)
print(tokenized_train_data[0][:20])

['if', 'the', 'myth', 'regarding', 'broken', 'mirrors', 'would', 'be', 'accurate,', 'everybody', 'involved', 'in', 'this', 'production', 'would', 'now', 'face', 'approximately', '170', 'years']


# Word2Vec model

In [483]:
from gensim.models import Word2Vec
model = Word2Vec(tokenized_train_data, 
                 size=32,      # embedding vector size
                 min_count=100,  # consider words that occured at least 100 times
                 window=5).wv

In [485]:
model.get_vector('film')

array([-1.1927758e+00,  1.4351714e+00,  1.4090706e+00, -9.6739531e-01,
        2.2448234e-01, -9.9713367e-01, -4.6309307e-01, -2.3953633e-01,
       -1.5805256e-01, -5.0986469e-01, -7.7348888e-01, -5.0096147e-02,
        1.4075789e+00,  2.3532798e+00,  6.8942630e-01,  1.2317375e+00,
        1.6273854e+00, -1.2622250e+00, -2.4078004e+00, -7.9034513e-01,
       -2.0152804e-01,  2.4576237e+00, -1.1414950e+00,  1.4708486e+00,
       -4.1453013e+00,  1.6182142e+00,  3.0991731e-03,  4.3842873e-01,
       -2.9843295e+00,  1.7097746e+00,  2.4455978e-01, -2.5442025e-01],
      dtype=float32)

In [486]:
model.most_similar('oscar')

[('award', 0.9104125499725342),
 ('academy', 0.8780422210693359),
 ('won', 0.8151159286499023),
 ('nominated', 0.808489203453064),
 ('actor', 0.7820138335227966),
 ('director,', 0.7722467184066772),
 ('winning', 0.7290933132171631),
 ('awards', 0.7244893312454224),
 ('bruce', 0.7120931148529053),
 ('actress', 0.7081336975097656)]

In [488]:
class MeanEmbeddingVectorizer(object):

    def __init__(self, word_model):
        self.word_model = word_model
        self.vector_size = word_model.wv.vector_size

    def fit(self):  
        return self

    def transform(self, docs): 
        doc_word_vector = self.word_average_list(docs)
        return doc_word_vector
    def word_average(self, sent):

        mean = []
        for word in sent:
            if word in self.word_model.wv.vocab:
                mean.append(self.word_model.wv.get_vector(word))

        if not mean:  # empty words
            # If a text is empty, return a vector of zeros.
            logging.warning("can't compute average".format(sent))
            return np.zeros(self.vector_size)
        else:
            mean = np.array(mean).mean(axis=0)
            return mean


    def word_average_list(self, docs):

        return np.vstack([self.word_average(sent) for sent in docs])

In [489]:
mean_vec_tr = MeanEmbeddingVectorizer(model)
doc_vec = mean_vec_tr.transform(tokenized_train_data)

  """


In [490]:
doc_vec[0]

array([ 0.06061316, -0.34751496,  0.08287247, -0.16194482,  0.39507714,
       -0.04011344,  0.07043894,  0.09254696, -0.23309317, -0.42761686,
       -0.3426229 , -0.12471391,  0.00167603,  0.2064087 ,  0.3749557 ,
       -0.04829623, -0.18988962, -0.2737884 ,  0.27389583,  0.14256309,
       -0.05938748,  0.03134052, -0.00522704, -0.17655511, -0.03553925,
        0.2433135 , -0.09339889, -0.10438342, -0.08998757,  0.09295134,
       -0.01837852, -0.06334517], dtype=float32)

In [491]:
#Check how it works with sklearn
%time model2.fit(doc_vec, np.asarray(train_labels_enc))

CPU times: user 134 ms, sys: 11.2 ms, total: 145 ms
Wall time: 150 ms




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [493]:
preds = model2.predict(doc_vec)
print('accuracy:', (preds == train_labels_enc).mean())

accuracy: 0.7615333333333333


# Glove Word Embedding

## GloVe: Global Vectors for Word Representation

### Introduction

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

### Differences to Word2Vec:

1. Presence of Neural Networks: GloVe does not use neural networks while word2vec does. In GloVe, the loss function is the difference between the product of word embeddings and the log of the probability of co-occurrence. We try to reduce that and use SGD but solve it as we would solve a linear regression. While in the case of word2vec, we either train the word on its context (skip-gram) or train the context on the word (continuous bag of words) using a 1-hidden layer neural network.
2. Global information: word2vec does not have any explicit global information embedded in it by default. GloVe creates a global co-occurrence matrix by estimating the probability a given word will co-occur with other words. This presence of global information makes GloVe ideally work better. Although in a practical sense, they work almost similar and people have found similar performance with both.

In [494]:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [495]:
glove_file = datapath('/Users/dochkinavika/Downloads/glove.6B/glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

(400001, 100)

In [496]:
glove_word_model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
glove_mean_vec_tr = MeanEmbeddingVectorizer(glove_word_model)
glove_word_vec = glove_mean_vec_tr.transform(tokenized_train_data)

  """


In [497]:
glove_word_vec.shape

(15000, 100)

In [498]:
#Check how it works with sklearn logregression
%time model2.fit(glove_word_vec, np.asarray(train_labels_enc))



CPU times: user 309 ms, sys: 12.4 ms, total: 322 ms
Wall time: 323 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [499]:
preds = model2.predict(glove_word_vec)
print('accuracy: ', (preds == train_labels_enc).mean())

accuracy:  0.7899333333333334


### 2.2 Use word embeddings for text classification - FFNN

Use the same pre-trained word embeddings, but use instead of Logistic Regression a feedforward neural network. For both logistic regression and FFNN model, perform tuning of meta-parameters, such as the learning rate.

# Feedforward nn

In [502]:
import torch
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import torch.nn as nn

In [503]:
def train_data_process(X_train, X_test, y_train, y_test):
    X_train = torch.from_numpy(X_train)
    X_test = torch.from_numpy(X_test)
    y_train = torch.from_numpy(np.array(y_train))
    y_test = torch.from_numpy(np.array(y_test))
    return X_train, X_test, y_train, y_test

In [504]:
X_train, X_test, y_train, y_test = train_test_split(glove_word_vec, train_labels_enc, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = train_data_process(X_train, X_test, y_train, y_test)

In [525]:
class Feedforward(torch.nn.Module):
        def __init__(self, input_size, hidden_size_in, hidden_size_out):
            super(Feedforward, self).__init__()
            self.input_size = input_size
            self.hidden_size_in  = hidden_size_in
            self.hidden_size_out  = hidden_size_out
            self.fc1 = torch.nn.Linear(self.input_size, self.hidden_size_in)
            self.relu1 = torch.nn.ReLU()
            self.fc2 = nn.Linear(self.hidden_size_in, self.hidden_size_out)
            self.relu2 = nn.ReLU()
            self.fc3 = torch.nn.Linear(self.hidden_size_out, 2)
            self.sigmoid = torch.nn.Sigmoid()
            
        def forward(self, x):
            hidden = self.fc1(x)
            output = self.relu1(hidden)
            output = self.fc2(output)
            output = self.fc3(output)
            output = self.sigmoid(output)
            return output
        
def train(eval_model, X, y,optimizer, loss_func=nn.CrossEntropyLoss(), num_epoches = 200, loss_print = False):
    for i in range(num_epoches):
        
        eval_model.zero_grad()
        optimizer.zero_grad()
        y_predicted = eval_model(X)
        loss = loss_func(y_predicted, y)
        if loss_print == True:
            print(i, loss.item())
            
        loss.backward()
        optimizer.step()
        
def predict(eval_model, X):
    out = eval_model(X)
    return torch.max(out, 1)[1]

In [563]:
#define NN model
NN = Feedforward(X_train.shape[1], hidden_size_in = 500, hidden_size_out = 100)
optimizer = torch.optim.Adam(NN.parameters(), lr=0.01)

train(NN,X_train,y_train,optimizer,num_epoches=200)
y_pred = predict(NN,X_test)

In [564]:
print('Test accuracy: ', accuracy_score(y_test,y_pred))

Test accuracy:  0.7653333333333333


# 200d embeddings for NN

In [567]:
glove_file = datapath('/Users/dochkinavika/Downloads/glove.6B/glove.6B.200d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.200d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

(400001, 200)

In [569]:
glove_word_model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
glove_mean_vec_tr = MeanEmbeddingVectorizer(glove_word_model)
glove_word_vec = glove_mean_vec_tr.transform(tokenized_train_data)

  """


In [577]:
glove_word_vec.shape

(15000, 200)

In [578]:
X_train, X_test, y_train, y_test = train_test_split(glove_word_vec, train_labels_enc, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_data_process(X_train, X_test, y_train, y_test)

## learning rate = 0.01

In [579]:
#define NN model, learning rate = 0.01
NN = Feedforward(X_train.shape[1], hidden_size_in = 500, hidden_size_out = 100)
optimizer = torch.optim.Adam(NN.parameters(), lr=0.01)

train(NN,X_train,y_train,optimizer,num_epoches=500)

In [580]:
y_pred = predict(NN,X_test)
print('Test accuracy: ', accuracy_score(y_test,y_pred))

Test accuracy:  0.8053333333333333


## learning rate = 0.001

In [581]:
#define NN model, learning rate = 0.001
NN = Feedforward(X_train.shape[1], hidden_size_in = 500, hidden_size_out = 100)
optimizer = torch.optim.Adam(NN.parameters(), lr=0.001)

train(NN,X_train,y_train,optimizer,num_epoches=500)

In [582]:
y_pred = predict(NN,X_test)
print('Test accuracy: ', accuracy_score(y_test,y_pred))

Test accuracy:  0.8103333333333333


## learning rate = 0.0001

In [584]:
#define NN model, learning rate = 0.0001
NN = Feedforward(X_train.shape[1], hidden_size_in = 500, hidden_size_out = 100)
optimizer = torch.optim.Adam(NN.parameters(), lr=0.0001)

train(NN,X_train,y_train,optimizer,num_epoches=500)

In [585]:
y_pred = predict(NN,X_test)
print('Test accuracy: ', accuracy_score(y_test,y_pred))

Test accuracy:  0.8076666666666666


# 200d embeddings for Logistic Regression

In [586]:
%time model2.fit(glove_word_vec, np.asarray(train_labels_enc))



CPU times: user 637 ms, sys: 37.9 ms, total: 675 ms
Wall time: 683 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=300,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [587]:
preds = model2.predict(glove_word_vec)
print('accuracy: ', (preds == train_labels_enc).mean())

accuracy:  0.8183333333333334


# Takeaway: 

200d embeddings from Glove works a little bit better for NN and LogRegression comparing to 100d

leraning rate 0.001 is optimal in both cases

### 2.3 Use of document embeddings for text classification 

Use ``gensim`` to obtain document embeddings for all reviews. Build a model based on logistic regression using ``sklearn`` which load these embeddings for each document and performs a classification. 

In [588]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [589]:
class DocModel(object):

    def __init__(self, docs, **kwargs):

        self.model = Doc2Vec(**kwargs)
        self.docs = docs
        self.model.build_vocab([x for x in self.docs])
    def custom_train(self, fixed_lr=False, fixed_lr_epochs=None):

        if not fixed_lr:
            self.model.train([x for x in self.docs],
                    total_examples=len(self.docs),
                     epochs=self.model.epochs)
        else:
            for _ in range(fixed_lr_epochs):
                self.model.train(utils.shuffle([x for x in self.docs]),
                         total_examples=len(self.docs),
                         epochs=1)
                self.model.alpha -= 0.002
                self.model.min_alpha = self.model.alpha  # fixed learning rate


    def test_orig_doc_infer(self):

        idx = np.random.randint(len(self.docs))
        print('idx: ' + str(idx))
        doc = [doc for doc in self.docs if doc.tags[0] == idx]
        inferred_vec = self.model.infer_vector(doc[0].words)
        print(self.model.docvecs.most_similar([inferred_vec]))  # wrap vec in a list

In [590]:
dm_args = {
    'dm': 1,
    'dm_mean': 1,
    'vector_size': 100,
    'window': 5,
    'negative': 5,
    'hs': 0,
    'min_count': 2,
    'sample': 0,
    'alpha': 0.025,
    'min_alpha': 0.025,
    'epochs': 100,
    'comment': 'alpha=0.025'
}

In [591]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_train_data)]
documents[10]

TaggedDocument(words=['very', 'intelligent', 'humor', 'excellent', 'performing', 'i', "can't", 'believe', 'how', 'people', 'could', 'think', 'it', 'deserves', 'a', '1/10!', 'i', 'hope', 'this', 'movie', 'will', 'be', 'shown', 'everywhere', 'so', 'everyone', 'can', 'enjoy', 'it', 'if', 'you', 'ever', 'have', 'the', 'opportunity,', 'watch', 'it...', "don't", 'miss', 'it', 'there', 'is', 'a', 'part', 'when', 'the', 'principal', 'actors', 'are', 'driving', 'and', 'singing', '"happy', 'birthday"', 'and', '"el', 'payaso', 'plinplin"', '(an', 'argentinian', 'song', 'for', 'kids', '(i', 'think...', 'it', 'could', 'also', 'be', 'south', 'american,', "i'm", 'not', 'sure)).', 'this', 'two', 'songs', 'that', 'have', 'the', 'same', 'melody...', 'but', 'people', "don't", 'usually', 'realize', 'that...', "it's", 'just', 'grate!', 'i', 'tried', 'to', 'write', 'this', 'in', 'both', 'spanish', 'and', 'english,', 'because', "it's", 'an', 'argentinian', 'movie...', 'but', 'the', 'page', "wouldn't", 'allow

In [592]:
import pandas as pd
import os

In [593]:
dm = DocModel(docs=documents, **dm_args)

In [594]:
%time dm.custom_train()

CPU times: user 15min 35s, sys: 23.5 s, total: 15min 58s
Wall time: 7min 8s


In [595]:
# Save doc2vec
dm_doc_vec_ls = []
for i in range(len(dm.model.docvecs)):
    dm_doc_vec_ls.append(dm.model.docvecs[i])
    
dm_doc_vec = np.asarray(dm_doc_vec_ls)

In [596]:
dm_doc_vec.shape

(15000, 100)

In [600]:
X_train, X_test, y_train, y_test = train_test_split(dm_doc_vec, train_labels_enc, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = train_data_process(X_train, X_test, y_train, y_test)

In [598]:
NN = Feedforward(X_train.shape[1], hidden_size_in = 500, hidden_size_out = 100)
optimizer = torch.optim.Adam(NN.parameters(), lr=0.001)

train(NN,X_train,y_train,optimizer,num_epoches=400)

In [601]:
y_pred = predict(NN,X_test)
print('Test accuracy: ', accuracy_score(y_test,y_pred))

Test accuracy:  0.8274666666666667


In [602]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

In [603]:
logistic = LogisticRegression(random_state=1, multi_class='multinomial', solver='saga')

In [604]:
# try classification via stochastic gradient descent classifier.
sgd = SGDClassifier(loss='hinge',
                    verbose=1,
                    random_state=1,
                    learning_rate='invscaling',
                    eta0=1)

In [607]:
#Train Logistic regression
%time logistic.fit(dm_doc_vec, np.asarray(train_labels_enc))

CPU times: user 3.6 s, sys: 18.4 ms, total: 3.62 s
Wall time: 3.7 s




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=1, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [608]:
preds = logistic.predict(dm_doc_vec)
(preds == train_labels_enc).mean()

0.8383333333333334

In [609]:
#train SGD
%time sgd.fit(dm_doc_vec, np.asarray(train_labels_enc))

-- Epoch 1
Norm: 2.28, NNZs: 100, Bias: 0.551541, T: 15000, Avg. loss: 2.037334
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 1.84, NNZs: 100, Bias: 0.604777, T: 30000, Avg. loss: 0.733670
Total training time: 0.01 seconds.
-- Epoch 3
Norm: 1.51, NNZs: 100, Bias: 0.679587, T: 45000, Avg. loss: 0.648935
Total training time: 0.01 seconds.
-- Epoch 4
Norm: 1.43, NNZs: 100, Bias: 0.744350, T: 60000, Avg. loss: 0.606516
Total training time: 0.02 seconds.
-- Epoch 5
Norm: 1.40, NNZs: 100, Bias: 0.793753, T: 75000, Avg. loss: 0.578942
Total training time: 0.02 seconds.
-- Epoch 6
Norm: 1.23, NNZs: 100, Bias: 0.842357, T: 90000, Avg. loss: 0.555265
Total training time: 0.03 seconds.
-- Epoch 7
Norm: 1.27, NNZs: 100, Bias: 0.893918, T: 105000, Avg. loss: 0.548313
Total training time: 0.03 seconds.
-- Epoch 8
Norm: 1.18, NNZs: 100, Bias: 0.929273, T: 120000, Avg. loss: 0.535399
Total training time: 0.04 seconds.
-- Epoch 9
Norm: 1.20, NNZs: 100, Bias: 0.940310, T: 135000, Avg. loss: 0.5191

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=1, fit_intercept=True,
              l1_ratio=0.15, learning_rate='invscaling', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=1, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=1, warm_start=False)

In [610]:
preds = sgd.predict(dm_doc_vec)
(preds == train_labels_enc).mean()

0.8303333333333334

### 3.2 Impact of dimensionality

Train document embeddings from the Section 2.3 with different number of dimensions and plot dependence of classification accuracy from the number of dimensions.

# Results:

In [647]:
import pandas as pd
data = [[0.761, '-' ,'-' ], [0.789, 0.765, '-'],[0.818,0.81, '-' ] , [0.838,0.827, 0.83 ]]
pd.DataFrame(data,index= [ 'word2wec on 100d','Glove word embeddings on 100d','Glove word embeddings on 200d','Doc_vec on 200d '], columns=["Log Regression", "Feedforward NN", 'SGD'])



Unnamed: 0,Log Regression,Feedforward NN,SGD
word2wec on 100d,0.761,-,-
Glove word embeddings on 100d,0.789,0.765,-
Glove word embeddings on 200d,0.818,0.81,-
Doc_vec on 200d,0.838,0.827,0.83
