# Sentiment analysis 

# Introduction
Analyze & classify sentiment of text data, articles into positive or negative

# Objective
Sentiment analysis notebooks dives in very depth of various concepts, methods related to text analysis and understand the meaning of it semantically and/or syntactly. They are classified in the following five based notebooks based on different methods & tools used to analyze & classify text.

1. Sentiment Analysis with Text Blob, Word Cloud, Count Vectorizer, N-Gram
2. Sentiment Analysis using Doc2Vec, N-Gram & Phrase Modelling
3. Sentiment Analysis with Chi2 Square & PCA Dimension Reduction
4. Sentiment Analysis with Keras & Tensorflow
5. Sentiment Analysis with Keras & Tensorflow using Doc2Vec, Pretrained GloVe

# Due
## 2. Sentiment Analysis with Doc2Vec, N-Gram & Phrase Modelling

In [1]:
# from multiprocessing import Process

# # Multiprocessing to spawn processes using an API similar to threading module
#     proc = Process(target=model.run_pipeline, args=())

#     proc.start()
#     proc.join()

In [1]:
# Basic import

import re
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from gensim.models.phrases import Phrases, Phraser

In [3]:
import multiprocessing

In [4]:
# Read TF dataframe

df = pd.read_hdf('./data/redstone.hdf')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1600000 entries, 0 to 1599999
Data columns (total 3 columns):
sentiment        1600000 non-null int64
text             1600000 non-null object
pre_clean_len    1600000 non-null int64
dtypes: int64(2), object(1)
memory usage: 48.8+ MB


Unnamed: 0,sentiment,text,pre_clean_len
0,0,awww that bummer you shoulda got david carr of...,115
1,0,is upset that he can not update his facebook b...,111
2,0,dived many times for the ball managed to save ...,89
3,0,my whole body feels itchy and like its on fire,47
4,0,no it not behaving at all mad why am here beca...,111


In [5]:
# Santitizing dataframe

df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 3 columns):
sentiment        1600000 non-null int64
text             1600000 non-null object
pre_clean_len    1600000 non-null int64
dtypes: int64(2), object(1)
memory usage: 36.6+ MB


In [6]:
from sklearn import utils
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

train = df.text
label = df.sentiment
SEED = 21

# Splitting data into train, test & validation sets
x_train, x_val_test, y_train, y_val_test = train_test_split(train, label, test_size=.02, random_state=SEED)

x_val, x_test, y_val, y_test = train_test_split(x_val_test, y_val_test, test_size=.5, random_state=SEED)



### Doc2Vec

Word2vec consists of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these techniques learn weights which act as word vector representations. 
With a corpus, CBOW model predicts the current word from a window of surrounding context words, while Skip-gram model predicts surrounding context words given the current word.

eg. "I love dogs". 
CBOW model tries to predict the word "love" when given "I", "dogs" as inputs.
Skip-gram model tries to predict "I", "dogs" when given the word "love" as input.

![title](images/word2vec.png)

The word vectors are actually the weights of the trained models, not the predicted results. After extracting the weights, such a vector comes to represent in some abstract way the ‘meaning’ of a word.

Doc2vec uses the same logic as word2vec, but applies it on the document level. According to Mikolov et al. (2014), "every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. 

The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. 

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph.

https://cs.stanford.edu/~quocle/paragraph_vector.pdf

![title](images/doc2vec.png)

In [7]:
# Labelling tweets using genism library for unsupervised learning

def label_tweets_unigram(tweets, label):
    result = []
    prefix = label
    
    # Split tweets & attach label with index
    for index, tweet in zip(tweets.index, tweets):
        result.append(LabeledSentence(tweet.split(), [prefix + '_%s' % index]))
    
    return result

In [8]:
# All cores of CPU

cores = multiprocessing.cpu_count()

For training Doc2Vec, I have used the whole data set. The rationale behind is that the doc2vec training is completely unsupervised and hence there is no need to hold out any (unlabelled) data.

"An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation"
https://arxiv.org/pdf/1607.05368.pdf

In [9]:
word_vec_train = label_tweets_unigram(df.text , 'all')
len(word_vec_train)

  if __name__ == '__main__':


1600000

In [20]:
word_vec_train[0:10]

[LabeledSentence(words=['awww', 'that', 'bummer', 'you', 'shoulda', 'got', 'david', 'carr', 'of', 'third', 'day', 'to', 'do', 'it'], tags=['all_0']),
 LabeledSentence(words=['is', 'upset', 'that', 'he', 'can', 'not', 'update', 'his', 'facebook', 'by', 'texting', 'it', 'and', 'might', 'cry', 'as', 'result', 'school', 'today', 'also', 'blah'], tags=['all_1']),
 LabeledSentence(words=['dived', 'many', 'times', 'for', 'the', 'ball', 'managed', 'to', 'save', 'the', 'rest', 'go', 'out', 'of', 'bounds'], tags=['all_2']),
 LabeledSentence(words=['my', 'whole', 'body', 'feels', 'itchy', 'and', 'like', 'its', 'on', 'fire'], tags=['all_3']),
 LabeledSentence(words=['no', 'it', 'not', 'behaving', 'at', 'all', 'mad', 'why', 'am', 'here', 'because', 'can', 'not', 'see', 'you', 'all', 'over', 'there'], tags=['all_4']),
 LabeledSentence(words=['not', 'the', 'whole', 'crew'], tags=['all_5']),
 LabeledSentence(words=['need', 'hug'], tags=['all_6']),
 LabeledSentence(words=['hey', 'long', 'time', 'no', '

### DBOW (Document Bag Of Words)

DBOW: This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

In [10]:
# Initializing Distributed Bag Of Words parameters & building word vocabulary

dbow_ug_model = Doc2Vec(dm=0, vector_size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dbow_ug_model.build_vocab([w_v for w_v in tqdm(word_vec_train)])

100%|██████████| 1600000/1600000 [00:00<00:00, 3914399.94it/s]


One caveat of the way this algorithm is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results.

Below iteration implement explicit multiple-pass, alpha-reduction approach with added shuffling.

In [11]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dbow_ug_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train)]), total_examples=len(word_vec_train), epochs=1)
    dbow_ug_model.alpha -= 0.002
    dbow_ug_model.min_alpha = dbow_ug_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3751607.30it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4146295.09it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4147855.79it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4028767.00it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4132487.16it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4271302.23it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4237114.75it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4199004.26it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4193069.67it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4221891.42it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4336461.33it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4241836.43it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4211315.08it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4131311.82it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4146807.50it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4284621.

CPU times: user 27min 19s, sys: 1min 52s, total: 29min 11s
Wall time: 18min 59s


In [9]:
# Vectorize train, validation sets using above dbow_ug_model (Document Bag Of Words Unigram Model)

def vectorize(model, corpus, size):
    # Numpy zeros initialization
    vectors = np.zeros((len(corpus), size))
    
    for idx, count in zip(corpus.index, range(len(corpus.index))):
        prefix = 'all_' + str(idx)
        vectors[count] = model.docvecs[prefix]

    return vectors

In [13]:
# Vectorize train, validation sets

train_vecs_dbow = vectorize(dbow_ug_model, x_train, 100)
val_vecs_dbow = vectorize(dbow_ug_model, x_val, 100)

In [57]:
train_vecs_dbow[0:10]

array([[-1.10094115e-01, -4.17756796e-01, -2.21768126e-01,
         1.46800011e-01, -1.88945279e-01,  7.03278463e-03,
         5.17965779e-02,  1.32102147e-01, -3.78717393e-01,
         1.81373477e-01,  2.59546846e-01,  3.60601097e-01,
        -3.41447711e-01, -2.85278633e-02,  4.47911054e-01,
         6.52484521e-02,  3.95397097e-02,  2.68053282e-02,
        -1.71497181e-01,  3.02577972e-01,  7.94886146e-03,
         1.98269516e-01, -2.79568553e-01,  4.29900661e-02,
         4.18800950e-01, -3.09199154e-01, -1.86751425e-01,
         2.34446064e-01,  4.91926447e-02,  7.46935308e-02,
        -9.96179208e-02, -1.91968068e-01,  1.58625469e-01,
         1.78015947e-01,  2.22942159e-01,  3.40586193e-02,
         2.42498498e-02, -6.34907261e-02,  3.00550491e-01,
        -1.70043394e-01, -3.26719761e-01,  1.29000649e-01,
        -3.77572387e-01, -5.38290203e-01, -1.70218259e-01,
         6.27116263e-02,  5.01648366e-01, -3.81974764e-02,
        -7.97180384e-02,  6.05147064e-01, -2.96413392e-0

In [14]:
# Train a Logistic Regression model

clf = LogisticRegression()
clf.fit(train_vecs_dbow, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
clf.score(val_vecs_dbow, y_val)

0.7320625

DBOW model doesn't learn the semantic understanding of words but it's features obtained from it does a decent job with a simple Logistic Regression classifier.
The result doesn't seem to excel count vectorizer or Tfidf vectorizer. It might even not be a direct comparison as count vectorizer or tfidf vectorizer uses a large number of features to represent a tweet rather than using 200 dimensions as in this case.

In [16]:
# Save the dbow_ug_model

dbow_ug_model.save('./data/dbow_ug_model.doc2vec')

In [21]:
# Load the dbow_ug_model and delete temporary training data

dbow_ug_model = Doc2Vec.load('./data/dbow_ug_model.doc2vec')
dbow_ug_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

### Distributed Memory 

##### Distributed Memory Concatenated

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [29]:
# Initializing Distributed Memory parameters & building word vocabulary

dm_ug_model = Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dm_ug_model.build_vocab([w_v for w_v in tqdm(word_vec_train)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4041921.05it/s]


In [30]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dm_ug_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train)]), total_examples=len(word_vec_train), epochs=1)
    dm_ug_model.alpha -= 0.002
    dm_ug_model.min_alpha = dm_ug_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3830051.72it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3978165.31it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3984594.78it/s]
100%|██████████| 1600000/1600000 [00:01<00:00, 1241856.17it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3972896.94it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3864424.64it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4043486.99it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4053387.97it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4112617.14it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4072470.94it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4052362.41it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4083731.86it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4178236.07it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4120113.63it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4106891.38it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4122611.

CPU times: user 37min 59s, sys: 2min 20s, total: 40min 20s
Wall time: 21min 54s


In [31]:
# Save the dm_ug_model

dm_ug_model.save('./data/dm_ug_model.doc2vec')

In Doc2Vec, one can also retrieve individual word vectors alongwith document vectors. 
However, a Doc2Vec DBOW model doesn't learn the semantic meaning of the words. Hence the word vectors retrieved from pure DBOW model will be the automatic randomly-initialized vectors with no meaning. 
But in DM model, the word vectors has the semantic understanding about words. 

In [50]:
# Similar words to 'good'

dm_ug_model.most_similar('good')

  This is separate from the ipykernel package so we can avoid doing imports until


[('gd', 0.795044481754303),
 ('goood', 0.743361234664917),
 ('great', 0.7367996573448181),
 ('gud', 0.7260905504226685),
 ('goodly', 0.682033360004425),
 ('gooooooood', 0.6817256212234497),
 ('gooood', 0.6782906651496887),
 ('goooood', 0.6629056930541992),
 ('gooooooooooooood', 0.6597786545753479),
 ('gooooood', 0.6576763391494751)]

In [51]:
# Similar words to 'happy'

dm_ug_model.most_similar('happy')

  """Entry point for launching an IPython kernel.


[('hapi', 0.7175722718238831),
 ('hapy', 0.7089468836784363),
 ('happyy', 0.6959027647972107),
 ('happppy', 0.6892277002334595),
 ('pleased', 0.6851349472999573),
 ('ebar', 0.6754604578018188),
 ('happpy', 0.672626256942749),
 ('happpppy', 0.6442862749099731),
 ('delighted', 0.6398066878318787),
 ('happpyyyyy', 0.6274683475494385)]

In [52]:
# Similar words to 'google'

dm_ug_model.most_similar('google')

  This is separate from the ipykernel package so we can avoid doing imports until


[('bing', 0.6888588070869446),
 ('yahoo', 0.6703177094459534),
 ('gmail', 0.6586295962333679),
 ('linkedin', 0.6292126178741455),
 ('stocktwits', 0.6011006832122803),
 ('mixero', 0.5956629514694214),
 ('wikipedia', 0.5900527238845825),
 ('facebook', 0.5883870720863342),
 ('blogspot', 0.5877808928489685),
 ('myfoodtrip', 0.5857887864112854)]

It's interesting that the model has also learnt about corrupted form of the words.

In [33]:
# Words similar to the equation : Embed(bigger) + Embed(small) - Embed('big)

dm_ug_model.most_similar(positive=['bigger', 'small'], negative=['big'])

  This is separate from the ipykernel package so we can avoid doing imports until


[('smaller', 0.6095108389854431),
 ('fewer', 0.5827668905258179),
 ('larger', 0.5684505701065063),
 ('tastier', 0.554661214351654),
 ('shorter', 0.5472672581672668),
 ('deadlier', 0.5471574068069458),
 ('tiny', 0.5415958762168884),
 ('awesumer', 0.5377195477485657),
 ('saner', 0.5373097658157349),
 ('deader', 0.5307070016860962)]

In [34]:
# Vectorize train, validation sets

train_vecs_dm = vectorize(dm_ug_model, x_train, 100)
val_vecs_dm = vectorize(dm_ug_model, x_val, 100)

In [35]:
# Train a Logistic Regression model on Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
clf.score(val_vecs_dm, y_val)

0.668125

In [37]:
# Save the dm_ug_model

dm_ug_model.save('./data/dm_ug_model.doc2vec')

In [38]:
# Load the dm_ug_model and delete temporary training data

dm_ug_model = Doc2Vec.load('./data/dm_ug_model.doc2vec')
dm_ug_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

### Distributed Memory 

##### Distributed Memory Mean

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [39]:
# Initializing Distributed Memory Mean parameters & building word vocabulary

dmm_ug_model = Doc2Vec(dm=1, dm_mean=1, vector_size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dmm_ug_model.build_vocab([w_v for w_v in tqdm(word_vec_train)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4000905.24it/s]


In [40]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dmm_ug_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train)]), total_examples=len(word_vec_train), epochs=1)
    dmm_ug_model.alpha -= 0.002
    dmm_ug_model.min_alpha = dmm_ug_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3817282.90it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4133260.90it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4141032.21it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4269630.96it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4155182.95it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4129372.20it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4222353.62it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4197985.11it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4180500.51it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4130457.45it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3914726.47it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4134643.68it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4104068.36it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4164393.25it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4107939.70it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4121277.

CPU times: user 49min 54s, sys: 9min 32s, total: 59min 26s
Wall time: 35min 49s


In [41]:
# Save the dm_ug_model

dmm_ug_model.save('./data/dmm_ug_model.doc2vec')

In Doc2Vec, one can also retrieve individual word vectors alongwith document vectors. 
However, a Doc2Vec DBOW model doesn't learn the semantic meaning of the words. Hence the word vectors retrieved from pure DBOW model will be the automatic randomly-initialized vectors with no meaning. 
But in DM model, the word vectors has the semantic understanding about words. 

In [55]:
# Similar words to 'good'

dmm_ug_model.most_similar('good')

  This is separate from the ipykernel package so we can avoid doing imports until


[('great', 0.9271145462989807),
 ('bad', 0.8902649283409119),
 ('nice', 0.8806440830230713),
 ('sad', 0.8632817268371582),
 ('wonderful', 0.8587142825126648),
 ('alone', 0.856594979763031),
 ('busy', 0.8555803894996643),
 ('weird', 0.8548664450645447),
 ('that', 0.8542935848236084),
 ('but', 0.8539254665374756)]

In [56]:
# Similar words to 'happy'

dmm_ug_model.most_similar('happy')

  This is separate from the ipykernel package so we can avoid doing imports until


[('sad', 0.8552143573760986),
 ('bummed', 0.8075793385505676),
 ('upset', 0.8072896599769592),
 ('excited', 0.8055294752120972),
 ('busy', 0.7970961928367615),
 ('lame', 0.7911461591720581),
 ('good', 0.7825235724449158),
 ('nervous', 0.7758963704109192),
 ('jealous', 0.7745265960693359),
 ('sick', 0.7709179520606995)]

In [43]:
# Words similar to the equation : Embed(bigger) + Embed(small) - Embed('big)

dmm_ug_model.most_similar(positive=['bigger', 'small'], negative=['big'])

  This is separate from the ipykernel package so we can avoid doing imports until


[('smaller', 0.7307447195053101),
 ('nicer', 0.6246125102043152),
 ('shorter', 0.6226528286933899),
 ('cheaper', 0.6045967936515808),
 ('larger', 0.59166419506073),
 ('useable', 0.58817058801651),
 ('clearer', 0.5850169658660889),
 ('higher', 0.5828668475151062),
 ('tiny', 0.5705291628837585),
 ('different', 0.5560811758041382)]

In [44]:
# Vectorize train, validation sets

train_vecs_dmm = vectorize(dmm_ug_model, x_train, 100)
val_vecs_dmm = vectorize(dmm_ug_model, x_val, 100)

In [45]:
# Train a Logistic Regression model on Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dmm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [46]:
clf.score(val_vecs_dmm, y_val)

0.7258125

In [47]:
# Save the dmm_ug_model

dmm_ug_model.save('./data/dmm_ug_model.doc2vec')

In [48]:
# Load the dmm_ug_model and delete temporary training data

dmm_ug_model = Doc2Vec.load('./data/dmm_ug_model.doc2vec')
dmm_ug_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [49]:
display('Pretty Nice.')

'Pretty Nice.'

#### Training on concatanated document vectors obtained from Distributed Bag Of Words & Distributed Memory Concatanated models

Updated vectorize_concate function

In [10]:
# Vectorize train, validation sets using above dbow_ug_model (Document Bag Of Words Unigram Model)

def vectorize_concate(model1, model2, corpus, size):
    # Numpy zeros initialization
    vectors = np.zeros((len(corpus), size))
    
    for idx, count in zip(corpus.index, range(len(corpus.index))):
        prefix = 'all_' + str(idx)
        # Appending document vectors
        vectors[count] = np.append(model1.docvecs[prefix], model2.docvecs[prefix])

    return vectors

In [59]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory 

train_vecs_dbow_dm = vectorize_concate(dbow_ug_model, dm_ug_model, x_train, 200)
val_vecs_dbow_dm = vectorize_concate(dbow_ug_model, dm_ug_model, x_val, 200)

In [60]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [61]:
clf.score(val_vecs_dbow_dm, y_val)

0.742375

#### Training on concatanated document vectors obtained from Distributed Bag Of Words & Distributed Memory Mean models

In [62]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory Mean 

train_vecs_dbow_dmm = vectorize_concate(dbow_ug_model, dmm_ug_model, x_train, 200)
val_vecs_dbow_dmm = vectorize_concate(dbow_ug_model, dmm_ug_model, x_val, 200)

In [63]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory Mean

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dmm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [64]:
clf.score(val_vecs_dbow_dmm, y_val)

0.7504375

#### Populate table with Models & it's Accuracy

In [28]:
mydata = [['Distributed Bag Of Words', 0.7320625], 
          ['Distributed Memory Concatanated', 0.668125],
          ['Distributed Memory Mean', 0.7258125],
          ['Distributed Memory BoW & Concatanated', 0.742375],
          ['Distributed Memory BoW & Mean', 0.7504375]]

In [29]:
from tabulate import tabulate
from IPython.display import HTML

display(HTML(tabulate(mydata, headers= ['Model', 'Accuracy'], floatfmt='.4f', tablefmt='html')))

Model,Accuracy
Distributed Bag Of Words,0.7321
Distributed Memory Concatanated,0.6681
Distributed Memory Mean,0.7258
Distributed Memory BoW & Concatanated,0.7424
Distributed Memory BoW & Mean,0.7504


## Phrase Modeling¶

In [37]:
# Tokenizing train set

tokenized_train = [tweet.split() for tweet in x_train]

The tokenized tweets corpus will be fed to genism library Phrase functions to get the frequently used phrases and connect them together with underbar.

In [38]:
%%time

# Getting Phrases from the tokens & thereafter bigram token from the phrases

phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)

CPU times: user 1min 4s, sys: 222 ms, total: 1min 4s
Wall time: 1min 5s


In [42]:
# Example

check = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
display(bigram[check])

['the', 'mayor', 'of', 'new_york', 'was', 'there']

As we can see that the bigram can find out the most frequently used phrase in the example "new_york".

In [15]:
# Labelling tweets using genism phrase library for unsupervised learning

def label_tweets_bigram(tweets, label):
    result = []
    prefix = label
    
    # Split tweets & attach label with index
    for index, tweet in zip(tweets.index, tweets):
        result.append(LabeledSentence(bigram[tweet.split()], [prefix + '_%s' % index]))
    
    return result

In [45]:
word_vec_train_bg = label_tweets_bigram(df.text , 'all')
len(word_vec_train_bg)

  if __name__ == '__main__':


1600000

In [48]:
# All cores of CPU

cores = multiprocessing.cpu_count()

### DBOW (Document Bag Of Words) Bigram

DBOW Bigram: This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

In [49]:
# Initializing Distributed Bag Of Words Bigram parameters & building word vocabulary

dbow_bg_model = Doc2Vec(dm=0, vector_size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dbow_bg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_bg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4073397.91it/s]


One caveat of the way this algorithm is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results.

Below iteration implement explicit multiple-pass, alpha-reduction approach with added shuffling.

In [50]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dbow_bg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_bg)]), total_examples=len(word_vec_train_bg), epochs=1)
    dbow_bg_model.alpha -= 0.002
    dbow_bg_model.min_alpha = dbow_bg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3964611.74it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3915788.64it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4228147.68it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4056659.02it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4037393.24it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4105476.88it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3941255.68it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4115886.08it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4247767.46it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4174262.32it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3970551.04it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4251526.89it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3975575.30it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3935482.10it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4351857.56it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3976069.

CPU times: user 29min 6s, sys: 2min 9s, total: 31min 16s
Wall time: 20min 19s


In [17]:
# Vectorize train, validation sets using above dbow_ug_model (Document Bag Of Words Unigram Model)

def vectorize(model, corpus, size):
    # Numpy zeros initialization
    vectors = np.zeros((len(corpus), size))
    
    for idx, count in zip(corpus.index, range(len(corpus.index))):
        prefix = 'all_' + str(idx)
        vectors[count] = model.docvecs[prefix]

    return vectors

In [52]:
# Vectorize train, validation sets

train_vecs_dbow_bg = vectorize(dbow_bg_model, x_train, 100)
val_vecs_dbow_bg = vectorize(dbow_bg_model, x_val, 100)

In [53]:
# Train a Logistic Regression model

clf = LogisticRegression()
clf.fit(train_vecs_dbow_bg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [55]:
clf.score(val_vecs_dbow_bg, y_val)

0.742875

DBOW model doesn't learn the semantic understanding of words but it's features obtained from it does a decent job with a simple Logistic Regression classifier.
The result doesn't seem to excel count vectorizer or Tfidf vectorizer. It might even not be a direct comparison as count vectorizer or tfidf vectorizer uses a large number of features to represent a tweet rather than using 200 dimensions as in this case.

In [56]:
# Save the dbow_bg_model

dbow_bg_model.save('./data/dbow_bg_model.doc2vec')

In [9]:
# Load the dbow_ug_model and delete temporary training data

dbow_bg_model = Doc2Vec.load('./data/dbow_bg_model.doc2vec')
dbow_bg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

### Distributed Memory Bigram

##### Distributed Memory Concatenated Bigram

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [58]:
# Initializing Distributed Memory Bigram parameters & building word vocabulary

dm_bg_model = Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dm_bg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_bg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4037378.66it/s]


In [59]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dm_bg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_bg)]), total_examples=len(word_vec_train_bg), epochs=1)
    dm_bg_model.alpha -= 0.002
    dm_bg_model.min_alpha = dm_bg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3899667.09it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4049567.40it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3943645.87it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3983012.65it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4088690.57it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3595165.22it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4086942.57it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3979476.91it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3929414.78it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3976567.07it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4034230.77it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3730645.40it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4059670.12it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3739878.39it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4002217.57it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4064220.

CPU times: user 38min 37s, sys: 2min 15s, total: 40min 52s
Wall time: 22min 14s


In [66]:
# Save the dm_bg_model

dm_bg_model.save('./data/dm_bg_model.doc2vec')

In Doc2Vec, one can also retrieve individual word vectors alongwith document vectors. 
However, a Doc2Vec DBOW model doesn't learn the semantic meaning of the words. Hence the word vectors retrieved from pure DBOW model will be the automatic randomly-initialized vectors with no meaning. 
But in DM model, the word vectors has the semantic understanding about words. 

In [71]:
# Similar words to 'good'

dm_bg_model.most_similar('good')

  This is separate from the ipykernel package so we can avoid doing imports until


[('great', 0.7463171482086182),
 ('gooood', 0.7296287417411804),
 ('gd', 0.7225816249847412),
 ('goood', 0.7052001357078552),
 ('gud', 0.6986550092697144),
 ('haveno', 0.6722873449325562),
 ('horrrrible', 0.6549490690231323),
 ('nice', 0.6540406346321106),
 ('goooood', 0.6462099552154541),
 ('lebay', 0.6451210379600525)]

In [72]:
# Similar words to 'happy'

dm_bg_model.most_similar('happy')

  This is separate from the ipykernel package so we can avoid doing imports until


[('pleased', 0.6988117694854736),
 ('greatful', 0.6522352695465088),
 ('hapy', 0.6349790692329407),
 ('delighted', 0.6305642127990723),
 ('thankful', 0.6269996166229248),
 ('proud', 0.622420608997345),
 ('happpy', 0.6171808838844299),
 ('thrilled', 0.6133846044540405),
 ('happpppy', 0.609881579875946),
 ('upset', 0.6092841625213623)]

In [73]:
# Words similar to the equation : Embed(bigger) + Embed(small) - Embed('big)

dm_bg_model.most_similar(positive=['bigger', 'small'], negative=['big'])

  This is separate from the ipykernel package so we can avoid doing imports until


[('smaller', 0.5979503989219666),
 ('considerably_less', 0.5903685092926025),
 ('poorer', 0.5861400961875916),
 ('larger', 0.5799989104270935),
 ('reclined', 0.5764734148979187),
 ('different', 0.5636833310127258),
 ('healthier', 0.5633150339126587),
 ('stuffier', 0.5453404188156128),
 ('venomous', 0.5452708005905151),
 ('shorter', 0.5448837876319885)]

In [74]:
# Vectorize train, validation sets

train_vecs_dm_bg = vectorize(dm_bg_model, x_train, 100)
val_vecs_dm_bg = vectorize(dm_bg_model, x_val, 100)

In [75]:
# Train a Logistic Regression model on Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dm_bg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [76]:
clf.score(val_vecs_dm_bg, y_val)

0.6571875

In [77]:
# Save the dm_bg_model

dm_bg_model.save('./data/dm_bg_model.doc2vec')

In [12]:
# Load the dm_bg_model and delete temporary training data

dm_bg_model = Doc2Vec.load('./data/dm_bg_model.doc2vec')
dm_bg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

##### Distributed Memory Bigram Mean

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [79]:
# Initializing Distributed Memory Bigram Mean parameters & building word vocabulary

dmm_bg_model = Doc2Vec(dm=1, dm_mean=1, vector_size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dmm_bg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_bg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4160714.04it/s]


In [80]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dmm_bg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_bg)]), total_examples=len(word_vec_train_bg), epochs=1)
    dmm_bg_model.alpha -= 0.002
    dmm_bg_model.min_alpha = dmm_bg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3632233.96it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4175659.68it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4068483.60it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4160414.82it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4041168.95it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4119794.93it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4088239.73it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4134373.67it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3936137.65it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4103855.04it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4157388.98it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4185503.98it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4066237.84it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4090781.66it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4051797.23it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4106305.

CPU times: user 50min 59s, sys: 9min 45s, total: 1h 45s
Wall time: 36min 41s


In [82]:
# Save the dmm_bg_model

dmm_bg_model.save('./data/dmm_bg_model.doc2vec')

In Doc2Vec, one can also retrieve individual word vectors alongwith document vectors. 
However, a Doc2Vec DBOW model doesn't learn the semantic meaning of the words. Hence the word vectors retrieved from pure DBOW model will be the automatic randomly-initialized vectors with no meaning. 
But in DM model, the word vectors has the semantic understanding about words. 

In [87]:
# Similar words to 'good'

dmm_bg_model.most_similar('good')

  This is separate from the ipykernel package so we can avoid doing imports until


[('great', 0.9449338316917419),
 ('bad', 0.9191789627075195),
 ('nice', 0.9140607714653015),
 ('sad', 0.8977138996124268),
 ('cool', 0.8960317373275757),
 ('ok', 0.8918680548667908),
 ('that', 0.8916789293289185),
 ('you', 0.8908518552780151),
 ('awesome', 0.8881815075874329),
 ('crazy', 0.8871078491210938)]

In [15]:
# Similar words to 'happy'

dmm_bg_model.most_similar('happy')

  This is separate from the ipykernel package so we can avoid doing imports until


[('sad', 0.9064098000526428),
 ('excited', 0.8870008587837219),
 ('upset', 0.8546335101127625),
 ('lucky', 0.8500240445137024),
 ('cool', 0.8468711376190186),
 ('jealous', 0.8455446362495422),
 ('depressed', 0.8413048982620239),
 ('good', 0.8396432995796204),
 ('bummed', 0.837196946144104),
 ('nice', 0.8340071439743042)]

In [89]:
# Words similar to the equation : Embed(bigger) + Embed(small) - Embed('big)

dmm_bg_model.most_similar(positive=['bigger', 'small'], negative=['big'])

  This is separate from the ipykernel package so we can avoid doing imports until


[('smaller', 0.7743483185768127),
 ('shorter', 0.6514254808425903),
 ('tiny', 0.6247305870056152),
 ('different', 0.6246160268783569),
 ('their_own', 0.6177682876586914),
 ('messy', 0.6109225153923035),
 ('pricey', 0.6010245680809021),
 ('cheaper', 0.5943189859390259),
 ('empty', 0.592384397983551),
 ('diff', 0.5899870991706848)]

In [18]:
# Vectorize Bigram train, validation sets

train_vecs_dmm_bi = vectorize(dmm_bg_model, x_train, 100)
val_vecs_dmm_bi = vectorize(dmm_bg_model, x_val, 100)

In [19]:
# Train a Logistic Regression model on Distributed Memory Bigram

clf = LogisticRegression()
clf.fit(train_vecs_dmm_bi, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [20]:
clf.score(val_vecs_dmm_bi, y_val)

0.7385

In [21]:
# Save the dmm_bg_model

dmm_bg_model.save('./data/dmm_bg_model.doc2vec')

In [10]:
# Load the dmm_bg_model and delete temporary training data

dmm_bg_model = Doc2Vec.load('./data/dmm_bg_model.doc2vec')
dmm_bg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [22]:
display('Pretty Nice.')

'Pretty Nice.'

Training on concatanated document vectors obtained from Bigram Distributed Bag Of Words & Distributed Memory Concatanated models

Updated vectorize_concate function

In [16]:
# Vectorize train, validation sets using Bigram Model

def vectorize_concate(model1, model2, corpus, size):
    # Numpy zeros initialization
    vectors = np.zeros((len(corpus), size))
    
    for idx, count in zip(corpus.index, range(len(corpus.index))):
        prefix = 'all_' + str(idx)
        # Appending document vectors
        vectors[count] = np.append(model1.docvecs[prefix], model2.docvecs[prefix])

    return vectors

In [25]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory 

train_vecs_dbow_dm_bi = vectorize_concate(dbow_bg_model, dm_bg_model, x_train, 200)
val_vecs_dbow_dm_bi = vectorize_concate(dbow_bg_model, dm_bg_model, x_val, 200)

In [26]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dm_bi, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [27]:
clf.score(val_vecs_dbow_dm_bi, y_val)

0.7499375

Training on concatanated document vectors obtained from Bigram Distributed Bag Of Words & Distributed Memory Mean models

In [12]:
# Vectorize & concate document vectors of train, validation sets obtained from Bigram Distributed Bag Of Words & Distributed Memory Mean

train_vecs_dbow_dmm_bi = vectorize_concate(dbow_bg_model, dmm_bg_model, x_train, 200)
val_vecs_dbow_dmm_bi = vectorize_concate(dbow_bg_model, dmm_bg_model, x_val, 200)

In [13]:
# Train a Logistic Regression model on Bigram Distributed Bag Of Words & Distributed Memory Mean

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dmm_bi, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [14]:
clf.score(val_vecs_dbow_dmm_bi, y_val)

0.7584375

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

### Trigram

Now we will run the same phrase detection on trigram phrases (generated from the corpus) to generate trigram phrases.

In [39]:
%%time

# Getting Bigram Phrases from the tokens & thereafter triagram phrases from the Bigram phrases

tg_phrases = Phrases(bigram[tokenized_train])
trigram = Phraser(tg_phrases)

CPU times: user 2min 3s, sys: 234 ms, total: 2min 4s
Wall time: 2min 4s


In [44]:
check = [u'last', u'cream', u'time', u'ice', u'with', u'nutella', u'and', u'vanilla', u'sadface']
display(bigram[check])

['last',
 'cream',
 'time',
 'ice',
 'with',
 'nutella',
 'and',
 'vanilla',
 'sadface']

In [46]:
# Example

check = [u'last', u'cream', u'time', u'ice', u'with', u'nutella', u'and', u'vanilla', u'sadface']
trigram[bigram[check]]

['last',
 'cream',
 'time',
 'ice',
 'with',
 'nutella',
 'and',
 'vanilla',
 'sadface']

As we can see that the trigram can find out the most frequently used phrase in the example "vanilla_ice_cream".

In [18]:
# Labelling tweets using genism phrase library for unsupervised learning

def label_tweets_trigram(tweets, label):
    result = []
    prefix = label
    
    # Split tweets & attach label with index
    for index, tweet in zip(tweets.index, tweets):
        result.append(LabeledSentence(trigram[tweet.split()], [prefix + '_%s' % index]))
    
    return result

In [19]:
word_vec_train_tg = label_tweets_trigram(df.text , 'all')
len(word_vec_train_tg)

  if __name__ == '__main__':


1600000

In [20]:
# All cores of CPU

cores = multiprocessing.cpu_count()

### DBOW (Document Bag Of Words) Trigram

DBOW Trigram: This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

In [29]:
# Initializing Distributed Bag Of Words Trigram parameters & building word vocabulary

dbow_tg_model = Doc2Vec(dm=0, vector_size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dbow_tg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_tg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 4302349.44it/s]


One caveat of the way this algorithm is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results.

Below iteration implement explicit multiple-pass, alpha-reduction approach with added shuffling.

In [30]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dbow_tg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_tg)]), total_examples=len(word_vec_train_tg), epochs=1)
    dbow_tg_model.alpha -= 0.002
    dbow_tg_model.min_alpha = dbow_bg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3955698.73it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3838712.10it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4079791.89it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4083977.89it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4118222.41it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3981169.60it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4061537.42it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4284777.07it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4226352.96it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4121130.75it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4235857.76it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4305728.17it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4096767.64it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4182775.23it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4168601.96it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4191974.

CPU times: user 27min 25s, sys: 2min 3s, total: 29min 28s
Wall time: 18min 34s


In [32]:
# Vectorize train, validation sets using above dbow_tg_model (Document Bag Of Words Trigram Model)

def vectorize(model, corpus, size):
    # Numpy zeros initialization
    vectors = np.zeros((len(corpus), size))
    
    for idx, count in zip(corpus.index, range(len(corpus.index))):
        prefix = 'all_' + str(idx)
        vectors[count] = model.docvecs[prefix]

    return vectors

In [33]:
# Vectorize train, validation sets

train_vecs_dbow_tg = vectorize(dbow_tg_model, x_train, 100)
val_vecs_dbow_tg = vectorize(dbow_tg_model, x_val, 100)

In [47]:
# Train a Logistic Regression model

clf = LogisticRegression()
clf.fit(train_vecs_dbow_tg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
clf.score(val_vecs_dbow_tg, y_val)

0.7403125

DBOW model doesn't learn the semantic understanding of words but it's features obtained from it does a decent job with a simple Logistic Regression classifier.
The result doesn't seem to excel count vectorizer or Tfidf vectorizer. It might even not be a direct comparison as count vectorizer or tfidf vectorizer uses a large number of features to represent a tweet rather than using 200 dimensions as in this case.

In [49]:
# Save the dbow_tg_model

dbow_tg_model.save('./data/dbow_tg_model.doc2vec')

In [18]:
# Load the dbow_tg_model and delete temporary training data

dbow_tg_model = Doc2Vec.load('./data/dbow_tg_model.doc2vec')
dbow_tg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

### Distributed Memory Trigram

##### Distributed Memory Concatenated Trigram

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [51]:
# Initializing Distributed Memory Trigram parameters & building word vocabulary

dm_tg_model = Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dm_tg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_tg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 3968097.63it/s]


In [52]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dm_tg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_tg)]), total_examples=len(word_vec_train_tg), epochs=1)
    dm_tg_model.alpha -= 0.002
    dm_tg_model.min_alpha = dm_tg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 3755285.33it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3890833.57it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3938480.03it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3939654.58it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3831453.39it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3861249.51it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3899821.19it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3610144.30it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3982093.27it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4006465.86it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3968857.98it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3796121.68it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3859770.46it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3947928.58it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3787305.91it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 3852633.

CPU times: user 39min 32s, sys: 2min 41s, total: 42min 13s
Wall time: 22min 32s


In [54]:
# Save the dm_tg_model

dm_tg_model.save('./data/dm_tg_model.doc2vec')

In [55]:
# Vectorize train, validation sets

train_vecs_dm_tg = vectorize(dm_tg_model, x_train, 100)
val_vecs_dm_tg = vectorize(dm_tg_model, x_val, 100)

In [56]:
%%time

# Train a Logistic Regression model on Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dm_tg, y_train)

CPU times: user 10.7 s, sys: 2.9 s, total: 13.6 s
Wall time: 2min 33s


In [57]:
clf.score(val_vecs_dm_tg, y_val)

0.6563125

In [58]:
# Save the dm_tg_model

dm_tg_model.save('./data/dm_tg_model.doc2vec')

In [25]:
# Load the dm_tg_model and delete temporary training data

dm_tg_model = Doc2Vec.load('./data/dm_tg_model.doc2vec')
dm_tg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

##### Distributed Memory Trigram Mean

DM is the Doc2Vec model analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a centre word based on context words and a context paragraph.

In [21]:
# Initializing Distributed Memory Trigram Mean parameters & building word vocabulary

dmm_tg_model = Doc2Vec(dm=1, dm_mean=1, vector_size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
dmm_tg_model.build_vocab([w_v for w_v in tqdm(word_vec_train_tg)])

100%|██████████| 1600000/1600000 [00:00<00:00, 3917843.80it/s]


In [22]:
%%time

# Multiple epochs iterating over labels more than once with decreasing learning rate

for epoch in range(30):
    
    # Shuffling word_vec_train & reducing aplha over multiple passes
    dmm_tg_model.train(utils.shuffle([w_v for w_v in tqdm(word_vec_train_tg)]), total_examples=len(word_vec_train_tg), epochs=1)
    dmm_tg_model.alpha -= 0.002
    dmm_tg_model.min_alpha = dmm_tg_model.alpha

100%|██████████| 1600000/1600000 [00:00<00:00, 4065400.32it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4177864.10it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4213694.90it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4218218.66it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4231530.89it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4198137.42it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4102530.40it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4194372.16it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4165403.91it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4220107.32it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4027964.18it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4071309.73it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4190985.26it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4198444.72it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4301635.17it/s]
100%|██████████| 1600000/1600000 [00:00<00:00, 4152280.

CPU times: user 48min 34s, sys: 9min 50s, total: 58min 25s
Wall time: 34min 38s


In [23]:
# Save the dmm_tg_model

dmm_tg_model.save('./data/dmm_tg_model.doc2vec')

In [24]:
# Vectorize train, validation sets

train_vecs_dmm_tg = vectorize(dmm_tg_model, x_train, 100)
val_vecs_dmm_tg = vectorize(dmm_tg_model, x_val, 100)

In [25]:
# Train a Logistic Regression model on Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dmm_tg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:
clf.score(val_vecs_dmm_tg, y_val)

0.73325

In [27]:
# Save the dmm_tg_model

dmm_tg_model.save('./data/dmm_tg_model.doc2vec')

In [31]:
# Load the dmm_ug_model and delete temporary training data

dmm_tg_model = Doc2Vec.load('./data/dmm_tg_model.doc2vec')
dmm_tg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [29]:
display('Pretty Nice.')

'Pretty Nice.'

Training on concatanated document vectors obtained from Distributed Bag Of Words & Distributed Memory Concatanated models

In [26]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory 

train_vecs_dbow_dm_tg = vectorize_concate(dbow_tg_model, dm_tg_model, x_train, 200)
val_vecs_dbow_dm_tg = vectorize_concate(dbow_tg_model, dm_tg_model, x_val, 200)

In [27]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dm_tg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
clf.score(val_vecs_dbow_dm_tg, y_val)

0.7496875

Training on concatanated document vectors obtained from Distributed Bag Of Words & Distributed Memory Mean models

In [32]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory 

train_vecs_dbow_dmm_tg = vectorize_concate(dbow_tg_model, dmm_tg_model, x_train, 200)
val_vecs_dbow_dmm_tg = vectorize_concate(dbow_tg_model, dmm_tg_model, x_val, 200)

In [33]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dbow_dmm_tg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [34]:
clf.score(val_vecs_dbow_dmm_tg, y_val)

0.75875

#### Populate table with Models & it's Accuracy

In [44]:
mydata = [['Distributed Bag Of Words', 0.7320625, 0.668125, 0.7258125],
          ['Distributed Memory Concatanated', 0.742875, 0.6571875, 0.7385],
          ['Distributed Memory Mean', 0.7403125, 0.6563125, 0.73325],
          ['Distributed Memory BoW & Concatanated', 0.742375, 0.7499375, 0.7496875],
          ['Distributed Memory BoW & Mean', 0.7504375, 0.7584375, 0.75875]]

In [45]:
from tabulate import tabulate
from IPython.display import HTML

display(HTML(tabulate(mydata, headers= [ '', 'Unigram', 'Bigram', 'Trigram'], floatfmt='.4f', tablefmt='html')))

Unnamed: 0,Unigram,Bigram,Trigram
Distributed Bag Of Words,0.7321,0.6681,0.7258
Distributed Memory Concatanated,0.7429,0.6572,0.7385
Distributed Memory Mean,0.7403,0.6563,0.7332
Distributed Memory BoW & Concatanated,0.7424,0.7499,0.7497
Distributed Memory BoW & Mean,0.7504,0.7584,0.7588


#### Joint Operation

Using best of above n-gram models, we will train on joint vectors (Distributed Bag Of Words Unigram + Distributed Memory BoW & Mean	Triagram) having best performance.

In [11]:
# Load the dbow_ug_model and delete temporary training data

dbow_ug_model = Doc2Vec.load('./data/dbow_ug_model.doc2vec')
dbow_ug_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [12]:
# Load the dmm_tg_model and delete temporary training data

dmm_tg_model = Doc2Vec.load('./data/dmm_tg_model.doc2vec')
dmm_tg_model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [13]:
# Vectorize & concate document vectors of train, validation sets obtained from Distributed Bag Of Words & Distributed Memory 

train_vecs_dbow_ug_dmm_tg = vectorize_concate(dbow_ug_model, dmm_tg_model, x_train, 200)
val_vecs_dbow_ug_dmm_tg = vectorize_concate(dbow_ug_model, dmm_tg_model, x_val, 200)

In [14]:
# Train a Logistic Regression model on Distributed Bag Of Words & Distributed Memory 

clf = LogisticRegression()
clf.fit(train_vecs_dbow_ug_dmm_tg, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
clf.score(val_vecs_dbow_ug_dmm_tg, y_val)

0.754

### Algorithms Comparison

In [30]:
# Basic imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

from datetime import datetime
from collections import Counter

# Classifiers names list
names = ["Logistic Regression", "Multinomial NB", "Bernoulli NB", "Ridge Classifier", 
         "AdaBoost", "Perceptron", "Passive-Aggresive", "Nearest Centroid"]

classifiers = [
    LogisticRegression(),
    MultinomialNB(),
    BernoulliNB(),
    RidgeClassifier(),
    AdaBoostClassifier(),
    Perceptron(),
    PassiveAggressiveClassifier(),
    NearestCentroid()
    ]

# Zipping all of them together
zipped_clf = zip(names,classifiers)

In [26]:
# Scaling inputs

mmscaler = MinMaxScaler()

train_vecs_dbow_ug_dmm_tg_scaled = mmscaler.fit_transform(train_vecs_dbow_ug_dmm_tg)
val_vecs_dbow_ug_dmm_tg_scaled = mmscaler.fit_transform(val_vecs_dbow_ug_dmm_tg)

In [27]:
# Calculate accuracy & summmary of different set of features

def accuracy_features(pipeline, x_train, y_train, x_test, y_test):
    
    counter = Counter(y_test)

    if (counter[0] / (len(y_test)*1.)) > 0.5:
        baseline_accuracy = counter[0] / (len(y_test)*1.)
    else:
        baseline_accuracy = 1. - (counter[0] / (len(y_test)*1.))
   
    # Timer starts
    timer = datetime.now()
    
    model = pipeline.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    
    elapsed_time = datetime.now() - timer
    # Timer stops

    accuracy = accuracy_score(y_test, y_pred)
    
    
    print('Baseline accuracy: {:.2f}%'.format(baseline_accuracy*100))
    print('Accuracy score: {:.2f}%'.format(accuracy*100))
    
    if(accuracy > baseline_accuracy):
        print('\nModel accuracy:{:.2f}% - Baseline accuracy:{:.2f}%: Increase of {:.2f}%'.format(accuracy*100, baseline_accuracy*100, (accuracy-baseline_accuracy)*100))
    else:
        print('Model accuracy:{:.2f}% - Baseline accuracy:{:.2f}%: Decrease of {:.2f}%'.format(accuracy*100, baseline_accuracy*100, (accuracy-baseline_accuracy)*100))
    
    print('Overall Train and Prediction time: {:.2f}s'.format(elapsed_time.total_seconds()))
    print('-'*89)
          
    return accuracy, elapsed_time

In [28]:
# Comparing different classifiers using pipeline

def classifier_comparator(train, val, classifier=zipped_clf):
    result = []
    
    for clf_name, clf in zipped_clf:
        pipeline = Pipeline([
            ('classifier', clf)
        ])
        
        print("\nValidation result for {} classifier".format(clf_name))
        print(clf)
        
        # Calculate accuracy & summmary
        clf_accuracy, clf_time = accuracy_features(pipeline, train, y_train, val, y_val)
        result.append((clf_name, clf_accuracy, clf_time))
        
    return result

In [31]:
%%time

print('Result for joint vectors operation \n(Distributed Bag Of Words Unigram + Distributed Memory BoW & Mean Triagram)\nRunning Different Classifiers now .......................\n')
classifier_comparator(train_vecs_dbow_ug_dmm_tg_scaled, val_vecs_dbow_ug_dmm_tg_scaled, zipped_clf)

Result for joint vectors operation 
(Distributed Bag Of Words Unigram + Distributed Memory BoW & Mean Triagram)
Running Different Classifiers now .......................


Validation result for Logistic Regression classifier
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Baseline accuracy: 50.01%
Accuracy score: 75.43%

Model accuracy:75.43% - Baseline accuracy:50.01%: Increase of 25.42%
Overall Train and Prediction time: 70.29s
-----------------------------------------------------------------------------------------

Validation result for Multinomial NB classifier
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Baseline accuracy: 50.01%
Accuracy score: 73.88%

Model accuracy:73.88% - Baseline accuracy:50.01%: Increase of 23.87%
Overall Train and Prediction tim



Baseline accuracy: 50.01%
Accuracy score: 71.85%

Model accuracy:71.85% - Baseline accuracy:50.01%: Increase of 21.84%
Overall Train and Prediction time: 5.02s
-----------------------------------------------------------------------------------------

Validation result for Passive-Aggresive classifier
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)




Baseline accuracy: 50.01%
Accuracy score: 67.92%

Model accuracy:67.92% - Baseline accuracy:50.01%: Increase of 17.91%
Overall Train and Prediction time: 6.60s
-----------------------------------------------------------------------------------------

Validation result for Nearest Centroid classifier
NearestCentroid(metric='euclidean', shrink_threshold=None)
Baseline accuracy: 50.01%
Accuracy score: 73.87%

Model accuracy:73.87% - Baseline accuracy:50.01%: Increase of 23.86%
Overall Train and Prediction time: 2.36s
-----------------------------------------------------------------------------------------
CPU times: user 55min 5s, sys: 10.7 s, total: 55min 16s
Wall time: 55min 10s


In [32]:
clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)