# Doc2Vec demonstration 

In this notebook, let us take a look at how to "learn" document embeddings and use them for text classification. We will be using the dataset of "Sentiment and Emotion in Text" from Figure-Eight.

"In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft’s Cortana Intelligence Gallery."

###https://www.figure-eight.com/data-for-everyone/

In [61]:
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [62]:
#Load the dataset and explore.
filepath = "/home/bangaru/Downloads/NLPBookTut/text_emotion.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()

(40000, 4)


Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


In [63]:
df['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

In [152]:
#Let us take the top 4 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(22306, 4)

#Text pre-processing:
- Removing @mentions, and urls perhaps?
- using NLTK Tweet tokenizer instead of a regular one
- may be just take: happiness, sadness, worry, neutral and ignore other cats?

In [153]:
tweeter = TweetTokenizer(strip_handles=True,preserve_case=False)
mystopwords = set(stopwords.words("english"))

def preprocess_corpus(texts):
    def remove_stops_digits(tokens):
        #Nested function that removes stopwords and digits from a list of tokens
        return [token for token in tokens if token not in mystopwords and not token.isdigit()]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]


mydata = preprocess_corpus(df_subset['content'])
mycats = df_subset['sentiment']
print(len(mydata), len(mycats))


22306 22306


In [154]:
#Split data into train and test
train_data, test_data, train_cats, test_cats = train_test_split(mydata,mycats,random_state=1234)

#prepare training data in doc2vec format:
train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]

In [165]:
#Train a doc2vec model
max_epochs = 100
vec_size = 50
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_count=10,
                dm =1, epochs=max_epochs)
  
model.build_vocab(train_doc2vec)
print(model.epochs)

model.train(train_doc2vec,
                total_examples=model.corpus_count,
                epochs=model.epochs)

model.save("d2v.model")
print("Model Saved")

100
Model Saved


In [166]:
#Infer using that model
model= Doc2Vec.load("d2v.model")
train_vectors =  [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]
test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]

In [167]:
#Use any regular classifier like logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

myclass = LogisticRegression(class_weight="balanced")
myclass.fit(train_vectors, train_cats)
print(myclass.classes_)

['happiness' 'neutral' 'worry']


In [168]:
preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))
print(confusion_matrix(test_cats,preds))

             precision    recall  f1-score   support

  happiness       0.50      0.47      0.49      1331
    neutral       0.47      0.58      0.52      2143
      worry       0.54      0.44      0.49      2103

avg / total       0.51      0.50      0.50      5577

[[ 626  498  207]
 [ 328 1245  570]
 [ 290  889  924]]
