### Introduction

Sentiment analysis is a key NLP task that receives much attention last years. Detecting sentiment of a text can be important for many applications, for instance, getting customer feedback about a brand or product, aggregating and summarising opinions in reviews for recommender systems and so on.

Sentiment analysis can be viewed as a text classification problem with opinions and emotions in the text as the criterion of the classification.

### Goal

The goal is to configure and train the neural network and logistic regression on the Twitter data to make predictions opinion polarity using google colab.


### Requirements:
- tensorflow
- keras
- scikit-learn

### Import Libraries:

In [0]:
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, concatenate, Activation, Conv1D, GlobalMaxPooling1D, Dropout
from keras.models import Model, model_from_json
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint

# scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.manifold import TSNE

# Others
import pandas as pd
import numpy as np
from google.colab import files

### DATASET:

A typical sentiment analysis system needs an annotated text
corpus on which the system is trained and/or evaluated.
In the English language, resources with annotated texts are
widely available, however, in Russian only a few such resources are available.


I choosed  http://study.mokoron.com/ - Twitter messages in Russian, labeled (positive or negative) mannualy.

In [3]:
! mkdir data
! wget https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv -P data -q
! wget https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv -P data -q
! ls data 

negative.csv  positive.csv


In [0]:
# TODO: Translate comments to Russian
column_list = [
    "id", # уникальный номер сообщения в системе twitter;
    "tdate", # дата публикации сообщения (твита);
    "tmane", # имя пользователя, опубликовавшего сообщение;
    "ttext", # текст сообщения (твита);
    "ttype", # поле в котором в дальнейшем будет указано к кому классу относится твит (положительный, отрицательный, нейтральный);
    "trep", # количество реплаев к данному сообщению. В настоящий момент API твиттера не отдает эту информацию;
    "tfav", # число сколько раз данное сообщение было добавлено в избранное другими пользователями;
    "tstcount", # число всех сообщений пользователя в сети twitter;
    "tfol", # количество фоловеров пользователя (тех людей, которые читают пользователя);
    "tfrien", # количество друзей пользователя (те люди, которых читает пользователь);
    "listcount" # количество листов-подписок в которые добавлен твиттер-пользователь.
]

In [5]:
data_positive = pd.read_csv("data/positive.csv", sep=";", names=column_list, index_col=False) 
data_positive.head()

Unnamed: 0,id,tdate,tmane,ttext,ttype,trep,tfav,tstcount,tfol,tfrien,listcount
0,408906692374446080,1386325927,pleease_shut_up,"@first_timee хоть я и школота, но поверь, у на...",1,0,0,0,7569,62,61
1,408906692693221377,1386325927,alinakirpicheva,"Да, все-таки он немного похож на него. Но мой ...",1,0,0,0,11825,59,31
2,408906695083954177,1386325927,EvgeshaRe,RT @KatiaCheh: Ну ты идиотка) я испугалась за ...,1,0,1,0,1273,26,27
3,408906695356973056,1386325927,ikonnikova_21,"RT @digger2912: ""Кто то в углу сидит и погибае...",1,0,1,0,1549,19,17
4,408906761416867842,1386325943,JumpyAlex,@irina_dyshkant Вот что значит страшилка :D\nН...,1,0,0,0,597,16,23


In [6]:
data_positive["ttext"][777]

'@Olgana1000000 нет...предупреждает, что их власть надолго...но думаю опять переоценил он свои возможности..алкоголь не даст дожить до 70))'

In [7]:
data_positive.shape

(114911, 11)

In [8]:
data_negative = pd.read_csv("data/negative.csv", sep=";", names=column_list, index_col=False) 
data_negative.head()

Unnamed: 0,id,tdate,tmane,ttext,ttype,trep,tfav,tstcount,tfol,tfrien,listcount
0,408906762813579264,1386325944,dugarchikbellko,на работе был полный пиддес :| и так каждое за...,-1,0,0,0,8064,111,94
1,408906818262687744,1386325957,nugemycejela,"Коллеги сидят рубятся в Urban terror, а я из-з...",-1,0,0,0,26,42,39
2,408906858515398656,1386325966,4post21,@elina_4post как говорят обещаного три года жд...,-1,0,0,0,718,49,249
3,408906914437685248,1386325980,Poliwake,"Желаю хорошего полёта и удачной посадки,я буду...",-1,0,0,0,10628,207,200
4,408906914723295232,1386325980,capyvixowe,"Обновил за каким-то лешим surf, теперь не рабо...",-1,0,0,0,35,17,34


In [9]:
data_negative.shape

(111923, 11)

In [10]:
# Merge positive and negative tweets into one df
data_input = shuffle(pd.concat([data_positive, data_negative], ignore_index=True))
data_input.shape

(226834, 11)

### Data processing:

In [0]:
# Constants for data processing:
RANDOM_STATE = 42
VOCABULARY_SIZE = 10000
MAX_SEQUENCE_LENGTH = 100
EMBEDDING_SIZE = 200
TEST_SIZE = 0.3

In [0]:
np.random.seed(RANDOM_STATE)

In [0]:
# Convert words to lower case
# TODO: clean data: delete names, urls and others
text = data_input["ttext"].str.lower()  

In [0]:
# target value, 1-positive, 0-negative
y = pd.to_numeric((data_input["ttype"] + 1) / 2, downcast="integer")

In [0]:
# Split on test/val and train datasets
# TODO: Use cross-validation
text_train, text_val, y_train, y_val = train_test_split(text, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

In [15]:
print("Train set has total {cnt_rows} entries with {neg_rate:.2%} negative, {pos_rate:.2%} positive".format(cnt_rows=text_train.shape[0], neg_rate=1-np.average(y_train), pos_rate=np.average(y_train)))
print("Test/Val set has total {cnt_rows} entries with {neg_rate:.2%} negative, {pos_rate:.2%} positive".format(cnt_rows=text_val.shape[0], neg_rate=1-np.average(y_val), pos_rate=np.average(y_val)))

Train set has total 158783 entries with 49.37% negative, 50.63% positive
Test/Val set has total 68051 entries with 49.26% negative, 50.74% positive


In [0]:
# Create sequences for our neural network
tokenizer = Tokenizer(num_words=VOCABULARY_SIZE)
tokenizer.fit_on_texts(text_train)
word_index = tokenizer.word_index

sequences_train = tokenizer.texts_to_sequences(text_train)
sequences_val = tokenizer.texts_to_sequences(text_val)

# TODO: Compare results for padding='post' and padding='pre'
sequences_pad_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
sequences_pad_val = pad_sequences(sequences_val, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

In [17]:
sequences_pad_train[0]

array([ 14,  15, 249,   4,  17,  29, 103,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int32)

In [0]:
# Create train and test/val data for logistic regression: 
vectorizer = TfidfVectorizer(max_features=VOCABULARY_SIZE, ngram_range=(1, 3))
vectorizer.fit(text_train)

X_train_tfidf = vectorizer.transform(text_train)
X_val_tfidf = vectorizer.transform(text_val)


### Models:
#### 1. CNN

Model based on Y. Kim's famous paper "Convolutional Neural Networks for Sentence Classification". https://arxiv.org/pdf/1408.5882.pdf. In this analysis, I will not use multi-channel approach (eg. one channel for static input word vectors, another channel for word vectors input but set them to update during training), only different n-grams.

The model has parallel layers which take the same input but do their own computation, then the results will be merged. In this kind of neural network structure, we can use Kera functional API. https://keras.io/getting-started/functional-api-guide/

In [0]:
# Constants for CNN model:
NUMBER_OF_FILTERS = 100
DROPOUT_RATE = 0.5      
BATCH_SIZE = 50      

In [20]:
# Specify each convolution layer and their kernel siz i.e. n-grams
# TODO: Try to add dropout layer after each conv layer
# TODO: Batch BatchNormalization?
# TODO: use pre-trained word embeddings
  
cnn_model_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
cnn_model_embedding = Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length=MAX_SEQUENCE_LENGTH, trainable=True)(cnn_model_input)

cnn_model_branch_2_conv1 = Conv1D(filters=NUMBER_OF_FILTERS, kernel_size=2, padding='valid', activation='relu', strides=1)(cnn_model_embedding)
cnn_model_branch_2_maxpool = GlobalMaxPooling1D()(cnn_model_branch_2_conv1)

cnn_model_branch_4_conv1 = Conv1D(filters=NUMBER_OF_FILTERS, kernel_size=4, padding='valid', activation='relu', strides=1)(cnn_model_embedding)
cnn_model_branch_4_maxpool = GlobalMaxPooling1D()(cnn_model_branch_4_conv1)

cnn_model_branch_5_conv1 = Conv1D(filters=NUMBER_OF_FILTERS, kernel_size=5, padding='valid', activation='relu', strides=1)(cnn_model_embedding)
cnn_model_branch_5_maxpool = GlobalMaxPooling1D()(cnn_model_branch_5_conv1)

cnn_model_concatenate = concatenate([cnn_model_branch_2_maxpool, cnn_model_branch_4_maxpool, cnn_model_branch_5_maxpool], axis=1)

cnn_model_dense_1 = Dense(256, activation='relu')(cnn_model_concatenate)
cnn_model_dropout = Dropout(DROPOUT_RATE)(cnn_model_dense_1)
cnn_model_dense_2 = Dense(1)(cnn_model_dropout)
cnn_model_output = Activation('sigmoid')(cnn_model_dense_2)
cnn_model = Model(inputs=[cnn_model_input], outputs=[cnn_model_output])

cnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
cnn_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 100, 200)     2000000     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 99, 100)      40100       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 97, 100)      80100       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (

#### FIT CNN

In [21]:
! mkdir models
! ls

data  models  sample_data


In [0]:
MODEL_FILEPATH = "models/CNN_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5"

In [23]:
checkpoint = ModelCheckpoint(MODEL_FILEPATH, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

%time
cnn_model.fit(sequences_pad_train, y_train, batch_size=BATCH_SIZE, epochs=3, validation_split=0.3, callbacks = [checkpoint])

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 12.4 µs
Train on 111148 samples, validate on 47635 samples
Epoch 1/3

Epoch 00001: val_acc improved from -inf to 0.77130, saving model to models/CNN_best_weights.01-0.7713.hdf5
Epoch 2/3

Epoch 00002: val_acc did not improve from 0.77130
Epoch 3/3

Epoch 00003: val_acc did not improve from 0.77130


<keras.callbacks.History at 0x7fbca122f2b0>

In [0]:
# Download best weights
files.download("models/CNN_best_weights.01-0.7713.hdf5")

In [26]:
! sha256sum models/CNN_best_weights.01-0.7713.hdf5
! du -sh models/CNN_best_weights.01-0.7713.hdf5

6ad3e8d4369b0e8cad21581f9f9ebdd3b7135299999cc8de311f327943c542ef  models/CNN_best_weights.01-0.7713.hdf5
27M	models/CNN_best_weights.01-0.7713.hdf5


In [0]:
# Download model
with open("models/cnn_model.json", "w") as json_file:
  json_file.write(cnn_model.to_json())
files.download("models/cnn_model.json")

In [32]:
! sha256sum models/cnn_model.json
! du -sh models/cnn_model.json

8a3c8edb7f6956b5455d07b0db787549a11b6974f06c0ec33b4cfa5d4d212c58  models/cnn_model.json
8.0K	models/cnn_model.json


#### Load model

In [39]:
# Weights and model can be downloaded from url:
! mkdir model_loaded
! wget https://www.dropbox.com/s/xkb69ivurvepk39/CNN_best_weights.01-0.7713.hdf5 -P model_loaded -q
! wget https://www.dropbox.com/s/g55ezivfssvvgwv/cnn_model.json -P model_loaded -q
! ls model_loaded

CNN_best_weights.01-0.7713.hdf5  cnn_model.json


In [49]:
with open('model_loaded/cnn_model.json', 'r') as json_file:
  cnn_model_loaded = model_from_json(json_file.read())

cnn_model_loaded.load_weights("model_loaded/CNN_best_weights.01-0.7713.hdf5") 
cnn_model_loaded.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

cnn_model_loaded.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 100, 200)     2000000     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 99, 100)      40100       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 97, 100)      80100       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (

#### 2. Logistic regression + Tf-Idf

In [46]:
# Simple logistic regression
# TODO: Tune hyper-parameters

%time
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.78 µs


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Evaluate

In [51]:
# Accuracy
accuracy_cnn = cnn_model_loaded.evaluate(sequences_pad_val, y_val,batch_size=BATCH_SIZE, verbose=1)[1]
accuracy_lr = lr_model.score(X_val_tfidf, y_val)
print("CNN accuracy: {0:.2f}\nLR accuracy: {1:.2f}".format(accuracy_cnn, accuracy_lr))

CNN accuracy: 0.77
LR accuracy: 0.75


In [76]:
yhat_cnn = cnn_model_loaded.predict(sequences_pad_val, batch_size=BATCH_SIZE, verbose=1)
yhat_lr = lr_model.predict_proba(X_val_tfidf)



In [0]:
y_class_cnn = (yhat_cnn > 0.5).astype('int32')
y_class_lr = lr_model.predict(X_val_tfidf)

In [78]:
# roc_auc
print("CNN_roc_auc: {cnn:.2f}, LR_roc_auc: {lr:.2f}".format(cnn=roc_auc_score(y_val, yhat_cnn), lr=roc_auc_score(y_val, yhat_lr[:,1]))) 

CNN_roc_auc: 0.86, LR_roc_auc: 0.83


In [79]:
# confusion matrix
conf_matrix_cnn = confusion_matrix(y_val, y_class_cnn)
conf_matrix_lr = confusion_matrix(y_val, y_class_lr)

print("CNN")
print(pd.DataFrame(conf_matrix_cnn, index=['true:pos', 'true:neg'], columns=['pred:pos', 'pred:neg']))

print("\nLR")
print(pd.DataFrame(conf_matrix_lr, index=['true:pos', 'true:neg'], columns=['pred:pos', 'pred:neg']))


CNN
          pred:pos  pred:neg
true:pos  24770     8755    
true:neg  6716      27810   

LR
          pred:pos  pred:neg
true:pos  24430     9095    
true:neg  8129      26397   


That is, the first tweet are classified as Positive and the second as Negative.

In [80]:
pd.set_option('display.max_colwidth', -1)
pd.DataFrame({"tweet": text_val, "CNN": yhat_cnn[:,0], "LR": yhat_lr[:,1]}).head()

Unnamed: 0,CNN,LR,tweet
130831,0.585968,0.564431,"седовласый инженер сказал, что удары током это вовсе не удары током, а всего лишь статика, и ""купи серебряные наручники, я тебя заземлю"".\n:("
2532,0.337828,0.337672,"rt @dance_with_me_: кто бы знал как я не люблю безхарактерных, вечноноющих парней :-)"
31387,0.914562,0.840336,"ну всё понятно, городской житель, к ухабам и колеям не привык)) в колее то хорошо - нашел желобок - и как трамвай едешь)))"
20210,0.946806,0.947246,rt @nina_one_nina: @directioner6901 талдна) спасибо щедрый человек
193285,0.039314,0.107516,@advakhova прости (( \nвеселье уедет в раменское не честно


In [96]:
# Word embeddings
conv_embds = cnn_model_loaded.layers[1].get_weights()[0]
conv_embds[0]

array([-0.02271716, -0.03282207, -0.01942212, -0.06260538, -0.02711385,
       -0.01694492, -0.00113889, -0.1432817 , -0.02591053,  0.00694427,
       -0.00456908, -0.072645  , -0.02462506, -0.05915061,  0.01752759,
       -0.0563279 , -0.03050778, -0.03511775,  0.00021002, -0.05763318,
        0.04794497,  0.02431646, -0.03907726, -0.01259894,  0.01502478,
       -0.02743094,  0.01916009,  0.02461171, -0.00249929, -0.008789  ,
       -0.00715822, -0.04765795, -0.05609009, -0.00240624,  0.03763028,
       -0.00055717, -0.0277299 ,  0.05324748, -0.00399163, -0.0204413 ,
        0.03093167,  0.00538668, -0.0503009 ,  0.01336168,  0.01986855,
        0.0211968 , -0.06580566,  0.02032636, -0.00210202,  0.03620323,
       -0.02590587,  0.04238628, -0.07325724, -0.04166886,  0.03910806,
       -0.00201675,  0.04281683,  0.045309  , -0.01104105, -0.00518672,
        0.07085174,  0.0420563 ,  0.02910163, -0.01480897, -0.00597519,
        0.02139434, -0.02614024,  0.00137777, -0.00439095, -0.07

In [0]:
# TODO: What words have the maximum impact
# TODO: Compare two models
# TODO: Word embedding visualization