# Sentiment analysis 

# Introduction
Analyze & classify sentiment of text data, articles into positive or negative

# Objective
Sentiment analysis notebooks dives in very depth of various concepts, methods related to text analysis and understand the meaning of it semantically and/or syntactly. They are classified in the following five based notebooks based on different methods & tools used to analyze & classify text.

1. Sentiment Analysis with Text Blob, Word Cloud, Count Vectorizer, N-Gram
2. Sentiment Analysis using Doc2Vec, N-Gram & Phrase Modelling
3. Sentiment Analysis with Chi2 Square & PCA Dimension Reduction
4. Sentiment Analysis with Keras & Tensorflow
5. Sentiment Analysis with Keras & Tensorflow using Doc2Vec, Pretrained GloVe

# Cuatro
## 4. Sentiment Analysis with Keras & Tensorflow

In [1]:
# Basic import

import re
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# from tqdm import tqdm
# tqdm.pandas(desc="progress-bar")

# from gensim.models import Doc2Vec
# from gensim.models.doc2vec import LabeledSentence
# from gensim.models.phrases import Phrases, Phraser

In [3]:
from textblob import TextBlob
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from datetime import datetime

import multiprocessing

In [4]:
# Read TF dataframe

df = pd.read_hdf('./data/redstone.hdf')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1600000 entries, 0 to 1599999
Data columns (total 3 columns):
sentiment        1600000 non-null int64
text             1600000 non-null object
pre_clean_len    1600000 non-null int64
dtypes: int64(2), object(1)
memory usage: 48.8+ MB


Unnamed: 0,sentiment,text,pre_clean_len
0,0,awww that bummer you shoulda got david carr of...,115
1,0,is upset that he can not update his facebook b...,111
2,0,dived many times for the ball managed to save ...,89
3,0,my whole body feels itchy and like its on fire,47
4,0,no it not behaving at all mad why am here beca...,111


In [5]:
# Santitizing dataframe

df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 3 columns):
sentiment        1600000 non-null int64
text             1600000 non-null object
pre_clean_len    1600000 non-null int64
dtypes: int64(2), object(1)
memory usage: 36.6+ MB


In [6]:
from sklearn import utils
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

train = df.text
label = df.sentiment
SEED = 21

# Splitting data into train, test & validation sets
x_train, x_val_test, y_train, y_val_test = train_test_split(train, label, test_size=.02, random_state=SEED)

x_val, x_test, y_val, y_test = train_test_split(x_val_test, y_val_test, test_size=.5, random_state=SEED)



In [7]:
type(y_train)

pandas.core.series.Series

In [8]:
# Quantifying the positive & negative sentiments in the dataset

from collections import Counter

counter = Counter(y_train)
print('Train set entries.')
for key in counter:
    if key == 0:
        print('{:.2f}% Negative Entries'.format( (counter[key]/len(y_train))*100 ))
    elif key == 1:
        print('{:.2f}% Positive Entries'.format( (counter[key]/len(y_train))*100 ))
        
counter = Counter(y_val)
print('\nValidation set entries.')
for key in counter:
    if key == 0:
        print('{:.2f}% Negative Entries'.format( (counter[key]/len(y_val))*100 ))
    elif key == 1:
        print('{:.2f}% Positive Entries'.format( (counter[key]/len(y_val))*100 ))

counter = Counter(y_test)
print('\nTest set entries.')
for key in counter:
    if key == 0:
        print('{:.2f}% Negative Entries'.format( (counter[key]/len(y_test))*100 ))
    elif key == 1:
        print('{:.2f}% Positive Entries'.format( (counter[key]/len(y_test))*100 ))

Train set entries.
50.00% Negative Entries
50.00% Positive Entries

Validation set entries.
50.01% Negative Entries
49.99% Positive Entries

Test set entries.
50.21% Negative Entries
49.79% Positive Entries


### Artificial Neural Networks

After experimenting with Logistic Regression, it would be interesting to evaluate the result of neural network classifier. Logistic regression can be thought as a basic neural network with no hidden layer and just one output node.

![title](./images/lr_nn.png)

### TFIDF Vectorizer with Artificial Neural Networks

The best performing TFIDF vectors have 100,000 features for (Unigram + Trigram) word tokens with logistic regression. 

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(max_features=100000, ngram_range=(1, 3))
tvec = tvec.fit(x_train)

In [12]:
# Transform train  & validation set

tf_train = tvec.transform(x_train)
tf_val = tvec.transform(x_val)

In [13]:
%%time

# Fitting Logistic Regression classical model
clf = LogisticRegression()
clf.fit(tf_train, y_train)

CPU times: user 30.7 s, sys: 124 ms, total: 30.8 s
Wall time: 31 s


In [14]:
# Train & Validation scores

display(clf.score(tf_train, y_train))
display(clf.score(tf_val, y_val))

0.8413386479591837

0.824

Keras 

In [15]:
# Basic Keras Import

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import numpy as np

# Fix the seed
np.random.seed(21)

Using TensorFlow backend.


ADAM is an optimization algorithm for updating the parameters and minimizing the cost of the neural network, which is proved to be very effective. It combines two methods of optimization: RMSProp, Momentum. 

Keras NN model cannot handle sparse matrix directly. Hence the data has to be either a dense array or matrix, but transforming the whole training data of 1.5 million (TFIDF vectors) into a dense array won't fit into my RAM. 
An iterable generator object would solve this problem by generating required data on the run which can be achieved by using "yield" instead of "return".

In [4]:
# Batch generator

def batch_generator(train, label, batch_size):
    
    # Calculate no of batches
    number_of_batches = train.shape[0]/batch_size
    
    # Data set indices to choose a batch from
    batch = np.arange(tf_train.shape[0])
    # Starting batch index
    batch_idx = 0
    while True:
        # Selecting batches
        train_batch = train[ batch[batch_size*batch_idx:batch_size*(batch_idx+1)], :].toarray()
        label_batch = label[ batch[batch_size*batch_idx:batch_size*(batch_idx+1)] ]

        #print('\n{} Batch indices from {} to {} selected.\n'.format((batch_idx+1), (batch_size*batch_idx), (batch_size*(batch_idx+1))))
        
        # Generator statement
        yield train_batch, label_batch
        
        # Next batch
        batch_idx += 1
        # Check if 1 epoch is finished then next batch index should be greater than no of batches
        if (batch_idx > number_of_batches):
            batch_idx=0
            

In [17]:
# Parameters

batch_size = 16

In [None]:
# Create Model

model = Sequential()
model.add(Dense(64, input_dim=100000, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile Model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model

model.fit_generator(generator=batch_generator(tf_train, y_train, batch_size),
                    epochs=5, validation_data=(tf_val, y_val),
                    steps_per_epoch=tf_train.shape[0]/batch_size)

Epoch 1/5


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]




#### Normalizing inputs

In [None]:
# Let's see if normalizing the inputs have any effect on the performance.

from sklearn.preprocessing import Normalizer

norm = Normalizer().fit(tf_train)

tf_train_norm = norm.transform(tf_train)
tf_val_norm = norm.transform(tf_val)

In [None]:
# Create Model

model = Sequential()
model.add(Dense(64, input_dim=100000, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile Model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model with normalized inputs

model.fit_generator(generator=batch_generator(tf_train_norm, y_train, batch_size),
                    epochs=5, validation_data=(tf_val_norm, y_val),
                    steps_per_epoch=tf_train.shape[0]/batch_size)

TFIDF is already normalized. TF (Term Frequency) in TFIDF isn't the absolute frequency but relative frequency and after multiplying IDF (Inverse Document Frequency) to the relative term frequency value, it further normalizes the values in a cross-document manner.

#### Dropout

According to the research paper "Improving neural networks by preventing co-adaptation of feature detectors" by Hinton et al. (2012), a good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. 
The standard way to do this is to train many separate networks and apply each of these networks to the test data, but this is computationally expensive during both phase of training and testing. 
Random dropout makes it possible to train a huge number of different networks in a reasonable time.
- https://arxiv.org/pdf/1207.0580.pdf

Dropout could be thought as the simulation of training many different networks and averaging them by randomly omitting hidden nodes with a certain probability, throughout the training process. 

In [None]:
# Create Model

model = Sequential()
model.add(Dense(64, input_dim=100000, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile Model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model

model.fit_generator(generator=batch_generator(tf_train, y_train, batch_size),
                    epochs=5, validation_data=(tf_val, y_val),
                    steps_per_epoch=tf_train.shape[0]/batch_size)

Dropout has added some generalization to the model.

#### Shuffling

By presenting data in the same order during each epoch, there's a possibility that the model learns the parameters which also include noise of the training data. It might eventually lead to overfitting. It can be mitigated by shuffling the order of the data fed to the model.

##### Updated Batch Generator with Shuffling

In [None]:
# Batch generator updated

def batch_generator_shuffle(train, label, batch_size):
    
    # Calculate no of batches
    number_of_batches = train.shape[0]/batch_size
    
    # Data set indices to choose a batch from
    batch = np.arange(tf_train.shape[0])
    # Shuffling batch indices
    np.random.shuffle(batch)
    
    # Starting batch index
    batch_idx = 0
    while True:
        # Selecting batches
        train_batch = train[ batch[batch_size*batch_idx:batch_size*(batch_idx+1)], :].toarray()
        label_batch = label[ batch[batch_size*batch_idx:batch_size*(batch_idx+1)] ]

        #print('\n{} Batch indices from {} to {} selected.\n'.format((batch_idx+1), (batch_size*batch_idx), (batch_size*(batch_idx+1))))
        
        # Generator statement
        yield train_batch, label_batch
        
        # Next batch
        batch_idx += 1
        # Check if 1 epoch is finished then next batch index should be greater than no of batches
        if (batch_idx > number_of_batches):
            np.random.shuffle(batch)
            batch_idx=0
            

Shuffling did improve the model's performance on the validation set. 

In the "deeplearning.ai" course by Andrew Ng, he states that the first thing he would try to improve a neural network model is tweaking the learning rate. 
Please note that except for the learning rate, the parameters 'beta_1', 'beta_2', and 'epsilon' are set to their default values as presented in the original paper titled
"ADAM: A Method for Stochastic Optimization" by Kingma and Ba (2015).       
- https://arxiv.org/pdf/1412.6980.pdf

In [None]:
%%time

import keras

# My ADAM with lr 0.005
my_adam = keras.optimizers.Adam(lr=0.005, beta_1=0.9, beta_2=0.999, epsilon=1e-8)

# Create Model
model = Sequential()
model.add(Dense(64, input_dim=100000, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile Model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model

model.fit_generator(generator=batch_generator_shuffle(tf_train, y_train, batch_size),
                    epochs=5, validation_data=(tf_val, y_val),
                    steps_per_epoch=tf_train.shape[0]/batch_size)

After trying four different learning rates (0.0005, 0.005, 0.01, 0.1), it seems that none of them outperformed the default learning rate of 0.001.

#### Increase Hidden Nodes

In [None]:
# Create Model

model = Sequential()
model.add(Dense(128, input_dim=100000, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile Model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model

model.fit_generator(generator=batch_generator(tf_train, y_train, batch_size),
                    epochs=5, validation_data=(tf_val, y_val),
                    steps_per_epoch=tf_train.shape[0]/batch_size)

With 128 hidden nodes, the validation accuracy got closer to the performance of logistic regression. Anyhow, Logistic regression took less than a minute to fit and even if the neural network can be improved further, it isn't an efficient way.

As neural network models failed to outperform logistic regression, the probable cause might be high dimensionality and sparse characteristics of the textual data. 
According to "An Empirical Evaluation of Supervised Learning in High Dimensions" by Caruana et al.(2008), logistic regression showed as good performance as neural networks, in some cases outperforms neural networks.   
- http://icml2008.cs.helsinki.fi/papers/632.pdf

Even though the neural network is a more complex version of logistic regression, it doesn't always outperform logistic regression.
Sometimes with high dimensional sparse data, logistic regression can deliver good performance with much less computation time than neural network.