<a href="https://colab.research.google.com/github/FFOLA/My-Titanic/blob/master/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Using TensorFlow, for Beginners

## Introduction 

Sentiment Analysis is the process of categorizing opinions expressed in text. Opinions can be positive, negative or neutral.

## Examples

- Lagos is a very dirty city - Negative sentiment
- The youths are our future - Positive sentiment


## Difficult Examples
- I do not dislike your API.
- My governor is better than your governer.
- I love my mobile but would not recommend it to any of my colleagues

## Applications of Sentiment Analysis
- business; e.g customer feedback such as restaurant reviews or product complaints
- politics; how do people feel about your candidate?
- relationships; is your boyfriend your boyfriend? 🤷🏿‍♂️

## Goal: 
Given a text, use ML to predict its sentiment. <br/>
We'll keep things simple and deal with only positive and negative sentiments.

Most of the material in this tutorial is taken from the documentation here https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub

## Machine Learning Lifecyle

<img src="https://docs.aws.amazon.com/sagemaker/latest/dg/images/ml-concepts-10.png"> <br/>
Source: https://docs.aws.amazon.com/sagemaker/latest/dg/images/ml-concepts-10.png

### Fetching the data

We'll use data from here https://www.kaggle.com/c/word2vec-nlp-tutorial/data.<br/>

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. 

In [0]:
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
from sklearn.model_selection import train_test_split

In [0]:
# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

In [0]:
data = pd.read_csv('data/labeledTrainData.tsv', delimiter='\t')

In [0]:
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [0]:
data['review'].iloc[1]

'\\The Classic War of the Worlds\\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \\"critics\\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \\"critics\\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \\"critics\\" perceive to be its shortcomings."'

In [0]:
data['review'].iloc[2]

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xc2\xa8Jurassik Park\xc2\xa8, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fi

Check distribution of instances. Important because class imabalance can skew result.

In [0]:
len(data[data['sentiment'] == 1]), len(data[data['sentiment'] == 0])

(12500, 12500)

Even number of instances in each class, so we don't need to worry. Typically, we'll worry about class imbalance if one class had say only 10% of all instances

We need to hold out some of the data to test the performance of our model against.

In [0]:
train_data, test_data = train_test_split(data, test_size=0.3, stratify=data['sentiment'], random_state=42)

In [0]:
len(train_data[train_data['sentiment'] == 1]), len(train_data[train_data['sentiment'] == 0])

(8750, 8750)

In [0]:
len(test_data[test_data['sentiment'] == 1]), len(test_data[test_data['sentiment'] == 0])

(3750, 3750)

### Clean

### Prepare

ML algorithms work on only numbers so we need to convert the data to a suitable representation

Options:
- one-hot encoding
- n-grams
- word embeddings etc

In [0]:
train_input_fn = tf.estimator.inputs.pandas_input_fn(train_data, train_data['sentiment'], num_epochs=None, shuffle=True)
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(train_data, train_data['sentiment'], shuffle=False)
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(test_data, test_data['sentiment'], shuffle=False)

### TFHub to the rescue!

A Library for Reusable Machine Learning Modules in TensorFlow.<br/>

We will be using `nnlm-en-dim128` module. The module maps from text to 128-dimensional embedding vectors.<br/>

Say your text is "Lagos is dirty". When passed to this module, it'll return a vector that looks like this [0.4343, 0.004565, 0.00545 ...].<br/>

Module also takes care of grunt work such as preprocessing the data; removal of punctuation and splitting on spaces

In [0]:
embedded_text_feature_column = hub.text_embedding_column(key='review', module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')

### Training

In [0]:
estimator = tf.estimator.DNNClassifier(hidden_units=[150, 50, 20], feature_columns=[embedded_text_feature_column], n_classes=2,
    optimizer=tf.train.AdagradOptimizer(learning_rate=0.3))

estimator.train(input_fn=train_input_fn, steps=1500)

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1a1dccf150>

### Evaluation

In [0]:
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

print('Training set accuracy: {accuracy}'.format(**train_eval_result))
print('Test set accuracy: {accuracy}'.format(**test_eval_result))

Training set accuracy: 0.797828555107
Test set accuracy: 0.793333351612


In [0]:
features = {
    'review': np.array(['Lagos is a very dirty city', 'The youths are our future', 'I do not dislike your API'])
}
custom_test_input_fn = tf.estimator.inputs.numpy_input_fn(features, np.array([0, 1, 1]), shuffle=False)

### Prediction

In [0]:
[x['class_ids'][0] for x in estimator.predict(input_fn=custom_test_input_fn)]

[0, 1, 0]

In [0]:
custom_test_eval_result = estimator.evaluate(input_fn=custom_test_input_fn)
print('Test 2 set accuracy: {accuracy}'.format(**custom_test_eval_result))

Test 2 set accuracy: 0.666666686535


## Next steps
- Hyperparameter optimization
- SoTA models; Universal Language Model Fine-tuning for Text Classification, https://arxiv.org/pdf/1801.06146.pdf error rate is 4.6%

## References
- https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub
- https://www.tensorflow.org/hub/modules/google/nnlm-en-dim128/1
- https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier
- https://arxiv.org/pdf/1801.06146.pdf
- https://medium.com/tensorflow/introducing-tensorflow-hub-a-library-for-reusable-machine-learning-modules-in-tensorflow-cdee41fa18f9
- http://nlpprogress.com/sentiment_analysis.html
- https://developers.google.com/machine-learning/guides/text-classification/step-2-5

## Installs
- Tensorflow: `conda install -c conda-forge tensorflow`
- TF Hub: `conda install -c conda-forge tensorflow-hub`