# Sentiment Analysis

A piece of text usually conveys an author's attitude towards a certain topic. This can be positive, negative, or neutral (e.g. Reviews on Amazon, IMDB, RottenTomatoes, etc..). Sentiment Analysis tries to infer the *sentiment* via computational modelling of text. 

- In this tutorial, we will build a simple LSTM model to perform sentiment classification.

(Note that this tutorial is largely inspired by https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras)

#### Load necessary libraries

You might need to install pandas (data/table processing library) before starting this tutorial
> pip install pandas

In [1]:
# This is a regex library. Regex stands for (reg)ular (ex)pression. 
# This lets you manipulate string in a very smart&convenient way.
import re 

In [2]:
# Data processing libraries
import numpy as np 
import pandas as pd 

# Libraries for text mipulation
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Libraries for building deep learning model. 
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM

# Libraries to perform standard machine learing training
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.


### Data Loading and Visualisations

- In this exercise, we will use dataset which contains tweets about First GOP Debate 2016. 

- The original source says: *We looked through tens of thousands of tweets about the early August GOP debate in Ohio and asked contributors to do both sentiment analysis and data categorization. Contributors were asked if the tweet was **relevant**, which **candidate** was mentioned, what **subject** was mentioned, and then what the **sentiment** was for a given tweet. We've removed the non-relevant messages from the uploaded dataset.*

In [3]:
data = pd.read_csv('Sentiment.csv')

FileNotFoundError: File b'Sentiment.csv' does not exist

Let's familiarise ourselves with data! 
Sentiment.csv is a dataset in the form of a table, where:
- Each row corresponds to one item (tweet1, tweet2, ...) 
- Each column corresponds to specific attribute ('tweet_id', 'tweet_name', 'subject', 'sentiment', etc..) 



In [None]:
# get the number of entries, and the number of fields per entry
n_entries, n_fields = data.shape

print('The dataset has {} entries and each entry has {} fields'.format(n_entries, n_fields))

Let's view first entry to see how the data looks like:

In [None]:
data.loc[0]

In pandas, it is easy to extract the only entries that you care about, as follows. The following code first extracts the relevant *columns*, then views first 5 entries

In [None]:
data[['candidate','name', 'text', 'sentiment',]].loc[0:4]

Lets see how entry 4's *text* field looks like in detail:

In [None]:
data['text'][4]

# Sentiment Analysis via LSTM Classifier

Let's infer sentiment of texts using Long-Short-Term-Memory (LSTM). We will:
1. Preprocessing the data, then prepare the input and output for the network
2. Build a network model for classification (model specification)
3. Train the network! 
4. Evaluate the network

### **Step 1: Preprocess/Data Cleaning**: 
The first stage is to clean data in a way that the neural network can handle data. 
- We will perform a binary classification of sentiment: whether the text conveys positive or negative sentiment. Therefore, 
    - (1) We will extract 2 fields we care about: *text* and *sentiment*. 
    - (2) We will delete the neutral sentiments from the data.

In [None]:
data = data[['text', 'sentiment']]

In [None]:
data = data[data.sentiment != "Neutral"]
print("Number of tweets with positive sentiment: {}".format(data[ data['sentiment'] == 'Positive'].size))
print("Number of tweets with negative sentiment: {}".format(data[ data['sentiment'] == 'Negative'].size))
print("Number of tweets with neutral sentiment: {}".format(data[ data['sentiment'] == 'Neutral'].size))

- At the moment, the text field contains a lot of symbols that is not actually part of the message. Also, we don't want to worry too much about case sensitivity. The following code cleans that up abit using **regex** (Don't worry too much about this)

In [None]:
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))
data['text'] = data['text'].apply(lambda x: x.replace('rt', ' '))

# Let's see how the text got converted
data['text'][4]

Now, the text looks better: it just contains all lower case words. There are still random stuff like "httptco" and "w". It would be very difficult to clean 20000 entries perfectly, so we will decide to live with this. When it affects the network's performance too much, we might come back to clean data further

###### Tokenization
- Human can see word and understand it, but machines cannot. Everything is just a bunch of numbers to them. Therefore, we need to convert words into a convenient digital representation.
- For example, if I want to represent 26 alphabets (a, b, c, ..., x, y, z), we may use the representation (1, 2, 3, ...24, 25, 26) where the correspondance is: (a->1, b->2,...z->26)
- For more general words in dictionary, we need more numbers. That's the idea of tokenization. In the following, we will decide the maximum number of features to be 2000. 

- We will then convert the text into a vectorised form. 

In [None]:
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)

# Let's see how the text is transformed into the representation that machine sees!
print(X[4])

We assigned a number to each word! 

Now, each tweet has different length. It is slightly easier for the network to handle input of the same length. We will "pad" the sequence so all tweets have the same number of length 

In [None]:
X = pad_sequences(X)

# Let's see how the text is transformed into the representation that machine sees!
print(X[4])

Finally, we will prepare the output of the network. At the moment, the target variables are 'Positive' and 'Negative'. Again, for the machines, we need a digital representation. We will Positive as [1, 0] and Negative as [0, 1]. This is called *one-hot encoding*. 

In [None]:
Y = pd.get_dummies(data['sentiment']).values

# Lets see how the output looks like:
print(data['sentiment'][4], ' -> ', Y[4])

Our input and output are ready!

### Step2: Building a network

Let's build a simple one-layer LSTM network for sentiment classification. We will be using Keras. 

Our network is composed of 3 components: 
- **Embedding layer**: The input we prepared above is just a sequence of numbers. This can be processed by the network, but it is not the best representation. For example, "love" and "like" are words that are similar. We want to be able to represent this "similarity" and "disimilarity" -- that's precisely the purpose of an embedding layer. In embedding layer, we map the above integers into a *continuous space* where we can define such notion of similarity as a distance metric. This space can have number of dimensions. This is your first hyper-parameter **embed_dim**.
- **LSTM layer**: Given the input from embedding layer, we apply LSTM layer to learn hidden representation. hyper-parameter **lstm_out** is the number of hidden units we want to use. Larger the number, more expressive the network gets, but it also becomes more difficult to train this many hidden units.
- **Classification layer**: Finally, given the hidden representation computed from the LSTM layer, we perform classification. This takes *lstm_out* number of inputs and map it into binary answer (positive/negative sentiment) via a fully connected layer called *Dense Layer*.

Other two hyperparameters are:

- **batch_size**: this determines how many number of data point we want to process at the same time. When the batch size is small, the network only learns from specific examples. If the batchsize is large, the network learns from a lot of words simultaneously.  
- **droupout_x**: this parameter "corrupts" the network randomly. The higher this parameter, more corruption is introduced. By making the network learn while being corrupted, we make the network more robust. 

These hyperparameters needs to be adjusted to achieve the best performance!

In [None]:
# Hyper parameters
embed_dim = 128
lstm_out = 196
batchsize = 32
dropout_x = 0.2

# Build LSTM classifier with 3 components
model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(LSTM(lstm_out, dropout=dropout_x, recurrent_dropout=dropout_x))
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

### Step3: Train the network

- From the dataset, we will split into training and testing data.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)

print('Size of training data: {}'.format(len(X_train)))
print('Size of testing data: {}'.format(len(X_test)))

- Now we will train the network! Since we don't have too much time, we will train for 5 epochs. Feel free to train for longer :)

In [None]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size, verbose = 2)

### Step4: Evaluation 
We will measure the accuracy on test data

In [None]:
# Measure the overall accuracy on test data
_, acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("Overall Accuracy: %.2f" % (acc))

# We will measure the accuracy on positive sentiment class and negative sentiment class
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_test)):
    
    result = model.predict(X_test[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
   
    if np.argmax(result) == np.argmax(Y_test[x]):
        if np.argmax(Y_test[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_test[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1

print("Accuracy on Positive Sentiment: ", pos_correct/pos_cnt*100, "%")
print("Accuracy on Negative Sentiment: ", neg_correct/neg_cnt*100, "%")

### Step 5: Enjoy!

Now, define your own tweet and see what the network thinks!

In [None]:
twt = 'Meetings: Because none of us is as dumb as all of us'

# Prepare the input: remember, we need preprocess the data: tokenization and padding. 
twt = tokenizer.texts_to_sequences(twt)
twt = pad_sequences(twt, maxlen=28, dtype='int32', padding='post', truncating='post', value=0)

# Feed the input to the network
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]

# What does network think?
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

### Step 6: Optimise the network

Try adjusting 4 hyper-parameters mentioned: 
- batchsize 
- dropout_x 
- embed_dim 
- lstm_out

## Questions
1. Why is the accuracy for positive sentiment so much lower than the negative one? There are no true answer to this question. Come up with as many reasons as you can. (Hint: for example, look at the number of examples for positive/negative sentiment. Can this be one of the reasons?)
2. How did the hyper-parameters affected the performance? How would you improve the network performance further?