**Sentiment Analysis on Stanford's 1.6 million tweets - Transfer Learning Pipeline**

This notebook needs to be run only once. It builds a deep learning model on Stanford's 1.6 million tweets and saves the featurizer ( tokenizer), model architecture and model weights into DBFS to be used on any other unlabeled dataset as part of the transfer learning pipeline.

In [2]:
import pandas as pd
import numpy as np

Now, let us import Stanford's 1.6 million tweets

In [4]:
# File location and type
file_location = "/FileStore/tables/training_1600000_processed_noemoticon-efba6.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type).load(file_location)

df = df.toPandas()
df.columns = ['sentiment','id','date','flag','user','text']

df.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [5]:
df['sentiment'].value_counts()

**Now, let us create a cleaning function. We will remove all the non-alphanumeric texts like punctuations etc. We will also remove the tags using beautiful soup and expand the shortened words like isn't into is not.**

**Now, let us create a cleaning function. We will remove all the non-alphanumeric texts like punctuations etc. We will also remove the tags using beautiful soup and expand the shortened words like isn't into is not.**

In [8]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

In [9]:
pip install python-Levenshtein

In [10]:
def textCleaning(text):
    """ 
         Summary: 
         Clean text data by removing http/www links, alphanumerics etc, making text in lowercase and removing leading and tailing white spaces.
       
         Parameters: 
         text (is passed in a loop), this function requires a series of input, hence this requires a loop (created in the next cell).
         
         Returns:
         Cleaned text
    """
    
    # to filter http/www links, alphanumerics etc
    filters = r'@[A-Za-z0-9_]+|https?://[^ ]+|www.[^ ]+'

    # Creating a dictionary for unshortten words
    my_dictionary = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not", "rt":''}

    myDictionary_pattern = re.compile(r'\b(' + '|'.join(my_dictionary.keys()) + r')\b')
  
    text = str(text)
    
    #Beautiful Soup is a Python package for parsing HTML and XML documents.
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
        
    #Filtering with our customized filter that we created
    stripped = re.sub(filters, '', bom_removed)
    lower_case = stripped.lower()
    dictionary_handled = myDictionary_pattern.sub(lambda x: my_dictionary[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", dictionary_handled)
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    
    return (" ".join(words)).strip()

In [11]:
df.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Now, let us run the cleaning function defined above to clean all our tweets

In [13]:
#nums = length of the dataframe
nums = df.shape[0]
print ("Cleaning and parsing the tweets...\n")

#creating empty list
clean_tweet_texts = []

#Passing texts of our dataframe in loop
for i in range(nums):
    if((i+1)%100000 == 0 ):
        print ("Tweets %d of %d has been processed" % ( i+1, nums ))   
        
        #collecting cleaned text in a list
    clean_tweet_texts.append(textCleaning(df['text'].iloc[i]))
print('Pro-processing done 👍🏼')

In [14]:

df['text'] = pd.Series(clean_tweet_texts)

In [15]:
df['text'] = df['text'].astype('str')

In [16]:
tweets = df.copy()

In [17]:
tweets['sentiment'].value_counts()

In [18]:
tweets.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww that bummer you shoulda got david carr of...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can not update his facebook b...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,dived many times for the ball managed to save ...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no it not behaving at all mad why am here beca...


In [19]:
tweets['sentiment'] = tweets['sentiment'].astype('int')
tweets['sentiment']=tweets['sentiment'].replace(4,1)
tweets['sentiment'].value_counts()

Stanford's data is balanced, we have 800,000 tweets whose sentiment is 0 and 800,000 tweets whose sentiment is 1

Let us grab only the text and the sentiment column to do our sentiment model building

In [21]:
tweets = tweets[['text', 'sentiment']]

In [22]:
tweets.head()

Unnamed: 0,text,sentiment
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


Now, let us start our modelling process. Let us import all the necessary libraries

In [24]:
##helper libraries

from subprocess import check_output
from collections import Counter
import gc
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation,Input, Flatten
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D, SpatialDropout1D, AveragePooling1D
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
import time
from keras import metrics
import h5py
from keras.models import model_from_json
import pickle
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.models import Sequential, Model

In [25]:
tweets.head()

Unnamed: 0,text,sentiment
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [26]:
texts = tweets['text']

Let us use keras tokenizer to tokenize the tweets using top 10000 occuring words and let us fit on our tweets. We will also save this tokenizer to be used for feature building on any new dataset.

In [28]:
num_max = 10000
tok = Tokenizer(num_words=num_max)
tok.fit_on_texts(texts)

In [29]:
import sys
import os
import csv

Since, most of the tweets are betwet 20-50 words length, we will only use max sequence length of 40 for our deep learning model. We will also chose 100 as the dimension of our embedding vector which means each word will have 100 dimentional features whose weights will be learnt with deep learning. lets use a validation split of 0.25 to check accuracy o new data.

In [31]:
MAX_SEQUENCE_LENGTH = 40
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.25

**Now, let us save the tokenizer using pickle**

In [33]:
# saving
with open('/dbfs/FileStore/tables/pipeline_featurizer.pickle', 'wb') as handle:
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)

**Now, lets convert our text into numbers using the above tokenizer. We will also use padding to equalize the size of each tweet which is reqquired in deep learning models.**

In [35]:
sequences = tok.texts_to_sequences(texts)
word_index = tok.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [36]:
labels = tweets['sentiment']

In [37]:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

In [38]:
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', len(texts))
print('Shape of label tensor:', len(labels))

In [39]:
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

Now x_train contains 75% of the dataset.Now, let us define an embedding layer with the size 100 defined above in the variable embedding_dim

In [41]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            mask_zero=False,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

In [42]:
print('Traing and validation set number of positive and negative reviews')
print (y_train.sum(axis=0))
print (y_val.sum(axis=0))

Rational of model used.

**LSTM PROS**

Recognize pattern across time and can keep overall information of any text

RNN’s are better when classification is determined by a long-range semantic dependency rather than some local key-phrases  (therefore checking both is necessary, CNN works better for classification in most cases though)


**LSTM CONS**

Weightage to earlier words decreases with time even with LSTM

Parallel computing not possible ( very costly )

LSTMs are very slow, CNNs are 5-6 times faster than LSTM

**LSTM NLP Applications:**

Questions-Answering

Translation ( Any sequence to sequence model)

Language modelling



**CNN PROS**

Detect patterns of multiples sizes (2, 3, or 5 adjacent words )

Parallel computation

Weights share in different patterns

Weightage remains same for different patterns ( no time component, therefore earlier patters are remembered)

Eg: Patterns could be expressions (word ngrams?) like “I hate”, “very good”

**CNN CONS**

Convolutions lose information about the local order of words; therefore sequence information is lost. ( How to rectify? Later ;) )

Text with sentiment or class updates with time. 

**Applications:**

Classification task  Importance (pattern) > Importance (pattern)

Sentiment Analysis
Spam Detection
Topic Categorization


**Exceptions:**

**LSTM>CNN**
Eg: I thought it would be a great movie, but it turns out that it was shitty.

**CNN >>> LSTM**
I was very disappointed with this restaurant. The service was incredibly slow, and the food was mediocre. I will not be back


**Rational behind Parameters Tuning**

Is word embeddings required? It turns out it isn’t. Why?

Even without pre-trained embeddings the model performs pretty good and sometimes better.
Linguistics used in #drones tweets appear multiple times in multiple contexts in 1.6 million data
Therefore, no need of adding complexity through embeddings

Dropout value?
0.5 performs a bit better for new data, better for generalization

Max/Average pooling ?
Average pooling works better because it averages the sentiments of 2 patterns. 
Max will give better results maybe for extreme sentiment/polarity classification ( Heuristic checks required though)

Adam Optimizer instead of stochastic gradient descent? Better performance and faster.

Size of filters: depends on what kind of pattern we want to detect ( size = 2, eg: not good,  3, 4 works perfectly fine too )


**Rationale for parameters:**

**MAX_SEQUENCE_LENGTH** :- max sequence length should be chosen according to the average length of the text pr document of our input. Here, we had twitter data so we chose around 40.

**embedding layer size**: embedding layer size depends on the amount of training data: If the training dta is huge, we can increase the size of the embedding layer for more and more new features to be learnt through more data

CNN layer number of filters and filter size parameter: **Number of filters** again depends on the number of context or features you want to learn from a single training daata. This has similar intuition as the embedding layer size

**Filter size** is dependent on the number of continuous words in 1 training data that can define a context or feature.: Here we took 3 because normally 3 words next to each other can define a sentimental context



**Note**

Suppose we have any other training dataset to build this pipeline like for example **amazon reviews**' dataset, in that case we have see the average number of words in each review and increase the size of the LSTM parameters ( max_sequence length from 40 to maybe 80 depends on the avg number of words). We can also increase the number of filters we are using in our CNN layer since more number of words leads to more number of features.

In [44]:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

**Based on the above defined rationale, let us build a deep learning model with a hybri of lstm and cnn to perform sentiment prediction**

In [46]:

x = Bidirectional(LSTM(40, return_sequences=True))(embedded_sequences)

x = Conv1D(64, 3, activation='relu')(x)
x = AveragePooling1D(pool_size=2)(x)

x = Flatten()(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.5)(x)
preds = Dense(2, activation='softmax')(x)

model = Model(sequence_input, preds)

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

model.summary()

Now, let us fit the above defined model for 3 epochs

In [48]:
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          epochs=3, batch_size=128) 

Now, let us save the model architecture and weights.

In [50]:
# Save the weights
model.save_weights('/dbfs/FileStore/tables/model_weights.h5')

In [51]:
# Save the model architecture

with open('/dbfs/FileStore/tables/model_architecture.json', 'w') as f:
    f.write(model.to_json())

END