# Twitter Sentiment Analysis

### Problem Statement
- We have dataset of tweets from various users. We need to classify the tweets as positive(0) or negative(1) based on the sentiment.

### Dataset
- The dataset is taken from kaggle. The link to the dataset is given below.
- https://www.kaggle.com/kazanova/sentiment140


In [203]:
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import time

SEED = 42
pd.set_option('display.max_colwidth', 200)

In [285]:
df_train = pd.read_csv("../input/train_2kmZucJ.csv")
df_test = pd.read_csv("../input/test_12QyDcx.csv")
df_sub = pd.read_csv("../input/sample_submission_LnhVWA4.csv")

In [286]:
df_train.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!


In [287]:
df_test.head()

Unnamed: 0,id,tweet
0,7921,I hate the new #iphone upgrade. Won't let me download apps. #ugh #apple sucks
1,7922,currently shitting my fucking pants. #apple #iMac #cashmoney #raddest #swagswagswag http://instagr.am/p/UUIS0bIBZo/
2,7923,"I'd like to puts some CD-ROMS on my iPad, is that possible?' — Yes, but wouldn't that block the screen?\n"
3,7924,"My ipod is officially dead. I lost all my pictures and videos from the 1D and 5sos concert,and from Vet Camp #hatinglife #sobbing"
4,7925,Been fighting iTunes all night! I only want the music I $&@*# paid for


In [288]:
df_sub.head()

Unnamed: 0,id,label
0,7921,0
1,7922,0
2,7923,0
3,7924,0
4,7925,0


In [289]:
# Let's create a utility function to time the execution of a function
def time_it(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Execution time: {end_time - start_time} seconds")
        return result
    return wrapper

In [290]:
def clean_text(text):
    text = BeautifulSoup(text).get_text()

    # Remove URLs
    text = re.sub(r"http\S+", " ", text)
    # Subsitute "$&@*#" with word "curseword"
    curseword = re.compile(r'[$&@*#]{2,}')
    text = curseword.sub("curseword", text)
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    return text

In [291]:
@time_it
def clean(df):
    df["clean_tweet"] = df["tweet"].apply(clean_text)
    return df

df_train = clean(df_train)

  text = BeautifulSoup(text).get_text()


Execution time: 1.2144443988800049 seconds


In [292]:
df_train.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone,fingerprint pregnancy test android apps beautiful cute health igers iphoneonly iphonesia iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/,finally a transparant silicon case thanks to my uncle yay sony xperia s sonyexperias
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu,we love this would you go talk makememories unplug relax iphone smartphone wifi connect
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/,i m wired i know i m george i was made that way iphone cute daventry home
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!,what amazing service apple won t even talk to me about a question i have unless i pay them for their stupid support


In [293]:
df_test = clean(df_test)

  text = BeautifulSoup(text).get_text()


Execution time: 0.3080263137817383 seconds


In [294]:
df_test.head()

Unnamed: 0,id,tweet,clean_tweet
0,7921,I hate the new #iphone upgrade. Won't let me download apps. #ugh #apple sucks,i hate the new iphone upgrade won t let me download apps ugh apple sucks
1,7922,currently shitting my fucking pants. #apple #iMac #cashmoney #raddest #swagswagswag http://instagr.am/p/UUIS0bIBZo/,currently shitting my fucking pants apple imac cashmoney raddest swagswagswag
2,7923,"I'd like to puts some CD-ROMS on my iPad, is that possible?' — Yes, but wouldn't that block the screen?\n",i d like to puts some cd roms on my ipad is that possible yes but wouldn t that block the screen
3,7924,"My ipod is officially dead. I lost all my pictures and videos from the 1D and 5sos concert,and from Vet Camp #hatinglife #sobbing",my ipod is officially dead i lost all my pictures and videos from the d and sos concert and from vet camp hatinglife sobbing
4,7925,Been fighting iTunes all night! I only want the music I $&@*# paid for,been fighting itunes all night i only want the music i curseword paid for


In [295]:
df_train["clean_tweet"][0]

' fingerprint  pregnancy test    android  apps  beautiful  cute  health  igers  iphoneonly  iphonesia  iphone'

In [296]:
# Separate the target variable
X = df_train["clean_tweet"]
y = df_train["label"]

In [297]:
# Split the data into train and validation set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=SEED)

In [298]:
from keras.preprocessing.text import Tokenizer

#prepare a tokenizer
x_tokenizer = Tokenizer() 

x_tokenizer.fit_on_texts(X_train)

In [299]:
x_tokenizer.word_index

{'iphone': 1,
 'apple': 2,
 'i': 3,
 'my': 4,
 'the': 5,
 'to': 6,
 'a': 7,
 'samsung': 8,
 'and': 9,
 'it': 10,
 'new': 11,
 's': 12,
 'for': 13,
 'twitter': 14,
 'com': 15,
 'me': 16,
 'you': 17,
 'phone': 18,
 'is': 19,
 'sony': 20,
 'follow': 21,
 'on': 22,
 'in': 23,
 'of': 24,
 'this': 25,
 't': 26,
 'pic': 27,
 'with': 28,
 'ipad': 29,
 'like': 30,
 'so': 31,
 'have': 32,
 'just': 33,
 'at': 34,
 'life': 35,
 'android': 36,
 'ios': 37,
 'love': 38,
 'your': 39,
 'now': 40,
 'rt': 41,
 'that': 42,
 'day': 43,
 'all': 44,
 'can': 45,
 'instagram': 46,
 'curseword': 47,
 'an': 48,
 'cute': 49,
 'photo': 50,
 'today': 51,
 'm': 52,
 'gain': 53,
 'not': 54,
 'photography': 55,
 'get': 56,
 'galaxy': 57,
 'back': 58,
 'got': 59,
 'from': 60,
 'fun': 61,
 'be': 62,
 'case': 63,
 'news': 64,
 'app': 65,
 'out': 66,
 'music': 67,
 'instagood': 68,
 'happy': 69,
 'who': 70,
 'time': 71,
 'no': 72,
 'funny': 73,
 'lol': 74,
 'fashion': 75,
 'beautiful': 76,
 'birthday': 77,
 'are': 78,
 'b

In [300]:
len(x_tokenizer.word_index)

15198

In [311]:
from keras.utils import pad_sequences 

# maximum sequence length allowed
max_len = 100

#convert text sequences into integer sequences
x_tr_seq = x_tokenizer.texts_to_sequences(X_train) 
x_val_seq = x_tokenizer.texts_to_sequences(X_val)

#padding up with zero 
x_tr_seq = pad_sequences(x_tr_seq,  padding='post', maxlen=max_len)
x_val_seq = pad_sequences(x_val_seq, padding='post', maxlen=max_len)

In [312]:
x_tr_seq[1]

array([  33, 1435,    4,   11,    1,  150,  175,   17,    1,    2,   37,
       2148,   18,   87,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

## RNN

In [303]:
from keras.models import *
from keras.layers import *
from keras.callbacks import *
import keras.backend as K

In [255]:
K.clear_session()
#sequential model
model = Sequential()

#embedding layer
model.add(Embedding(len(x_tokenizer.word_index) + 1, 50, input_shape=(max_len,), mask_zero=True))

#rnn layer
model.add(SimpleRNN(128,activation='relu'))

model.add(Dropout(0.5))

#dense layer
model.add(Dense(128,activation='relu')) 

#output layer
model.add(Dense(1,activation='sigmoid'))

In [256]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 50)           759950    
                                                                 
 simple_rnn (SimpleRNN)      (None, 128)               22912     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 799,503
Trainable params: 799,503
Non-trainable params: 0
_________________________________________________________________


In [257]:
def get_f1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

In [258]:
# Define optimizer to adam and loss function to binary crossentropy. We will use f1 score as the metric
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[get_f1])

In [281]:
# checkpoint to save best model during training
mc = ModelCheckpoint("../models/rnn_weights.best.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [260]:
#train the model 
model.fit(x_tr_seq, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_val_seq, y_val), callbacks=[mc])

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.34830, saving model to ../models/mlp_weights.best.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 0.34830 to 0.31849, saving model to ../models/mlp_weights.best.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 0.31849 to 0.27881, saving model to ../models/mlp_weights.best.hdf5
Epoch 4/10
Epoch 4: val_loss did not improve from 0.27881
Epoch 5/10
Epoch 5: val_loss did not improve from 0.27881
Epoch 6/10
Epoch 6: val_loss did not improve from 0.27881
Epoch 7/10
Epoch 7: val_loss did not improve from 0.27881
Epoch 8/10
Epoch 8: val_loss did not improve from 0.27881
Epoch 9/10
Epoch 9: val_loss did not improve from 0.27881
Epoch 10/10
Epoch 10: val_loss did not improve from 0.27881


<keras.callbacks.History at 0x7f99efb1c6d0>

In [262]:
# load weights into new model
model.load_weights("../models/rnn_weights.best.hdf5")

#predict probabilities
pred_prob = model.predict(x_val_seq)



In [263]:
pred_prob[0]

array([0.96341085], dtype=float32)

In [264]:
#define candidate threshold values
threshold  = np.arange(0,0.5,0.01)
threshold

array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49])

In [265]:
# convert probabilities into classes or tags based on a threshold value
def classify(pred_prob,thresh):
  y_pred_seq = []

  for i in pred_prob:
    temp=[]
    for j in i:
      if j>=thresh:
        temp.append(1)
      else:
        temp.append(0)
    y_pred_seq.append(temp)

  return y_pred_seq

In [266]:
from sklearn import metrics
score=[]

#convert to 1 array
y_true = np.array(y_val).ravel() 

for thresh in threshold:
    
    #classes for each threshold
    y_pred_seq = classify(pred_prob,thresh) 

    #convert to 1d array
    y_pred = np.array(y_pred_seq).ravel()

    score.append(metrics.f1_score(y_true,y_pred))

In [267]:
# find the optimal threshold
opt = threshold[score.index(max(score))]
opt

0.41000000000000003

In [268]:
#predictions for optimal threshold
y_pred_seq = classify(pred_prob,opt)
y_pred = np.array(y_pred_seq).ravel()

In [269]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.89      0.92      1179
           1       0.73      0.89      0.80       405

    accuracy                           0.89      1584
   macro avg       0.85      0.89      0.86      1584
weighted avg       0.90      0.89      0.89      1584



In [270]:
y_pred

array([1, 0, 1, ..., 0, 0, 1])

In [271]:
df = pd.DataFrame({'comment':X_val,'actual':y_true,'predictions':y_pred})
df.head()

Unnamed: 0,comment,actual,predictions
906,my iphone it sucks keeps screwing up shut off freezes and all that jazz,1,1
3556,flower art flower art colour photography creative beautiful lines samsung tuesday pic twitter com xd uwxqnq,0,0
2043,got my ps with a copy of killzone shadow fall and a year of ps plus sony ps killzone,0,1
1264,coffee is love iphoneography instagram iphonesia photooftheday iphone instagood popular,0,0
5337,i m sure my iphone just deleted every text message in the history of all my contacts apple,0,1


In [272]:
#prepare a tokenizer
x_tokenizer = Tokenizer() 

X_test = df_test["clean_tweet"]

x_tokenizer.fit_on_texts(X_test)

In [273]:
len(x_tokenizer.word_index)

6889

In [274]:
from keras.utils import pad_sequences 

# maximum sequence length allowed
max_len = 100

#convert text sequences into integer sequences
x_test_seq = x_tokenizer.texts_to_sequences(X_test)

#padding up with zero 
x_test_seq = pad_sequences(x_test_seq,  padding='post', maxlen=max_len)

In [275]:
x_test_seq[1]

array([1491, 1492,    4,  116, 1493,    2,  424, 2353, 2354, 2355,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

In [276]:
#predict probabilities
pred_test_prob = model.predict(x_test_seq)



In [277]:
#predictions for optimal threshold
y_pred_test_seq = classify(pred_test_prob,opt)
y_pred_test = np.array(y_pred_test_seq).ravel()

In [278]:
y_pred_test

array([1, 0, 1, ..., 0, 0, 0])

In [279]:
df_sub["label"] = y_pred_test

In [280]:
df_sub.head()

Unnamed: 0,id,label
0,7921,1
1,7922,0
2,7923,1
3,7924,0
4,7925,0


In [170]:
df_sub.to_csv("../results/submission_rnn_v2.csv", index=False)

## CNN

In [323]:
# define model architecture
K.clear_session()
model =  Sequential()
model.add(Embedding(len(x_tokenizer.word_index) + 1, 50, trainable=True, input_shape=(max_len,)))  #embedding layer

model.add(SpatialDropout1D(0.2)) #spatialdropout1d layer

model.add(Conv1D(64,5,padding='same', activation='relu'))  #conv1d layer
model.add(Bidirectional(LSTM(64,dropout=0.2, recurrent_dropout=0.2))) #bidirectional lstm layer
model.add(Dense(128,activation='relu'))  #dense layer
model.add(Dropout(0.5))
model.add(Dense(64,activation='relu')) #dense layer


model.add(Dense(1,activation='sigmoid')) #output layer
model.summary() #summary) of model

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 50)           759950    
                                                                 
 spatial_dropout1d (SpatialD  (None, 100, 50)          0         
 ropout1D)                                                       
                                                                 
 conv1d (Conv1D)             (None, 100, 64)           16064     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dropout (Dropout)           (None, 128)               0

In [325]:
#define optimizer and loss
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=[get_f1])

#checkpoint to save best model during training
mc = ModelCheckpoint("../models/cnn.weights.best.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [326]:
#train the model 
model.fit(x_tr_seq, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_val_seq, y_val), callbacks=[mc])

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.37332, saving model to ../models/cnn.weights.best.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 0.37332 to 0.25281, saving model to ../models/cnn.weights.best.hdf5
Epoch 3/10
Epoch 3: val_loss did not improve from 0.25281
Epoch 4/10
Epoch 4: val_loss did not improve from 0.25281
Epoch 5/10
Epoch 5: val_loss did not improve from 0.25281
Epoch 6/10
Epoch 6: val_loss did not improve from 0.25281
Epoch 7/10
Epoch 7: val_loss did not improve from 0.25281
Epoch 8/10
Epoch 8: val_loss did not improve from 0.25281
Epoch 9/10
Epoch 9: val_loss did not improve from 0.25281
Epoch 10/10
Epoch 10: val_loss did not improve from 0.25281


<keras.callbacks.History at 0x7f99d628be50>

## LSTM

In [307]:
# Define model architecture
K.clear_session()
model =  Sequential()
model.add(Embedding(len(x_tokenizer.word_index) + 1, 128, trainable=True, input_shape=(max_len,), mask_zero=True))  #embedding layer
  
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_shape =(1,)))

model.add(Dropout(0.6))

# Dense layer
model.add(Dense(128,activation='relu')) 

# Output layer
model.add(Dense(1,activation='sigmoid')) #output layer
model.summary() #summary) of model

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 128)          1945472   
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 2,093,697
Trainable params: 2,093,697
Non-trainable params: 0
_________________________________________________________________


In [308]:
#define optimizer and loss
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=[get_f1])

#checkpoint to save best model during training
mc = ModelCheckpoint("../models/lstm.weights.best.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')

In [310]:
#train the model 
model.fit(x_tr_seq, y_train, batch_size=64, epochs=10, verbose=1, validation_data=(x_val_seq, y_val), callbacks=[mc])

Epoch 1/10


Epoch 1: val_loss did not improve from 0.24433
Epoch 2/10
Epoch 2: val_loss did not improve from 0.24433
Epoch 3/10
Epoch 3: val_loss did not improve from 0.24433
Epoch 4/10
Epoch 4: val_loss did not improve from 0.24433
Epoch 5/10
Epoch 5: val_loss did not improve from 0.24433
Epoch 6/10
Epoch 6: val_loss did not improve from 0.24433
Epoch 7/10
Epoch 7: val_loss did not improve from 0.24433
Epoch 8/10
Epoch 8: val_loss did not improve from 0.24433
Epoch 9/10
Epoch 9: val_loss did not improve from 0.24433
Epoch 10/10
Epoch 10: val_loss did not improve from 0.24433


<keras.callbacks.History at 0x7f99ecedcd00>