# Sentiment analysis
1. Task type: NLP
2. Dataset: Tweets (Text type)
3. Usecases: Social Media management, Review Systems, News Analysis for Stock Markets

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('./data/train.csv', encoding='latin-1', names=['Target', 'TweetID', 'Date', 'No_Query', 'UserName', 'Data'])
df.head()

Observation: 
1. Data is not utf-8 encoded that is why required to set correct encoding to read csv file.
2. Data is not having column names that is why provided it with column name.

In [None]:
df.info()

As it is visible to me that Target and Data are only columns useful for me to train model for sentiment detection, I can drop other columns.

In [None]:
df = df[['Target', 'Data']]
df.head()

Now, I will try to remove words which usually does not contribute to sentiments like tags (@username in data in tweet) and urls. I will keep hashtags as of now to check if they make any effect on data or not. 

In [None]:
df['Data'] = df['Data'].replace(r'http\S+', '', regex=True).replace(r'@\S+', '', regex=True)

In [None]:
df.head(20)

Data info showing data is not having null values and datatypes are int64 or objects. Now I need to determine language of text for each statement as I want my model to get trained for english only. 

In [None]:
from langdetect import detect

In [None]:
from numpy import NaN


for i in range(len(df)):
    if df['Data'][i].isspace() == True:
        df['Data'][i] = NaN

In [None]:
df = df[df['Data'].noatna()]

In [None]:
df = df.reset_index(drop=True)

In [None]:
import string
for char in string.punctuation:
    df['Data'] = df['Data'].replace(char, NaN, regex=False)

In [None]:
df.head()

In [None]:
df['ln']=[0]*len(df)
print(df.head())
for i in range(len(df)):
    try:
        x = detect(df['Data'][i])
        df['ln'][i] = x
    except:
        df['ln'][i]=NaN

for i in range(len(df)):
    if df['ln'][i]=='en':
        df['ln'][i]='en'
    else:
        df['ln'][i]=NaN

In [None]:
df.head(20)

In [None]:
df.to_csv('./data/trainModified.csv')

In [3]:
df = pd.read_csv('./data/trainModified.csv')
df.head(10)

<IPython.core.display.Javascript object>

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Target,Data,ln
0,0,0,0,"- Awww, that's a bummer. You shoulda got Da...",en
1,1,1,0,is upset that he can't update his Facebook by ...,en
2,2,2,0,I dived many times for the ball. Managed to s...,en
3,3,3,0,my whole body feels itchy and like its on fire,en
4,4,4,0,"no, it's not behaving at all. i'm mad. why am...",en
5,5,5,0,not the whole crew,en
6,6,6,0,Need a hug,en
7,7,7,0,"hey long time no see! Yes.. Rains a bit ,onl...",en
8,8,8,0,nope they didn't have it,en
9,9,9,0,que me muera ?,


In [4]:
df = df[df.ln == 'en']
# df = df.drop(df.iloc[:, 0:1], axis=1)
df = df.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1)
df = df.reset_index()
df.head(10)

Unnamed: 0,index,Target,Data,ln
0,0,0,"- Awww, that's a bummer. You shoulda got Da...",en
1,1,0,is upset that he can't update his Facebook by ...,en
2,2,0,I dived many times for the ball. Managed to s...,en
3,3,0,my whole body feels itchy and like its on fire,en
4,4,0,"no, it's not behaving at all. i'm mad. why am...",en
5,5,0,not the whole crew,en
6,6,0,Need a hug,en
7,7,0,"hey long time no see! Yes.. Rains a bit ,onl...",en
8,8,0,nope they didn't have it,en
9,10,0,spring break in plain city... it's snowing,en


In [5]:
len(df[df.Target==0])/len(df[df.Target==4])

1.0135628243231949

It is now clearly visible that data is pretty cleaned and sample is almost of similar length so we can train model without thinking much about bias due to unbalanced data. Data length ratio of 1.01 suggests almost same length of both targets.

In [6]:
df = df[['Data','Target']]
df.head()

Unnamed: 0,Data,Target
0,"- Awww, that's a bummer. You shoulda got Da...",0
1,is upset that he can't update his Facebook by ...,0
2,I dived many times for the ball. Managed to s...,0
3,my whole body feels itchy and like its on fire,0
4,"no, it's not behaving at all. i'm mad. why am...",0


In [7]:
df['Target'].value_counts()

0    739687
4    729789
Name: Target, dtype: int64

Now, I trained model on complete data and it was taking too much time so I shuffled the data and drop rows so my train file have 800000 total rows.

In [8]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,Data,Target
0,i cant spelll well today! argh!! i want to do ...,4
1,This is so awesome!,4
2,I hope they stop when it closes,0
3,YAY! Finished doing the sketch of the venue! W...,4
4,Cool! Managed to connect to my office network ...,4


In [9]:
df = df.drop(df.index[800000:])

In [10]:
tweet = df.Data.values

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)

tokenizer.fit_on_texts(tweet)



In [12]:
encoded_docs = tokenizer.texts_to_sequences(tweet)

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequence = pad_sequences(encoded_docs, maxlen=200)

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense, Dropout, SpatialDropout1D
from tensorflow.keras.layers import Embedding

vocab_size = len(tokenizer.word_index) + 1
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_size, embedding_vector_length, input_length=200))
model.add(SpatialDropout1D(0.25))
model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 32)           6310144   
                                                                 
 spatial_dropout1d (SpatialD  (None, 200, 32)          0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 50)                16600     
                                                                 
 dropout (Dropout)           (None, 50)                0         
                                                                 
 dense (Dense)               (None, 1)                 51        
                                                                 
Total params: 6,326,795
Trainable params: 6,326,795
Non-trainable params: 0
______________________________________________

In [15]:
sentiment_label = df.Target.factorize()

In [19]:
history = model.fit(padded_sequence,sentiment_label[0],validation_split=0.5, epochs=2, batch_size=800)

Epoch 1/2
Epoch 2/2


In [21]:
def predict_sentiment(text):
    tw = tokenizer.texts_to_sequences([text])
    tw = pad_sequences(tw,maxlen=200)
    prediction = int(model.predict(tw).round().item())
    if sentiment_label[1][prediction]==0:
        print('Predicted Sentiment: Negative')
    else:
        print('Predicted Sentiment: Positive')


In [22]:
predict_sentiment('His daughter died in his arms')
predict_sentiment('Life is so good')

Predicted Sentiment: Negative
Predicted Sentiment: Positive


In [44]:
df_test = pd.read_csv('./data/test.csv', names=['Target', 'TweetID', 'Date', 'No_Query', 'UserName', 'Data'])

<IPython.core.display.Javascript object>

In [45]:
df_test = df_test[['Target', 'Data']]
len(df_test)

498

In [47]:
df_test = df_test[df_test.Target!=2]
len(df_test)

359

In [48]:
df_test['Data'] = df_test['Data'].replace(r'http\S+', '', regex=True).replace(r'@\S+', '', regex=True)


In [54]:
df = df_test.reset_index(inplace=True)

In [55]:
for i in range(len(df_test)):
    if df_test['Data'][i].isspace() == True:
        df_test['Data'][i] = NaN

In [62]:
import string
from numpy import NaN

df_test = df_test.reset_index(drop=True)
for char in string.punctuation:
    df_test['Data'] = df_test['Data'].replace(char, NaN, regex=False)

In [63]:
df_test.to_csv('./data/testModified.csv')


In [77]:
df_test = pd.read_csv('./data/testModified.csv')

<IPython.core.display.Javascript object>

In [78]:
df_test = df_test.drop('ln', axis=1)
df_test = df_test.drop('Unnamed: 0', axis=1)
df_test.head()

Unnamed: 0,index,Target,Data
0,0,4,I loooooooovvvvvveee my Kindle2. Not that the...
1,1,4,Reading my kindle2... Love it... Lee childs i...
2,2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,3,4,You'll love your Kindle2. I've had mine for a...
4,4,4,Fair enough. But i have the Kindle2 and I th...


In [80]:
df_test = df_test[['Data', 'Target']]
df_test.head()

Unnamed: 0,Data,Target
0,I loooooooovvvvvveee my Kindle2. Not that the...,4
1,Reading my kindle2... Love it... Lee childs i...,4
2,"Ok, first assesment of the #kindle2 ...it fuck...",4
3,You'll love your Kindle2. I've had mine for a...,4
4,Fair enough. But i have the Kindle2 and I th...,4


In [82]:
df_test.Data.apply(predict_sentiment)

Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Negative
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted Sentiment: Positive
Predicted Sentiment: Negative
Predicted 