<a href="https://colab.research.google.com/github/SameerR007/sentiment_analysis_rnn/blob/main/sentiment_analysis_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Reading the data

In [26]:
import pandas as pd
dataset = pd.read_csv('train.csv',encoding= 'unicode_escape')

In [27]:
dataset.shape

(27481, 10)

Dataset has 27481 entries with 10 parameters

In [28]:
dataset.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


We preprocess the data to consider only the desirable information required for our model 

#Data Preprocessing

In [29]:
df=dataset[['selected_text','sentiment']]

In [30]:
df.head()

Unnamed: 0,selected_text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD,negative
2,bullying me,negative
3,leave me alone,negative
4,"Sons of ****,",negative


In [31]:
df['sentiment'] = df['sentiment'].map({'positive': 2, 'neutral': 1,'negative':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentiment'] = df['sentiment'].map({'positive': 2, 'neutral': 1,'negative':0})


In [32]:
df.head()

Unnamed: 0,selected_text,sentiment
0,"I`d have responded, if I were going",1
1,Sooo SAD,0
2,bullying me,0
3,leave me alone,0
4,"Sons of ****,",0


We encoded positive as 2, neutral as 1 and negative as 0 to feed into the model afterwards

In [34]:
#dropping null values
df=df.dropna()

In [10]:
#import pickle
#pickle.dump(df, open("df.pkl", 'wb'))

In [35]:
#to split the data into train and test
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=25)

In [36]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [37]:
docs=df["selected_text"].astype("string")
docs_train=train["selected_text"].astype("string")
docs_test=test["selected_text"].astype("string")

In [38]:
tokenizer.fit_on_texts(docs_train)

In [39]:
tokenizer.word_index

{'i': 1,
 'to': 2,
 'the': 3,
 'a': 4,
 'you': 5,
 'it': 6,
 'my': 7,
 'and': 8,
 'is': 9,
 's': 10,
 'in': 11,
 't': 12,
 'for': 13,
 'of': 14,
 'me': 15,
 'that': 16,
 'on': 17,
 'so': 18,
 'have': 19,
 'm': 20,
 'but': 21,
 'good': 22,
 'just': 23,
 'not': 24,
 'day': 25,
 'be': 26,
 'with': 27,
 'at': 28,
 'was': 29,
 'can': 30,
 'love': 31,
 'no': 32,
 'happy': 33,
 'all': 34,
 'out': 35,
 'this': 36,
 'now': 37,
 'up': 38,
 'like': 39,
 'get': 40,
 'are': 41,
 'go': 42,
 'do': 43,
 'work': 44,
 'going': 45,
 'what': 46,
 'too': 47,
 'your': 48,
 'don': 49,
 'today': 50,
 'lol': 51,
 'got': 52,
 'one': 53,
 'time': 54,
 'we': 55,
 'u': 56,
 'thanks': 57,
 'miss': 58,
 'really': 59,
 'will': 60,
 'from': 61,
 'great': 62,
 'know': 63,
 'back': 64,
 'there': 65,
 'im': 66,
 'fun': 67,
 'see': 68,
 'its': 69,
 'sad': 70,
 'sorry': 71,
 'am': 72,
 'about': 73,
 'home': 74,
 'if': 75,
 'some': 76,
 'want': 77,
 'well': 78,
 'night': 79,
 'they': 80,
 'had': 81,
 'bad': 82,
 'hope': 83,

In [40]:
#total words in training corpus
len(tokenizer.word_index)

15489

In [41]:
#convert training and test texts into a series of word token indices
sequences_train = tokenizer.texts_to_sequences(docs_train)
sequences_test=tokenizer.texts_to_sequences(docs_test)

In [44]:
docs_train

15707    exhausting day. And more to come tomorrow! Cit...
21368                   Got Six Feet Under series 1 on DVD
392      Just discovered a shortcoming of Gravity. When...
23187    I was so hype about it being Friday & it raini...
19983                                       very bad idea!
                               ...                        
24832    My class will be at Chem Sc building. Will see...
2935                                                 stuck
26768                                          i`m bored :
6619                                                happy.
24895                                           havin fun?
Name: selected_text, Length: 21984, dtype: string

In [46]:
sequences_train

[[3148,
  25,
  8,
  95,
  2,
  142,
  121,
  858,
  13,
  139,
  10,
  25,
  30,
  12,
  157,
  2192,
  3,
  704],
 [52, 1252, 916, 648, 2193, 206, 17, 917],
 [23,
  3149,
  4,
  6052,
  14,
  2594,
  102,
  5,
  377,
  320,
  6053,
  6,
  205,
  12,
  6054,
  3,
  440,
  1595,
  61,
  3,
  2194,
  1921,
  1596],
 [1,
  29,
  18,
  4015,
  73,
  6,
  216,
  229,
  6,
  509,
  453,
  32,
  339,
  34,
  25,
  21,
  69,
  4016,
  102,
  69,
  54,
  2,
  1050,
  6,
  733,
  294,
  9,
  2595],
 [122, 82, 454],
 [3,
  6055,
  28,
  2195,
  2196,
  9,
  175,
  65,
  60,
  26,
  32,
  2596,
  28,
  2195,
  2196,
  28,
  357,
  28,
  375,
  1002,
  1922,
  158,
  65],
 [67],
 [22],
 [1, 120, 216, 140],
 [18, 89, 13, 3150, 13, 1253, 11, 220],
 [4017],
 [71, 603, 48, 579],
 [106, 4018, 50],
 [39],
 [90,
  6056,
  112,
  305,
  2,
  42,
  75,
  5,
  30,
  1,
  334,
  211,
  165,
  604,
  2,
  2597,
  2,
  300,
  6,
  35,
  185,
  153,
  165,
  286,
  65],
 [33],
 [58],
 [1750,
  23,
  45,
  2,
  

In [47]:
import numpy as np
import statistics
statistics.median((docs_train.apply(len)).values)

22.0

In [48]:
#padding each sentence to uniform length of median length
from keras.utils import pad_sequences
sequences_train = pad_sequences(sequences_train,padding='post',maxlen=22)
sequences_test = pad_sequences(sequences_test,padding='post',maxlen=22)

#Model Training and Validation

In [49]:
#Creating RNN model
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [50]:
model = Sequential()
model.add(Embedding(15489,2,input_length=22))
model.add(SimpleRNN(32,return_sequences=False))
model.add(Dense(3, activation='softmax'))

In [51]:
X_train=sequences_train
X_test=sequences_test

In [52]:
Y_train=train['sentiment']
Y_test=test['sentiment']

In [53]:
Y_train=Y_train.to_numpy()
Y_test=Y_test.to_numpy()

In [54]:
model.compile(optimizer='adam', loss='SparseCategoricalCrossentropy', metrics=['acc'])
model.fit(X_train,Y_train,epochs=5,validation_data=(X_test,Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efc1a402560>

Training accuracy increases at every epoch so the training data is fitting well. Also since validation accuracy increases on an average, it means there is no case of overfitting.
Hence now we can train the model with full dataset for further usage.

#Final Model Training with whole dataset

In [73]:
from keras.utils import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
sequences = tokenizer.texts_to_sequences(docs)
sequences = pad_sequences(sequences,padding='post',maxlen=22)
voc_size=len(tokenizer.word_index)
model = Sequential()
model.add(Embedding(voc_size+1,2,input_length=22))
model.add(SimpleRNN(32,return_sequences=False))
model.add(Dense(3, activation='softmax'))
X=sequences
Y=df['sentiment']
Y=Y.to_numpy()
model.compile(optimizer='adam', loss='SparseCategoricalCrossentropy', metrics=['acc'])
model.fit(X,Y,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efc318fbd00>

Now we can use our fully trained model to predict the sentiment of our user input text.

#User input

In [74]:
abc=[input()]

Pizza is delicious


In [75]:
seq=tokenizer.texts_to_sequences(abc)

In [76]:
seq

[[1062, 9, 890]]

In [77]:
inp=pad_sequences(seq,padding='post',maxlen=22)

In [78]:
inp

array([[1062,    9,  890,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0]],
      dtype=int32)

Converted the input data into form which is appropriate for the model

#Output

In [79]:
a=model.predict(inp)



In [80]:
value=a.argmax()
if value == 0:
  print("negative")
elif value == 1:
  print("neutral")
else:
  print("positive")

positive


Hence, the user input has a positive sentiment.