<a href="https://colab.research.google.com/github/KrishaDavda1411/Sentiment_Analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment analysis of IMDB reviews
We will start by importing the necessary libraries

In [None]:
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/gdrive')

Mounted at /gdrive


# Importing the data files
After importing the necessary libraries now we will read the data files we have two data files here


In [None]:
file1 = r'/gdrive/MyDrive/Colab Notebooks/tutorial/Sentiment_Analysis/imdb_reviews.csv'
file2 = r'/gdrive/MyDrive/Colab Notebooks/tutorial/Sentiment_Analysis/test_reviews.csv'

first data file contains the imdb reviews and their corresponding sentiments which can be either positive or negative, we are going to use this file as our training data.

In [None]:
imdb_reviews = pd.read_csv(file1) #training
test_reviews = pd.read_csv(file2) #testing

In [None]:
imdb_reviews.head()

Unnamed: 0,Reviews,Sentiment
0,<START this film was just brilliant casting lo...,positive
1,<START big hair big boobs bad music and a gian...,negative
2,<START this has to be one of the worst films o...,negative
3,<START the <UNK> <UNK> at storytelling the tra...,positive
4,<START worst mistake of my life br br i picked...,negative


# Preprocessing the data
We can not pass the string data to our model directly, so we need to transform the string data into integer format.For this we can map each distinct word as a distinct integer for eg.{'this':14 , 'the':1}.We already have a file that contains the mapping from words to integers so we are going to load that file.


In [None]:
file3 = r'/gdrive/MyDrive/Colab Notebooks/tutorial/Sentiment_Analysis/word_indexes.csv'

The word index file contains mapping from words to integers.

In [None]:
word_index = pd.read_csv(file3)
word_index.head(n=10)

Unnamed: 0,Words,Indexes
0,tsukino,52009
1,nunnery,52010
2,sonja,16819
3,vani,63954
4,woods,1411
5,spiders,16118
6,hanging,2348
7,woody,2292
8,trawling,52011
9,hold's,52012


Next we are going to convert the word_index dataframe into a python dictionary so that we can use it for converting our reviews from string to integer format.

In [None]:
word_index = dict(zip(word_index.Words,word_index.Indexes))

In [None]:
word_index["<PAD>"] = 0
word_index["<START"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"]=3

In [None]:
word_index

{'tsukino': 52009,
 'nunnery': 52010,
 'sonja': 16819,
 'vani': 63954,
 'woods': 1411,
 'spiders': 16118,
 'hanging': 2348,
 'woody': 2292,
 'trawling': 52011,
 "hold's": 52012,
 'comically': 11310,
 'localized': 40833,
 'disobeying': 30571,
 "'royale": 52013,
 "harpo's": 40834,
 'canet': 52014,
 'aileen': 19316,
 'acurately': 52015,
 "diplomat's": 52016,
 'rickman': 25245,
 'arranged': 6749,
 'rumbustious': 52017,
 'familiarness': 52018,
 "spider'": 52019,
 'hahahah': 68807,
 "wood'": 52020,
 'transvestism': 40836,
 "hangin'": 34705,
 'bringing': 2341,
 'seamier': 40837,
 'wooded': 34706,
 'bravora': 52021,
 'grueling': 16820,
 'wooden': 1639,
 'wednesday': 16821,
 "'prix": 52022,
 'altagracia': 34707,
 'circuitry': 52023,
 'crotch': 11588,
 'busybody': 57769,
 "tart'n'tangy": 52024,
 'burgade': 14132,
 'thrace': 52026,
 "tom's": 11041,
 'snuggles': 52028,
 'francesco': 29117,
 'complainers': 52030,
 'templarios': 52128,
 '272': 40838,
 '273': 52031,
 'zaniacs': 52133,
 '275': 34709,


Now we define a function review_encoder that encodes the reviews into integer format according to the mapping specified by word_index file.

In [None]:
def review_encoder(text):
  arr = [word_index[word] for word in text]
  return arr

We split the reviews from their corresponding sentiments so that we can preprocess the reviews and sentiments separately and then later pass it to our model.

In [None]:
train_data,train_lables = imdb_reviews['Reviews'],imdb_reviews['Sentiment']
test_data,test_lables = test_reviews['Reviews'],test_reviews['Sentiment']

Before transforming the reviews as integers we need to tokenize or split the review on the basis of whitespaces
For eg.the string "The movie was wonderful" becomes ["The" , "movie" , "was" , "wonderful" ].

In [None]:
train_data = train_data.apply(lambda review:review.split())
test_data =  test_data.apply(lambda review:review.split())

In [None]:
test_data

0        [<START, please, give, this, one, a, miss, br,...
1        [<START, this, film, requires, a, lot, of, pat...
2        [<START, many, animation, buffs, consider, <UN...
3        [<START, i, generally, love, this, type, of, m...
4        [<START, like, some, other, people, wrote, i'm...
                               ...                        
24995    [<START, the, book, is, better, than, the, fil...
24996    [<START, the, largest, crowd, to, ever, see, a...
24997    [<START, i, suppose, that, to, say, this, is, ...
24998    [<START, in, love, 2, is, the, third, movie, i...
24999    [<START, a, good, ol', boy, film, is, almost, ...
Name: Reviews, Length: 25000, dtype: object

Since we have tokenized the reviews now we can apply the review_encoder function to each review and transform the reviews into integer format.

In [None]:
train_data = train_data.apply(review_encoder)
test_data = test_data.apply(review_encoder)

After transforming, our reviews are going to look like this.

In [None]:
train_data

0        [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, ...
1        [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463,...
2        [1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5...
3        [1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111, 153,...
4        [1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637, 1...
                               ...                        
24995    [1, 14, 9, 6, 2758, 20, 21, 1517, 7, 2078, 5, ...
24996    [1, 4679, 2784, 299, 6, 1042, 37, 80, 81, 233,...
24997    [1, 11, 6, 230, 245, 6401, 9, 6, 1225, 446, 2,...
24998    [1, 1446, 7079, 69, 72, 3305, 13, 610, 930, 8,...
24999    [1, 17, 6, 194, 337, 7, 4, 204, 22, 45, 254, 8...
Name: Reviews, Length: 25000, dtype: object

We also need to encode the sentiments and we are labeling the positive sentiment as 1 and negative sentiment as 0.

In [None]:
def encode_sentiments(sentiment):
  if sentiment=='positive':
    return 1
  else:
    return 0

train_lables = train_lables.apply(encode_sentiments)
test_lables = test_lables.apply(encode_sentiments)


Before giving the review as an input to the model we need to perform following preprocessing steps:

 


*   The length of each review should be made equal for the model to be working correctly.

*  We have chosen the length of each review to be 500. 
*     If the review is longer than 500 words we are going to cut the extra part of the review.


*       If the review is contains less than 500 words we are going to pad the review with zeros to increase its length to 500.




In [None]:
train_data=keras.preprocessing.sequence.pad_sequences(train_data,value=word_index["<PAD>"],padding='post',maxlen=500)
test_data=keras.preprocessing.sequence.pad_sequences(test_data,value=word_index["<PAD>"],padding='post',maxlen=500)

preprocessing completed

#Building the model
Our model is a neural network and it consits of the following layers : 

1.   one word embedding layer which creates word embeddings of length 16 from integer encoded review.
2.  second layer is global average pooling layer which is used to prevent overfitting by reducing the number of parameters.

1.   then a dense layer which has 16 hidden units and uses relu as activation function
2.  the final layer is the output layer which uses sigmoid as activation function 




In [None]:
model = keras.Sequential([keras.layers.Embedding(10000,16,input_length=500),
                         keras.layers.GlobalAveragePooling1D(),
                         keras.layers.Dense(16,activation='relu'),
                         keras.layers.Dense(1,activation='sigmoid')])

'''We use a combination of random weights and rectified linear unit (ReLU) activation function 
to add a ReLU dense (ReDense) layer to the trained neural network such that it can achieve 
a lower training loss

Sigmoid Function acts as an activation function in machine learning which is used to add 
non-linearity in a machine learning model, in simple words it decides which value to pass
 as output and what not to pass'''


'We use a combination of random weights and rectified linear unit (ReLU) activation function \nto add a ReLU dense (ReDense) layer to the trained neural network such that it can achieve \na lower training loss\n\nSigmoid Function acts as an activation function in machine learning which is used to add \nnon-linearity in a machine learning model, in simple words it decides which value to pass\n as output and what not to pass'

#compiling the model


1.   Adam is used as optimization function for our model.
2.   Binary cross entropy loss function is used as loss function for the model.

1.   Accuracy is used as the metric for evaluating the model.





In [None]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In the next step we are going to train the model on our downloaded IMDB dataset.

In [None]:
history =  model.fit(train_data,train_lables,epochs=30,batch_size=512,validation_data=(test_data,test_lables))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Now we will be evaluating the loss and accuracy of our model on testing data.

In [None]:
loss, accuracy = model.evaluate(test_data,test_lables)



As we can see our model is giving an accuracy of 88.56% on the testing data.

In [None]:
test_reviews.head()

Unnamed: 0,Reviews,Sentiment
0,<START please give this one a miss br br <UNK>...,negative
1,<START this film requires a lot of patience be...,positive
2,<START many animation buffs consider <UNK> <UN...,positive
3,<START i generally love this type of movie how...,negative
4,<START like some other people wrote i'm a die ...,positive


Now we are going to take a random review from our test dataset and check wether our model produces correct output or not

In [None]:
index = np.random.randint(1,1000)

In [None]:
user_review = test_reviews.loc[index]
print(user_review.Reviews)

<START i acquired this film a couple of years ago and on trying to find some info about it i found that even the mighty imdb didn't have it listed that should have been all i needed to know br br with friends like these is an anthology that plays like a collection of second rate twilight zone outer limits episodes all linked together by a bus journey that never really seems to tie in with the rest of the film of the three stories the only one that i <UNK> any entertainment value from was the second episode in which a man of sorts grows out of the <UNK> in a guys <UNK> this episode wins points for a few spots of humour and it's bizarre premise other than that there is an episode with a talking car bland and <UNK> and an episode where a girl visits a very unique dating agency my dog guessed the ending of this one br br as has been mentioned in other comments the 18 rating is entirely <UNK> there is nothing to offend here if you're after a good horror anthology check out asylum or the <UN

As we can see the sentiment for the above review is positive, now we are going to take the integer format of this particular review which we already have in our preprocessed test data and then give it as an input to our model to check the prediction of our model.

In [None]:
user_review = test_data[index]
user_review = np.array([user_review])
if (model.predict(user_review) > 0.5).astype("int32"):
  print("positive sentiment")
else:
  print("negative sentiment")

negative sentiment


As we can see our model is now able to predict the sentiment of the review.