##IMDB Sentiment Analysis

Firstly, import all the required modules

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
df = pd.read_csv('/content/labeledTrainData.tsv', delimiter="\t",quoting=3)

In [4]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


Tokenization of the data

Initially we tokenize the sentences into words and later you stopwords inorder to eliminate the words which are not really essential for the sentence. Then we use Lemmatization inorder to get the meaningful output

Here we have used the BeautifulSoup function in order to clear all the html tags and this can be in generally said to be a part of text preprocessing. WordnNetLemmatizer is used for lemmatizing the data.

In [5]:
lemmatizer = WordNetLemmatizer()

In [6]:
def process(review):
    review = BeautifulSoup(review).get_text()
    # without punctuation and numbers
    review = re.sub("[^a-zA-Z]",' ',review)
    # lowercase and splitting to eliminate stopwords
    review = review.lower()
    review = review.split()
    swords = set(stopwords.words("english"))
    review = [lemmatizer.lemmatize(w) for w in review if w not in swords]
    # we join splitted paragraphs with join before return..
    return(" ".join(review))

In [7]:
# We clean our training data with the help of the above function:
# We can see the status of the review process by printing a line after every 1000 reviews.

train_x_tum = []
for r in range(len(df["review"])):
    if (r+1)%1000 == 0:
        print("No of reviews processed =", r+1)
    train_x_tum.append(process(df["review"][r]))

  review = BeautifulSoup(review).get_text()


No of reviews processed = 1000
No of reviews processed = 2000
No of reviews processed = 3000
No of reviews processed = 4000
No of reviews processed = 5000
No of reviews processed = 6000
No of reviews processed = 7000
No of reviews processed = 8000
No of reviews processed = 9000
No of reviews processed = 10000
No of reviews processed = 11000
No of reviews processed = 12000
No of reviews processed = 13000
No of reviews processed = 14000
No of reviews processed = 15000
No of reviews processed = 16000
No of reviews processed = 17000
No of reviews processed = 18000
No of reviews processed = 19000
No of reviews processed = 20000
No of reviews processed = 21000
No of reviews processed = 22000
No of reviews processed = 23000
No of reviews processed = 24000
No of reviews processed = 25000


Converting words to vectors

In [8]:
!pip install tensorflow



In [9]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

In [10]:
vol_size = 10000 #Consider the vocabulary to be 10000. The more we consider this, the OHE will be that big

In [11]:
onehot_repr = [one_hot(words,vol_size) for words in train_x_tum]

In [12]:
embedded_docs = pad_sequences(onehot_repr,padding='pre',maxlen=200)
print(embedded_docs)

[[8635 8904 6999 ... 5651   70 8407]
 [   0    0    0 ... 1879 5670  472]
 [6231 2556 3682 ... 2558 7132 5441]
 ...
 [   0    0    0 ... 6999 6590 5701]
 [   0    0    0 ...  774 8916 5000]
 [   0    0    0 ... 5268 9696 3731]]


Model Building

LSTM model is built using the tensorflow library.

In [13]:
model = Sequential()
model.add(Embedding(vol_size,40,input_length=200))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 40)           400000    
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 456,501
Trainable params: 456,501
Non-trainable params: 0
_________________________________________________________________
None


In [14]:
X_final = np.array(embedded_docs)
y_final = np.array(df['sentiment'])

In [15]:
X_final.shape,y_final.shape

((25000, 200), (25000,))

Train Test Split

In [16]:
callback= EarlyStopping(monitor='accuracy',mode='max',verbose=1,patience=5)
X_train,X_test,y_train,y_test = train_test_split(X_final,y_final, train_size=0.4)

In [17]:
model.fit(X_train,y_train, validation_data=(X_test,y_test),epochs=10,batch_size=8,callbacks=[callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2ff019f4c0>

In [18]:
test_predict = model.predict(X_test)
acc = roc_auc_score(y_test, test_predict)



In [19]:
print("Accuracy: % ", acc * 100)

Accuracy: %  87.67729829603441


##The Conclusion

In this project, I have implemented the LSTM model in order to solve the natural language problem of IMDB reviews sentiment analysis. I have initially tokenised the reviews and then lemmatizied the words in order to make it the words more valuable. Now this obtained data is then converted into vectors using one hot encoder. This is further given to the LSTM model and obtained the accuracy of 87.70%