<a href="https://colab.research.google.com/github/Mrigakshi24-ux/Sentiment-Analysis-using-LSTM/blob/main/Sentiment_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the libraries

In [50]:
from string import punctuation as pt                  
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 
from collections import Counter 
import pandas as pd
import numpy as np
from keras import Sequential
from keras.layers import LSTM, Dense, Flatten, Dropout
from keras.layers.embeddings import Embedding
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Steps:
1.  Read data.
2.  Cleaning.
3.  Make a vocabulary (Dictionary of words wrt to index)
4.  Splitting into tokens.
5.  Convert into vectors.
6.  Encode labels.
7.  Removing strings with length zero.
8.  Padding and restricting the string length to a given number.
9. Splitting into training, and testing set.
10. Building up the LSTM model
11. Training and evaluating the model
12. Predicting the accuracy of the model.
13. Testing the model on random data

### **Reading the data**

In [51]:
with open ('/content/drive/MyDrive/Datasets/reviews.txt', 'r') as f:
  reviews = f.read()
with open ('/content/drive/MyDrive/Datasets/labels.txt', 'r') as f:
  labels = f.read()

In [52]:
reviews[:300]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the ins'

In [53]:
labels[:10]

'positive\nn'

In [54]:
# total length of the reviews
reviews = reviews.split('\n')
len(reviews)

25001

In [55]:
reviews[:10]

['bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   ',
 'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience i

 ### **Cleaning**
    

1. Removing the punctuation
2. Converting to lower case
3. Removing the stopwords

In [56]:
# function to preprocess the dataset
def clean(l):
  new_reviews = []
  stop = stopwords.words('english')
  for i in l:
    i = i.lower()                                                    # converting to lower case
    i = ' '.join([c for c in i.split() if c not in stop])            # removing stopwords
    i = ' '.join([c for c in i.split() if c not in pt])              # removing punctuation
    new_reviews.append(i)
  return new_reviews

In [57]:
reviews_1 = clean(reviews)

In [58]:
reviews_1[:10]

['bromwell high cartoon comedy ran time programs school life teachers years teaching profession lead believe bromwell high satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity',
 'story man unnatural feelings pig starts opening scene terrific example absurd comedy formal orchestra audience turned insane violent mob crazy chantings singers unfortunately stays absurd whole time general narrative eventually making putting even era turned cryptic dialogue would make shakespeare seem easy third grader technical level better might think good cinematography future great vilmos zsigmond future stars sally kirkland frederic forrest seen briefly',
 'homelessness houselessness

### **Vocabulary**

In [59]:
#Count frequency of words
x = ' '.join(reviews_1)
count = Counter(x.split())

In [60]:
# it is originaly a dictinary, for checking the output it is converted to the list for the time
list(count.items())[:10]

[('bromwell', 8),
 ('high', 2161),
 ('cartoon', 545),
 ('comedy', 3246),
 ('ran', 238),
 ('time', 12724),
 ('programs', 66),
 ('school', 1659),
 ('life', 6628),
 ('teachers', 77)]

In [61]:
vocab = list(count)
vocabulary = {}
j=1                                          # we start from 1 because for padding we need 0
for i in vocab:
  vocabulary[i] = j
  j=j+1
list(vocabulary.items())[:10]

[('bromwell', 1),
 ('high', 2),
 ('cartoon', 3),
 ('comedy', 4),
 ('ran', 5),
 ('time', 6),
 ('programs', 7),
 ('school', 8),
 ('life', 9),
 ('teachers', 10)]

In [62]:
len(vocabulary)

73919

In [63]:
reviews_1[:3]

['bromwell high cartoon comedy ran time programs school life teachers years teaching profession lead believe bromwell high satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity',
 'story man unnatural feelings pig starts opening scene terrific example absurd comedy formal orchestra audience turned insane violent mob crazy chantings singers unfortunately stays absurd whole time general narrative eventually making putting even era turned cryptic dialogue would make shakespeare seem easy third grader technical level better might think good cinematography future great vilmos zsigmond future stars sally kirkland frederic forrest seen briefly',
 'homelessness houselessness

###**Splitting into tokens**

In [64]:
# Splitting sentence into tokens
reviews_2 = []
for i in reviews_1:
  reviews_2.append(i.split())
print(reviews_2[:2])

[['bromwell', 'high', 'cartoon', 'comedy', 'ran', 'time', 'programs', 'school', 'life', 'teachers', 'years', 'teaching', 'profession', 'lead', 'believe', 'bromwell', 'high', 'satire', 'much', 'closer', 'reality', 'teachers', 'scramble', 'survive', 'financially', 'insightful', 'students', 'see', 'right', 'pathetic', 'teachers', 'pomp', 'pettiness', 'whole', 'situation', 'remind', 'schools', 'knew', 'students', 'saw', 'episode', 'student', 'repeatedly', 'tried', 'burn', 'school', 'immediately', 'recalled', 'high', 'classic', 'line', 'inspector', 'sack', 'one', 'teachers', 'student', 'welcome', 'bromwell', 'high', 'expect', 'many', 'adults', 'age', 'think', 'bromwell', 'high', 'far', 'fetched', 'pity'], ['story', 'man', 'unnatural', 'feelings', 'pig', 'starts', 'opening', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'turned', 'insane', 'violent', 'mob', 'crazy', 'chantings', 'singers', 'unfortunately', 'stays', 'absurd', 'whole', 'time', 'general'

### **Convert words to vectors**

In [65]:
# convert into vectors
review_3 = []
for i in reviews_2:
  x = []
  for j in i:
    x.append(vocabulary[j])
  review_3.append(x)
print(review_3[:3])

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 16, 17, 18, 19, 10, 20, 21, 22, 23, 24, 25, 26, 27, 10, 28, 29, 30, 31, 32, 33, 34, 24, 35, 36, 37, 38, 39, 40, 8, 41, 42, 2, 43, 44, 45, 46, 47, 10, 37, 48, 1, 2, 49, 50, 51, 52, 53, 1, 2, 54, 55, 56], [57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 4, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 67, 30, 6, 80, 81, 82, 83, 84, 85, 86, 71, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 53, 100, 101, 102, 103, 104, 105, 102, 106, 107, 108, 109, 110, 111, 112], [113, 114, 115, 116, 117, 118, 11, 119, 120, 121, 122, 123, 124, 125, 126, 8, 127, 128, 129, 130, 53, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 134, 145, 146, 147, 148, 148, 149, 150, 151, 147, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 125, 162, 25, 163, 131, 164, 165, 166, 148, 148, 167, 168, 169, 106, 165, 170, 171, 58, 125, 172, 173, 90, 150, 174, 175, 176, 177, 25, 151, 147, 178, 179, 153, 154, 165, 180, 181, 102, 182, 83, 183, 

### **Encode the labels**

In [66]:
# Encode Labels
labels_1 = labels.split('\n')
labels_2 = ['1' if i == 'positive' else '0' for i in labels_1]      # 1-positive, 0-negative
labels_2[:3]

['1', '0', '1']

### **Removing strings with length zero**

In [67]:
sum = []
for i in review_3:
  sum.append(len(i))
print(max(sum))
print(min(sum))         # this hows that we have string with max length (no of words) of 1442 and string with lenght of 0

1442
0


In [68]:
for i in range(len(review_3)):
  if len(review_3[i])==0:
    review_3.pop(i)
    labels_2.pop(i)                             # removing the zero length review and corresponding label

In [69]:
sum = []
for i in review_3:
  sum.append(len(i))
print(min(sum))                  # so string with length 0 are removed.

4


In [70]:
print(len(review_3), len(labels_2))

25000 25000


### **Restricting sentences to a length of 300 and padding them**

In [71]:
# Removing outlier, padding
# padding for sentences shorter and longer

seq_length = 300

for i in review_3:
  if len(i)>seq_length:
    review_3[review_3.index(i)] = i[:seq_length]
  else:
    z = seq_length - len(i)
    review_3[review_3.index(i)] = [0]*z+ i

In [72]:
sum = []
for i in review_3:
  sum.append(len(i))
print(max(sum))
print(min(sum))   # this means all the sentences are of the same length i.e. 300 now.

300
300


In [73]:
X = pd.DataFrame(review_3).astype('int32')
y = pd.DataFrame(labels_2).astype('int32')

### **Splitting the data into training and testing dataset**

In [74]:
xtrain, xtest, ytrain, ytest = train_test_split(X , y, test_size = 0.25, random_state = 35)

In [75]:
xtrain.shape

(18750, 300)

In [76]:
xtrain.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299
21350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,53,584,135,9,3060,1472,66848,127,1082,367,12397,2236,39875,666,2802,103,737,1185,2811,303,135,45630,3123,244,69139,345,26064,98,3486,2748,725,2481,47,3259,3930,43,11,575,658,3366
23930,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,9697,1399,1358,8493,25856,11383,188,1220,1583,673,2257,3352,3519,1728,953,9418,45662,4985,991,3582,148,148,92,4446,460,5259,18,384,57,262,1415,5654,1169,11,524,1028,10131,65,9697,43
20756,1677,16197,460,98,235,736,16067,835,646,523,574,610,464,3010,3011,872,6818,148,148,3096,518,610,198,2227,12477,7965,23041,30,673,461,12610,3450,542,3,198,13819,28824,460,98,615,...,4335,574,64,260,44218,6793,45307,481,2827,1677,7965,453,247,23973,802,6793,2044,28824,5204,1278,5023,2794,508,3229,450,33887,4569,148,148,17,163,260,7965,6387,4280,658,98,1251,1708,253
8904,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,858,260,237,130,3186,18849,100,2587,6802,6131,16320,1329,15771,64,15211,770,3388,1924,2559,939,2187,2513,130,172,3186,18849,433,869,148,148,1107,260,198,1305,2455,2631,260,100,2587,11176
20546,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,4712,2324,595,62,541,508,507,8208,11121,356,1729,4069,11337,686,2938,1548,27932,1754,29212,965,516,339,688,1119,4585,2509,541,1773,479,163,1215,58,2492,47,1220,376,725,904,1386,5497


In [77]:
xtest.shape

(6250, 300)

## **LSTM Network**

In [78]:
# defining hyperparameters
embedding_vector_length = 100
max_review_length = 300
model = Sequential()
model.add(Embedding(80000, embedding_vector_length, input_length = max_review_length ))
model.add(Dropout(0.2))
model.add(LSTM(100))                                               # LSTM layer with 100 neurons
model.add(Dropout(0.2))
# model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          8000000   
_________________________________________________________________
dropout (Dropout)            (None, 300, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 8,080,501
Trainable params: 8,080,501
Non-trainable params: 0
_________________________________________________________________
None


In [79]:
model.fit(xtrain, ytrain, epochs = 10, batch_size = 32, verbose = 5)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f406bf3dc50>

### **Evaluating the model**

In [80]:
scores = model.evaluate(xtest,ytest, verbose = 5)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 84.64%


In [81]:
pred = model.predict(xtest)
mean_squared_error(ytest, pred.flatten())

0.13207549703679747

In [82]:
pred.shape

(6250, 1)

In [83]:
ytest.shape

(6250, 1)

In [84]:
pred1 = model.predict(xtest.head(1))
print(pred1)

[[0.00047497]]


### **Testing the model on random data**

In [85]:
# Predicting on random
x = []
sent = ['I am happy']
sent_1 = clean(sent)
# print(sent_1)
for i in sent_1[0].split():
  x.append(vocabulary[i])
x1 = pd.DataFrame(np.array(x).reshape(-1,len(x)))
y = model.predict(x1)
# print(y)
if y>0.5:
  print('Positive')
else:
  print('Negative')

Positive
