<a href="https://colab.research.google.com/github/Rishita32/Kaggle-Machine-Learning-Practice/blob/main/Sentiment_Analysis_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Libraries

In [21]:
import numpy as np
import pandas as pd
import re
import spacy
import en_core_web_sm
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive

# Load Dataset

In [22]:
drive.mount('/content/drive', force_remount=True)
data=pd.read_csv('drive/MyDrive/Machine Learning Practice/datasets/IMDB_Dataset.csv', nrows=2000)

Mounted at /content/drive


In [23]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Convert Sentiment Labels into Numerical Values

In [24]:
data['sentiment'].replace({"positive":1, "negative":0}, inplace=True)
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     2000 non-null   object
 1   sentiment  2000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [26]:
data.shape

(2000, 2)

In [27]:
data['sentiment'].value_counts().reset_index()
#no class imbalance

Unnamed: 0,sentiment,count
0,1,1005
1,0,995


# Data Preprocessing

In [28]:
def remove_number(text):
  return re.sub('[0-9]+', '', text)

def remove_htmltags(text):
  return re.sub(r'<[^>]+>', '', text)

def remove_symbols(text):
  return re.sub("[!@#$%^&*(){}Â£\/'']",'',text)

In [29]:
data['review']=data['review'].apply(lambda x: remove_number(x))
data['review']=data['review'].apply(lambda x: remove_htmltags(x))
data['review']=data['review'].apply(lambda x: remove_symbols(x))

In [30]:
print(data['review'][1])

A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great masters of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional dream techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwells murals decorating every surface are terribly well done.


# Tokenization

In [31]:
nlp=en_core_web_sm.load()

In [32]:
def tokenize(text):
  doc=nlp(text)
  filtered_tokens=[]
  for token in doc:
    if token.is_stop or token.is_punct:
      continue
    filtered_tokens.append(token.lemma_)
  return ''.join(filtered_tokens)

In [33]:
data['review']=data['review'].apply(tokenize)

# LSTM Model

In [34]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from tensorflow.keras.callbacks import EarlyStopping

In [36]:
max_words=1000
max_len=150

tokenizer=Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['review'])
X=tokenizer.texts_to_sequences(data['review'])
X=pad_sequences(X, maxlen=max_len)
y=np.array(data['sentiment'])

# Train-Test Split

In [37]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2)

In [38]:
embedding_dim=100

model=Sequential()
model.add(Embedding(max_words, embedding_dim))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 8
batch_size = 264


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         100000    
                                                                 
 spatial_dropout1d (Spatial  (None, None, 100)         0         
 Dropout1D)                                                      
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 180501 (705.08 KB)
Trainable params: 180501 (705.08 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [39]:
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
