<a href="https://colab.research.google.com/github/TiffanyWang20/Sentimetnt-Analysis/blob/main/Sentiment_Analysis_on_Reviews_with_LSTM(IMDB).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
!pip install kaggle




**Importing the Dependencies**


In [10]:
import os
import json

from zipfile import ZipFile
import pandas as pd

from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences # for the same equal shape

**Data Collection - Kaggle API**

In [11]:
kaggle_dictionary = json.load(open("/content/kaggle(1).json"))

In [12]:
kaggle_dictionary.keys()

dict_keys(['username', 'key'])

In [13]:
# setup kaggle credentials as environment variables
os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]

In [14]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
  0% 0.00/25.7M [00:00<?, ?B/s]
100% 25.7M/25.7M [00:00<00:00, 961MB/s]


In [15]:
!ls

 imdb-dataset-of-50k-movie-reviews.zip	'kaggle(1).json'   sample_data


In [16]:
# unzip the dataset file
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
  zip_ref.extractall()

In [17]:
!ls

'IMDB Dataset.csv'			'kaggle(1).json'
 imdb-dataset-of-50k-movie-reviews.zip	 sample_data


**Loading the Dataset**

In [18]:
data = pd.read_csv("/content/IMDB Dataset.csv")

In [19]:
data.shape

(50000, 2)

In [20]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [21]:
data.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [22]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [23]:
data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)

  data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


In [24]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [25]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,25000
0,25000


In [26]:
# split data into training data and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

In [27]:
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


**Data Preprocessing**

In [28]:
# Tokenize text data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)

In [29]:
print(X_train)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]


In [30]:
print(X_test)

[[   0    0    0 ...  995  719  155]
 [  12  162   59 ...  380    7    7]
 [   0    0    0 ...   50 1088   96]
 ...
 [   0    0    0 ...  125  200 3241]
 [   0    0    0 ... 1066    1 2305]
 [   0    0    0 ...    1  332   27]]


In [31]:
Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]

In [32]:
print(Y_train)

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64


**LSTM - Long Short-Term Memory**
(Type of RNN)

    Time series prediction (stock prices, weather forecasting)

    Natural language processing (language modeling, machine translation, speech recognition)

    Video analysis and other sequential data tasks

In [37]:
# build the model
# drop out not to overfit(generalize, regularization)
# Dense (connected neurons of layers)
# Sigmoid (btw 1 & 0 for binary classification)

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=200))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1,activation="sigmoid"))
model.build(input_shape=(None, 200))

In [38]:
model.summary()

 Parameter = (input_dim x output_dim, 5000 * 128 = 640,000)


In [39]:
# compile the model, loss function is to reduce loss, adam optimizer is backward & forward propogation
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

**Train the Model**

In [40]:
model.fit(X_train, Y_train, epochs = 5, batch_size = 64, validation_split = 0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m208s[0m 402ms/step - accuracy: 0.7158 - loss: 0.5390 - val_accuracy: 0.8372 - val_loss: 0.3742
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m257s[0m 399ms/step - accuracy: 0.8330 - loss: 0.3854 - val_accuracy: 0.8654 - val_loss: 0.3191
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 397ms/step - accuracy: 0.8694 - loss: 0.3181 - val_accuracy: 0.8635 - val_loss: 0.3236
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 397ms/step - accuracy: 0.8913 - loss: 0.2757 - val_accuracy: 0.8530 - val_loss: 0.3971
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 400ms/step - accuracy: 0.9052 - loss: 0.2407 - val_accuracy: 0.8788 - val_loss: 0.3048


<keras.src.callbacks.history.History at 0x7d40ef0b1990>

**Model Evaluation**

In [41]:
loss,accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy : {accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 113ms/step - accuracy: 0.8776 - loss: 0.2942
Test Loss: 0.29030999541282654
Test Accuracy : 0.8824999928474426


**Building a Predictive System**

In [42]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction [0][0] > 0.5 else "negative"
  return sentiment

In [48]:
def predict_sentiment(review):
    # Tokenize and pad the review
    sequence = tokenizer.texts_to_sequences([review])
    padded_sequence = pad_sequences(sequence, maxlen=200)

    # Predict sentiment
    prediction = model.predict(padded_sequence, verbose=0)  # suppress progress bar
    sentiment = "positive" if prediction[0][0] > 0.5 else "negative"

    return sentiment


In [49]:
# Example usage
new_review = "This movie was fantastic. I loved it"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is : {sentiment}")

The sentiment of the review is : positive


In [50]:
# example usage
new_review = "This movie was not that good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

The sentiment of the review is: negative


In [54]:
# example usage
new_review = "John Garfield plays a Marine who is blinded by a grenade while fighting on Guadalcanal and who has to learn to live with his disability. He has all the stereotypical notions about blindness, and is sure he'll be a burden to everyone. The hospital staff and his fellow wounded Marines can't get through to him. Neither can his girl back home played by Eleanor Parker. He's stubborn and blinded by his own fears, self pity, and prejudices. It's a complex role that Garfield carries off memorably in a great performance that keeps one watching in spite of the ever present syrupy melodrama. The best scenes are on Guadalcanal, where he's in a machine gun nest trying to fend off the advancing Japanese soldiers in a hellish looking night time battle, and later a dream sequence in the hospital where he sees himself walking down a train platform with a white cane, dark glasses, and holding out a tin cup, all the while his girlfriend walks backward away from the camera."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

The sentiment of the review is: positive
