<a href="https://colab.research.google.com/github/Deeksha-coder-debug/Stock-Prediction-project/blob/main/Sentiment_Analysis_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import re
# Provides functions for working with regular expressions for text cleaning
import pandas as pd
# Used for data manipulation and analysis with DataFrames
import numpy as np
# Provides support for large arrays and numerical computations
from sklearn.preprocessing import LabelEncoder
# Converts categorical labels into numeric values
from sklearn.model_selection import train_test_split
# Splits data into training and testing sets
from tensorflow.keras.preprocessing.text import Tokenizer
# Converts text into integer sequences for deep learning models
from keras.preprocessing.sequence import pad_sequences
# Pads sequences to ensure equal input length for models
import keras
# Deep learning library for building and training neural networks
from sklearn.metrics import classification_report, accuracy_score
# Evaluates model performance using metrics like precision, recall, F1-score, and accuracy
import math  # Provides mathematical functions like ceil, floor, sqrt, etc.
import nltk  # Natural Language Toolkit for NLP tasks like tokenization, stemming, and stopword removal


In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"saideekshacoder","key":"3c4c983d2bd526bed4832dfdb0531c8c"}'}

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [4]:
!kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
  0% 0.00/25.7M [00:00<?, ?B/s]
100% 25.7M/25.7M [00:00<00:00, 1.35GB/s]


In [5]:
import zipfile
zip_ref=zipfile.ZipFile('/content/imdb-dataset-of-50k-movie-reviews.zip','r')
zip_ref.extractall('/content')
zip_ref.close()

In [6]:
data=pd.read_csv('IMDB Dataset.csv')
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


# **Data Preprocessing**

First step in sentiment analysis with LSTM is to remove HTML tags, URLs, and non-alphanumeric characters from the reviews. We do that with the help of the remove_tags function, and Regex functions are used for easy string manipulation.

In [7]:
def remove_tags(string):
  removelist=''
  # removes html tags like <p>,<br>,etc and not the content inside it
  res=re.sub(r'<.*?>','',string)
  # Removes URLs starting with http, https, or www
  res=re.sub(r'http\S+|www.\S+','',res)
  res = re.sub(r'[^\w\s]', ' ', res)
  # Removes everything except letters, digits, underscore, and whitespace
  # Example: "hello!!!" -> "hello"
  res=res.lower()
  # Converts all characters to lowercase for uniformity
  return res
data['review']=data['review'].apply(lambda cw:remove_tags(cw))
# Applies the cleaning function to every review in the 'review' column

We also need to remove stopwords from the corpus. Commonly used words like ‘and’, ‘the’, and ‘at’ are stopwords that do not add any special meaning or significance to a sentence. NLTK provides a list of stopwords, and you can remove them from the corpus using the following code:

In [8]:
nltk.download('stopwords')
# Downloads the NLTK 'stopwords' dataset (first time only),
# which contains a list of common words like "the", "is", "and" that should be removed
from nltk.corpus import stopwords
# Imports the stopwords list from the NLTK library
stop_words=set(stopwords.words('english'))
# Creates a set of English stopwords for fast lookups
data['review']=data['review'].apply(lambda x:' '.join([word for word in x.split() if word not in (stop_words)]))
# For each review in the 'review' column:
# 1. Splits the review into words (x.split())
# 2. Removes words that are in the stop_words set
# 3. Joins the remaining words back into a cleaned sentence

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
nltk.download('wordnet')
w_tokenizer=nltk.tokenize.WhitespaceTokenizer()
# Creates a tokenizer that splits text based on whitespace (spaces, tabs, newlines)
lemmatizer=nltk.stem.WordNetLemmatizer()
# Loads the WordNet lemmatizer which reduces words to their **base form** (lemma)

def lemmatize_text(text):
  st=''
  # tokenize the input text into words
  for w in w_tokenizer.tokenize(text):
    st=st+lemmatizer.lemmatize(w)+' '
    # Lemmatize each word and append to the result string with a space
  return st # return the fully lemmatized sentence

data['review']=data['review'].apply(lemmatize_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [10]:
data

Unnamed: 0,review,sentiment
0,one reviewer mentioned watching 1 oz episode h...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically family little boy jake think zombie ...,negative
4,petter mattei love time money visually stunnin...,positive
...,...,...
49995,thought movie right good job creative original...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,catholic taught parochial elementary school nu...,negative
49998,going disagree previous comment side maltin on...,negative


The next step in sentiment analysis with LSTM is to print some basic statistics about the dataset and check if it has an equal number of all labels to ensure balance. Ideally, a balanced dataset is preferable, as a severely imbalanced dataset can be challenging to model and require specialized techniques.

In [11]:
s = 0.0
for i in data['review']:
    word_list = i.split()
    s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
    if data.iloc[i]['sentiment'] == 'positive':
        pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")

Average length of each review :  119.54964
Percentage of reviews with positive sentiment is 50.0%
Percentage of reviews with negative sentiment is 50.0%


# Encoding Labels and Making Train-Test Splits

use the LabelEncoder() from sklearn.preprocessing to convert the labels (‘positive’ and ‘negative’) into 1s and 0s, respectively.

In [12]:
reviews=data['review'].values
labels=data['sentiment'].values
encoder=LabelEncoder()
encoded_labels=encoder.fit_transform(labels)

In [13]:
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, stratify = encoded_labels)

Why It's Important

Imagine your dataset has:

90 positive reviews

10 negative reviews

If you split randomly, the test set might accidentally end up with only positive reviews, which makes evaluation useless.

With stratify=encoded_labels:

It preserves the ratio of each class in both training and test sets.

Example:
If original dataset has 90% positive and 10% negative:

Train set → 90% positive, 10% negative

Test set → 90% positive, 10% negative

This prevents class imbalance problems

In scikit-learn, if you don’t specify test_size or train_size, it automatically defaults to:

test_size = 0.25 (25% of the data goes to the test set).

In [14]:
# Hyperparameters of the model
vocab_size = 3000 # choose based on statistics
oov_tok = ''
embedding_dim = 100
max_length = 200 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
# Creates a tokenizer object that:
# Keeps only top 3000 words (based on frequency).
# Maps unseen words to the OOV token.

# {'the': 1, 'good': 2, 'movie': 3, '<OOV>': 4, ...}
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# stores mapping of each unique word → integer.
# train_sentences = ["I love pizza", "Pizza is amazing"]
# word_index → {'pizza': 1, 'i': 2, 'love': 3, 'is': 4, 'amazing': 5}

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
# Sentence: "I love pizza"
# Sequence: [2, 3, 1]
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

next step in sentiment analysis using LSTM is to build a Keras sequential model. It is a linear stack of the following layers :

* An embedding layer of dimension 100 converts each word in the sentence into a fixed-length dense vector of size 100. The input dimension is the vocabulary size, and the output dimension is 100. Hence, each word in the input will be represented by a vector of size 100.

* A bidirectional LSTM layer of 64 units.

* A dense (fully connected) layer of 24 units with relu activation.

* A dense layer of 1 unit and sigmoid activation outputs the probability of the review is positive, i.e., if the label is 1.

In [15]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_shape=(max_length,)),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# model summary
model.summary()

  super().__init__(**kwargs)


compile the LSTM model for sentiment analysis with binary cross-entropy loss and the Adam optimizer, given that we have a binary classification problem. The Adam optimizer uses stochastic gradient descent to train deep learning models, and it compares the predicted probabilities to the actual class label (0 or 1). We use accuracy as the primary performance metric

# Model Training and Evaluation

In [16]:
num_epochs = 5
history = model.fit(train_padded, train_labels,
                    epochs=num_epochs, verbose=1,
                    validation_split=0.1)

Epoch 1/5
[1m1055/1055[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 19ms/step - accuracy: 0.7337 - loss: 0.5115 - val_accuracy: 0.8584 - val_loss: 0.3531
Epoch 2/5
[1m1055/1055[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 20ms/step - accuracy: 0.8859 - loss: 0.2877 - val_accuracy: 0.8789 - val_loss: 0.3020
Epoch 3/5
[1m1055/1055[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 17ms/step - accuracy: 0.9056 - loss: 0.2426 - val_accuracy: 0.8651 - val_loss: 0.3406
Epoch 4/5
[1m1055/1055[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 17ms/step - accuracy: 0.9246 - loss: 0.2020 - val_accuracy: 0.8664 - val_loss: 0.3499
Epoch 5/5
[1m1055/1055[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 17ms/step - accuracy: 0.9353 - loss: 0.1745 - val_accuracy: 0.8597 - val_loss: 0.3512


In [17]:
prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))

[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step
Accuracy of prediction on test set :  0.85704


 prediction accuracy on the test set is 87.27%! You can improve the accuracy further by playing around with the model hyperparameters, tuning the model architecture, or changing the train-test split ratio. You should also train the model for a more significant number of epochs, and we stopped at five epochs because of the computational time. Ideally, this would help prepare the model until the train and test losses converge.

# Using the Model to Determine the Sentiment of Unseen Movie Reviews

We can use our trained LSTM model for sentiment analysis to determine the sentiment of new unseen movie reviews that are not present in the dataset. Before feeding each new text as input to the model, you must tokenize and pad it. The model.predict() function returns the probability of the positive review. If the probability is more significant than 0.5, we consider the study positive; otherwise, it is negative.

In [18]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming",
            "I have never seen a terrible movie like this",
            "the movie plot is terrible but it had good acting"]
# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)
# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)
# Get labels based on probability 1 if p>= 0.5 else 0
prediction = model.predict(padded)
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
for i in range(len(sentence)):
    print(sentence[i])
    if pred_labels[i] == 1:
        s = 'Positive'
    else:
        s = 'Negative'
    print("Predicted sentiment : ",s)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
The movie was very touching and heart whelming
Predicted sentiment :  Positive
I have never seen a terrible movie like this
Predicted sentiment :  Negative
the movie plot is terrible but it had good acting
Predicted sentiment :  Negative
