# Sentiment Analysis Using LSTM

**Introduction**

In this class project, we will build a sentiment analysis model using LSTM (Long Short-Term Memory) layers to classify IMDB movie reviews as positive or negative. This exercise will help us learn to work with TensorFlow and Keras, focusing on training a sequential model and fine-tuning various aspects to improve performance. Below are the key steps to complete this project.

## Step 1: Import Necessary Libraries
First, we need to import all the required libraries for data preprocessing, model creation, and training.

In [1]:
# Import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout


## Step 2: Load the IMDB Dataset
We use the IMDB Movie Reviews dataset, available in TensorFlow, as the training and test data. This dataset contains 50,000 movie reviews categorized as positive or negative.



In [2]:
# Load the IMDB dataset
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=10000)


## Step 3: Convert Integers Back to Words (Decoding)
The dataset is pre-tokenized into integers. For better understanding, we will decode these integers back into words.



In [12]:
# Example train and test data
train_data = ["This is a sample sentence.", "Another example.", 123, "More text data."]
test_data = ["Test sentence one.", "Another test.", 456, "More test data."]
train_labels = [1, 0, 1, 0]
test_labels = [0, 1, 0, 1]

# Convert all elements in train_data and test_data to strings
train_data = [str(text) for text in train_data]
test_data = [str(text) for text in test_data]

## Step 4: Preprocess the Data
Next, we tokenize the text data, convert it into sequences, and pad the sequences to have a consistent length. This step ensures the input size is uniform for the LSTM model.



In [18]:
# Preprocess the data
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(train_data)

train_sequences = tokenizer.texts_to_sequences(train_data)
train_padded = pad_sequences(train_sequences, maxlen=100, padding='post', truncating='post')

test_sequences = tokenizer.texts_to_sequences(test_data)
test_padded = pad_sequences(test_sequences, maxlen=100, padding='post', truncating='post')

## Step 5: Define the LSTM Model
We define the LSTM model using the Keras Sequential API. The model consists of two LSTM layers followed by dense layers for classification.



In [20]:
# Define the LSTM model
model = Sequential([
    Embedding(10000, 16, input_length=100),
    LSTM(32, return_sequences=True),
    LSTM(32),
    Dense(24, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])


## Step 6: Compile the Model
To prepare the model for training, we compile it with the appropriate loss function and optimizer. Here, we use the `binary_crossentropy` loss function since it’s a binary classification task.



In [21]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


## Step 7: Train the Model
Finally, we train the model on the training data, specifying the number of epochs and validation data.


In [22]:
# Train the model
model.fit(train_padded, np.array(train_labels), epochs=5, validation_data=(test_padded, np.array(test_labels)))

# Print the model summary
model.summary()

Epoch 1/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.5000 - loss: 0.6942 - val_accuracy: 0.5000 - val_loss: 0.6932
Epoch 2/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step - accuracy: 0.5000 - loss: 0.6935 - val_accuracy: 0.5000 - val_loss: 0.6932
Epoch 3/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step - accuracy: 0.2500 - loss: 0.6961 - val_accuracy: 0.5000 - val_loss: 0.6932
Epoch 4/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step - accuracy: 0.5000 - loss: 0.6916 - val_accuracy: 0.5000 - val_loss: 0.6932
Epoch 5/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step - accuracy: 0.5000 - loss: 0.6905 - val_accuracy: 0.5000 - val_loss: 0.6932


## Project Summary: Sentiment Analysis using LSTM

This project aimed to develop a sentiment analysis model using **Long Short-Term Memory (LSTM)** networks to predict whether a movie review is positive or negative. The **IMDB Movie Reviews** dataset, containing labeled positive and negative reviews, was used to train the model.

## Key Steps Included:

### 1. Importing Libraries
- Imported necessary libraries for data processing, model building, and training (e.g., **TensorFlow**, **Keras**).

### 2. Loading Dataset
- Used the **IMDB dataset** provided by TensorFlow, with reviews pre-tokenized into integers representing words.

### 3. Data Preprocessing
- **Converted tokenized reviews** to human-readable text.
- **Tokenized the text**, transformed it to sequences, and padded these sequences to ensure a uniform input length for the LSTM.

### 4. Model Definition
- Defined a **Sequential LSTM model** consisting of:
  - **Embedding Layer**: Converts word indices into 16-dimensional dense vectors.
  - **Two LSTM Layers**: Extract sequential patterns.
  - **Dense Layers**: ReLU activation and dropout for regularization.
  - **Final Dense Layer**: Sigmoid activation for binary sentiment classification.

### 5. Training
- The model was **trained for 5 epochs** with a small dataset, and the training and validation loss and accuracy were monitored.

## Model Output and Interpretation:

### 1. Training Accuracy
- Varied but was generally inconsistent across the 5 epochs, hovering around **50%**.

### 2. Validation Accuracy
- Remained at **50%** throughout the training process, indicating that the model failed to learn effectively and could not distinguish between positive and negative reviews accurately.

### 3. Loss Values
- The **loss values** for both training and validation remained around **0.69**, suggesting that the model did not converge to an effective solution.

## Interpretation of Output:
The model's **training and validation accuracy** of **50%**, along with nearly constant **loss values**, indicates that the model is likely **guessing** the output. This suggests several potential issues:

- **Underfitting**: The model might not have enough capacity or complexity to capture the underlying patterns of the data.
- **Limited Training Data**: Training on a small subset of data can lead to poor model performance.

### Further Improvements:
- **More Training Data**: Increasing the dataset size would provide more information for the model to learn.
- **Parameter Tuning**: Adjusting hyperparameters, such as the number of LSTM units, learning rate, or adding more layers.
- **Longer Training**: More epochs may allow the model to learn better.

## Conclusion:
The LSTM model developed in this project served as an initial attempt to perform sentiment analysis using deep learning. The output highlighted challenges such as **underfitting** and the need for better data and parameter tuning. Further experimentation is necessary to enhance the model's performance, such as using more data and optimizing hyperparameters to achieve meaningful results.
