# Sentiment Analysis Using LSTM
This notebook demonstrates the creation of a bidirectional LSTM model for text classification.
The dataset used here contains preprocessed tweets categorized into three classes.

## Importing Required Libraries
We'll begin by importing the necessary libraries for data processing, model building, and evaluation.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

## Loading the Dataset
We will load the preprocessed dataset, which is assumed to be stored in a CSV file named `processed_text.csv`. The dataset contains a `clean_tweet` column with preprocessed text and a `class` column indicating the category.

In [None]:
df = pd.read_csv('processed_text.csv')
df.head()

## Splitting Input and Target Variables
We'll separate the `clean_tweet` column as the input (X) and the `class` column as the target variable (y).

In [None]:
X = df['clean_tweet'].values
y = df['class'].values

## Tokenizing the Text Data
Text data will be tokenized into numerical sequences using Keras's `Tokenizer`. We'll also calculate the vocabulary size.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
vocab_size = len(tokenizer.word_index) + 1
X = tokenizer.texts_to_sequences(X)

## Padding Sequences
Since sequences have varying lengths, we'll pad them to ensure uniformity. This step is crucial for processing data through the LSTM model.

In [None]:
max_length = max(len(sequence) for sequence in X)
X = pad_sequences(X, maxlen=max_length)

## Splitting Data into Training and Testing Sets
The data will be split into training and testing sets using an 80-20 ratio.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Building the Bidirectional LSTM Model
The model includes embedding layers, multiple bidirectional LSTMs, and dense layers for classification.

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(3, activation='softmax'))

## Compiling the Model
The model is compiled with the sparse categorical crossentropy loss function, Adam optimizer, and accuracy metric.

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Adding Early Stopping Callback
Early stopping will prevent overfitting by stopping training when the validation accuracy stops improving for three consecutive epochs.

In [None]:
early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, restore_best_weights=True)

## Model Summary
Let's inspect the architecture of the model.

In [None]:
model.summary()

## Training the Model
We train the model using a batch size of 64 and validate it on the testing set.

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=64, callbacks=[early_stopping])

## Evaluating the Model
Finally, we'll evaluate the model on the test data to calculate its accuracy and loss.

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')