<a href="https://colab.research.google.com/github/Rtsodzai/ICE1_KNN/blob/main/RNN_Part_1_POE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**TASK 1 – NATURAL LANGUAGE PROCESSING USING A RECURRENT NEURAL NETWORK**

**What Is Recurrent Neural Network**

A Recurrent Neural Network (RNN) is a type of artificial neural network designed for processing sequential data (Jayawardhana, 2020;Research Graph,2024). Unlike traditional neural networks, RNNs can take input data in a sequence and maintain a memory of previous inputs, which makes them particularly well-suited for tasks involving time-series data, natural language processing, and speech (Jayawardhana, 2020;Research Graph,2024).

RNNs have "memory," which allows them to carry information from one part of a sequence to another, effectively learning from the context of the entire sequence rather than just the individual input at each step (Donges, 2024). The ability to maintain information from previous steps in the sequence is the key feature which differenciates between RNNs and other neural networks (Donges, 2024).

The basic structure of an RNN is a loop that allows information to be passed from one step to the next in a sequence. Each time the network processes an element of the sequence, it updates its hidden state (the memory of what it has seen so far) and outputs a prediction based on both the current input and its hidden state.

* **Input:** A sequence of data (e.g., words in a sentence, stock prices over time).
* **Hidden State:** A set of values that capture the memory of the RNN from previous steps.
* **Output:** The prediction or decision at each step in the sequence.

**Common disadvantages of RNN**

According to Donges,2024 below are some common challenges with RNNs which makes it difficult for basic RNNs to remember long-term dependencies in the data:

* Vanishing gradients: During training, RNNs use backpropagation to update weights. However, as the sequence gets longer, the gradients (which indicate how much to update the weights) can become very small (vanish), making it difficult to learn long-term dependencies. This was solved through the concept of LSTM by Sepp Hochreiter and Juergen Schmidhuber.

* Exploding gradients: Conversely, gradients can sometimes become too large, causing unstable weight updates and making the training process difficult.This problem can be easily solved by truncating or squashing the gradients.

* Complex training process: Because RNNs process data sequentially, this can result in a tedious training process.

* Difficulty with long sequences: The longer the sequence, the harder RNNs must work to remember past information.

* Inefficient methods: RNNs process data sequentially, which can be a slow and inefficient approach.  

To address the issues with basic RNNs, especially the vanishing gradient problem, a more advanced type of RNN called Long Short-Term Memory (LSTM) was developed. LSTMs are capable of learning long-term dependencies by using a more complex architecture that carefully regulates the flow of information.

**Brief Description of the Dataset**

The dataset consists of two parts: a training set with 74,682 entries and a validation set with 1510
entries. Each entry includes a tweet ID, entity, sentiment label, and tweet content

**Why the Dataset is Appropriate for Text Processing**

1. **Sequential nature of textual data**

The core reason for using RNN is that its a good fit for analyzing this dataset because of its sequential nature of textual data. When processing natural language (such as Twitter messages), the meaning of each word depends not only on the word itself but also on the surrounding words in the sentence. This means that the sentiment of the message regarding a given entity can only be understood by considering the order of the words and the context in which they appear.

2. **Entity-Level Analysis**

The dataset requires sentiment analysis at the entity level, meaning that the task is not just about understanding the general sentiment of a message but also the sentiment specifically about a given entity. For example, a tweet like:

* "Microsoft's new Windows OS is great, but their customer service is disappointing."

* has both positive sentiment about "Windows" and negative sentiment about "customer service."

* RNNs, particularly LSTMs, can be trained to focus on the parts of the message that pertain to the entity being analyzed. The network processes each word in the sequence and can "remember" important aspects from earlier in the text that relate to the entity. This memory allows the model to correctly associate sentiment with the relevant entity, even when the sentence is long or contains multiple sentiments.

3. **Handling Long and Short Sentences**

Social media data, especially tweets, can be highly variable in length. Some tweets may be short, while others are longer and more complex.

4. **Class Imbalance (Positive, Negative, Neutral)**

The dataset contains three sentiment classes: Positive, Negative, and Neutral. One challenge in sentiment analysis is that the distribution of these classes may be imbalanced. RNNs, with proper tuning and regularization, can handle imbalanced data relatively well.

5. **Contextual Understanding for Sentiment**

In sentiment analysis, understanding context is crucial. Words or phrases can carry different sentiments depending on the context in which they are used. RNNs, and especially LSTMs, excel in contextual understanding because they can maintain information from earlier in the sentence and use it to interpret the meaning of words or phrases that appear later. This is critical for accurately determining the sentiment in the dataset, where different tweets may express sentiment in nuanced ways.

6. **RNNs and Multiclass Classification**

In this dataset, the task is to classify each tweet as either
positive, negative, or neutral regarding a specific entity. RNNs, especially with LSTM layers, are powerful tools for multi-class classification tasks because they can learn to differentiate between subtle differences in text that correspond to each class.

**Explanation of the Analysis to Be Performed and the Aim**

**Key Question:**
Can the RNN with LSTM accurately classify tweets based on their sentiment?

**Analysis to Be Performed on the Dataset**

The analysis of the entity-level sentiment analysis dataset of Twitter will focus on building a machine learning model, specifically a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units, to predict the sentiment of tweets related to a specific entity. The goal is to classify each tweet as positive, negative, or neutral regarding the entity mentioned in the tweet.

Here is a step-by-step breakdown of the analysis:

**Data Preprocessing, Exploration & Model building.**

The objective is to prepare the raw data for input into the RNN and explore the dataset to understand its structure, distribution, and potential issues.

* Loading the Data: Import the dataset into a Spark DataFrame, verify its structure, and inspect the first few rows.
* Assigning Column Names: Assign the correct column names to ensure the data is readable: Tweet_ID, Entity, Sentiment, and Text.
* Data Cleaning: Clean the text by removing unnecessary characters (e.g., URLs, punctuation, emojis, special characters), converting all text to lowercase, and handling missing or incomplete data.
* Analyze the distribution of sentiments across the dataset (positive, negative, neutral).
* Explore the most frequent entities and words.
* Visualize word distributions using word clouds or frequency plots.
* Tokenization: Break down the text into individual tokens (words) and convert them into sequences of integers (for input into the neural network).
* Padding: Ensure all sequences are of the same length by padding shorter sequences with zeros.
* Build and train an RNN with LSTM layers to classify the sentiment of each tweet.
* Ensure the model generalizes well to new, unseen data.
* Optimize the model’s performance by tuning its hyperparameters.
* Evaluate the performance of the model and interpret the results.

**Reference of Datasource**

Datasource Kaggle [Twitter Sentiment Analysis Dataset](https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis)

# **Setup and Import Relevant Library**

# **Install Spark**

In [None]:
!pip install pyspark

In [None]:
import numpy as np
import pandas as pd

# **Download Dataset From Kaggle**

In [None]:
! kaggle datasets list
# Copy API command from Kaggle
!kaggle datasets download -d jp797498e/twitter-entity-sentiment-analysis

# **Load Dataset**

In [None]:
import os
import zipfile
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("TwitterSentimentAnalysis").getOrCreate()

# Path to zip file
zip_file_path = 'twitter-entity-sentiment-analysis.zip'

# Creating a temporary directory to extract files
temp_dir = 'temp_twitter_data'
os.makedirs(temp_dir, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

# List the extracted files
extracted_files = os.listdir(temp_dir)
print("Extracted files: ", extracted_files)


In [None]:
# Define column names
column_names = ['Tweet_ID', 'Entity', 'Sentiment', 'Text']

# Loading dataset
training_file_path = os.path.join(temp_dir, 'twitter_training.csv')
validation_file_path = os.path.join(temp_dir, 'twitter_validation.csv')

# Load the CSV file into a Spark DataFrame
df_training = spark.read.csv(training_file_path, header=False, inferSchema=True).toDF(*column_names)
df_validation = spark.read.csv(validation_file_path, header=False, inferSchema=True).toDF(*column_names)

# Display the first 5 rows to confirm it's loaded correctly
df_training.show(5)

# Display the first 5 rows to confirm it's loaded correctly
df_validation.show(5)


# **Exploratory Data Analysis (EDA)**

In [None]:
# Basic Information checking the schema.
#Essential to verify the structure of the dataset to ensure the data has been #loaded correctly.Once the schema has been verified and confirm the data types #are correct,l will proceed to data cleaning and preprocessing.

df_training.printSchema()
df_validation.printSchema()

In [None]:
# Checking total number of rows (records) in the Training DataFrame
df_training.count()

In [None]:
# Checking total number of rows (records) in the Validation DataFrame
df_validation.count()

# **Data Distribution and Imbalance**

In [None]:
#Sentiment Distribution: Check the distribution of sentiment classes (e.g., #positive, negative, neutral,Irrelevant) to see if there is any imbalance.
df_training.groupBy("Sentiment").count().show()

The sentiment distribution in the training data reveals that the dataset is relatively balanced across
the Positive, Neutral, and Negative sentiment categories, with each category having a substantial
number of entries. The Negative sentiment has the highest count, followed by Positive, and then
Neutral. The Irrelevant category, while still significant, has noticeably fewer entries compared
to the other sentiments. This distribution indicates that the dataset provides a good variety of
sentiment classes for training a sentiment analysis model.

However, the irrelevant sentiment suggests that some tweets in the dataset do not express a sentiment related to the target entities thus marked as irrelevant. As such l will remove the irrelavant sentiment from the dataset.

In [None]:
#Removing irrelavant sentiment from training dataset
df_training = df_training.filter(df_training.Sentiment != 'Irrelevant')

In [None]:
#Removing irrelavant sentiment from training dataset
df_validation = df_validation.filter(df_validation.Sentiment != 'Irrelevant')

In [None]:
#Verifying removal
df_training.groupBy("Sentiment").count().show()
df_validation.groupBy("Sentiment").count().show()

Validation dataset contains a lot of noisy and irrelevant sentiment values. To focus only on Positive, Negative, and Neutral sentiments, l will filter the dataset to remove the irrelevant entries.

In [None]:
# Filter the validation dataset to keep only rows with valid sentiments
valid_sentiments = ["Positive", "Negative", "Neutral"]
df_validation_clean = df_validation.filter(df_validation.Sentiment.isin(valid_sentiments))

# Show the counts of the filtered validation dataset
df_validation_clean.groupBy("Sentiment").count().show()

**Creating a Bar plot to visualize sentiments**

In [None]:
# Group by sentiment and count for both datasets
sentiment_counts_training = df_training.groupBy("Sentiment").count().toPandas()
sentiment_counts_validation_clean = df_validation_clean.groupBy("Sentiment").count().toPandas()

import matplotlib.pyplot as plt
import seaborn as sns

# Set up the figure size
plt.figure(figsize=(10, 6))

# Plot for training dataset
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st plot
sns.barplot(x='Sentiment', y='count', data=sentiment_counts_training, palette="viridis")
plt.title("Training Dataset Sentiment Distribution")
plt.xlabel("Sentiment")
plt.ylabel("Count")

# Plot for validation dataset
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd plot
sns.barplot(x='Sentiment', y='count', data=sentiment_counts_validation_clean, palette="magma")
plt.title("Cleaned Validation Dataset Sentiment Distribution")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()

# Show the plots
plt.tight_layout()
plt.show()


I am happy the "Positive," "Neutral," and "Negative" classes are fairly balanced in terms of the number of records, which is ideal for training a machine learning model.
This balanced distribution means that the model is less likely to be biased towards any one sentiment.

In [None]:
#Entity Distribution:
df_training.groupBy("Entity").count().orderBy("count", ascending=False).show(10)

This helps me to understand which entities are most commonly associated with sentiment in your dataset. In analysis l will not remove any entity.

# Text Analysis: Word Count

In [None]:
from pyspark.sql.functions import split, size, col

# Correctly split the text based on spaces between words
df_training = df_training.withColumn("Word_Count", size(split(col("Text"), " ")))

# Group by word count, count occurrences, and order by word count
df_training.groupBy("Word_Count").count().orderBy("Word_Count").show()


There is a row with a word count of -1, which is incorrect because word counts should always be positive. This could be caused by missing or incorrectly formatted text data in the Text column (e.g., empty strings or null values). I will have to clean the Text column by filtering out or handling rows that might be causing this issue.

In [None]:
from pyspark.sql.functions import when, length

# Replace empty strings or nulls with 0 word count
df_training = df_training.withColumn(
    "Word_Count",
    when(length(col("Text")) == 0, 0)
    .otherwise(size(split(col("Text"), " ")))
)

# Filter out negative word counts if they still exist
df_training = df_training.filter(col("Word_Count") >= 0)

#group by word count and display the results
df_training.groupBy("Word_Count").count().orderBy("Word_Count").show(20)

Interpretation: 2045 tweets contains 1 word 1575 tweets contains 2 words and so on. This analysis gives me an insight into the length distribution of the texts in my dataset, which is helpful for tasks like padding sequences before feeding them into a neural network model.

# Text Analysis: Tweet Length

In [None]:
from pyspark.sql.functions import length

df_training = df_training.withColumn("Tweet_Length", length("Text"))
df_training.groupBy("Tweet_Length").count().orderBy("Tweet_Length").show()

In [None]:
# Convert the Spark DataFrame into Pandas
pandas_df = df_training.groupBy("Tweet_Length").count().orderBy("Tweet_Length").toPandas()

# Bar chart
plt.figure(figsize=(12, 6))
plt.bar(pandas_df['Tweet_Length'], pandas_df['count'], color='skyblue')
plt.xlabel('Tweet Length (characters)')
plt.ylabel('Count')
plt.title('Distribution of Tweet Lengths')
plt.xticks(rotation=90)  # Rotate x-axis labels for better visibility
#plt.tight_layout()
plt.show()


# Text Analysis: Frequent Words:

In [None]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.sql.functions import explode, col

# Drop rows with null values in the "Text" column
df_training = df_training.na.drop(subset=["Text"])

tokenizer = Tokenizer(inputCol="Text", outputCol="Words")
wordsData = tokenizer.transform(df_training)

remover = StopWordsRemover(inputCol="Words", outputCol="Filtered_Words")
filteredData = remover.transform(wordsData)

filteredData.select(explode(col("Filtered_Words")).alias("Word")).groupBy("Word").count().orderBy("count", ascending=False).show(20)

Checking for Missing Values

In [None]:
from pyspark.sql.functions import col, sum

df_training.select([sum(col(c).isNull().cast("int")).alias(c) for c in df_training.columns]).show()

In [None]:
df_validation_clean.select([sum(col(c).isNull().cast("int")).alias(c) for c in df_validation_clean.columns]).show()

# Average Tweet Length by Sentiment:

In [None]:
df_training.groupBy("Sentiment").avg("Tweet_Length").show()

* General Interpretation:

The shorter average length of **positive** tweets suggests that positive emotions are often expressed briefly and directly.
**Neutral** tweets being the longest indicates that neutral statements may require more elaboration to convey the full context or objective information.
**Negative** tweets are relatively long but not as lengthy as neutral ones, likely due to the need for explanation or venting, but they may still be shorter than neutral tweets as users may prioritize emotional expression over detail.


This analysis help guide NLP models, as tweet length could correlate with the complexity of sentiment, influencing how to structure a recurrent neural network (RNN) or Long Short-Term Memory (LSTM) model for sentiment classification.

# Frequent Words by Sentiment:

In [None]:
filteredData.select("Sentiment", explode(col("Filtered_Words")).alias("Word")).groupBy("Sentiment", "Word").count().orderBy("Sentiment", "count", ascending=False).show(20)

# **Data Preprocessing For RNN**

I will prepare the text data by tokenizing and transforming it into numerical sequences that can be the input into the LSTM model.

To develop a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) layers for sentiment analysis task, we'll break down the process into manageable steps. I will proceed with the PySpark DataFrame for data preparation and the Keras library for building and training the LSTM model.

Below is the step-by-step guide for that.

# **Tokenization and Padding**

We will use Keras' Tokenizer to convert the text data into sequences and pad them to ensure all inputs to the model have the same length.

**Quick Summary of the Process:**

The tweets will first be converted into sequences of integers using a tokenizer.

These sequences are then padded to ensure uniform input length.

Sentiments are then mapped to numerical values and one-hot encoded to make them suitable for training a classification model (like a Recurrent Neural Network).

These preprocessed inputs (padded_sequences) and outputs (y_train) will be used to train an RNN model (using LSTM) for sentiment classification.

In [None]:
!pip install keras-preprocessing
!pip install tensorflow
!pip install keras

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical


# Extract the text and sentiment columns from Spark DataFrame and convert to Pandas
training_data = df_training.select("Text", "Sentiment").toPandas()

# Tokenizer - fit on the training data text
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(training_data['Text'])

# Convert the text data to sequences of integers
sequences = tokenizer.texts_to_sequences(training_data['Text'])

# Pad the sequences to ensure uniform input length
max_len = 50  # Define max length for padding
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Convert Sentiment to numerical labels
sentiment_mapping = {'Neutral': 0, 'Positive': 1, 'Negative': 2}
training_data['Sentiment'] = training_data['Sentiment'].map(sentiment_mapping)

# Convert the numerical labels to one-hot encoded vectors
y_train = to_categorical(training_data['Sentiment'], num_classes=3)

In [None]:
#Verify Data Shapes:
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)

# **Split Data into Training and Validation Sets**

Will split the data into training and validation sets for model training.

In [None]:
from sklearn.model_selection import train_test_split

# Define features (padded sequences) and labels (sentiment)
X = padded_sequences
y = training_data['Sentiment'].values

# Convert labels to one-hot encoding
y_one_hot = to_categorical(y)

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_one_hot, test_size=0.2, random_state=42)

In [None]:
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)

# **Build the LSTM Model**

Next, will build the LSTM model using Keras. Here's how to develop an LSTM-based RNN for sentiment analysis.



In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

# Build the model
model = Sequential()

# Add an embedding layer (10000 words and embedding size of 128)
model.add(Embedding(input_dim=10000, output_dim=128))

# Add LSTM layer
model.add(LSTM(128, return_sequences=False))

# Add a Dropout layer to prevent overfitting
model.add(Dropout(0.5))

# Add output layer (Softmax for multi-class classification)
model.add(Dense(3, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()


# **Train the Model**

Training the model using the training set.

In [None]:
# Train the LSTM model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val))

# **Evaluate the Model**

Once the model is trained, now l can evaluate its performance on the validation set.

In [None]:
# Evaluate the model on the validation set
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

# **Plot the Training History**

Now l want to visualize the performance, l will plot the accuracy and loss over epochs.

This plot of training and validation loss/accuracy curves will help me to understand better how my model is performing over epochs.

In [None]:
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='train accuracy')
plt.plot(history.history['val_accuracy'], label='val accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='train loss')
plt.plot(history.history['val_loss'], label='val loss')
plt.title('Model Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# **Address Overfitting**

To mitigate overfitting:
* I use Dropout: Already applied a 50% dropout layer.
* Now adding Early Stopping: Stop training when validation loss stops improving.


In [None]:
from keras.callbacks import EarlyStopping

# Use EarlyStopping to stop training when validation loss plateaus
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Re-train the model with early stopping
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_val, y_val), callbacks=[early_stopping])

# **Confusion Matrix and Classification Report**

To evaluate the model performance further, i will create a confusion matrix and classification report.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Predict on the validation set
y_pred = (model.predict(X_val) > 0.5).astype("int32")

# Convert multi-label predictions to single-label predictions
y_pred_single = np.argmax(y_pred, axis=1)

# Convert multi-label true values to single-label true values
y_val_single = np.argmax(y_val, axis=1)

# Confusion Matrix
conf_matrix = confusion_matrix(y_val_single, y_pred_single)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report
class_report = classification_report(y_val_single, y_pred_single)
print("Classification Report:\n", class_report)

The model demonstrates strong performance in sentiment classification with an overall accuracy of 88%. It achieves high precision, recall, and F1-scores across all sentiment categories: Neutral, Positive, and Negative. The Neutral class has slightly lower precision compared to Positive and Negative, but overall, the model effectively distinguishes between sentiments with robust metrics. The classification report and confusion matrix indicate that the model handles class imbalance well, providing reliable predictions for each sentiment category. This performance reflects a well-tuned model capable of accurately interpreting sentiment in tweets.

I am happy with the performance of the model on Training dataset, now l am going to use my Validation dataset to see how the model will perform on unseen data.

In [None]:
df_validation_clean.columns

# **Processing the validation Dataset**

In [None]:
# Extract text from validation DataFrame
validation_data = df_validation_clean.select("Text").toPandas()

# Convert validation text to sequences using the same tokenizer
validation_sequences = tokenizer.texts_to_sequences(validation_data['Text'])

# Pad the sequences
padded_validation_sequences = pad_sequences(validation_sequences, maxlen=max_len, padding='post')

In [None]:
# Predict sentiment on the validation data
predictions = model.predict(padded_validation_sequences)

In [None]:
# Convert predictions from one-hot encoded vectors to class labels
predicted_labels = np.argmax(predictions, axis=1)

Computing evaluation metrics such as accuracy, confusion matrix, and classification report for the validation dataset.

In [None]:
# Get true labels
true_labels = df_validation_clean.select("Sentiment").toPandas()
true_labels = true_labels['Sentiment'].map(sentiment_mapping).values

# Evaluate the model
conf_matrix = confusion_matrix(true_labels, predicted_labels)
class_report = classification_report(true_labels, predicted_labels)

print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

The model demonstrates strong performance on the validation dataset, achieving an overall accuracy of 92%. The confusion matrix reveals that the model accurately classifies sentiments with minimal misclassifications, showing particularly high precision and recall for each sentiment category. Specifically, precision and recall values for `Neutral`, `Positive`, and `Negative` sentiments are consistently above 90%, with F1-scores also reflecting robust performance. These results indicate that the model effectively generalizes to new data, correctly identifying sentiments with high reliability and balance across the different categories.

**Comparison Between Training and Validation Results**

The model exhibits a notable improvement in performance on the validation dataset compared to the training data. On the training set, the accuracy achieved was approximately 86%, with a confusion matrix reflecting some challenges in distinguishing between sentiment classes, particularly with lower precision and recall in certain categories. In contrast, the validation results show a substantial accuracy of 92%, with higher precision, recall, and F1-scores across all sentiment classes. This indicates that the model generalizes well to unseen data, achieving a more balanced and effective classification. The substantial enhancement in validation performance suggests that the model is robust and has learned to handle variations in data beyond the training set, avoiding overfitting and demonstrating strong predictive capabilities.

# **Conclusion**

The analysis and model development for the sentiment classification of tweets have demonstrated successful results, highlighting the effectiveness of a Long Short-Term Memory (LSTM) network in handling text data. After preprocessing the dataset, including tokenization and padding, the LSTM model was trained and validated with a focus on accuracy and generalization. The model achieved a commendable performance on the training set, with an accuracy of 88%, indicating that it effectively learned the patterns in the training data. The confusion matrix and classification report revealed that the model initially struggled with distinguishing certain sentiment classes, leading to suboptimal precision and recall for some categories.

However, the model's performance significantly improved on the validation set, where it reached an accuracy of 92%. The validation results showed enhanced precision, recall, and F1-scores across all sentiment classes, demonstrating the model's robust ability to generalize to new, unseen data. This indicates that the model is well-calibrated and not overfitted, making it reliable for practical applications. Overall, the successful application of LSTM in this sentiment analysis task underscores its capability to effectively capture and interpret complex text data, providing valuable insights into sentiment classification for real-world use.

# **Reference List**

1. Research Gate.(2024) *An Introduction to Recurrent Neural Networks(RNNs)*. Available at:https://medium.com/@researchgraph/an-introduction-to-recurrent-neural-networks-rnns-802fcfee3098. (Accessed 01 September 2024).

2.
Jayawardhana,S.(2020) *Sequence Models & Recurrent Neural Networks (RNNs)*. Available at:https://towardsdatascience.com/sequence-models-and-recurrent-neural-networks-rnns-62cadeb4f1e1 (Accessed 01 September 2024).

3. Donges,N. (2024) *What are Recurrent Nueral Networks (RNNs)*. Available at:https://builtin.com/data-science/recurrent-neural-networks-and-lstm. (Accessed September 02 2024).

# **Instructions To Run Code**

1. Install required libraries. Can be installed using pip.
2. Check tensorflow installation. Use code import tensorflow as tf
print(tf.__version__)
3. Data Preparation
4. Handling error - when encountering errors, check the data shapes and types. Ensure that labels are in the correct format.

# **Save the Model**

In [None]:
model.save('lstm_sentiment_model.h5')