The practice is performed following this tutorial: https://www.makeuseof.com/create-sentiment-analysis-model/, which used Trip Advisor Hotel Reviews dataset from Kaggle to build the sentiment analysis model

Dataset used to train the model in this practice: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset, which is is a Tweet Polarity dataset that is intented for sentiment analysis

In [1]:
! pip install tensorflow scikit-learn pandas numpy pickle5



In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
import pickle5 as pickle

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Loading the Dataset

In [4]:
# Load dataset
df_test = pd.read_csv("/content/drive/MyDrive/Sentiment Analysis Dataset/test.csv", encoding='latin1')
df_train = pd.read_csv("/content/drive/MyDrive/Sentiment Analysis Dataset/train.csv", encoding='latin1')

# Select only 'text' and 'sentiment' columns
df_test = df_test[['text', 'sentiment']]
df_train = df_train[['text', 'sentiment']]

In [5]:
# Display the first 5 rows of the datasets
print(df_test.head())
print(df_train.head())

                                                text sentiment
0  Last session of the day  http://twitpic.com/67ezh   neutral
1   Shanghai is also really exciting (precisely -...  positive
2  Recession hit Veronique Branquinho, she has to...  negative
3                                        happy bday!  positive
4             http://twitpic.com/4w75p - I like it!!  positive
                                                text sentiment
0                I`d have responded, if I were going   neutral
1      Sooo SAD I will miss you here in San Diego!!!  negative
2                          my boss is bullying me...  negative
3                     what interview! leave me alone  negative
4   Sons of ****, why couldn`t they put them on t...  negative


## Data Preprocessing

In [7]:
# Check for missing values in the 'text' column
missing_values_train = df_train['text'].isnull().sum()
print("Number of missing values in 'text' column of training dataset:", missing_values_train)
missing_values_test = df_test['text'].isnull().sum()
print("Number of missing values in 'text' column of testing dataset:", missing_values_test)

# Drop rows with missing values
df_train = df_train.dropna(subset=['text'])
df_test = df_train.dropna(subset=['text'])

Number of missing values in 'text' column of training dataset: 1
Number of missing values in 'text' column of testing dataset: 1281


In [8]:
# Tokenization and Padding
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(df_train['text'])
word_index = tokenizer.word_index
sequences_train = tokenizer.texts_to_sequences(df_train['text'])
sequences_test = tokenizer.texts_to_sequences(df_test['text'])
padded_sequences_train = pad_sequences(sequences_train, maxlen=100, truncating='post')
padded_sequences_test = pad_sequences(sequences_test, maxlen=100, truncating='post')

In [9]:
# Convert sentiment labels to one-hot encoded vectors
train_sentiment_labels = pd.get_dummies(df_train['sentiment']).values
test_sentiment_labels = pd.get_dummies(df_test['sentiment']).values

In [10]:
# Split data into features and labels
x_train = padded_sequences_train
y_train = train_sentiment_labels
x_test = padded_sequences_test
y_test = test_sentiment_labels

## Creating and Training the Neural Network

In [11]:
# Creating the Neural Network
model = Sequential()
model.add(Embedding(5000, 100, input_length=100))
model.add(Conv1D(64, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          500000    
                                                                 
 conv1d (Conv1D)             (None, 96, 64)            32064     
                                                                 
 global_max_pooling1d (Glob  (None, 64)                0         
 alMaxPooling1D)                                                 
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 3)                 99        
                                                        

In [12]:
# Training the Neural Network
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7a49098bfdf0>

In [13]:
# Evaluating the Performance of the Trained Model
from sklearn.metrics import f1_score

y_pred = np.argmax(model.predict(x_test), axis=-1)
y_true = np.argmax(y_test, axis=-1)
# Calculate accuracy score
print("Accuracy:", accuracy_score(y_true, y_pred))
# Calculate F1-score
print("F1-score:", f1_score(y_true, y_pred, average='macro'))

Accuracy: 0.9828238719068413
F1-score: 0.9832633729384904


In [14]:
# Saving the Model
model.save('my_sentiment_analysis_model.h5')
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

  saving_api.save_model(


## Using the Model to Classify the Sentiment of Given Text



In [34]:
# Load the saved model and tokenizer
import keras

model = keras.models.load_model('my_sentiment_analysis_model.h5')
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [55]:
# Define a function to predict the sentiment of input text
def predict_sentiment(text, model, tokenizer):
    # Tokenize and pad the input text
    text_sequence = tokenizer.texts_to_sequences([text])
    text_sequence = pad_sequences(text_sequence, maxlen=100)

    # Make a prediction using the trained model
    predicted_rating = model.predict(text_sequence)[0]

    # Map the predicted sentiment index to its corresponding label
    sentiment_mapping = {0: 'negative', 1: 'neutral', 2: 'positive'}
    predicted_index = np.argmax(predicted_rating)
    predicted_sentiment_label = sentiment_mapping[predicted_index]

    return predicted_index, predicted_sentiment_label

In [64]:
positive_texts = [
    "I loved the new book. It's amazing!",
    "The weather today is beautiful.",
    "I'm feeling great today.",
    "I had a fantastic time at the party last night!",
    "I'm really excited about the upcoming event.",
    "The customer service was excellent!",
    "I feel so happy right now.",
    "The meeting went well.",
    "The hotel room was clean and comfortable.",
    "I'm proud of my achievements."
]
print(len(positive_texts))

10


In [65]:
for text in positive_texts:
    predicted_index, predicted_sentiment = predict_sentiment(text, model, tokenizer)
    print("Text:", text)
    print("Predicted Sentiment Index:", predicted_index)
    print("Predicted Sentiment Label:", predicted_sentiment)
    print()

Text: I loved the new book. It's amazing!
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: The weather today is beautiful.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: I'm feeling great today.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: I had a fantastic time at the party last night!
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: I'm really excited about the upcoming event.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: The customer service was excellent!
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: I feel so happy right now.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: The meeting went well.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: The hotel room was clean and comfortable.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: I'm proud of my achievements.
Pred

In [58]:
negative_texts = [
    "The service at the restaurant was terrible.",
    "The traffic was awful this morning.",
    "The food tasted awful.",
    "I'm so disappointed with the product quality.",
    "The flight got delayed again.",
    "I can't stand this waiting.",
    "I'm tired of all this drama.",
    "I'm annoyed by all the noise outside.",
    "The internet connection is so slow.",
    "I'm so frustrated with this project."
]
print(len(negative_texts))

10


In [59]:
for text in negative_texts:
    predicted_index, predicted_sentiment = predict_sentiment(text, model, tokenizer)
    print("Text:", text)
    print("Predicted Sentiment Index:", predicted_index)
    print("Predicted Sentiment Label:", predicted_sentiment)
    print()

Text: The service at the restaurant was terrible.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: The traffic was awful this morning.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: The food tasted awful.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: I'm so disappointed with the product quality.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: The flight got delayed again.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: I can't stand this waiting.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: I'm tired of all this drama.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: I'm annoyed by all the noise outside.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: The internet connection is so slow.
Predicted Sentiment Index: 0
Predicted Sentiment Label: negative

Text: I'm so frustrated with this project.


In [60]:
neutral_texts = [
    "I am going to the store to buy some groceries.",
    "The meeting is scheduled for 2 PM in the conference room.",
    "I need to finish this report by the end of the day.",
    "The new software update includes several bug fixes and improvements.",
    "I'm planning to take a vacation next month.",
    "The book I'm reading is quite interesting.",
    "I'm going to try out a new recipe for dinner tonight.",
    "I'm considering joining a yoga class to improve my flexibility.",
    "The Industrial Revolution marked a significant turning point in human history.",
    "The discovery of penicillin by Alexander Fleming revolutionized the field of medicine."
]
print(len(neutral_texts))

10


In [61]:
for text in neutral_texts:
    predicted_index, predicted_sentiment = predict_sentiment(text, model, tokenizer)
    print("Text:", text)
    print("Predicted Sentiment Index:", predicted_index)
    print("Predicted Sentiment Label:", predicted_sentiment)
    print()

Text: I am going to the store to buy some groceries.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: The meeting is scheduled for 2 PM in the conference room.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: I need to finish this report by the end of the day.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: The new software update includes several bug fixes and improvements.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: I'm planning to take a vacation next month.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: The book I'm reading is quite interesting.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive

Text: I'm going to try out a new recipe for dinner tonight.
Predicted Sentiment Index: 1
Predicted Sentiment Label: neutral

Text: I'm considering joining a yoga class to improve my flexibility.
Predicted Sentiment Index: 2
Predicted Sentiment Label: positive


Based on the results, the model is struggling with identifying neutral texts, especially differentially between positive and neutral texts.

In the positive text set, 8/10 texts are labeled correctly, with the incorrect one labeled as neutral. In the negative text set, 9/10 texts are labeled correctly, with the incorrect one labeled as neutral. In the neutral text set, 8/10 texts are labeled correctly, and the remaining 2 texts are mislabeled as positive.

When analyzing the dataset, I found that it contains more neutral tweets (40.5% of the tweets are labeled neutral). The trained model has decent accuracy score and F1-score, but it struggles to accurately identify neutral input text. This may be due to some problems with the quality of the dataset and the labeling process for neutral tweets.