# Sentiment Analysis on Amazon Product Reviews

## 1. Dataset Overview
- **Dataset Description**:
  - Analyze an Amazon product review dataset containing textual reviews (`reviewText`) and corresponding sentiment labels (`Positive`).
  - Sentiment is binary: 1 for positive, 0 for negative.
- **Objective**:
  - Predict the sentiment of a product review based on its textual content.


In [70]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
import seaborn as sns
import matplotlib.pyplot as plt
nltk.download('punkt')  # Tokenizer
nltk.download('wordnet')  # Lemmatizer
nltk.download('stopwords')  # Stopwords
nltk.download('omw-1.4')  # WordNet extensions for lemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import warnings
warnings.filterwarnings("ignore")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\afifa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\afifa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\afifa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\afifa\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import MultinomialNB
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import regularizers
from sklearn.model_selection import *

In [3]:
url = 'https://raw.githubusercontent.com/rashakil-ds/Public-Datasets/refs/heads/main/amazon.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


## 2. Data Preprocessing
- Handle missing values, if any.
- Perform text preprocessing on the `reviewText` column:
  - Convert text to lowercase.
  - Remove stop words, punctuation, and special characters.
  - Tokenize and lemmatize text data.
- Split the dataset into training and testing sets.


In [4]:
df = df.dropna(subset=['reviewText', 'Positive'])  
X = df['reviewText'].astype(str)
y = df['Positive']

In [5]:
df.isnull().sum()

reviewText    0
Positive      0
dtype: int64

In [6]:
df['reviewText'] = df['reviewText'].fillna('')
df['Positive'] = df['Positive'].fillna(0)

In [7]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [8]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    # Join tokens back to a single string
    return ' '.join(tokens)

# Apply preprocessing to the reviewText column
df['reviewText'] = df['reviewText'].apply(preprocess_text)

In [9]:
max_words = 2000  
max_len = 50

In [10]:
df.head()

Unnamed: 0,reviewText,Positive
0,one best apps acording bunch people agree bomb...,1
1,pretty good version game free lot different le...,1
2,really cool game bunch level find golden egg s...,1
3,silly game frustrating lot fun definitely reco...,1
4,terrific game pad hr fun grandkids love great ...,1


In [11]:
tokenizer = Tokenizer(num_words=max_words, oov_token="<nothing>")
tokenizer.fit_on_texts(df['reviewText'])

In [12]:
seq = tokenizer.texts_to_sequences(df['reviewText'])

In [39]:
padded_seq = pad_sequences(seq, max_len, padding='post')

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(padded_seq, df['Positive'], test_size=0.3, random_state=24)


## 3. Model Selection
- Choose at least three machine learning models for sentiment classification:
  - Statistical Models:
    - Logistic Regression
    - Random Forest
    - Support Vector Machine (SVM)
    - Naïve Bayes
    - Gradient Boosting (e.g., XGBoost, AdaBoost, CatBoost)
  - Neural Models:
    - LSTM (Long Short-Term Memory)
    - GRUs (Gated Recurrent Units)


In [14]:
log_reg = LogisticRegression()
rf = RandomForestClassifier(n_estimators=100, random_state=42)
svm = SVC(kernel='linear', random_state=42)

In [15]:
nb = MultinomialNB()

In [45]:
#LSTM
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(32, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.summary()

## 4. Model Training
- Train each selected model on the training dataset.
- Utilize vectorization techniques for text data:
  - TF-IDF (Term Frequency-Inverse Document Frequency)
  - Word embeddings (e.g., Word2Vec, GloVe)


In [21]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_features = tfidf_vectorizer.fit_transform(df['reviewText']).toarray()

In [41]:
x_train, x_test, y_trainn, y_testt = train_test_split(tfidf_features, df['Positive'], test_size=0.3, random_state=24)

In [29]:
svm.fit(x_train, y_trainn)
y_pred_svm = svm.predict(x_test)
svm_accuracy = accuracy_score(y_testt, y_pred_svm)
svm_score = svm.score(x_test, y_testt)
svm_accuracy , svm_score

(0.8911666666666667, 0.8911666666666667)

In [34]:
nb.fit(x_train, y_trainn)
y_pred_nb = nb.predict(x_test)
nb_accuracy = accuracy_score(y_testt, y_pred_nb)
nb_accuracy 

0.8488333333333333

In [33]:
rf.fit(x_train, y_trainn)
y_pred_rf = rf.predict(x_test)
rf_accuracy = accuracy_score(y_testt, y_pred_rf)

rf_accuracy 

0.8675

In [35]:
log_reg = LogisticRegression()
log_reg.fit(x_train, y_trainn)
y_pred_log_reg = log_reg.predict(x_test)
log_reg_acc = accuracy_score(y_testt, y_pred_log_reg)
#log_reg_score = log_reg.score(x_test, y_testt)
log_reg_acc 

0.8881666666666667

In [42]:
history = model.fit(
    X_train,        
    y_train,         
    epochs=10,            
    batch_size=32,        
    validation_data=(X_test, y_test)  
)

Epoch 1/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 36ms/step - accuracy: 0.7518 - loss: 0.5583 - val_accuracy: 0.7573 - val_loss: 0.5634
Epoch 2/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 35ms/step - accuracy: 0.7596 - loss: 0.5535 - val_accuracy: 0.7573 - val_loss: 0.5548
Epoch 3/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 35ms/step - accuracy: 0.7607 - loss: 0.5514 - val_accuracy: 0.7573 - val_loss: 0.5546
Epoch 4/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 35ms/step - accuracy: 0.7635 - loss: 0.5482 - val_accuracy: 0.7573 - val_loss: 0.5555
Epoch 5/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 35ms/step - accuracy: 0.7640 - loss: 0.5474 - val_accuracy: 0.7573 - val_loss: 0.5542
Epoch 6/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 37ms/step - accuracy: 0.7653 - loss: 0.5454 - val_accuracy: 0.7573 - val_loss: 0.5554
Epoch 7/10
[1m4

In [46]:
history = model.fit(
    X_train,        
    y_train,         
    epochs=10,            
    batch_size=32,        
    validation_data=(X_test, y_test)  
)

Epoch 1/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 52ms/step - accuracy: 0.7574 - loss: 0.5508 - val_accuracy: 0.7573 - val_loss: 0.5570
Epoch 2/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 55ms/step - accuracy: 0.7639 - loss: 0.5482 - val_accuracy: 0.7573 - val_loss: 0.5546
Epoch 3/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 54ms/step - accuracy: 0.7666 - loss: 0.5442 - val_accuracy: 0.7573 - val_loss: 0.5544
Epoch 4/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 50ms/step - accuracy: 0.7576 - loss: 0.5543 - val_accuracy: 0.7573 - val_loss: 0.5554
Epoch 5/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 54ms/step - accuracy: 0.7704 - loss: 0.5392 - val_accuracy: 0.7573 - val_loss: 0.5541
Epoch 6/10
[1m438/438[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 58ms/step - accuracy: 0.7684 - loss: 0.5418 - val_accuracy: 0.7573 - val_loss: 0.5551
Epoch 7/10
[1m4

In [58]:
lstm_eval=model.evaluate(X_test,y_test)

[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.7556 - loss: 0.5564


## 5. Formal Evaluation
- Evaluate the performance of each model on the testing set using the following metrics:
  - Accuracy
  - Precision
  - Recall
  - F1 Score
  - Confusion Matrix

In [54]:
log_reg_precision = precision_score(y_testt, y_pred_log_reg)
log_reg_recall = recall_score(y_testt, y_pred_log_reg)
log_reg_f1 = f1_score(y_testt, y_pred_log_reg)
log_reg_confusion = confusion_matrix(y_testt, y_pred_log_reg)

print("Logistic Regression - Accuracy:", log_reg_acc)
print("Logistic Regression - Precision:", log_reg_precision)
print("Logistic Regression - Recall:", log_reg_recall)
print("Logistic Regression - F1 Score:", log_reg_f1)
print("Logistic Regression - Confusion Matrix:\n", log_reg_confusion)

Logistic Regression - Accuracy: 0.8881666666666667
Logistic Regression - Precision: 0.8911331044233488
Logistic Regression - Recall: 0.9709507042253521
Logistic Regression - F1 Score: 0.9293312269615587
Logistic Regression - Confusion Matrix:
 [[ 917  539]
 [ 132 4412]]


In [55]:
svm_precision = precision_score(y_testt, y_pred_svm)
svm_recall = recall_score(y_testt, y_pred_svm)
svm_f1 = f1_score(y_testt, y_pred_svm)
svm_confusion = confusion_matrix(y_testt, y_pred_svm)
print("SVM - Accuracy:", svm_accuracy)
print("SVM - Precision:", svm_precision)
print("SVM - Recall:", svm_recall)
print("SVM - F1 Score:", svm_f1)
print("SVM - Confusion Matrix:\n", svm_confusion)


SVM - Accuracy: 0.8911666666666667
SVM - Precision: 0.9088043706661063
SVM - Recall: 0.9518045774647887
SVM - F1 Score: 0.9298075889498011
SVM - Confusion Matrix:
 [[1022  434]
 [ 219 4325]]


In [56]:
rf_precision = precision_score(y_testt, y_pred_rf)
rf_recall = recall_score(y_testt, y_pred_rf)
rf_f1 = f1_score(y_testt, y_pred_rf)
rf_confusion = confusion_matrix(y_testt, y_pred_rf)
print("Random Forest - Accuracy:", rf_accuracy)
print("Random Forest - Precision:", rf_precision)
print("Random Forest - Recall:", rf_recall)
print("Random Forest - F1 Score:", rf_f1)
print("Random Forest - Confusion Matrix:\n", rf_confusion)

Random Forest - Accuracy: 0.8675
Random Forest - Precision: 0.8709677419354839
Random Forest - Recall: 0.9685299295774648
Random Forest - F1 Score: 0.9171616130040637
Random Forest - Confusion Matrix:
 [[ 804  652]
 [ 143 4401]]


In [57]:
nb_precision = precision_score(y_testt, y_pred_nb)
nb_recall = recall_score(y_testt, y_pred_nb)
nb_f1 = f1_score(y_testt, y_pred_nb)
nb_confusion = confusion_matrix(y_testt, y_pred_nb)
print("Naïve Bayes - Accuracy:", nb_accuracy)
print("Naïve Bayes - Precision:", nb_precision)
print("Naïve Bayes - Recall:", nb_recall)
print("Naïve Bayes - F1 Score:", nb_f1)
print("Naïve Bayes - Confusion Matrix:\n", nb_confusion)

Naïve Bayes - Accuracy: 0.8488333333333333
Naïve Bayes - Precision: 0.8416306594025925
Naïve Bayes - Recall: 0.9859154929577465
Naïve Bayes - F1 Score: 0.9080774298165603
Naïve Bayes - Confusion Matrix:
 [[ 613  843]
 [  64 4480]]


In [59]:
lstm_accuracy = lstm_eval[1]
lstm_loss = lstm_eval[0]
y_pred_lstm = (model.predict(X_test) > 0.5).astype(int)
lstm_precision = precision_score(y_test, y_pred_lstm)
lstm_recall = recall_score(y_test, y_pred_lstm)
lstm_f1 = f1_score(y_test, y_pred_lstm)
lstm_confusion = confusion_matrix(y_test, y_pred_lstm)
print("LSTM - Accuracy:", lstm_accuracy)
print("LSTM - Precision:", lstm_precision)
print("LSTM - Recall:", lstm_recall)
print("LSTM - F1 Score:", lstm_f1)
print("LSTM - Confusion Matrix:\n", lstm_confusion)

[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step
LSTM - Accuracy: 0.7573333382606506
LSTM - Precision: 0.7573333333333333
LSTM - Recall: 1.0
LSTM - F1 Score: 0.8619119878603946
LSTM - Confusion Matrix:
 [[   0 1456]
 [   0 4544]]


## 6. Hyperparameter Tuning
- Perform hyperparameter tuning for selected models using:
  - Grid Search
  - Random Search
- Explain the chosen hyperparameters and justify their selection.


In [72]:
#Logistic Regression
log_reg_params = {
    'C': [0.01, 0.1, 1, 10, 100],  
    'solver': ['liblinear', 'lbfgs']  
}
#Grid search
grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params, cv=5, scoring='accuracy')
grid_log_reg.fit(x_train, y_trainn)
print("Best Parameters for Logistic Regression:", grid_log_reg.best_params_)
print("Best Accuracy for Logistic Regression:", grid_log_reg.best_score_)

Best Parameters for Logistic Regression: {'C': 10, 'solver': 'liblinear'}
Best Accuracy for Logistic Regression: 0.8875714285714287


In [None]:
#SVM
svm_params = {
     'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']  
}
grid_svm = GridSearchCV(SVC(), svm_params, cv=5, scoring='accuracy')
grid_svm.fit(x_train, y_trainn)
print("Best Parameters for SVM:", grid_svm.best_params_)
print("Best Accuracy for SVM:", grid_svm.best_score_)

In [74]:
#Naive Bayes
nb_params = {
    'alpha': [0.1, 0.5, 1, 2]  
}
grid_nb = GridSearchCV(MultinomialNB(), nb_params, cv=5, scoring='accuracy')
grid_nb.fit(x_train, y_trainn)
print("Best Parameters for Naive Bayes:", grid_nb.best_params_)
print("Best Accuracy for Naive Bayes:", grid_nb.best_score_)

Best Parameters for Naive Bayes: {'alpha': 0.1}
Best Accuracy for Naive Bayes: 0.8675714285714285


In [None]:
#Random forest
rf_params = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}
random_rf = RandomizedSearchCV(RandomForestClassifier(), rf_params, n_iter=10, cv=5, scoring='accuracy', random_state=42, n_jobs=-1)
random_rf.fit(x_train, y_trainn)
print("Best Parameters for Random Forest (Randomized Search):", random_rf.best_params_)
print("Best Accuracy for Random Forest (Randomized Search):", random_rf.best_score_)


## 7. Comparative Analysis
- Compare the performance of all models based on evaluation metrics.
- Identify strengths and weaknesses of each model (e.g., speed, accuracy, interpretability).


## 8. Conclusion & Comments
- Summarize the findings of the project.
- Provide insights into the challenges faced during data preprocessing, model training, and evaluation.
- Highlight key lessons learned.
- Add clear and concise comments to the code for each step of the project.
- Highlight key results, visualizations, and model comparisons.
