# Sentiment ANalysis on Amazon Product Reviews Dataset #

Sourced from: Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, Bollywood, Boom-boxes and Blenders: Domain adaptation for sentiment classification. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 440–447. http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html


## Importing the Required Libraries ##

In [None]:
import os
import pandas as pd
from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.classifier_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Downloading necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

## Loading the Dataset ##
The dataset was loaded using the html parser from Beautiful Soup. The code was compiled with step by step assitance from various websites and articles. 

Assistance taken from:

https://oxylabs.io/blog/beautiful-soup-parsing-tutorial

https://stackoverflow.com/questions/21570780/using-python-and-beautifulsoup-saved-webpage-source-codes-into-a-local-file

https://stackoverflow.com/questions/43214305/how-to-use-text-strip-function

In [17]:
folder = 'sorted_data_acl'
categories = ['books', 'dvd', 'electronics', 'kitchen_&_housewares']
data = []

for category in categories:
    for file in ['negative', 'positive']:
        path = os.path.join(folder, category, f"{file}.review")
        with open(path, 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file, 'html.parser')
            reviews = soup.find_all('review_text')
            
            for review in reviews:
                clean_text = review.text.strip()  # Removing  leading amd trailing whitespace,  assistance taken from https://stackoverflow.com/questions/43214305/how-to-use-text-strip-function
                data.append((clean_text, 1 if file == 'positive' else 0))

df = pd.DataFrame(data, columns=['review_text', 'file'])


## Viewing the Dataframe ##

In [18]:
df.head()

Unnamed: 0,review_text,sentiment
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


## Removing Outliers: Very short reviews.

In [None]:
# Defining a function for counting words
def word_count(text):
    return len(text.split())


#Assistance taken from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
min_length = 10 

df['word_count'] = df['review_text'].apply(word_count)
df = df[df['word_count'] >= min_length]

In [19]:
# Counting the number of positive and negative reviews
positive_count = df[df['file'] == 1].shape[0]
negative_count = df[df['file'] == 0].shape[0]

print(f"Number of positive reviews: {positive_count}")
print(f"Number of negative reviews: {negative_count}")

Number of positive reviews: 4000
Number of negative reviews: 4000


## Pre-processing the data ##

The pre-processing task was done with asistance from:

https://www.dataquest.io/blog/how-to-clean-and-prepare-your-data-for-analysis/

In [20]:
# Preprocess the text
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    # Tokenize
    words = nltk.word_tokenize(text.lower())
    # Remove stopwords and stem
    filtered_words = [ps.stem(word) for word in words if word not in stop_words and word.isalpha()]
    return " ".join(filtered_words)

df['processed_text'] = df['review_text'].apply(preprocess_text)

### Vectorization of processed data ###

Assistance from https://medium.com/@WojtekFulmyk/text-tokenization-and-vectorization-in-nlp-ac5e3eb35b85

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['processed_text'])
y = df['file']

### Spliting the Data ###

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Building the Logistic Regression classifier and training ##

Assistance from : https://spotintelligence.com/2023/02/22/logistic-regression-text-classification-python/

In [24]:
from sklearn.linear_classifier import LogisticRegression

classifier = LogisticRegression(max_iter=1000)  
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

### Caluclating Accuracy

In [25]:
from sklearn.metrics import accuracy_score

y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.815625


## Classifier Validation ##

In [31]:
def predict_file(review):
    processed_review = preprocess_text(review)
    vectorized_review = vectorizer.transform([processed_review])
    prediction = classifier.predict(vectorized_review)
    return "Positive" if prediction[0] == 1 else "Negative"

### Testing with a complicated Negative Review ###

In [32]:
# Test the function
print(predict_file("I don't know what to say about this product. The quality of paper was super, and the fininsh just right, but then again the glue used laid waste to it all. All beautiful things broken apart and scattered around"))

Negative


### Testing with a complicated Positive Review ###

In [33]:

# Test the function
print(predict_file("I don't know what to say about this product. The quality of paper was super, and the fininsh just right"))

Positive


*End of Code*

# Continuing Further model Exploration #

## Support Vector Machine - SVM ##

Assistance from: https://scikit-learn.org/stable/modules/svm.html

In [None]:
from sklearn.svm import SVC

# Creating the SVM model
SVC_model = SVC(kernel='linear', probability=True)

# Training the model
SVC_model.fit(X_train, y_train)

SVC(kernel='linear', probability=True)

### Checking Model Performance ###

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.776875


## Naive Bayes Classifier ##

Assistance from: https://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Creating and training the Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

MultinomialNB()

### Checking Model Performance ###

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.789375


## Decision Tree ##

Assistance from: https://scikit-learn.org/stable/modules/tree.html

In [None]:
from sklearn.tree import DecisionTreeClassifier


dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.688125


## Random Forest ##

Assistance from: https://scikit-learn.org/stable/modules/ensemble.html

In [None]:
from sklearn.ensemble import RandomForestClassifier


rf_model = RandomForestClassifier(n_estimators=100)  
rf_model.fit(X_train, y_train)

RandomForestClassifier()

### Checking Model Performance ###

In [None]:
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.81125


## Gradient Boosting ##

Assistance from: https://scikit-learn.org/stable/modules/ensemble.html

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbm_model = GradientBoostingClassifier()
gbm_model.fit(X_train, y_train)

GradientBoostingClassifier()

### Checking Model Performance ###

In [None]:
y_pred = gbm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.775


## XG Boost Model ##

Assistance from: https://scikit-learn.org/stable/modules/ensemble.html

In [None]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)



XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)

### Checking Model Performance ###

In [None]:
y_pred = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.805625


## Ensmeble Model ##

Assitance from : https://scikit-learn.org/stable/modules/ensemble.html

https://machinelearningmastery.com/voting-ensembles-with-python/
                 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier


model1 = LogisticRegression(max_iter=1000)
model2 = RandomForestClassifier(n_estimators=100)
model3 = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

### Training the Model ###

In [None]:
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators=[
    ('lr', model1), 
    ('rf', model2), 
    ('xgb', model3)
], voting='soft')

ensemble.fit(X_train, y_train)

### Checking Model Performance ###

In [None]:
from sklearn.metrics import accuracy_score

y_pred = ensemble.predict(X_test) 

accuracy = accuracy_score(y_test, y_pred)
print("Ensemble Model Accuracy:", accuracy)

# Building the Neural Network Model #

Assistance from: https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c04_nlp_embeddings_and_sentiment.ipynb

https://www.kaggle.com/code/dilipkumar2k6/tensorflow-nlp-word-embedding-optimization

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Parameters
vocab_size = 10000
max_length = 100
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

In [None]:
reviews = df['review_text'].tolist()
labels = df['sentiment'].tolist()

In [None]:
# Tokenizing
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(reviews)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(reviews)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)


In [None]:
# Splitting data
training_size = int(len(reviews) * 0.7)
train_padded = padded[:training_size]
test_padded = padded[training_size:]
train_labels = labels[:training_size]
test_labels = labels[training_size:]

In [None]:
import numpy as np

# Converting labels to numpy arrays
train_labels = np.array(train_labels).astype('float32')
test_labels = np.array(test_labels).astype('float32')

## Model Architecture ##

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 16, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.5),  # Adding dropout
    tf.keras.layers.Dense(12, activation='elu', kernel_regularizer=tf.keras.regularizers.l2(0.01)),  # L2 regularization
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d_2   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dropout_39 (Dropout)        (None, 16)                0         
                                                                 
 dense_4 (Dense)             (None, 12)                204       
                                                                 
 dense_5 (Dense)             (None, 1)                 13        
                                                                 
Total params: 160,217
Trainable params: 160,217
Non-trainable params: 0
________________________________________________

## Model Training ##

The model training was stopped midway as the mdoel started to overfit, the validation loss value started increasing.

Assistance from: https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_padded, train_labels))
train_dataset = train_dataset.batch(32)  # You can adjust the batch size

test_dataset = tf.data.Dataset.from_tensor_slices((test_padded, test_labels))
test_dataset = test_dataset.batch(32)

# Now use the dataset objects for training and evaluation
model.fit(train_dataset, epochs=100, validation_data=test_dataset, verbose=2)


Epoch 1/100
175/175 - 4s - loss: 0.8064 - accuracy: 0.5309 - val_loss: 0.7579 - val_accuracy: 0.4167 - 4s/epoch - 20ms/step
Epoch 2/100
175/175 - 1s - loss: 0.7354 - accuracy: 0.5352 - val_loss: 0.7191 - val_accuracy: 0.4167 - 1s/epoch - 7ms/step
Epoch 3/100
175/175 - 1s - loss: 0.7110 - accuracy: 0.5339 - val_loss: 0.7059 - val_accuracy: 0.4167 - 1s/epoch - 7ms/step
Epoch 4/100
175/175 - 1s - loss: 0.7023 - accuracy: 0.5298 - val_loss: 0.7012 - val_accuracy: 0.4167 - 1s/epoch - 7ms/step
Epoch 5/100
175/175 - 1s - loss: 0.6984 - accuracy: 0.5250 - val_loss: 0.6984 - val_accuracy: 0.4671 - 1s/epoch - 6ms/step
Epoch 6/100
175/175 - 1s - loss: 0.6949 - accuracy: 0.5314 - val_loss: 0.6950 - val_accuracy: 0.5350 - 1s/epoch - 6ms/step
Epoch 7/100
175/175 - 1s - loss: 0.6899 - accuracy: 0.5539 - val_loss: 0.6905 - val_accuracy: 0.5671 - 1s/epoch - 7ms/step
Epoch 8/100
175/175 - 1s - loss: 0.6823 - accuracy: 0.5832 - val_loss: 0.6842 - val_accuracy: 0.5967 - 1s/epoch - 7ms/step
Epoch 9/100
175

KeyboardInterrupt: 

# Saving the Best Model for Executable File #

Assistance from: https://medium.com/@maziarizadi/pickle-your-model-in-python-2bbe7dba2bbb

In [None]:
import pickle

with open('logistic_regression_model.pkl', 'wb') as file:
    pickle.dump(classifier, file)
with open('vectorizer.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)