# Sentiment ANalysis on Amazon Product Reviews Dataset #

Sourced from: Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, Bollywood, Boom-boxes and Blenders: Domain adaptation for sentiment classification. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 440–447. http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html


## Importing the Required Libraries ##

In [None]:
import os
import pandas as pd
from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.classifier_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Downloading necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

## Loading the Dataset ##
The dataset was loaded using the html parser from Beautiful Soup. The code was compiled with step by step assitance from various websites and articles. 

Assistance taken from:

https://oxylabs.io/blog/beautiful-soup-parsing-tutorial

https://stackoverflow.com/questions/21570780/using-python-and-beautifulsoup-saved-webpage-source-codes-into-a-local-file

https://stackoverflow.com/questions/43214305/how-to-use-text-strip-function

In [17]:
folder = 'sorted_data_acl'
categories = ['books', 'dvd', 'electronics', 'kitchen_&_housewares']
data = []

for category in categories:
    for file in ['negative', 'positive']:
        path = os.path.join(folder, category, f"{file}.review")
        with open(path, 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file, 'html.parser')
            reviews = soup.find_all('review_text')
            
            for review in reviews:
                clean_text = review.text.strip()  # Removing  leading amd trailing whitespace,  assistance taken from https://stackoverflow.com/questions/43214305/how-to-use-text-strip-function
                data.append((clean_text, 1 if file == 'positive' else 0))

df = pd.DataFrame(data, columns=['review_text', 'file'])


## Viewing the Dataframe ##

In [18]:
df.head()

Unnamed: 0,review_text,sentiment
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


## Removing Outliers: Very short reviews.

In [None]:
# Defining a function for counting words
def word_count(text):
    return len(text.split())


#Assistance taken from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
min_length = 10 

df['word_count'] = df['review_text'].apply(word_count)
df = df[df['word_count'] >= min_length]

In [19]:
# Counting the number of positive and negative reviews
positive_count = df[df['file'] == 1].shape[0]
negative_count = df[df['file'] == 0].shape[0]

print(f"Number of positive reviews: {positive_count}")
print(f"Number of negative reviews: {negative_count}")

Number of positive reviews: 4000
Number of negative reviews: 4000


## Pre-processing the data ##

The pre-processing task was done with asistance from:

https://www.dataquest.io/blog/how-to-clean-and-prepare-your-data-for-analysis/

In [20]:
# Preprocess the text
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    # Tokenize
    words = nltk.word_tokenize(text.lower())
    # Remove stopwords and stem
    filtered_words = [ps.stem(word) for word in words if word not in stop_words and word.isalpha()]
    return " ".join(filtered_words)

df['processed_text'] = df['review_text'].apply(preprocess_text)

### Vectorization of processed data ###

Assistance from https://medium.com/@WojtekFulmyk/text-tokenization-and-vectorization-in-nlp-ac5e3eb35b85

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['processed_text'])
y = df['file']

### Spliting the Data ###

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Building the Logistic Regression classifier and training ##

Assistance from : https://spotintelligence.com/2023/02/22/logistic-regression-text-classification-python/

In [24]:
from sklearn.linear_classifier import LogisticRegression

classifier = LogisticRegression(max_iter=1000)  
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

### Caluclating Accuracy

In [25]:
from sklearn.metrics import accuracy_score

y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.815625


## Classifier Validation ##

In [31]:
def predict_file(review):
    processed_review = preprocess_text(review)
    vectorized_review = vectorizer.transform([processed_review])
    prediction = classifier.predict(vectorized_review)
    return "Positive" if prediction[0] == 1 else "Negative"

### Testing with a complicated Negative Review ###

In [32]:
# Test the function
print(predict_file("I don't know what to say about this product. The quality of paper was super, and the fininsh just right, but then again the glue used laid waste to it all. All beautiful things broken apart and scattered around"))

Negative


### Testing with a complicated Positive Review ###

In [33]:

# Test the function
print(predict_file("I don't know what to say about this product. The quality of paper was super, and the fininsh just right"))

Positive


*End of Code*