In [None]:
# Data Source: https://www.kaggle.com/datasets/yasserh/amazon-product-reviews-dataset
# Folder: Amazon
# Description:
### The dataset consists of samples from Amazon Ratings for select products.
### The reviews are picked randomly and the corpus has nearly 1.6k reviews of different customers.
### Amazon aims to understand what are the main topics of these reviews to classify them for easier search.\

# Cleaning, Analysis, Visualization, and Modeling of Amazon Product Reviews Dataset

## Objective
- Understand the Dataset & perform the necessary cleanup.
- Add additional algorithms to go in depth on the positivity of each review
- Build a strong Topic Modelling Algorithm to classify the topics a bit more than what is provided in each review's title.
- Create a regression model to predict product ratings based on the length of reviews 

## Libraries and Tools used throughout
- Pandas (data cleaning and manipulation)
- NLTK & spaCy (NLP)
- sklearn (regression)
- langdetect & googletrans (detecting non-english languages and translating to english)
- Gensim (topic modelling)
- pyLDAvis & matplotlib (visualizing topic model)
- warnings (prevent certain warnings from showing up and or displaying personal information on user's device after execution of a cell)


## In the case of errors:
- Not all python libraries may be on your machine and or within your directory. Ensure to install them.
    - They may not be updated to a version for certain operations to occur. Update them to latest versions.
- You ran a cell with a problematic edit that you made to it(This notebook is designed to run seamlessly with no edits)
- Not running a python kernel or you're using an old version of python kernel
- Don't have libraries or necessary downloads that are necessary for operation of parts or the entirety certain libraries.
    - ex. vader_lexicon is required to be downloaded with Sentiment Analysis(later on in the notebook)
    

In [None]:
import pandas as pd
from pandas.errors import SettingWithCopyWarning
from sklearn.linear_model import LinearRegression  # minimum model to be used later.
import nltk # for NLP
import warnings
# warnings.filterwarnings("ignore") # ignore all overall
warnings.filterwarnings("ignore", category=SettingWithCopyWarning) # ignore a warning later on for copying over on a dataframe.
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
# load in the file and create dataframe
df = pd.read_csv('product_reviews.csv')
df.head()

## From this we can see that this Dataset contains a lot of columns. For the purpose of our analyses, we only need a few

## For reference, here is a description of each column 

- **id:** Unique identifier for each product.
- **asins:** ASIN (Amazon Standard Identification Number) associated with the product.
- **brand:** Brand of the product.
- **categories:** Categories to which the product belongs.
- **colors:** Colors available for the product.
- **dateAdded:** Date when the product was added.
- **dateUpdated:** Date when the product information was last updated.
- **dimension:** Dimensions of the product.
- **ean:** EAN (European Article Number) associated with the product.
- **keys:** Unique keys associated with the product.
- **manufacturer:** Manufacturer of the product.
- **manufacturerNumber:** Manufacturer number for the product.
- **name:** Name of the product.
- **prices:** Prices associated with the product, including currency and date information.
- **reviews.date:** Date when the review was posted.
- **reviews.doRecommend:** Indicates whether the reviewer recommends the product.
- **reviews.numHelpful:** Number of users who found the review helpful.
- **reviews.rating:** Rating given by the reviewer.
- **reviews.sourceURLs:** URLs to the source of the reviews.
- **reviews.text:** Text content of the review.
- **reviews.title:** Title of the review.
- **reviews.userCity:** City of the reviewer.
- **reviews.userProvince:** Province of the reviewer.
- **reviews.username:** Username of the reviewer.
- **sizes:** Sizes available for the product.
- **upc:** UPC (Universal Product Code) associated with the product.
- **weight:** Weight of the product.


In [None]:
# To get an easier idea of all the columns we are working with, let us see how many exist
df.columns

In [None]:
# Lets make a new df including more of what is actually relevant
relevant_columns = ['id', 'asins', 'brand', 'categories', 'colors', 'manufacturer',
        'name', 'prices', 'reviews.date',
       'reviews.doRecommend', 'reviews.numHelpful', 'reviews.rating', 'reviews.text', 'reviews.title',
         'sizes', 'weight']
product_reviews = df[relevant_columns]
product_reviews.tail()

# Now that we have a dataset with more of the information we need, we have spotted that a few columns needs restructuring
### Specifically the prices column and the reviews date.

In [None]:
product_reviews['prices'][0]

In [None]:
product_reviews['reviews.date']

In [None]:

# Change format to datetime
product_reviews['reviews.date'] = pd.to_datetime(product_reviews['reviews.date'], format='ISO8601')

# Gets rid of milliseconds
product_reviews['reviews.date'] = product_reviews['reviews.date'].dt.strftime('%Y-%m-%d %H:%M:%S')
product_reviews['reviews.date'].dtype #still datetime but is stored as object

In [None]:
product_reviews['reviews.date']

In [None]:
# quick test to make sure things are working as intended
product_reviews['reviews.date'] > '2016-02-01'

## Now that the date is fixed, we will move on to fixing the price column


In [None]:
# For a refresher here are what values in the price column look like
prices_first_row = product_reviews['prices'][0]
print(prices_first_row)
print(type(prices_first_row))

In [None]:
# it is a lot to take in so we'll adjust it to be more presentable
import json

# convert the value that is currently a str to a list with dictionaries
prices_1 = json.loads(prices_first_row)
print("before proper formatting; ", type(prices_1))

# makes it more presentable within json format
prices_1_format = json.dumps(prices_1, indent = 3)
print(prices_1_format)


## For our purposes, we only want prices in USD. With the example shown above we see that there can be multiple prices in USD
- The original price when not on sale and the sale price.

## With this knowledge, we'll create two extra columns to the product reviews table and store those prices in

In [None]:
#ensure all columns have a price in USD
len(product_reviews['prices'].str.contains("USD"))

In [None]:
# lists to store OG price and sale price takes in the prices in USD for each item
full_prices = []
sale_prices = []

for i in product_reviews.index:
    list_dict = json.loads(product_reviews['prices'][i])

    # Initialize variables to store original and sale prices
    original_price = float(list_dict[0]['amountMax'])



    # Iterate through the list of dictionaries to find prices
    for price_info in list_dict:
        if price_info.get('currency') == 'USD' and price_info.get('isSale') == 'true':
            sale_price = float(price_info['amountMax'])
            break


    # Append prices to respective lists
    full_prices.append(original_price)
    sale_prices.append(sale_price)

In [None]:
# checking to ensure if the loop above needs to be adjusted to include a substitute value if there isnt a sale price
print(len(sale_prices),len(full_prices))


In [None]:
# Now we add two columns to showcase the two prices
product_reviews.insert(8,'fullPrice',full_prices)
product_reviews.insert(9,'salePrice',sale_prices)
product_reviews.head()


In [None]:
# now that this is done, we no longer need the original price column
product_reviews = product_reviews.drop(columns='prices')


In [None]:
product_reviews.head()

## The data is finally clean and we will now move on to utilizing NLP for the following purposes
- elaborating on how positive each review is
    - creating a classification model to then support classifying the level of positivity
- topic of each review


In [None]:
# for an intro to the natural language processing toolkit and the different language packages it has. Close it when you've had a good view of the GUI
nltk.download()

In [None]:
nltk.download('vader_lexicon',quiet=True) # required to be used with sentiment analysis intensity
from nltk.sentiment import SentimentIntensityAnalyzer # for identifying the level of sentiment(neg to pos) of text

# class and function of sentiment intensity analysis
sia = SentimentIntensityAnalyzer()


In [None]:
# quick check to make sure all products have reviews.
product_reviews['reviews.text'].isnull().sum()

In [None]:
# libraries for translation
from langdetect import detect
from googletrans import Translator


translator = Translator()
sia = SentimentIntensityAnalyzer()

scores_data = []

for review in product_reviews['reviews.text']:
    # Check if the review is in English
    try:
        if detect(review) != 'en':
            # Translate non-English reviews to English
            translation = translator.translate(review, dest='en').text
            review = translation

        # Analyze sentiment for the (translated or original) review
        score = sia.polarity_scores(review)
        scores_data.append(score)
    except Exception as e:
        print(f"Error processing review: {e}")

scores_data[:10]


In [None]:
# Insert a column to store the positivity scores
product_reviews.insert(15,'positivityScore',[scores_data[i]['compound'] for i in range(len(scores_data))])

In [None]:
# storing proper labels for each review in a list
positivity_level = []

for i in product_reviews['positivityScore']:
    if .66 <= i <= 1:
        positivity_level.append("highly positive")
    elif .33 <= i < .66:
        positivity_level.append("positive")
    elif .1 <= i < .33:
        positivity_level.append("fairly positive")
    elif -.1 <= i < .1:
        positivity_level.append("neutral")
    elif -.33 <= i < -.1:
        positivity_level.append("fairly negative")
    elif -.66 <= i < -.33:
        positivity_level.append("negative")
    elif -1 <= i < -.66:
        positivity_level.append("highly negative")


# inserting the values from the list into a column for positivity level
product_reviews.insert(16,'positivityLevel',positivity_level)

In [None]:
product_reviews.head(3)

## Now we'll go over to creating the algorithm for identifying the main topics reviews

In [None]:
# used for splitting the reviews by words
from nltk.tokenize import word_tokenize

# english tokenizer that adds more depth to the tokenizer
nltk.download("punkt",quiet=True)

In [None]:
# For creating the model and additional tools to assist or support it
# Topic modeling library
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy
import spacy
from nltk.corpus import stopwords

# visualizing the model
import pyLDAvis
import pyLDAvis.gensim

warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:

# storing stopwords in a variable to be used pretty soon
stopwords = stopwords.words("english")
stopwords[0:100:10] # examples of stopwords

In [None]:
translator = Translator()
# Data to be used in the Topic modeling algorithm
data = product_reviews['reviews.text']

for i in range(len(data)):
    review = data[i]

    # Check if the review is in English
    try:
        if detect(review) != 'en':
            # Translate non-English reviews to English
            translation = translator.translate(review, dest='en').text
            data[i] = translation
    except Exception as e:
        print(f"Error processing review: {e}")

In [None]:
# break words down to their most basic form to allow for the model used later to create a better model
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):

    # for loading in the data and also applying tokenization and other language processing to it in respects to the english language
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:

            # eliminates stopwords
            if token.pos_ in allowed_postags and token.text.lower() not in stopwords:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return texts_out


lemmatized_texts = lemmatization(data)
print (lemmatized_texts[0][0:90])

In [None]:
# Further pre-processing of the texts
def gen_words(texts):
    final = []
    for text in texts:

        # tokenizing the words and getting rid of accent marks on words if there are
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])

In [None]:
# incorporate bigrams and trigrams if they potentially exist by checking if two or three words appear by each other enough times that they are probably meant to be used together
bigrams_phrases = gensim.models.Phrases(data_words, min_count = 5, threshold = 50)
trigram_phrases = gensim.models.Phrases(bigrams_phrases[data_words], threshold = 50)

# turning phrases into a Phraser object, which can then be used to apply those phrases to new sentences
bigram = gensim.models.phrases.Phraser(bigrams_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return [bigram[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram[bigram[doc]] for doc in texts]

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print(data_bigrams_trigrams[0])


In [None]:
# Creating a dictionary mapping words to unique IDs
id2word = corpora.Dictionary(data_words)

# Creating a bag-of-words representation of the corpus
corpus = []
for text in data_words:
    # Converting each document to a bag-of-words format
    new = id2word.doc2bow(text)
    corpus.append(new)

# Printing the bag-of-words representation of the first document (up to the first 20 elements)
print(corpus[0][0:20])

# Retrieving the word corresponding to the first unique ID in the dictionary
word = id2word[[0][:1][0]]

print(word)


## Preprocessing and cleaning the texts is complete. Now we go on to creating the model(s) and doing some analysis on it

In [None]:
# Example hyperparameter tuning
import numpy as np
import matplotlib.pyplot as plt

# Vary the number of topics
num_topics_list = list(range(3, 16))
coherence_scores = []

#creates a model for each number of topics in the range and then produces a coherence model to see how well words in the clusters are connected to each other in making a topic
for num_topics in num_topics_list:
    lda_model = gensim.models.ldamodel.LdaModel(
        corpus=corpus,
        id2word=id2word,
        num_topics=num_topics,
        random_state=100,
        update_every=1,
        chunksize=100,
        passes=10,
        alpha="auto"
    )
    coherence_model = CoherenceModel(model=lda_model, texts=data_words, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

# Plot the coherence scores, highest is the best
plt.plot(num_topics_list, coherence_scores)
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.title('Coherence Score vs. Number of Topics')
plt.show()


In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=6, # play around with. Coherence model helps guide to the best number for this, which are 6, 8, and 10
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")


## Visualize the different clusters of topics. 
- This is unsupervised so it is up to the user to determine the labeling of the actual clusters(topic for each cluster)

In [None]:
# visualizing the model & details of it. Interactive display(must be run on personal machine as it is not displayable in github)
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
pyLDAvis.display(vis)

In [None]:
# main topics throughout all reviews after looking at most frequent words in each cluster
main_topics = [
    "Device and Display Experience",
    "Audio and Speaker Performance",
    "Headphones and Sound Quality",
    "Content Consumption and App/Software Experience",
    "General Product Review and Comparison",
    "TV Box and Streaming Experience"
]


### On top of analyzing the texts a lot for its meaning. Let's also check to see if any trends can be attached to them as well

### Specifically, we want to see if there is a decent trend between the number of people who found a review helpful and the actual length of the review
- If there is an identifiable trend, it would be good to create a model for predictions based on it

In [None]:
## First we want to see if there are any null values between both fields
product_reviews[['reviews.text','reviews.numHelpful']].isna().sum()

In [None]:
## To fix this, we'll substitute the null values for 0
product_reviews['reviews.numHelpful'].fillna(0, inplace=True)


In [None]:
product_reviews[['reviews.text','reviews.numHelpful']].isna().sum()

In [None]:
## To check on the trend, I'll create a scatter plot
X = product_reviews['reviews.text'].apply(len).to_frame()
y = product_reviews['reviews.numHelpful']


plt.scatter(X,y)
plt.xscale('log')
plt.xlabel("Characters in review")
plt.ylabel("helpful score")
plt.show()

In [None]:
## From this we can see that there is definitely a relationship and one that
## Further this relationship seems to be best fitting for a polynomial model

from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# creating the test and training datasets for the model
from sklearn.model_selection import train_test_split

# create the training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.33, random_state = 26)
X_train

In [None]:
# pre-processing that standardizes the dataset applied to it
scaler = StandardScaler()

# the mean and std is computed for each feature in X_train and then the standardization ((x - u)/std) is applied
X_train_scaler = scaler.fit_transform(X_train)

# same is applied here, but we dont apply fit, because we want it to use the mean and std from the X_train_scaler
X_test_scaler = scaler.transform(X_test)

#linear
lin = LinearRegression()

In [None]:
# this is really a linear model but due to the degree being 1.
poly = PolynomialFeatures(degree = 2)

# applies the same fit and transform from last cell but specifically for the X_train_scaler
X_poly_train = poly.fit_transform(X_train_scaler)

# want to use the mean and std gained from X_poly_train and then apply transformation/standardization to it
X_poly_test = poly.transform(X_test_scaler)

# here we fit the transformed x_poly_train with y_train to then get actual values that minimize the difference between predicted and actual y_train value
poly.fit(X_poly_train,y_train)

# same idea here given but on linear model to also get minimized values to match y train to x poly train
lin.fit(X_poly_train,y_train)

In [None]:
#lin model is now trained predicted y values are created in relation to X_poly_test
y_pred = lin.predict(X_poly_test)


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# using mse because there aren't any outliers in the values of y as seen in the matplotlib vis.
# So its safe to use MSE which is always more accurate as long as outliers arent present. If outliers are present, switch to MAE
test_mse = mean_squared_error(y_test,y_pred)
test_mse

In [None]:
# The test_mse is actually very close to the training mse which is most accurate, given the y_training values were seen by the y_pred_training model
y_pred_train = lin.predict(X_poly_train)
train_mse = mean_squared_error(y_train,y_pred_train)
train_mse

In [None]:
# compare the difference between the two to know how well the model is with test_data it hasn't seen
print(test_mse - train_mse)

### Re run with different degree and compare the difference between MSE

In [None]:
poly = PolynomialFeatures(degree = 3)
X_poly_train = poly.fit_transform(X_train_scaler)
X_poly_test = poly.transform(X_test_scaler)
poly.fit(X_poly_train,y_train)
lin.fit(X_poly_train,y_train)

In [None]:
y_pred = lin.predict(X_poly_test)
test_mse1 = mean_squared_error(y_test,y_pred)
test_mse1

In [None]:
y_pred_train = lin.predict(X_poly_train)
train_mse1 = mean_squared_error(y_train,y_pred_train)
train_mse1

In [None]:
print(test_mse1 - train_mse1)


In [None]:
# Overall comparison of both models. Lowest is of the absolute value of both is the best
print("Differences between MSEs. Degree 2: ",(test_mse - train_mse), " vs. Degree 3: ",(test_mse1 - train_mse1))
print("Differences between actual values of training predictions and test predictions respectively\nDegree 2: ",\
      "trained -", train_mse, " test -", test_mse,"\nDegree 3: ","trained -", train_mse1, " test -", test_mse1)

### From the comparison, a polynomial model of degree 3 is the best option to use for the following reasons.
- The test and training MSE are lower for the model with degree = 3.
- The model with degree = 3 results has smaller difference between the test and training MSE

## Overall, this marks the end of my project.
### Topics showcased include:
- Topic Modeling with the Gensim library and additional tools
- Sentiment Analysis with nltk
- Regression Modeling with sklearn
- Lil bits of visualization with matplotlib and a LDA model based visualization library(pyLDAvis)
