## `Introduction`
#Problem Statement:
** Classifying Amazon reviews based on customer ratings using NLP *italicized text*

  **Impact**
Reviews provide objective feedback to a product and are therefore inherently useful for consumers. These ratings are often summarized by a numerical rating, or the number of stars. Of course there is more value in the actual text itself than the quantified stars. And at times, the given rating does not truly convey the experience of the product – the heart of the feedback is actually in the text itself. The goal therefore is to build a classifier that would understand the essence of a piece of review and assign it the most appropriate rating based on the meaning of the text.

**Background**
Though product ratings on Amazon are aggregated from all the reviews by every customer, each individual rating is actually only an integer that ranges from one star to five stars. This reduces our predictions to discrete classes totaling five possibilities. Therefore what we'll have is a supervised, multi-class classifier with the actual review text as the core predictor.

This study is an exploration of Natural Language Processing (NLP). The goal of predicting the star rating given a piece of text will take on different NLP topics including word embedding, topic modeling, and dimension reduction. From there, we'll arrive at a final dataframe and we'll be employing different machine learning techniques in order to come up with the best approach (i.e. most accurate estimator) for our classifier.

**Datasets**
The Amazon dataset contains the customer reviews for all listed Electronics products spanning from May 1996 up to July 2014. There are a total of 1,689,188 reviews by a total of 192,403 customers on 63,001 unique products. The data dictionary is as follows:


*   asin - Unique ID of the product being reviewed, string
*   helpful - A list with two elements: the number of users that voted helpful, and the total number of users that voted on the review (including the not helpful votes), listList item
*overall - The reviewer's rating of the product, int64
* rreviewerID - Unique ID of the reviewer, stringeviewText - The review text itself,
* reviewerName - Specified name of the reviewer, string

*summary - Headline summary  the review, string
* unixReviewTime - Unix Time of when the review was posted, string


EDA

In [None]:
import pandas as pd
data_df = pd.read_excel('/content/all_kindle_review-X.xlsx')

In [None]:
data_df.head()

In [None]:


# Assuming your DataFrame is named data_df
data_df = data_df.loc[:, ~data_df.columns.str.contains('^Unnamed')]




In [None]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   asin            12000 non-null  object
 1   helpful         12000 non-null  object
 2   rating          12000 non-null  int64 
 3   reviewText      12000 non-null  object
 4   reviewTime      12000 non-null  object
 5   reviewerID      12000 non-null  object
 6   reviewerName    11962 non-null  object
 7   summary         11998 non-null  object
 8   unixReviewTime  12000 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 843.9+ KB


The reviewTime is dropped since the unixReviewTime series more accurately describes the time when each review was posted.

In [None]:
data_df.drop(labels="reviewTime", axis=1, inplace=True)

In [None]:


from datetime import datetime

# Apply the condition to format the numeric timestamps
condition = lambda row: datetime.fromtimestamp(row).strftime("%m-%d-%Y")# if pd.notna(row) else None
data_df["unixReviewTime"] = data_df["unixReviewTime"].apply(condition)

# Rename the column
data_df.rename(columns={'unixReviewTime': 'ReviewTime'}, inplace=True)



In [None]:
data_df.head()

Unnamed: 0,asin,helpful,rating,reviewText,reviewerID,reviewerName,summary,ReviewTime
0,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...",A3HHXRELK8BHQG,Ridley,Entertaining But Average,09-02-2010
1,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,10-08-2013
2,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,04-11-2014
3,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,AC4OQW3GZ919J,Cleargrace,very light murder cozy,07-05-2014
4,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,A3C9V987IQHOQD,Rjostler,Book,12-31-2012


In [None]:
#Each review is stored as string in the reviewText series. A sample product review
print(data_df["reviewText"].iloc[0])

Jace Rankin may be short, but he's nothing to mess with, as the man who was just hauled out of the saloon by the undertaker knows now. He's a famous bounty hunter in Oregon in the 1890s who, when he shot the man in the saloon, just finished a years long quest to avenge his sister's murder and is now trying to figure out what to do next. When the snotty-nosed farm boy he just rescued from a gang of bullies offers him money to kill a man who forced him off his ranch, he reluctantly agrees to bring the man to justice, but not to kill him outright. But, first he needs to tell his sister's widower the news.Kyla "Kyle" Springer Bailey has been riding the trails and sleeping on the ground for the past month while trying to find Jace. She wants revenge on the man who killed her husband and took her ranch, amongst other crimes, and she's not so keen on the detour Jace wants to take. But she realizes she's out of options, so she hides behind her boy persona as best she can and tries to keep pace

In [None]:
print(data_df['rating'].unique())

[3 5 4 2 1]


In [None]:
print(data_df['rating'].unique())

[3 5 4 2 1]


NLP Pre-Processing
We'll work with reviewText to prepare our model's final dataframe. The goal is to produce tokens for every document (i.e. every review). These documents will make up our corpora where we'll draw our vocabulary from.

The following is a sample text in its original form.

In [None]:
sample_review = data_df["reviewText"].iloc[11961]

print(sample_review)

Although the premise of Serial sounded very interesting to me, I was disappointed in the story.  It was extremely violent and gruesome.  This is not for you if you have a weak stomach.  I had anticipated a psychological thriller and instead read pages recounting violent and gory deaths.  I wish it had scared me more than grossed me out.


HTML Entities
Some special characters like the apostrophe (’) and the en dash (–) are expressed as a set of numbers prefixed by &# and suffixed by ;. This is because the dataset was scraped from an HTML parser, and the dataset itself includes data that predated the universal UTF-8 standard.

These HTML Entities can be decoded by importing the html library.

In [None]:
import html

decoded_review = html.unescape(sample_review)
print(decoded_review)

Although the premise of Serial sounded very interesting to me, I was disappointed in the story.  It was extremely violent and gruesome.  This is not for you if you have a weak stomach.  I had anticipated a psychological thriller and instead read pages recounting violent and gory deaths.  I wish it had scared me more than grossed me out.


Since punctuation marks do not add value in the way we'll perform NLP, all the HTML entities in the review texts can be dropped. The output series preprocessed is our reviewText but without the special characters.

pattern = r"\&\#[0-9]+\;"

df["preprocessed"] = df["reviewText"].str.replace(pat=pattern, repl="", regex=True)

print(df["preprocessed"].iloc[1689185])

In [None]:
pattern = r"\&\#[0-9]+\;"

# Use data_df instead of df
data_df["preprocessed"] = data_df["reviewText"].str.replace(pat=pattern, repl="", regex=True)
print(data_df["preprocessed"].iloc[11500])

ainslinn and kyle were two very well thought out characters and I enjoyed ainslinn letting go of her fears and allowing kyle to get close to her.loved this book
