# NLP - Sentiment Analysis for Amazon Product Reviews
# Data Preprocessing

This is the second phase of the project - cleaning and manipulating extracted data from Amazon's review section, hence, we'll read the 'Whey_Protein_Amazon_Scraped_Reviews.csv' file, which represents the scraper's direct output.   

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import re

### Read scraped data (from 10 pages worth of comments in which each page fits a max of 10 comments) per whey protein product currently being sold at Amazon.
There's a total of 3314 rows in the dataset - some products have more reviews than others. The columns available/required for the analysis are the following:
1. ID
2. Product_Name
3. Date
4. Rating_Score
5. Reviews
6. Link
7. Product ID (note that some product ID's are linked to the same 'Product Name')

In [2]:
# Read scraped results from CSV
df = pd.read_csv('Whey_Protein_Amazon_Scraped_Reviews.csv')

# Assign name to unnamed col. - for later use in Sentiment analysis 
df.rename(columns = {'Unnamed: 0': 'ID'}, inplace = True)

# Change data type for 'Review' & 'Link' to 'string' & fill empty cells (from CSV) with NA
df['Reviews'] = df['Reviews'].astype('string')
df = df.fillna('NA')

df['Link'] = df['Link'].astype('string')
print(df.dtypes)

# Extract product id's (or ASIN ID) from Links
product_id = []
for value in df['Link']:
    x = re.search(r'product-reviews/(.*?)/ref=cm_cr', value).group(1)
    product_id.append(x)

df["Product_ID"] = product_id

# Print dataframe's shape and datatypes to confirm 7 columns in total
print(df.shape)

ID                int64
Product_Name     object
Date             object
Rating_Score    float64
Reviews          string
Link             string
dtype: object
(3314, 7)


In [3]:
df.head(5)

Unnamed: 0,ID,Product_Name,Date,Rating_Score,Reviews,Link,Product_ID
0,0,NatureWorks-HydroMATE-Electrolytes-Chocolate-C...,2023-01-25,5.0,I love this. I make it for myself and my kids...,https://www.amazon.com/NatureWorks-HydroMATE-E...,B0BRT77ZK8
1,1,NatureWorks-HydroMATE-Electrolytes-Chocolate-C...,2023-02-06,5.0,Takes away lightheadedness and makes my husba...,https://www.amazon.com/NatureWorks-HydroMATE-E...,B0BRT77ZK8
2,2,NatureWorks-HydroMATE-Electrolytes-Chocolate-C...,2023-01-27,5.0,The chocolate tastes delicious! I drink it ev...,https://www.amazon.com/NatureWorks-HydroMATE-E...,B0BRT77ZK8
3,3,NatureWorks-HydroMATE-Electrolytes-Chocolate-C...,2023-01-27,5.0,I absolutely love this! My buddy gave me a fe...,https://www.amazon.com/NatureWorks-HydroMATE-E...,B0BRT77ZK8
4,4,NatureWorks-HydroMATE-Electrolytes-Chocolate-C...,2023-02-18,4.0,I like to work out regularly. This includes w...,https://www.amazon.com/NatureWorks-HydroMATE-E...,B0BRT77ZK8


In [4]:
# Converting dataframe to a csv file to use it later for further analysis
df.to_csv('Whey_Protein_Amazon_Preprocessed_Reviews.csv') 