<a href="https://colab.research.google.com/github/LenSin3/Conservatory-Product-Review-Classififcation/blob/main/pc_product_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Amazon Review - Classifying Reviews on PC**

# **Objective**

The main objective of this project is to explore the performance of different machine learning models on text analysis. The models will be assessed by the accuracy of their predictions. Dataset on PC reviews on Amazon is employed for this task.

# **Project Steps**

The project is divided into the following sections:
1. Pyspark Environment Setup
2. Extract Data
3. Exploratory Data Analysis
4. Preprocessing For Machine Learning
5. Machine Learning Models
6. Discuss Best Model
7. Conclusions

## **Pyspark Environment Setup**

In [1]:
import os
# Find the latest version of spark 3.0  from http://www-us.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.1'
spark_version = 'spark-3.0.1'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,386 kB]
Get:12 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packa

## **Extract Data**

The data is read from s3.amazonaws.com


In [2]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AmazonPCReviews").getOrCreate()

In [3]:
# Read in data from S3 Buckets
from pyspark import SparkFiles
url = 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_PC_v1_00.tsv.gz'
spark.sparkContext.addFile(url)
pc_review_df = spark.read.csv(SparkFiles.get("amazon_reviews_us_PC_v1_00.tsv.gz"), sep=r'\t', header=True)

# Show DataFrame
pc_review_df.show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   22873041|R3ARRMDEGED8RD|B00KJWQIIC|     335625766|Plemo 14-Inch Lap...|              PC|          5|            0|          0|   N|                Y|Pleasantly surprised|I was very surpri...| 2015-08-31|
|         US|   30088427| RQ28TSA020Y6J|B013ALA9LA|     671157305|TP-Link OnHub AC1...|              PC|          5|    

## **Exploratory Data Analysis**

This section will manipulate the data, employing exploratory data analysis techniques to provide a general overview of trends and patterns.

The data is first inspected, cleaned and transofrmed into the right format for analysis.

### **Size of dataset**

The size of the dataset is determined by its length. This is done by counting the number of rows

In [4]:
# Count number of rows in dataset
num_rows = pc_review_df.count()
print(f'The are {num_rows} records in the dataset')

The are 6908554 records in the dataset


This is a large dataset containing **6908554** rows of reviews. The use of a cloud platform is hence justified.

### **Check Data Types**

Use dataframe.dtypes to check for column data types

**Check data frame columns**

In [13]:
pc_review_df.columns

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

In [12]:

print(len(pc_review_df_coulumns))

15


There are 15 coulumns in the dataframe. 

**Check for nan and Null Values**

In [16]:
from pyspark.sql.functions import isnan, when, count, col
pc_review_df.select([count(when(isnan(c), c)).alias(c) for c in pc_review_df.columns]).show()

+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+
|marketplace|customer_id|review_id|product_id|product_parent|product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|review_headline|review_body|review_date|
+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+
|          0|          0|        0|         0|             0|            0|               0|          0|            0|          0|   0|                0|              1|          0|          0|
+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+



The dataset is relatively clean except for a null value in review_headline.

In [19]:
pc_review_df.filter(isnan(pc_review_df.review_headline)).show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+--------------------+-----------+
|         US|   17228356|R1BHLWPYHYJ3Y6|B003ZWZ72G|     781813215|Gizmo Dorks Doubl...|              PC|          4|            0|          0|   N|                Y|            nan|got it before the...| 2010-11-29|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+----------

The nan value is replaced with Null value

In [26]:
pc_review_df_replace_nan = pc_review_df.replace(float('nan'), None)

In [27]:
pc_review_df_replace_nan.filter(isnan(pc_review_df_replace_nan.review_headline)).show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+--------------------+-----------+
|         US|   17228356|R1BHLWPYHYJ3Y6|B003ZWZ72G|     781813215|Gizmo Dorks Doubl...|              PC|          4|            0|          0|   N|                Y|            nan|got it before the...| 2010-11-29|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+----------

In [23]:
pc_review_df.select([count(when(col(c).isNull(), c)).alias(c) for c in pc_review_df.columns]).toPandas().T
         

Unnamed: 0,0
marketplace,0
customer_id,0
review_id,0
product_id,0
product_parent,0
product_title,0
product_category,3
star_rating,3
helpful_votes,3
total_votes,3


The dataframe indicates null values in columns with values different from zero. These observations can de dropped without significantly affecting any future analysis. The original dataframe contains 6908554 rows and the indicated amount of null values is insignificant as compared to the size of the data.

Drop null values

In [30]:
pc_review_drop_na = pc_review_df.dropna()

In [31]:
pc_review_drop_na.select([count(when(col(c).isNull(), c)).alias(c) for c in pc_review_df.columns]).toPandas().T

Unnamed: 0,0
marketplace,0
customer_id,0
review_id,0
product_id,0
product_parent,0
product_title,0
product_category,0
star_rating,0
helpful_votes,0
total_votes,0


All rows with null values are now dropped.

**Check for duplicates**

In [32]:
pc_review_drop_na_duplicates = pc_review_drop_na.distinct().count()
print(pc_review_drop_na_duplicates)

6908145


The distinct count value (**6908145**) is less than the count value (**6908554**). This indicates the presence of duplicate values. The values are dropped below.

**Drop Duplicates**

In [87]:
pc_review_drop_duplicates = pc_review_drop_na.drop_duplicates()

In [34]:
pc_review_drop_duplicates.count()

6908145

The current count value matches that of the distinct count. This is a confirms duplicate rows have been successfully dropped.

**Check data types** 

In [88]:
from pyspark.sql.functions import mean, min, max


In [89]:
pc_review_drop_duplicates.dtypes

[('marketplace', 'string'),
 ('customer_id', 'string'),
 ('review_id', 'string'),
 ('product_id', 'string'),
 ('product_parent', 'string'),
 ('product_title', 'string'),
 ('product_category', 'string'),
 ('star_rating', 'string'),
 ('helpful_votes', 'string'),
 ('total_votes', 'string'),
 ('vine', 'string'),
 ('verified_purchase', 'string'),
 ('review_headline', 'string'),
 ('review_body', 'string'),
 ('review_date', 'string')]

In [81]:
pc_review_drop_duplicates.filter(col("helpful_votes").rlike('([a-z]|\\s+)')).show(truncate=False)

+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+
|marketplace|customer_id|review_id|product_id|product_parent|product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|review_headline|review_body|review_date|
+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+
+-----------+-----------+---------+----------+--------------+-------------+----------------+-----------+-------------+-----------+----+-----------------+---------------+-----------+-----------+



The print out indicates the values in each column are of string type. However, **star_rating**, **helpful_votes** and **total_votes** are numbers, suggesting the need to convert them to numeric type or integers. Upon closer inspection, values in **star_rating** column should not be treated as numeric as they are used to categorize or rate the reviews. Hence, **star_rating** is a catogorical variable and actually one of the main variables of interest for Machine Learning implementation.


In [90]:
from pyspark.sql.types import IntegerType, DoubleType, LongType, DateType
from pyspark.sql import functions as func
# pc_review_drop_duplicates_cast_int = pc_review_drop_duplicates.withColumn("helpful_votes", pc_review_drop_duplicates["helpful_votes"].cast())
# pc_review_drop_duplicates_cast_int = pc_review_drop_duplicates.withColumn('helpful_votes', func.col('helpful_votes').cast('long'))

# integer_cols = ['helpful_votes', 'total_votes']

from functools import reduce   # not needed in python 2

# pc_review_drop_duplicates_cast_int = reduce(
    # lambda integer_cols, c: pc_review_drop_duplicates.withColumn(c, pc_review_drop_duplicates[c].cast(DoubleType())),
    # integer_cols,
    # pc_review_drop_duplicates_cast_int
    
# )




In [None]:
# pc_review_drop_duplicates_cast_int = pc_review_drop_duplicates.withColumn('total_votes', func.col('total_votes').cast('long'))

**Convert review_date to date**

In [91]:
from pyspark.sql.types import DateType
pc_review_drop_duplicates_cast_int_date = pc_review_drop_duplicates.withColumn("review_date", pc_review_drop_duplicates_cast_int['review_date'].cast(DateType()))

In [93]:
pc_review_drop_duplicates_cast_int.dtypes

[('marketplace', 'string'),
 ('customer_id', 'string'),
 ('review_id', 'string'),
 ('product_id', 'string'),
 ('product_parent', 'string'),
 ('product_title', 'string'),
 ('product_category', 'string'),
 ('star_rating', 'string'),
 ('helpful_votes', 'string'),
 ('total_votes', 'double'),
 ('vine', 'string'),
 ('verified_purchase', 'string'),
 ('review_headline', 'string'),
 ('review_body', 'string'),
 ('review_date', 'string')]

In [94]:
pc_review_drop_duplicates_cast_int_date.schema

StructType(List(StructField(marketplace,StringType,true),StructField(customer_id,StringType,true),StructField(review_id,StringType,true),StructField(product_id,StringType,true),StructField(product_parent,StringType,true),StructField(product_title,StringType,true),StructField(product_category,StringType,true),StructField(star_rating,StringType,true),StructField(helpful_votes,StringType,true),StructField(total_votes,StringType,true),StructField(vine,StringType,true),StructField(verified_purchase,StringType,true),StructField(review_headline,StringType,true),StructField(review_body,StringType,true),StructField(review_date,DateType,true)))

In [95]:
pc_review_drop_duplicates_cast_int_date_new = pc_review_drop_duplicates_cast_int_date.withColumn('total_votes', func.col('total_votes').cast('long'))

In [96]:
pc_review_drop_duplicates_cast_int_date_new.schema

StructType(List(StructField(marketplace,StringType,true),StructField(customer_id,StringType,true),StructField(review_id,StringType,true),StructField(product_id,StringType,true),StructField(product_parent,StringType,true),StructField(product_title,StringType,true),StructField(product_category,StringType,true),StructField(star_rating,StringType,true),StructField(helpful_votes,StringType,true),StructField(total_votes,LongType,true),StructField(vine,StringType,true),StructField(verified_purchase,StringType,true),StructField(review_headline,StringType,true),StructField(review_body,StringType,true),StructField(review_date,DateType,true)))

In [97]:
pc_review_drop_duplicates_cast_int_date_new_df = pc_review_drop_duplicates_cast_int_date_new.withColumn('helpful_votes', func.col('helpful_votes').cast('long'))

In [98]:
pc_review_drop_duplicates_cast_int_date_new_df.schema

StructType(List(StructField(marketplace,StringType,true),StructField(customer_id,StringType,true),StructField(review_id,StringType,true),StructField(product_id,StringType,true),StructField(product_parent,StringType,true),StructField(product_title,StringType,true),StructField(product_category,StringType,true),StructField(star_rating,StringType,true),StructField(helpful_votes,LongType,true),StructField(total_votes,LongType,true),StructField(vine,StringType,true),StructField(verified_purchase,StringType,true),StructField(review_headline,StringType,true),StructField(review_body,StringType,true),StructField(review_date,DateType,true)))