# Amazon Product Ratings Dataset, Data Pre- Processing and Cleaning

## Introduction

The Amazon Product Ratings dataset contains product ratings and metadata from Amazon, we have to Do Data Cleaning and Data Pre- Processing on the data.

***Attribute Information***

1. userId : Every user identified with a unique id (First Column)

2. productId : Every product identified with a unique id(Second Column)

3. Rating : Rating of the corresponding product by the corresponding user(Third Column)

4. timestamp : Time of the rating ( Fourth Column)

## Importing Libraries

In [2]:
# Importing Numpy
import numpy as np
# Importing Matplotlib
import matplotlib.pyplot as plt
# plt is athe alias name for pyplot
import pandas as pd
# pd is the alias for pandas

## Loading Data into Dataframe

In [3]:
# Loading the Dataset
amazon_df = pd.read_csv("Amazon_Product_Ratings_Dataset.csv")

## Exploring data and Performing Data Cleaning and Pre-Processing 

In [None]:
# Showing the first five rows
amazon_df.head()

Unnamed: 0,AKM1MP6P0OYPR,0132793040,5.0,1365811200
0,A2CX7LUOHB2NDG,321732944,5.0,1341100800
1,A2NWSAGRHCP8N5,439886341,1.0,1367193600
2,A2WNBOD3WNDNKT,439886341,3.0,1374451200
3,A1GI0U4ZRJA8WN,439886341,1.0,1334707200
4,A1QGNMC6O1VW39,511189877,5.0,1397433600


In [8]:
# Showing the shape of the Dataset
amazon_df.shape

(7824481, 4)

*From above we can see that the data contains 7824481 which is approximately 7.82 million rows, its really a vast data*

## Renaming the columns

In [4]:
amazon_df.rename(columns = {'AKM1MP6P0OYPR':'userId', '0132793040':'productId', '5.0':'ratings', 
                                 '1365811200':'timestamp',}, inplace = True)

In [5]:
# Check datatypes
amazon_df.dtypes

userId        object
productId     object
ratings      float64
timestamp      int64
dtype: object

From above we can conclude that the columns have following datatypes:
- userId => object
- productId => object
- ratings => float64
- timestamp => int64

In [11]:
# Check ratings info
amazon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824481 entries, 0 to 7824480
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     object 
 1   productId  object 
 2   ratings    float64
 3   timestamp  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 238.8+ MB


In [12]:
# Check Duplicates
amazon_df.duplicated().sum()

0

There are no duplicate values in the dataset

In [13]:
# Check the presence of missing values
amazon_df.isnull().sum()

userId       0
productId    0
ratings      0
timestamp    0
dtype: int64

There are no null values in the dataset

As the timestamp column in dataset have dataes in hexadecimal formate , writing a custom function to convert hexadecimal date in to datetime format

In [6]:
# Importing 'datetime' library
import datetime

# Writing a custom function to convert hexadecimal date in to datetime format
def date_time(x):
    date_time = datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S')
    return date_time

In [33]:
# Checking if the function is working properly
date_time(1341100800)

'2012-07-01 05:30:00'

In [34]:
# Applying the function on whole column by using apply() method
amazon_df['date_time'] = amazon_df['timestamp'].apply(date_time)
# Showing the first five rows
amazon_df.head()

Unnamed: 0,userId,productId,ratings,timestamp,date_time
0,A2CX7LUOHB2NDG,321732944,5.0,1341100800,2012-07-01 05:30:00
1,A2NWSAGRHCP8N5,439886341,1.0,1367193600,2013-04-29 05:30:00
2,A2WNBOD3WNDNKT,439886341,3.0,1374451200,2013-07-22 05:30:00
3,A1GI0U4ZRJA8WN,439886341,1.0,1334707200,2012-04-18 05:30:00
4,A1QGNMC6O1VW39,511189877,5.0,1397433600,2014-04-14 05:30:00


In [36]:
# Checking for data types of new dataframe 
amazon_df.dtypes

userId        object
productId     object
ratings      float64
timestamp      int64
date_time     object
dtype: object

In [37]:
# convert column to datetime pandas
amazon_df['date_time'] = pd.to_datetime(amazon_df['date_time'])
# Checking data type of date_time column
amazon_df.dtypes

userId               object
productId            object
ratings             float64
timestamp             int64
date_time    datetime64[ns]
dtype: object

## Splitting 'date_time' column into 'day', 'month', 'year' & 'time'

In [38]:
# Making a 'date' column from 'date_time' column by using map() method and lambda
amazon_df["day"] = amazon_df['date_time'].map(lambda x: x.day)
# Making a 'month' column from 'date_time' column by using map() method and lambda
amazon_df["month"] = amazon_df['date_time'].map(lambda x: x.month)
# Making a 'year' column from 'date_time' column by using map() method and lambda
amazon_df["year"] = amazon_df['date_time'].map(lambda x: x.year)

In [51]:
# Showing the first three rows
amazon_df.head(3)

Unnamed: 0,userId,productId,ratings,timestamp,date_time,day,month,year,time
0,A2CX7LUOHB2NDG,321732944,5.0,1341100800,2012-07-01 05:30:00,1,7,2012,05:30:00
1,A2NWSAGRHCP8N5,439886341,1.0,1367193600,2013-04-29 05:30:00,29,4,2013,05:30:00
2,A2WNBOD3WNDNKT,439886341,3.0,1374451200,2013-07-22 05:30:00,22,7,2013,05:30:00


In [44]:
# Making a 'time' column from 'date_time' column using pandas to_datetime() function
amazon_df["time"] = pd.to_datetime(amazon_df['date_time']).dt.time

In [50]:
# Checking Unique values in 'time' column
amazon_df["time"].unique

<bound method Series.unique of 0          05:30:00
1          05:30:00
2          05:30:00
3          05:30:00
4          05:30:00
             ...   
7824476    05:30:00
7824477    05:30:00
7824478    05:30:00
7824479    05:30:00
7824480    05:30:00
Name: time, Length: 7824481, dtype: object>

In [52]:
# Making a 'month_name' column from 'month' column by using pandas to_datetime() function
amazon_df['month_name'] = pd.to_datetime(amazon_df['month'], format='%m').dt.month_name()

In [53]:
# Showing the first five rows
amazon_df.head()

Unnamed: 0,userId,productId,ratings,timestamp,date_time,day,month,year,time,month_name
0,A2CX7LUOHB2NDG,321732944,5.0,1341100800,2012-07-01 05:30:00,1,7,2012,05:30:00,July
1,A2NWSAGRHCP8N5,439886341,1.0,1367193600,2013-04-29 05:30:00,29,4,2013,05:30:00,April
2,A2WNBOD3WNDNKT,439886341,3.0,1374451200,2013-07-22 05:30:00,22,7,2013,05:30:00,July
3,A1GI0U4ZRJA8WN,439886341,1.0,1334707200,2012-04-18 05:30:00,18,4,2012,05:30:00,April
4,A1QGNMC6O1VW39,511189877,5.0,1397433600,2014-04-14 05:30:00,14,4,2014,05:30:00,April


In [55]:
# Checking for duplicate entries into the new dataframe
amazon_df.duplicated().sum()

0

In [56]:
# Remove two columns name is 'C' and 'D'
amazon_df = amazon_df.drop(['timestamp', 'date_time'], axis=1)

In [57]:
# Showing the first five rows
amazon_df.head()

Unnamed: 0,userId,productId,ratings,day,month,year,time,month_name
0,A2CX7LUOHB2NDG,321732944,5.0,1,7,2012,05:30:00,July
1,A2NWSAGRHCP8N5,439886341,1.0,29,4,2013,05:30:00,April
2,A2WNBOD3WNDNKT,439886341,3.0,22,7,2013,05:30:00,July
3,A1GI0U4ZRJA8WN,439886341,1.0,18,4,2012,05:30:00,April
4,A1QGNMC6O1VW39,511189877,5.0,14,4,2014,05:30:00,April


## Saving the Pre-Processed and cleaned data to a csv file

In [60]:
# saving the dataframe
amazon_df.to_csv('Amazon_Pre_Processed_dataset.csv',index=False)

Now we can use the saved csv file to perform further analysis and applying various machine learning models based on data and desired output