### Dataset
The dataset is live and was taken from Rapidapi and uploaded to AWS S3 bucket using Lambda to be udated daily.
URL to dataset source: https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data/playground/apiendpoint_17991940-c656-454f-a9ee-0277b0ada11d

In [141]:
# Import libraries
from datetime import datetime 
import pandas as pd 
import numpy as np

# Store the file url from S3 bucket to a file_name variale
file_name = f"s3://lambda-amer/electronics_data_{datetime.now().strftime('%Y-%m-%d')}.csv"

# Read the csv in pandas
df = pd.read_csv(file_name)
# Print the dataframe
df.head()

Unnamed: 0,asin,product_title,product_price,product_original_price,currency,product_star_rating,product_num_ratings,product_url,product_photo,product_num_offers,product_minimum_offer_price,is_best_seller,is_amazon_choice,is_prime,climate_pledge_friendly,sales_volume,delivery,coupon_text,unit_price,unit_count
0,B0D49CWPH2,"USB C to USB C Cable, (5Pack 6FT), New Nylon U...",$29.66,,USD,5.0,91,https://www.amazon.com/dp/B0D49CWPH2,https://m.media-amazon.com/images/I/818ElRqqaJ...,1,$29.66,False,False,True,False,10K+ bought in past month,"FREE delivery Mon, Jun 10 on $35 of items ship...",Save 50% with coupon,,
1,B0D2X7Y7VF,USB C to Lightning Cable [Apple MFi Certified]...,$7.49,$24.98,USD,4.4,41,https://www.amazon.com/dp/B0D2X7Y7VF,https://m.media-amazon.com/images/I/61yvV52CIQ...,1,$7.49,False,False,True,False,3K+ bought in past month,"FREE delivery Mon, Jun 10 on $35 of items ship...",Save 20% with coupon,,
2,B0D12T4WZT,USB C Charger for iPhone 14 13 12 11 Charger [...,$9.99,$15.99,USD,5.0,51,https://www.amazon.com/dp/B0D12T4WZT,https://m.media-amazon.com/images/I/61nKMuhlx7...,1,$9.99,False,False,True,False,3K+ bought in past month,"FREE delivery Mon, Jun 10 on $35 of items ship...",Save 30% with coupon,,
3,B0CZPHPJLN,Beats Solo 4 - Wireless Bluetooth On-Ear Headp...,$149.95,$199.95,USD,4.1,92,https://www.amazon.com/dp/B0CZPHPJLN,https://m.media-amazon.com/images/I/510fGxoTsT...,1,$149.95,False,False,True,False,3K+ bought in past month,"FREE delivery Mon, Jun 10 Or fastest delivery ...",,,
4,B0C58MS6HF,"Anker Magnetic Power Bank 10,000mAh, Wireless ...",$39.99,,USD,4.3,5689,https://www.amazon.com/dp/B0C58MS6HF,https://m.media-amazon.com/images/I/61RB59T7AF...,1,$39.99,False,False,True,True,10K+ bought in past month,"FREE delivery Mon, Jun 10 Or fastest delivery ...",Save 10% with coupon,,


In [110]:
# Define the shape of the dataset
df.shape

(349, 20)

### Data Description
The data has 349 rows and 20 columns containing the information about the electronics category listed on Amazon

### Data Cleaning

In [111]:
# List all the columns in df
df.columns

Index(['asin', 'product_title', 'product_price', 'product_original_price',
       'currency', 'product_star_rating', 'product_num_ratings', 'product_url',
       'product_photo', 'product_num_offers', 'product_minimum_offer_price',
       'is_best_seller', 'is_amazon_choice', 'is_prime',
       'climate_pledge_friendly', 'sales_volume', 'delivery', 'coupon_text',
       'unit_price', 'unit_count'],
      dtype='object')

In [112]:
# Count the number of null values
df.isnull().sum()

asin                             0
product_title                    0
product_price                    6
product_original_price         211
currency                         6
product_star_rating             15
product_num_ratings              0
product_url                      0
product_photo                    0
product_num_offers               0
product_minimum_offer_price      6
is_best_seller                   0
is_amazon_choice                 0
is_prime                         0
climate_pledge_friendly          0
sales_volume                    84
delivery                         8
coupon_text                    274
unit_price                     306
unit_count                     306
dtype: int64

Looking at the number of null values above and the unnecessarity of some columns, these columns will be dropped:
1. product_original_price
2. product_url 
3. product_photo
4. coupon_text
5. unit_price
6. unit_count

In [117]:
# Assign the columns to a variable
columns_drop = ['product_original_price', 'product_url', 'product_photo','coupon_text', 'unit_price', 'unit_count']
# drop the columns
df_drop = df.drop(columns=columns_drop)

In [118]:
# Count the number of null vlues again
df_drop.isnull().sum()

asin                            0
product_title                   0
product_price                   6
currency                        6
product_star_rating            15
product_num_ratings             0
product_num_offers              0
product_minimum_offer_price     6
is_best_seller                  0
is_amazon_choice                0
is_prime                        0
climate_pledge_friendly         0
sales_volume                   84
delivery                        8
dtype: int64

Now that we have removed the columns that we don't need. we can fill out null values for the rest of columns that has null values with 0 value

In [143]:
# fill the null values with 0 value
df_drop.fillna(0, inplace=True)

# Count the number of null values again
df_drop.isnull().sum()

asin                           0
product_title                  0
product_price                  0
currency                       0
product_star_rating            0
product_num_ratings            0
product_num_offers             0
product_minimum_offer_price    0
is_best_seller                 0
is_amazon_choice               0
is_prime                       0
climate_pledge_friendly        0
sales_volume                   0
delivery                       0
dtype: int64

In [149]:
# Save the data into a csv file
df_drop.to_csv('Resources/cleaned_data.csv')