<a href="https://colab.research.google.com/github/RennieCh/Amazon-Best-Seller/blob/phase1/phase1_main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

# Phase1 Data Retrieval

The team retrieved both bestseller data and product detail data from Amazon across 38 categories by sending API requests through the ScrapeHero API. The raw datasets for bestsellers and product details are saved separately. At this point, the team will explore the raw data, combine the two datasets, and remove unnecessary records so that the dataset is ready for deeper cleaning in phase 2.

**Read Raw Data as Dataframe**

In [3]:
bestseller = pd.read_csv('bestsellers.csv')
productdetail = pd.read_csv('cleaned_productdetails.csv')

**Explore Datasets**

In [10]:
bestseller.head(4)

Unnamed: 0,rank,asin,name,ratings_count,rating,sale_price,image,is_prime,product_url,category
0,1,B0CJM1GNFQ,Amazon Fire TV Stick 4K with AI-powered Fire T...,41456.0,4.7,$49.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/Amazon-Fire-TV-Stick-4K...,Amazon Devices & Accessories
1,2,B0B1N5FK48,Blink Outdoor 4 – Wireless smart security came...,24950.0,4.2,$259.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/Blink-Outdoor-4th-Gen-3...,Amazon Devices & Accessories
2,3,B08C1W5N87,"Amazon Fire TV Stick, HD, sharp picture qualit...",498544.0,4.7,$39.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/fire-tv-stick-with-3rd-...,Amazon Devices & Accessories
3,4,B0BP9SNVH9,"Amazon Fire TV Stick 4K Max, our most powerful...",35230.0,4.6,$59.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/all-new-amazon-fire-tv-...,Amazon Devices & Accessories


In [11]:
productdetail.head(4)

Unnamed: 0,asin,brand,price,regular_price,seller,availability_status,is_prime,is_aplus_page,full_description,video_count,parent_asin,five_star,four_star,three_star,two_star,one_star
0,B0CJM1GNFQ,Amazon,$49.99,,Amazon.com,In Stock,True,False,,10,B0CDR2MSVC,83%,10%,3%,1%,3%
1,B0B1N5FK48,Blink,$259.99,,Amazon.com,In Stock,True,False,,10,B0C32KN8DC,65%,13%,6%,5%,11%
2,B08C1W5N87,Amazon,$39.99,,Amazon.com,In Stock,True,False,,10,B08WJSHSLC,82%,11%,3%,1%,3%
3,B0BP9SNVH9,Amazon,$59.99,,Amazon.com,In Stock,True,False,,10,B0CDR3P78V,80%,11%,3%,1%,4%


In [8]:
bestseller.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 760 entries, 0 to 759
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   rank           760 non-null    int64  
 1   asin           760 non-null    object 
 2   name           760 non-null    object 
 3   ratings_count  739 non-null    float64
 4   rating         739 non-null    float64
 5   sale_price     733 non-null    object 
 6   image          760 non-null    object 
 7   is_prime       760 non-null    bool   
 8   product_url    760 non-null    object 
 9   category       760 non-null    object 
dtypes: bool(1), float64(2), int64(1), object(6)
memory usage: 54.3+ KB


In [9]:
productdetail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 707 entries, 0 to 706
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   asin                 707 non-null    object
 1   brand                604 non-null    object
 2   price                680 non-null    object
 3   regular_price        435 non-null    object
 4   seller               655 non-null    object
 5   availability_status  638 non-null    object
 6   is_prime             707 non-null    bool  
 7   is_aplus_page        707 non-null    bool  
 8   full_description     586 non-null    object
 9   video_count          707 non-null    int64 
 10  parent_asin          707 non-null    object
 11  five_star            707 non-null    object
 12  four_star            707 non-null    object
 13  three_star           707 non-null    object
 14  two_star             707 non-null    object
 15  one_star             707 non-null    object
dtypes: bool(

**Combine Datasets into 1**

In [12]:
# Merge the two DataFrames on the 'asin' column using a left join
merged_df = pd.merge(bestseller, productdetail, on='asin', how='left')

# Print the merged DataFrame
merged_df.head(4)

Unnamed: 0,rank,asin,name,ratings_count,rating,sale_price,image,is_prime_x,product_url,category,...,is_prime_y,is_aplus_page,full_description,video_count,parent_asin,five_star,four_star,three_star,two_star,one_star
0,1,B0CJM1GNFQ,Amazon Fire TV Stick 4K with AI-powered Fire T...,41456.0,4.7,$49.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/Amazon-Fire-TV-Stick-4K...,Amazon Devices & Accessories,...,True,False,,10.0,B0CDR2MSVC,83%,10%,3%,1%,3%
1,2,B0B1N5FK48,Blink Outdoor 4 – Wireless smart security came...,24950.0,4.2,$259.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/Blink-Outdoor-4th-Gen-3...,Amazon Devices & Accessories,...,True,False,,10.0,B0C32KN8DC,65%,13%,6%,5%,11%
2,3,B08C1W5N87,"Amazon Fire TV Stick, HD, sharp picture qualit...",498544.0,4.7,$39.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/fire-tv-stick-with-3rd-...,Amazon Devices & Accessories,...,True,False,,10.0,B08WJSHSLC,82%,11%,3%,1%,3%
3,4,B0BP9SNVH9,"Amazon Fire TV Stick 4K Max, our most powerful...",35230.0,4.6,$59.99,https://images-na.ssl-images-amazon.com/images...,False,https://www.amazon.com/all-new-amazon-fire-tv-...,Amazon Devices & Accessories,...,True,False,,10.0,B0CDR3P78V,80%,11%,3%,1%,4%


In [13]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 760 entries, 0 to 759
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   rank                 760 non-null    int64  
 1   asin                 760 non-null    object 
 2   name                 760 non-null    object 
 3   ratings_count        739 non-null    float64
 4   rating               739 non-null    float64
 5   sale_price           733 non-null    object 
 6   image                760 non-null    object 
 7   is_prime_x           760 non-null    bool   
 8   product_url          760 non-null    object 
 9   category             760 non-null    object 
 10  brand                622 non-null    object 
 11  price                697 non-null    object 
 12  regular_price        440 non-null    object 
 13  seller               672 non-null    object 
 14  availability_status  656 non-null    object 
 15  is_prime_y           725 non-null    obj

**Preprocessing - data cleaning**

During the process of obtaining data using the API, the team noticed errors occurred when retrieving data for two categories: 'Apps & Games' and 'Gift Cards,' which were documented in '01_rawdata_api_runner.py.' The data for these two categories was incomplete and inaccurate. To avoid potential negative impacts on our dataset, the team decided to remove all records for 'Apps & Games' and 'Gift Cards.'

In [14]:
# Remove rows where 'category' is 'Apps & Games' or 'Gift Cards'
merged_df = merged_df[~merged_df['category'].isin(['Apps & Games', 'Gift Cards'])]

In [15]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 720 entries, 0 to 759
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   rank                 720 non-null    int64  
 1   asin                 720 non-null    object 
 2   name                 720 non-null    object 
 3   ratings_count        702 non-null    float64
 4   rating               702 non-null    float64
 5   sale_price           693 non-null    object 
 6   image                720 non-null    object 
 7   is_prime_x           720 non-null    bool   
 8   product_url          720 non-null    object 
 9   category             720 non-null    object 
 10  brand                615 non-null    object 
 11  price                690 non-null    object 
 12  regular_price        440 non-null    object 
 13  seller               665 non-null    object 
 14  availability_status  649 non-null    object 
 15  is_prime_y           718 non-null    object 


In [16]:
# Check the number of unique categories
num_unique_categories = merged_df['category'].nunique()
print(f"Number of unique categories: {num_unique_categories}")

# List all the unique categories
unique_categories = merged_df['category'].unique()
print("\nList of all categories:")
for category in unique_categories:
  print(category)

Number of unique categories: 36

List of all categories:
Amazon Devices & Accessories
Amazon Renewed
Appliances
Arts, Crafts & Sewing
Audible Books & Originals
Automotive
Baby
Beauty & Personal Care
Books
Camera & Photo Products
CDs & Vinyl
Cell Phones & Accessories
Clothing, Shoes & Jewelry
Collectible Coins
Computers & Accessories
Electronics
Entertainment Collectibles
Grocery & Gourmet Food
Handmade Products
Health & Household
Home & Kitchen
Kindle Store
Kitchen & Dining
Movies & TV
Musical Instruments
Office Products
Patio, Lawn & Garden
Pet Supplies
Software
Sports & Outdoors
Sports Collectibles
Tools & Home Improvement
Toys & Games
Unique Finds
Video Games
Industrial & Scientific


In [17]:
# Export the merged DataFrame as a CSV file
merged_df.to_csv('amazon_bestseller_phase1.csv', index=False)