# Amazon Best Sellers Web Scraping and Data Analysis Project

This notebook performs web scraping of Amazon India's best sellers across multiple categories, cleans and transforms the data, and saves it to a CSV file for further analysis.

**Used Github Copilot to optimized and add comments to the code

## Step 1: Install Required Libraries

Install and upgrade necessary Python packages for data manipulation, web scraping, and HTTP requests.

In [38]:
# Install and upgrade pip to the latest version
# %pip install --upgrade pip
# Install numpy and pandas for data handling and analysis
# %pip install numpy pandas
# Install requests for making HTTP requests and beautifulsoup4 for HTML parsing
# %pip install requests beautifulsoup4

## Step 2: Import Libraries

Import all necessary modules for web scraping, data processing, and file operations.

In [39]:
# Import BeautifulSoup for parsing HTML content
from bs4 import BeautifulSoup
# Import pandas for data manipulation and analysis
import pandas as pd
# Import requests for sending HTTP requests
import requests
# Import datetime for handling dates and timestamps
import datetime
# Import time for adding delays if needed (though not used here)
import time
# Import csv for writing data to CSV files
import csv
# Import json for potential JSON handling (though not used here)
import json

## Step 3: Define Scraping Function and Execute Scraping

Define a function to scrape product data from Amazon best sellers pages using ScraperAPI, then loop through categories to collect data for up to 30 products per category.

In [None]:
# Define the scraping function that takes a category name and URL link
def scrapping(category, link):
    # Set the URL to the provided link
    url = link
    # API key for ScraperAPI to bypass anti-scraping measures
    api_key = '484e60dc1c1ca55adeebb3062b06cff0'
    # Payload for the API request including the API key and target URL
    payload = {'api_key': api_key, 'url': url}
    
    # Send a GET request to ScraperAPI with the payload
    response = requests.get("http://api.scraperapi.com", params=payload)
    
    # Parse the response content using BeautifulSoup with 'html' parser
    soup = BeautifulSoup(response.content, 'html')
    # Initialize counter for number of products scraped
    number_of_products = 0
    # List to store product data dictionaries
    product_list = []
    # Loop until we have data for 30 products
    while number_of_products < 30:
        # Find product elements by their ID pattern
        products = soup.find_all(id=f"p13n-asin-index-{number_of_products}")
        for product in products:
            # Try to extract the rank; if not found, set to 'N/A'
            try:
                rank = product.find(class_='zg-bdg-text').get_text(strip=True)       
            except AttributeError:
                rank = 'N/A'
            # Try to extract the product title; if not found, set to 'N/A'
            try:
                title = product.find(class_=lambda class_name: class_name and class_name.startswith('_cDEzb_p13n-sc-css-line-clamp-')).get_text(strip=True)
            except AttributeError:
                title = 'N/A'
            # Try to extract the price; if not found, set to 'N/A'
            try:
                price = product.find(class_='_cDEzb_p13n-sc-price_3mJ9Z').get_text(strip=True)
            except AttributeError:
                price = 'N/A'
            # Try to extract the rating; if not found, set to 'N/A'
            try:
                rating = product.find(class_='a-icon-alt').get_text(strip=True)
            except AttributeError:
                rating = 'N/A'
            # Try to extract the number of reviews; if not found, set to 'N/A'
            try:
                reviews = product.find('span', class_='a-size-small').get_text(strip=True)
            except AttributeError:
                reviews = 'N/A'
            
            # Append a dictionary of product details to the list
            product_list.append({
                'Category': category,
                'Rank': rank,
                'Title': title,
                'Price': price,
                'Rating': rating,
                'Number of Reviews': reviews
            })
        # Increment the product counter
        number_of_products += 1
    
    # Return the list of product data
    return product_list

# Print a message indicating the start of scraping
print("Starting the scraping process...")

# Dictionary of categories and their corresponding Amazon URLs
categories = {
    'Books': 'https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_nav_books_0',
    'Electronics': 'https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_electronics_sm',
    'Home & Kitchen': 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0',
    'Kindle Store':'https://www.amazon.in/gp/bestsellers/digital-text/ref=zg_bs_nav_digital-text_0' ,
    'Movies & TV Shows': 'https://www.amazon.in/gp/bestsellers/dvd/ref=zg_bs_nav_dvd_0',
    'Music': 'https://www.amazon.in/gp/bestsellers/music/ref=zg_bs_nav_music_0' ,
    'Musical Instruments': 'https://www.amazon.in/gp/bestsellers/musical-instruments/ref=zg_bs_nav_musical-instruments_0',
    'Office Products': 'https://www.amazon.in/gp/bestsellers/office/ref=zg_bs_nav_office_0' ,
    'Pet Supplies': 'https://www.amazon.in/gp/bestsellers/pet-supplies/ref=zg_bs_nav_pet-supplies_0',
    'Shoes & Handbags': 'https://www.amazon.in/gp/bestsellers/shoes/ref=zg_bs_nav_shoes_0' ,
    'Software':'https://www.amazon.in/gp/bestsellers/software/ref=zg_bs_nav_software_0' ,
    'Sports, Fitness & Outdoors': 'https://www.amazon.in/gp/bestsellers/sports/ref=zg_bs_nav_sports_0',
    'Toys & Games': 'https://www.amazon.in/gp/bestsellers/toys/ref=zg_bs_nav_toys_0' ,
    'Video Games': 'https://www.amazon.in/gp/bestsellers/videogames/ref=zg_bs_nav_videogames_0',
    'Watches': 'https://www.amazon.in/gp/bestsellers/watches/ref=zg_bs_watches_sm' 
}
# Initialize an empty list to store all product data
product_data = []
# Loop through each category and scrape data
for category, link in categories.items():
    print(f"Scraping category: {category} from link: {link}")
    product_data = product_data + scrapping(category, link)

# Print the collected product data (for debugging/verification)
print(product_data)
    
# Generate a date string for the filename
date_str = datetime.datetime.now().strftime("%Y%m%d")
# Get the keys from the first product dictionary for CSV headers
keys = product_data[0].keys()    
# Create a filename with the date
filename = f"AmazonBestSellers_{date_str}.csv"
# Open the file in write mode and write the data as CSV
with open(filename, 'w', newline='', encoding='utf-8') as output_file:
    dict_writer = csv.DictWriter(output_file, fieldnames=keys)
    dict_writer.writeheader()
    dict_writer.writerows(product_data)

# Print a completion message
print("Scraping completed and data saved to CSV files.")

Starting the scraping process...
Scraping category: Books from link: https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_nav_books_0
Scraping category: Electronics from link: https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_electronics_sm
Scraping category: Home & Kitchen from link: https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0
Scraping category: Clothing, Shoes & Jewelry from link: https://www.amazon.in/gp/bestsellers/fashion/ref=zg_bs_nav_fashion_0
Scraping category: Kindle Store from link: https://www.amazon.in/gp/bestsellers/digital-text/ref=zg_bs_nav_digital-text_0
Scraping category: Movies & TV Shows from link: https://www.amazon.in/gp/bestsellers/dvd/ref=zg_bs_nav_dvd_0
Scraping category: Music from link: https://www.amazon.in/gp/bestsellers/music/ref=zg_bs_nav_music_0
Scraping category: Musical Instruments from link: https://www.amazon.in/gp/bestsellers/musical-instruments/ref=zg_bs_nav_musical-instruments_0
Scraping category: Office Products f

# Data Transformation in Pandas

## Step 4: Load and Inspect Raw Data

Load the scraped data from the CSV file into a pandas DataFrame and display it for initial inspection.

In [41]:
# Generate the date string to match the saved CSV filename
date_str = datetime.datetime.now().strftime("%Y%m%d")
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(f"D:\\Python\\Data Analyst Project\\AmazonBestSellers_{date_str}.csv")
# Display the DataFrame
df

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,#1,Sacred Waters,₹358.00,4.7 out of 5 stars,58
1,Books,#2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,₹697.00,4.4 out of 5 stars,1221
2,Books,#3,My First Library: Boxset of 10 Board Books for...,₹379.00,4.5 out of 5 stars,87739
3,Books,#4,Oswaal CBSE 20 Combined Sample Question Papers...,₹399.00,4.6 out of 5 stars,55
4,Books,#5,The Psychology of Money,₹285.00,4.6 out of 5 stars,77869
...,...,...,...,...,...,...
445,Watches,#26,Titan Karishma Analog Champagne Dial Women's W...,"₹1,994.00",4.2 out of 5 stars,3385
446,Watches,#27,Matrix Minimalist Dial with Softest Silicone S...,₹265.00,4.0 out of 5 stars,1059
447,Watches,#28,Casio Enticer Men Analog Blue Dial Men MTP-130...,"₹3,675.00",4.5 out of 5 stars,10691
448,Watches,#29,SWADESI STUFF Stainless Steel Date Display Men...,₹457.00,4.2 out of 5 stars,456


## Step 5: Check Data Size

Determine the number of rows in the DataFrame to verify the amount of data collected.

In [42]:
# Get the length of the DataFrame index (number of rows)
len(df.index)

450

## Step 6: Clean Price Column

Remove the Indian Rupee symbol (₹) from the Price column to prepare for numeric conversion.

In [43]:
# Strip the '₹' symbol from the Price column for analysis
df['Price'] = df['Price'].str.strip('₹')
# Display the updated DataFrame
df

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,#1,Sacred Waters,358.00,4.7 out of 5 stars,58
1,Books,#2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,697.00,4.4 out of 5 stars,1221
2,Books,#3,My First Library: Boxset of 10 Board Books for...,379.00,4.5 out of 5 stars,87739
3,Books,#4,Oswaal CBSE 20 Combined Sample Question Papers...,399.00,4.6 out of 5 stars,55
4,Books,#5,The Psychology of Money,285.00,4.6 out of 5 stars,77869
...,...,...,...,...,...,...
445,Watches,#26,Titan Karishma Analog Champagne Dial Women's W...,1994.00,4.2 out of 5 stars,3385
446,Watches,#27,Matrix Minimalist Dial with Softest Silicone S...,265.00,4.0 out of 5 stars,1059
447,Watches,#28,Casio Enticer Men Analog Blue Dial Men MTP-130...,3675.00,4.5 out of 5 stars,10691
448,Watches,#29,SWADESI STUFF Stainless Steel Date Display Men...,457.00,4.2 out of 5 stars,456


## Step 7: Inspect Data Types

Check the data types and non-null counts for each column to understand the current structure.

In [44]:
# Display information about the DataFrame including data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Category           450 non-null    object
 1   Rank               450 non-null    object
 2   Title              450 non-null    object
 3   Price              426 non-null    object
 4   Rating             442 non-null    object
 5   Number of Reviews  450 non-null    object
dtypes: object(6)
memory usage: 21.2+ KB


## Step 8: Clean Rank Column

Remove the '#' symbol from the Rank column to prepare for numeric conversion.

In [45]:
# Strip the '#' symbol from the Rank column
df['Rank'] = df['Rank'].str.strip('#')
# Display the first few rows to verify changes
df.head()

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,1,Sacred Waters,358.0,4.7 out of 5 stars,58
1,Books,2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,697.0,4.4 out of 5 stars,1221
2,Books,3,My First Library: Boxset of 10 Board Books for...,379.0,4.5 out of 5 stars,87739
3,Books,4,Oswaal CBSE 20 Combined Sample Question Papers...,399.0,4.6 out of 5 stars,55
4,Books,5,The Psychology of Money,285.0,4.6 out of 5 stars,77869


## Step 9: Clean Rating Column

Remove the ' out of 5 stars' suffix from the Rating column to prepare for numeric conversion.

In [46]:
# Strip the ' out of 5 stars' text from the Rating column
df['Rating'] = df['Rating'].str.strip(' out of 5 stars')
# Display the first few rows to verify changes
df.head()

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,1,Sacred Waters,358.0,4.7,58
1,Books,2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,697.0,4.4,1221
2,Books,3,My First Library: Boxset of 10 Board Books for...,379.0,4.0,87739
3,Books,4,Oswaal CBSE 20 Combined Sample Question Papers...,399.0,4.6,55
4,Books,5,The Psychology of Money,285.0,4.6,77869


## Step 10: Clean Number of Reviews Column

Remove commas from the Number of Reviews column to prepare for numeric conversion.

In [47]:
# Replace commas in the Number of Reviews column with empty strings
df['Number of Reviews'] = df['Number of Reviews'].str.replace(',','')
# Display the first few rows to verify changes
df.head()

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,1,Sacred Waters,358.0,4.7,58
1,Books,2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,697.0,4.4,1221
2,Books,3,My First Library: Boxset of 10 Board Books for...,379.0,4.0,87739
3,Books,4,Oswaal CBSE 20 Combined Sample Question Papers...,399.0,4.6,55
4,Books,5,The Psychology of Money,285.0,4.6,77869


## Step 11: Convert Columns to Numeric Types

Convert relevant columns to numeric data types, coercing errors to NaN for invalid values.

In [48]:
# Convert Price to numeric, setting invalid values to NaN
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
# Convert Number of Reviews to numeric, setting invalid values to NaN
df['Number of Reviews'] = pd.to_numeric(df['Number of Reviews'], errors='coerce')
# Convert Rating to numeric, setting invalid values to NaN
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
# Convert Rank to numeric, setting invalid values to NaN
df['Rank'] = pd.to_numeric(df['Rank'], errors='coerce')

# Display updated info to confirm data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Category           450 non-null    object 
 1   Rank               450 non-null    int64  
 2   Title              450 non-null    object 
 3   Price              331 non-null    float64
 4   Rating             442 non-null    float64
 5   Number of Reviews  339 non-null    float64
dtypes: float64(3), int64(1), object(2)
memory usage: 21.2+ KB


## Step 12: Final Data Inspection

Display the cleaned DataFrame to verify all transformations.

In [49]:
# Display the final cleaned DataFrame
df

Unnamed: 0,Category,Rank,Title,Price,Rating,Number of Reviews
0,Books,1,Sacred Waters,358.0,4.7,58.0
1,Books,2,Educart PRAYAS CBSE Class 10 for 2026 (Introdu...,697.0,4.4,1221.0
2,Books,3,My First Library: Boxset of 10 Board Books for...,379.0,4.0,87739.0
3,Books,4,Oswaal CBSE 20 Combined Sample Question Papers...,399.0,4.6,55.0
4,Books,5,The Psychology of Money,285.0,4.6,77869.0
...,...,...,...,...,...,...
445,Watches,26,Titan Karishma Analog Champagne Dial Women's W...,,4.2,3385.0
446,Watches,27,Matrix Minimalist Dial with Softest Silicone S...,265.0,4.0,1059.0
447,Watches,28,Casio Enticer Men Analog Blue Dial Men MTP-130...,,4.0,10691.0
448,Watches,29,SWADESI STUFF Stainless Steel Date Display Men...,457.0,4.2,456.0


## Step 13: Save Cleaned Data

Export the cleaned DataFrame to a new CSV file without the index column.

In [50]:
# Save the cleaned DataFrame to a CSV file, excluding the index
df.to_csv('Amazon_Web_Scraping_Cleaned_Data.csv', index=False)  # Set index=False to exclude the row index in the CSV file