# WEB SCRAPPING

#  Import Libraries
selenium` → Automates web browsing.  
- `BeautifulSoup` → Parses HTML to extract data.  
- `pandas` → Organizes data in tables.  
- `time`, `datetime`, `random` → Manage delays and mimic human browsing.  
- `webdriver_manager` and `Options` → Smoothly set up Chrome browser for automation.

In [1]:
import time, datetime, random
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options


# Set  Selenium Chrome Options
we configure how Chrome will run for scraping:  

- `--headless` → Runs browser in the background.  
- `--no-sandbox` & `--disable-dev-shm-usage` → Improves stability.  
- `user-agent` → Makes the browser look like a real user.  
- We also set up ChromeDriver using `webdriver.Chrome` to launch the automated browser.

In [2]:
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36")

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)


# Define Product Categories and URLs
We create a dictionary called `amazon_urls` to store all the Amazon product links.  

- Each key represents a category like Electronics, Fashion, or Books.  
- Each value is a list of search URLs for that category.  
- This structure makes scraping organized and scalable.

In [3]:
amazon_urls = {
    "Electronics & Gadgets": [
        "https://www.amazon.in/s?k=smartphones",
        "https://www.amazon.in/s?k=tablets",
        "https://www.amazon.in/s?k=televisions",
        "https://www.amazon.in/s?k=projectors",
        "https://www.amazon.in/s?k=bluetooth+headphones",
        "https://www.amazon.in/s?k=power+banks",
        "https://www.amazon.in/s?k=usb+cables",
        "https://www.amazon.in/s?k=gaming+laptops",
        "https://www.amazon.in/s?k=home+theatre+systems"
    ],

    "Home & Kitchen Appliances": [
        "https://www.amazon.in/s?k=air+fryer",
        "https://www.amazon.in/s?k=microwave+oven",
        "https://www.amazon.in/s?k=otg+oven",
        "https://www.amazon.in/s?k=robotic+vacuum+cleaner",
        "https://www.amazon.in/s?k=induction+cooktop",
        "https://www.amazon.in/s?k=pressure+cooker",
        "https://www.amazon.in/s?k=rice+cooker",
        "https://www.amazon.in/s?k=food+processor",
        "https://www.amazon.in/s?k=air+cooler",
        "https://www.amazon.in/s?k=dehumidifier"
    ],

    "Fashion & Accessories": [
        "https://www.amazon.in/s?k=mens+tshirts",
        "https://www.amazon.in/s?k=jeans+for+women",
        "https://www.amazon.in/s?k=kurta+sets",
        "https://www.amazon.in/s?k=jackets",
        "https://www.amazon.in/s?k=wristwatches",
        "https://www.amazon.in/s?k=ethnic+wear",
        "https://www.amazon.in/s?k=sneakers",
        "https://www.amazon.in/s?k=leather+bags",
        "https://www.amazon.in/s?k=sunglasses",
        "https://www.amazon.in/s?k=belts"
    ],

    "Beauty & Personal Care": [
        "https://www.amazon.in/s?k=hair+straightener",
        "https://www.amazon.in/s?k=face+wash",
        "https://www.amazon.in/s?k=lipstick+set",
        "https://www.amazon.in/s?k=perfume+for+women",
        "https://www.amazon.in/s?k=mens+perfume",
        "https://www.amazon.in/s?k=skin+serum",
        "https://www.amazon.in/s?k=sunscreen",
        "https://www.amazon.in/s?k=dental+care+products",
        "https://www.amazon.in/s?k=vitamins+supplements",
        "https://www.amazon.in/s?k=fitness+watch"
    ],

    "Books & Stationery": [
        "https://www.amazon.in/s?k=fiction+books",
        "https://www.amazon.in/s?k=non+fiction+books",
        "https://www.amazon.in/s?k=comics+manga",
        "https://www.amazon.in/s?k=education+books",
        "https://www.amazon.in/s?k=art+supply",
        "https://www.amazon.in/s?k=notebooks",
        "https://www.amazon.in/s?k=pens+kits",
        "https://www.amazon.in/s?k=magazines",
        "https://www.amazon.in/s?k=board+games",
        "https://www.amazon.in/s?k=movie+dvds"
    ],

    "Grocery & Food": [
        "https://www.amazon.in/s?k=organic+snacks",
        "https://www.amazon.in/s?k=coffee+powder",
        "https://www.amazon.in/s?k=tea+leaves",
        "https://www.amazon.in/s?k=spices+masala",
        "https://www.amazon.in/s?k=health+protein+powder",
        "https://www.amazon.in/s?k=almonds+nuts",
        "https://www.amazon.in/s?k=cookies+biscuit",
        "https://www.amazon.in/s?k=breakfast+cereal",
        "https://www.amazon.in/s?k=keto+foods",
        "https://www.amazon.in/s?k=energy+bars"
    ],

    "Baby, Kids & Toys": [
        "https://www.amazon.in/s?k=diapers",
        "https://www.amazon.in/s?k=school+bags",
        "https://www.amazon.in/s?k=toy+cars",
        "https://www.amazon.in/s?k=lego+sets",
        "https://www.amazon.in/s?k=puzzle+games",
        "https://www.amazon.in/s?k=baby+clothing",
        "https://www.amazon.in/s?k=strollers",
        "https://www.amazon.in/s?k=baby+toys",
        "https://www.amazon.in/s?k=kids+footwear",
        "https://www.amazon.in/s?k=baby+care+essentials"
    ],

    "Automotive & Outdoor": [
        "https://www.amazon.in/s?k=car+accessories",
        "https://www.amazon.in/s?k=bike+lights",
        "https://www.amazon.in/s?k=tool+kit",
        "https://www.amazon.in/s?k=power+tools",
        "https://www.amazon.in/s?k=garden+tools",
        "https://www.amazon.in/s?k=garden+hose",
        "https://www.amazon.in/s?k=outdoor+lamp",
        "https://www.amazon.in/s?k=camping+gear",
        "https://www.amazon.in/s?k=fishing+kit",
        "https://www.amazon.in/s?k=car+cleaner+kit"
    ]
}


# Initialize Containers for Scraped Data
create empty lists to store all the details we want to scrape:  

- `categories`, `names`, `prices`, `ratings`, `reviews`, `boughts_last_month`, `offers`, `org_prices`  
- `num_pages_to_scrape` defines how many pages we will scrape per product URL.  
- These lists act as storage for our collected data.

In [4]:
categories, names, prices, ratings, reviews, boughts_last_month, offers, org_prices = [], [], [], [], [], [], [], []

num_pages_to_scrape = 1

This is the main scraping loop:  

1. Loops through each category and its URLs.  
2. Opens each page using Selenium and parses it with BeautifulSoup.  
3. Saves HTML pages for debugging.  
4. Checks for errors like "503 Service Unavailable" and skips problematic pages.  
5. Extracts product details — name, price, rating, reviews, offers, and boughts last month.  
6. Appends all collected data into the respective lists.  
7. Prints progress after scraping each page.

In [5]:
total_categories = len(amazon_urls.keys())
current_category_count = 0
for category, base_urls in amazon_urls.items():
    current_category_count += 1
    url_count = 0
    for base_url in base_urls:
        url_count += 1
        total_category_urls = len(base_urls)
        for page_index in range(num_pages_to_scrape):
            if page_index != 1:
                url = base_url + f"&page={str(page_index+1)}"
            else:
                url = base_url
            driver.get(url)
            time.sleep(random.randint(5, 10))

            soup = BeautifulSoup(driver.page_source, "html.parser")

            with open(f"cat_{current_category_count}_soup_page_{page_index}.html", "w", encoding="utf-8") as f:
                f.write(soup.prettify())

            text_to_find = "503 - Service Unavailable Error"
            with open(f"soup_page_{page_index}.html", "r", encoding="utf-8") as f:
                content = f.read()
                if text_to_find in content:
                    print(f"Skipping Page {page_index} ==> ", text_to_find)
                    continue

            products = soup.select('div[data-component-type="s-search-result"]')

            for prod in products:
                categories.append(category)
                name = prod.select_one("h2 span:not([class])")
                names.append(name.text.strip() if name else "N/A")

                price = prod.select_one("span.a-price-whole")
                prices.append(price.text.strip() if price else "N/A")

                rating = prod.select_one("span.a-icon-alt")
                ratings.append(rating.text.strip() if rating else "N/A")

                price_block = prod.select_one('div[data-cy="price-recipe"]')
                if not price_block:
                    offers.append("N/A")
                    org_prices.append("N/A")
                else:
                    offs = price_block.select("span:not([class])")
                    offer_text = ""
                    for ofr in offs:
                        text = ofr.get_text(strip=True)
                        if r"% off" in text:
                            offer_text = text.strip("()")
                            break
                    offers.append(offer_text if offer_text else "N/A")

                    org_price = prod.select_one("span.a-offscreen")
                    org_prices.append(org_price.text.replace("₹", "").strip() if org_price else "N/A")

                review_block = prod.select_one('div[data-cy="reviews-block"]')
                if not review_block:
                    reviews.append("N/A")
                    boughts_last_month.append("N/A")
                else:
                    review = review_block.find('span', class_=('a-size-base s-underline-text'))
                    reviews.append(review.text.strip() if review else "N/A")

                    bought_last_month = review_block.find('span', class_=['a-size-base a-color-secondary'])
                    boughts_last_month.append(bought_last_month.text.strip() if bought_last_month else "N/A")

            print(f"Category ({current_category_count}/{total_categories}): {category} , url ({url_count}/{total_category_urls}) ✅ Page {page_index+1} scraped, {len(products)} newly added, total rows so far: {len(names)}, url: {url}")

Category (1/8): Electronics & Gadgets , url (1/9) ✅ Page 1 scraped, 22 newly added, total rows so far: 22, url: https://www.amazon.in/s?k=smartphones&page=1
Category (1/8): Electronics & Gadgets , url (2/9) ✅ Page 1 scraped, 16 newly added, total rows so far: 38, url: https://www.amazon.in/s?k=tablets&page=1
Category (1/8): Electronics & Gadgets , url (3/9) ✅ Page 1 scraped, 22 newly added, total rows so far: 60, url: https://www.amazon.in/s?k=televisions&page=1
Category (1/8): Electronics & Gadgets , url (4/9) ✅ Page 1 scraped, 16 newly added, total rows so far: 76, url: https://www.amazon.in/s?k=projectors&page=1
Category (1/8): Electronics & Gadgets , url (5/9) ✅ Page 1 scraped, 16 newly added, total rows so far: 92, url: https://www.amazon.in/s?k=bluetooth+headphones&page=1
Category (1/8): Electronics & Gadgets , url (6/9) ✅ Page 1 scraped, 22 newly added, total rows so far: 114, url: https://www.amazon.in/s?k=power+banks&page=1
Category (1/8): Electronics & Gadgets , url (7/9) ✅ P

 Close the Selenium Browser

In [6]:
driver.quit()

Once all data is collected:  

1. We organize the scraped lists into a dictionary called `data`.  
2. Convert it into a `pandas DataFrame` for structured analysis.  
3. Print the total number of rows collected to verify our scraping results.

In [7]:
data = {
    "Category": categories,
    "Name": names,
    "Price (INR)": prices,
    "Rating": ratings,
    "Reviews (Nos)": reviews,
    "Bought Last Month": boughts_last_month,
    "Offer": offers,
    
}

df = pd.DataFrame(data)
print("Total rows collected:", len(df))


Total rows collected: 3062


In [8]:
raw_data_len = df.shape[0]
df = df.drop_duplicates()
final_data_len = df.shape[0]
print(f"{raw_data_len - final_data_len} duplicates found & removed, Finally {final_data_len} rows available")

filename = f"amazon_beauty_{len(df)}_{datetime.datetime.now().strftime('%Y-%m-%d %H_%M_%S')}.csv"
df.to_csv(filename, index=False)


71 duplicates found & removed, Finally 2991 rows available
