# Amazon Mobile Phone WebScraper

## Overview

This is my web scraper for Amazon India mobile phone product pages. I built it to extract key information such as title, price, rating, features, and product specifications from given URLs and save the data to a CSV file for my analysis.

### Features I Implemented:
- Extracts `product title`, `price`, `rating`, `features`, and `product specifications`
- Handles multiple fallback selectors for robust data extraction
- Includes rate limiting to avoid being blocked
- Saves scraped data to `CSV` for further analysis

### Libraries I Chose:
- `pandas`: For data manipulation and CSV export
- `requests`: For making HTTP requests to Amazon
- `BeautifulSoup`: For parsing HTML content
- `time`: For rate limiting

### Important Notes for My Project:
- This scraper is for educational purposes only. I respect Amazon's terms of service and robots.txt.
- I use appropriate headers to mimic a real browser.
- I implemented rate limiting to avoid overwhelming the server.

In [1]:
# Import necessary libraries

import pandas as pd
import requests, re, os, smtplib, time, datetime
from bs4 import BeautifulSoup

# Headers to mimic a browser request and avoid blocking

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
}

# List of Amazon mobile phone product URLs to scrape
urls = [
    'https://www.amazon.in/OnePlus-13R-Smarter-Lifetime-Warranty/dp/B0DPS62DYH?ref_=pd_hp_d_btf_unk_B0DPS62DYH',
    'http://amazon.in/Apple-iPhone-15-128-GB/dp/B0CHX1W1XY/ref=sr_1_1_sspa?sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY',
    'https://www.amazon.in/Samsung-Galaxy-Smartphone-Titanium-Storage/dp/B0CS5XW6TN/ref=sr_1_1_sspa?nsdOptOutParam=true&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY',
    'https://www.amazon.in/Titanium-Storage-Additional-Exchange-Offers/dp/B07WDJMX16/ref=sr_1_4?nsdOptOutParam=true&s=electronics&sr=1-4'
]

## Data Extraction Functions

Here are the functions I created to extract specific information from Amazon mobile phone product pages using BeautifulSoup. Each function includes multiple fallback selectors to handle variations in Amazon's HTML structure.

- `extract_title()`: Extracts the product title
- `extract_price()`: Extracts the current price
- `extract_rating()`: Extracts the star rating
- `extract_features()`: Extracts bullet point features
- `extract_specs()`: Extracts product specifications from tables
- `scrape_product()`: Main function that orchestrates the scraping for a single URL

# Extracts the product title from the Amazon page  
- Tries multiple selectors in order of preference:  
  - `id='productTitle'`  
  - `h1` with specific class  
  - `h1` with `id='title'`  
  - `span` with `id='productTitle'`  
- Returns the title as a string, or **"Title not found"** if none found  
- Removes non-ASCII characters for clean text  


In [2]:
def extract_title(soup):
    try:
        title = soup.find(id='productTitle').get_text().strip()
        title = title.encode('ascii', 'ignore').decode('ascii')
        return title
    except AttributeError:
        try:
            title = soup.find('h1', {'class': 'a-size-large a-spacing-small a-text-bold'}).get_text().strip()
            title = title.encode('ascii', 'ignore').decode('ascii')
            return title
        except AttributeError:
            try:
                title = soup.find('h1', id='title').get_text().strip()
                title = title.encode('ascii', 'ignore').decode('ascii')
                return title
            except AttributeError:
                try:
                    title = soup.find('span', id='productTitle').get_text().strip()
                    title = title.encode('ascii', 'ignore').decode('ascii')
                    return title
                except AttributeError:
                    return "Title not found"

### Extract Price Function
Extracts the product price from the Amazon page. Tries to find price components: 
- 1. Whole and fraction parts with specific classes, 
- 2. Deal price block. Returns formatted price string with ₹ symbol, or "Not found".

In [3]:
def extract_price(soup):
    try:
        price_whole = soup.find('span', {'class': 'a-price-whole'}).get_text().strip()
        price_fraction = soup.find('span', {'class': 'a-price-fraction'}).get_text().strip()
        if price_fraction.startswith('.'):
            price = f"₹{price_whole}{price_fraction}"
        else:
            price = f"₹{price_whole}.{price_fraction}"
        price = price.encode('ascii', 'ignore').decode('ascii')
        return price
    except AttributeError:
        try:
            price = soup.find(id='priceblock_dealprice').get_text().strip()
            price = price.encode('ascii', 'ignore').decode('ascii')
            return price
        except AttributeError:
            return "Not found"

### Extract Features Function
Extracts product features from the bullet points section. Looks for div with id='feature-bullets' and extracts all li elements. Returns features as a semicolon-separated string, or "Not found".

In [4]:
def extract_features(soup):
    try:
        features_div = soup.find(id='feature-bullets')
        features = [li.get_text().strip().encode('ascii', 'ignore').decode('ascii') for li in features_div.find_all('li')]
        features_str = '; '.join(features)
        return features_str
    except AttributeError:
        return "Not found"

### Extract Rating Function
Extracts the product rating (stars) from the Amazon page. Tries multiple selectors for rating information. Returns rating string (e.g., "4.5 out of 5 stars"), or "Not found".

In [5]:
def extract_rating(soup):
    try:
        # Try common selectors for star ratings
        rating_element = soup.find('span', {'class': 'a-icon-alt'})
        if rating_element:
            text = rating_element.get_text().strip()
            if 'stars' in text:
                return text
        # Try another common selector
        rating_element = soup.find('span', {'data-hook': 'rating-out-of-text'})
        if rating_element:
            return rating_element.get_text().strip()
        # Try finding in average customer reviews
        reviews_div = soup.find('div', {'id': 'averageCustomerReviews'})
        if reviews_div:
            rating_element = reviews_div.find('span', {'class': 'a-icon-alt'})
            if rating_element:
                text = rating_element.get_text().strip()
                if 'stars' in text:
                    return text
        return "Not found"
    except AttributeError:
        return "Not found"

### Extract Specifications Function
Extracts product specifications from tables on the Amazon page. Looks for tables and extracts key-value pairs from th/td elements. Returns a dictionary of specifications, or empty dict if not found.

In [6]:
def extract_specs(soup):
    specs = {}
    try:
        # Find all tables on the page
        tables = soup.find_all('table')
        for table in tables:
            rows = table.find_all('tr')
            for row in rows:
                th = row.find('th')
                td = row.find('td')
                if th and td:
                    key = th.get_text().strip().encode('ascii', 'ignore').decode('ascii')
                    value = td.get_text().strip().encode('ascii', 'ignore').decode('ascii')
                    specs[key] = value
        return specs
    except Exception:
        return {}

### Main Scraping Function
Scrapes a single Amazon mobile phone product page for key information. Takes a URL as input and returns a dictionary containing extracted product data, or None if failed.

In [7]:
def scrape_product(url):
    """
    Scrapes a single Amazon mobile phone product page for key information.

    Args:
        url (str): The Amazon mobile phone product URL to scrape

    Returns:
        dict: Dictionary containing extracted product data, or None if failed
    """
    print(f"Scraping: {url}")
    page = requests.get(url, headers=headers)
    if page.status_code != 200:
        print(f"Error fetching {url}: {page.status_code}")
        return None
    soup = BeautifulSoup(page.content, 'html.parser')

    title = extract_title(soup)
    price = extract_price(soup)
    rating = extract_rating(soup)
    features = extract_features(soup)
    specs = extract_specs(soup)

    product_data = {
        'Title': title,
        'Price': price,
        'Rating': rating,
        'Features': features
    }
    product_data.update(specs)

    print(f"Title: {title}")
    print(f"Price: {price}")
    print(f"Rating: {rating}")
    if features != "Not found":
        print("Features:")
        for feature in features.split('; '):
            print(f"  - {feature}")
    if specs:
        print("Product Specs:")
        for key, value in specs.items():
            if value != 'N/A':
                print(f"  {key}: {value}")
    else:
        print("Product Specs: Not found")

    print("-" * 50)
    time.sleep(1)  # Rate limiting
    return product_data

### Execute Scraping Process

In [8]:
# Main scraping loop
# Scrape all mobile phones from the URLs list
data = []
for url in urls:
    product = scrape_product(url)
    if product:
        data.append(product)

Scraping: https://www.amazon.in/OnePlus-13R-Smarter-Lifetime-Warranty/dp/B0DPS62DYH?ref_=pd_hp_d_btf_unk_B0DPS62DYH
Title: OnePlus 13R | Smarter with OnePlus AI | Lifetime Display Warranty (12GB RAM, 256GB Storage Nebula Noir)
Price: 37,999..00
Rating: 4.3 out of 5 stars
Features:
  - Flagship power made smarter with the Snapdragon 8 Gen 3 flagship  Up to 98% faster AI, 30% faster CPU compared to the OnePlus 12R. The OnePlus 13R maximizes the CPU efficiency of the Snapdragon 8 Gen 3 when gaming, while lowering heat and power consumption.
  - Winning made smooth with maximum 120fps gaming experience, no input delay and zero-touch latency gameplay. We've tuned the GPU pipeline to unlock near-instant 120fps HDR gaming that hits harder than any highlight.
  - Our biggest battery ever  Press play all day, every day, with the cutting-edge 6000mAh battery. Driven by our next-gen battery management system, your multimedia becomes an all-you-can-consume buffet.
  - Pro-grade triple camera with 

### Export Data to CSV

In [9]:
# Save scraped data to CSV, appending new data and dropping duplicates
df = pd.DataFrame(data)

if os.path.exists('AmazonWebScraperDataset.csv'):
    existing_df = pd.read_csv('AmazonWebScraperDataset.csv')
    combined_df = pd.concat([existing_df, df], ignore_index=True)
    # Drop duplicates based on 'Title' to avoid duplicate entries
    combined_df.drop_duplicates(subset=['Title'], inplace=True)
else:
    combined_df = df

combined_df.to_csv('AmazonWebScraperDataset.csv', index=False)
print("Data saved to AmazonWebScraperDataset.csv")

Data saved to AmazonWebScraperDataset.csv


## Output and Usage

### Running the Scraper
I execute the main scraping loop to process all URLs in my `urls` list.   scraper will:
1. Fetch each product page
2. Extract title, price, rating, features, and specs
3. Print extracted information to console
4. Store data in a list of dictionaries

### Data Export
The scraped data is saved to `AmazonWebScraperDataset.csv` with columns for:
- Title
- Price
- Rating
- Features
- Additional spec columns (varies by product)

### Potential Improvements for Future Development
- Add error handling for network issues
- Implement retry logic for failed requests
- Add more spec extraction methods
- Include image URLs or other metadata

## Key Takeaways

- **Web Scraping Ethics**: Always respect website terms of service, robots.txt, and implement rate limiting to avoid overloading servers.
- **Robust Data Extraction**: Using multiple fallback selectors ensures reliable data extraction even when website layouts change.
- **Data Cleaning**: Removing non-ASCII characters and handling encoding issues is crucial for clean, usable data.
- **Error Handling**: Implementing checks for HTTP status codes and graceful failure handling makes scrapers more reliable.
- **Data Export**: Pandas makes it easy to structure and export scraped data to CSV for further analysis.