# Imports
Import required libraries used in this notebook:
- `requests` — HTTP requests
- `bs4` (`BeautifulSoup`) — HTML parsing
- `pandas` — data storage and CSV export
- `time` — polite request delays

In [88]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Amazon Multipage Scraper
This notebook scrapes Amazon search results across multiple pages and saves product data to `amazon_products.csv`.
- **Purpose:** Collect product title, price, rating, and product link from search results.
- **Prerequisites:** Install `requests`, `beautifulsoup4`, `pandas` (the notebook kernel already includes these).
- **Run order:** Execute cells top-to-bottom. Stop or reduce `NUM_PAGES` if you encounter CAPTCHAs or rate limiting.
- **Politeness & legality:** Use a short delay between requests (`time.sleep(1)`) and respect Amazon's terms of service and robots.txt.

In [89]:
# Set the base Amazon search URL
BASE_URL = "https://www.amazon.in/s?k=playstation+5&crid=302ZJMTZG1JP1&sprefix=playstation+%2Caps%2C369&ref=nb_sb_noss_2"

### Multipage Scraper
This scraper is designed to iterate search result pages by appending a `&page={n}` parameter to the base search URL.
- Set `NUM_PAGES` to control how many pages to scrape.
- Each page is fetched, parsed, and product entries are extracted with `extract_product_data(product)`.
- The collected rows are stored in `all_data` and written to CSV at the end.

### Search URL (BASE_URL)
Set the `BASE_URL` to the Amazon search page you want to scrape. Example:
- `https://www.amazon.in/s?k=playstation+5`
Notes:
- Do not include the `&page=` parameter in `BASE_URL`; the pagination loop will append `&page={page}` automatically.
- If you need a different locale or query, update `BASE_URL` accordingly.

### Request Headers
Set `HEADERS` to mimic a browser's `User-Agent` and `Accept-Language`.
- Passed to `requests.get()` to reduce blocking.
- You can update the `User-Agent` string if necessary.

In [90]:
# Set request headers to mimic a browser
HEADERS = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36 Edg/143.0.0.0',
    'Accept-Language': 'en-US, en;q=0.5'
}

### Fetch Page Function
`fetch_amazon_page(url, headers)` makes a GET request to `url` with `HEADERS` and returns a `BeautifulSoup` object on success (or `None` on failure).

In [91]:
# Function to fetch and parse a single Amazon search result page
def fetch_amazon_page(url, headers):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return BeautifulSoup(response.content, 'html.parser')
        else:
            print(f"Failed to fetch {url} (status code: {response.status_code})")
            return None
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

### Pagination Settings
Set `NUM_PAGES` to control how many search result pages to scrape and initialize `all_data` to collect extracted rows.

In [92]:
# Set how many pages to scrape and initialize data list
NUM_PAGES = 5  # Change this to scrape more/less pages
all_data = []

### Request Headers
To reduce chances of blocking, the scraper sends a `User-Agent` and minimal headers that mimic a browser.
- `HEADERS` is a dictionary provided to `requests.get()`.
- You can update the `User-Agent` string if needed, but avoid automated rapid changes that appear suspicious.
- Consider adding randomized delays, retries, or proxies for larger scraping tasks.

### Send HTTP Request
- `webpage = requests.get(URL, headers=HEADERS)`: Sends a GET request to the Amazon search URL with the specified headers and stores the response.

### Display Response Object
- `webpage`: Displays the response object to check the status of the HTTP request.

### View Raw HTML Content
- `webpage.content`: Displays the raw HTML content of the fetched web page.

### Check Content Type
- `type(webpage.content)`: Checks the data type of the web page content (should be bytes).

### Parse HTML with BeautifulSoup
- `soup = BeautifulSoup(webpage.content, 'html.parser')`: Parses the HTML content using BeautifulSoup for further data extraction.

### Display Parsed HTML Object
- `soup`: Displays the BeautifulSoup object to inspect the parsed HTML structure.

### Find Product Links
- `links = soup.find_all(...)`: Finds all anchor tags with specific classes that likely contain product links from the search results.

### Display Product Links
- `links`: Displays the list of found product link elements.

### Extract a Product Link
- `link = links[1].get('href')`: Extracts the href attribute from the second product link in the list.

### Build Full Product URL
- `product_list = 'https://amazon.com'+link`: Constructs the full URL for the selected product by combining the base URL with the extracted link.

### Display Product URL
- `product_list`: Displays the constructed product URL.

### Fetch Product Page
- `new_webpage = requests.get(product_list, headers=HEADERS)`: Sends a GET request to the product page URL to fetch its HTML content.

### Display Product Page Response
- `new_webpage`: Displays the response object for the product page request.

### Parse Product Page HTML
- `new_soup = BeautifulSoup(new_webpage.content, 'html.parser')`: Parses the product page HTML content for data extraction.

### Display Parsed Product Page
- `new_soup`: Displays the BeautifulSoup object for the product page to inspect its structure.

### Find Product Title Element
- `product = new_soup.find('span', attrs={'id':'productTitle'})`: Searches for the product title element by its ID in the parsed product page.

### Display Product Title Element
- `product`: Displays the found product title element (or None if not found).

In [93]:
# (Removed: single-page product title print, not needed for multipage automation)

### Extract and Print Product Title
- Checks if the product element was found.
- If found, extracts and prints the product title text.
- If not found, prints a message indicating the title was not found.

### Extract Product Data
`extract_product_data(product)` extracts `Title`, `Price`, `Rating`, and `Link` from an individual product container and appends a dictionary to `all_data`.

In [94]:
# Extract product details from a single product element and append to all_data
def extract_product_data(product):
    title_elem = product.find('span', {'class': 'a-size-medium a-color-base a-text-normal'})
    if not title_elem:
        h2_elem = product.find('h2')
        if h2_elem:
            title_elem = h2_elem.find('span')
    title = title_elem.text.strip() if title_elem else None
    price_whole = product.find('span', {'class': 'a-price-whole'})
    price_fraction = product.find('span', {'class': 'a-price-fraction'})
    rating_elem = product.find('span', {'class': 'a-icon-alt'})
    link_elem = product.find('a', {'class': 'a-link-normal s-no-outline'})
    price = None
    if price_whole and price_fraction:
        price = price_whole.text.strip() + price_fraction.text.strip()
    elif price_whole:
        price = price_whole.text.strip()
    rating = rating_elem.text.strip() if rating_elem else None
    link = 'https://www.amazon.in' + link_elem['href'] if link_elem else None
    all_data.append({'Title': title, 'Price': price, 'Rating': rating, 'Link': link})

### Extract Product Title Using Selenium
- Imports Selenium and related modules for browser automation.
- Sets up Chrome browser in headless mode.
- Opens the product page and waits for it to load.
- Tries to find and print the product title using Selenium.
- Handles exceptions if the title is not found.
- Closes the browser after extraction.

### Pagination Loop
Main loop: iterate `page` from `1` to `NUM_PAGES`, fetch each paged URL, parse product containers, call `extract_product_data` for each product, and sleep briefly between requests to be polite.

In [95]:
# Loop through multiple pages and collect product data
for page in range(1, NUM_PAGES + 1):
    paged_url = BASE_URL + f'&page={page}'
    print(f"Fetching page {page}...")
    soup = fetch_amazon_page(paged_url, HEADERS)
    if soup is None:
        continue
    products = soup.find_all('div', {'data-component-type': 's-search-result'})
    for product in products:
        extract_product_data(product)
    time.sleep(1)  # Be polite to Amazon's servers

Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...


### Parse and Extract Product Data from Search Results
- Imports BeautifulSoup and pandas again (for clarity in this cell).
- Parses the original search results page.
- Finds all product containers on the page.
- For each product, extracts:
  - Title
  - Price (whole and fraction)
  - Rating
  - Product link
- Appends the extracted data as a dictionary to a list.

### Save Extracted Data to CSV
After scraping completes the collected data in `all_data` is converted to a `pandas.DataFrame` and saved as `amazon_products.csv`.
- The CSV includes columns: `title`, `price`, `rating`, `link`.
- Post-processing suggestions: normalize `price` to numeric, remove duplicate rows, and validate `link` values before analysis.
- To re-run: adjust `NUM_PAGES`, run cells in order, and check `amazon_products.csv` for the results.

In [97]:
# Save all multipage data to CSV and display
print(f"Total products scraped: {len(all_data)}")
df = pd.DataFrame(all_data)
df.to_csv('amazon_products.csv', index=False)
print('Saved all multipage results to amazon_products.csv')
df

Total products scraped: 110
Saved all multipage results to amazon_products.csv


Unnamed: 0,Title,Price,Rating,Link
0,Sage Controllers PRO+ Controller compatible wi...,11700,5.0 out of 5 stars,https://www.amazon.in/sspa/click?ie=UTF8&spc=M...
1,Sage Controllers PRO+ Controller compatible wi...,14000,5.0 out of 5 stars,https://www.amazon.in/sspa/click?ie=UTF8&spc=M...
2,Sony PlayStation5 Gaming Console (Slim),54990,4.5 out of 5 stars,https://www.amazon.in/Sony-CFI-2008A01X-PlaySt...
3,Sony PlayStation®5 Digital Edition (slim) Cons...,49990,4.6 out of 5 stars,https://www.amazon.in/Sony-PlayStation%C2%AE5-...
4,Sony DualSense Wireless Controller White (Play...,4890,4.2 out of 5 stars,https://www.amazon.in/DualSense-Wireless-Contr...
...,...,...,...,...
105,OIVO INDIA Dust Protective Cover for PS5 Slim ...,499,5.0 out of 5 stars,https://www.amazon.in/OIVO-INDIA-Protective-Ac...
106,PowerA Ultra High Speed HDMI Cable 2.1 For Pla...,1999,4.6 out of 5 stars,https://www.amazon.in/PowerA-Ultra-HDMI-PlaySt...
107,SEGA Persona 5 Royal | Standard Edition | Play...,1968,4.8 out of 5 stars,https://www.amazon.in/Persona-Royal-Standard-P...
108,Sage Controllers PRO+ Controller compatible wi...,11700,5.0 out of 5 stars,https://www.amazon.in/sspa/click?ie=UTF8&spc=M...
