<a href="https://colab.research.google.com/github/Elhameed/PLG4_APIs/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APIs and Web Scraping (PLD 4)

This notebook demonstrates two key activities: scraping tabular data from a website and extracting product details from Amazon.com using Python libraries like `requests`, `BeautifulSoup`, and `pandas`.


# Task 1: Scraping from ScrapethisSite

In this section, we scraped tabular data from the website and save it into a CSV file.

In [56]:
# Import Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path

In [57]:
# --- Scraping from ScrapethisSite ---

# Set base URL for scraping data
base_url = "https://www.scrapethissite.com/pages/forms/"
page_num = 1
all_data = []
headers = []

# Loop through pages until no more data
while True:
    url = f"{base_url}?page_num={page_num}"
    print(f"Retrieving page {page_num}... at {url}")

    # Set headers for the HTTP request
    headers_request = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

    # Make request to the page
    response = requests.get(url, headers=headers_request)

    # Break loop if page is not found
    if response.status_code != 200:
        print(f"Stopping. Failed to retrieve page {page_num}.")
        break

    # Parse the page content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the table in the HTML content
    table = soup.find("table")
    number_rows = len(table.find_all("tr"))

    # If no more data, stop the loop
    if number_rows == 1:
        print(f"Stopping. No more data on page {page_num}.")
        break
    else:
        # For the first page, get the table headers
        if page_num == 1:
            headers = [header.text.strip() for header in table.find_all("th")]

        # Extract row data from the table
        for row in table.find_all("tr"):
            data = [cell.text.strip() for cell in row.find_all("td")]
            if data:
                all_data.append(data)

    # Increment to get the next page
    page_num += 1

df = pd.DataFrame(all_data, columns=headers)
df


Retrieving page 1... at https://www.scrapethissite.com/pages/forms/?page_num=1
Retrieving page 2... at https://www.scrapethissite.com/pages/forms/?page_num=2
Retrieving page 3... at https://www.scrapethissite.com/pages/forms/?page_num=3
Retrieving page 4... at https://www.scrapethissite.com/pages/forms/?page_num=4
Retrieving page 5... at https://www.scrapethissite.com/pages/forms/?page_num=5
Retrieving page 6... at https://www.scrapethissite.com/pages/forms/?page_num=6
Retrieving page 7... at https://www.scrapethissite.com/pages/forms/?page_num=7
Retrieving page 8... at https://www.scrapethissite.com/pages/forms/?page_num=8
Retrieving page 9... at https://www.scrapethissite.com/pages/forms/?page_num=9
Retrieving page 10... at https://www.scrapethissite.com/pages/forms/?page_num=10
Retrieving page 11... at https://www.scrapethissite.com/pages/forms/?page_num=11
Retrieving page 12... at https://www.scrapethissite.com/pages/forms/?page_num=12
Retrieving page 13... at https://www.scrapethi

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9,0.622,249,198,51
580,Washington Capitals,2011,42,32,8,0.512,222,230,-8


In [58]:
# Displaying first few rows of the DataFrame
df.head()

# Displaying DataFrame information
df.info()

# Displaying the shape of the DataFrame (rows, columns)
df.shape

# Save the data to a CSV file
df.to_csv('scrape1.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 582 entries, 0 to 581
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Team Name           582 non-null    object
 1   Year                582 non-null    object
 2   Wins                582 non-null    object
 3   Losses              582 non-null    object
 4   OT Losses           582 non-null    object
 5   Win %               582 non-null    object
 6   Goals For (GF)      582 non-null    object
 7   Goals Against (GA)  582 non-null    object
 8   + / -               582 non-null    object
dtypes: object(9)
memory usage: 41.0+ KB


# Task 2: Amazon Product Scraping

In this task, we were able to scrape products from different categories on Amazon. For each category, we extract the product name and image and save the image locally.

In [None]:

# Custom headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}

# URLs for different categories
urls = {
    "consoles": "https://www.amazon.com/s?k=consoles",
    "handbags": "https://www.amazon.com/s?k=handbags",
    "printers": "https://www.amazon.com/s?k=printers",
    "wigs": "https://www.amazon.com/s?k=wigs",
    "keyboards": "https://www.amazon.com/s?k=keyboards",
}

# Function to save image using pathlib
def save_image(url, name, folder="images"):
    folder_path = Path(folder)
    folder_path.mkdir(exist_ok=True)  # Create folder if not exists
    response = requests.get(url)
    if response.status_code == 200:
        with open(folder_path / f"{name}.jpg", "wb") as f:
            f.write(response.content)
        print(f"Image saved: {folder_path / f'{name}.jpg'}")
    else:
        print(f"Failed to download image from {url}. Status code: {response.status_code}")

# Scrape and save a single product image per category
def scrape_products(url, category):
    print(f"Fetching data from {url}...")
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to retrieve {category}. Status code: {response.status_code}")
        return

    soup = BeautifulSoup(response.content, "html.parser")
    product = soup.select_one("div[data-component-type='s-search-result']")  # Get only the first product

    if product:
        try:
            name = product.h2.text.strip()
            img_tag = product.find("img", class_="s-image")
            if img_tag and img_tag.has_attr('src'):
                img_url = img_tag['src']
                save_image(img_url, f"{category}_1")
                print(f"Saved {category} product: {name}")
            else:
                print(f"No image found for {category}.")
        except Exception as e:
            print(f"Error processing product in {category}: {e}")
    else:
        print(f"No products found for {category}.")

# Loop through categories
for category, url in urls.items():
    print(f"Scraping {category}...")
    scrape_products(url, category)

Scraping consoles...
Fetching data from https://www.amazon.com/s?k=consoles...
Failed to retrieve consoles. Status code: 503
Scraping handbags...
Fetching data from https://www.amazon.com/s?k=handbags...
Failed to retrieve handbags. Status code: 503
Scraping printers...
Fetching data from https://www.amazon.com/s?k=printers...
Image saved: images/printers_1.jpg
Saved printers product: Canon PIXMA TS6420a All-in-One Wireless Inkjet Printer [Print,Copy,Scan], Black, Works with Alexa
Scraping wigs...
Fetching data from https://www.amazon.com/s?k=wigs...
Failed to retrieve wigs. Status code: 503
Scraping keyboards...
Fetching data from https://www.amazon.com/s?k=keyboards...
Failed to retrieve keyboards. Status code: 503
