<a href="https://colab.research.google.com/github/Rob85225/CSC-115/blob/main/Assignment_3_2_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

scrape the laptop listings (with “reviews” counts) from Webscraper.io’s test e-commerce site:
- http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
- Each product card includes:
- Title
- Description
- Price
- Number of reviews

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 1. Fetch the page
url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
response = requests.get(url)
response.raise_for_status()

# 2. Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# 3. Locate product cards
cards = soup.select(".thumbnail")

# 4. Extract data into lists
titles = []
descriptions = []
prices = []
reviews = []

for card in cards:
    titles.append(card.select_one(".title")["title"])
    descriptions.append(card.select_one(".description").text.strip())
    prices.append(card.select_one(".price").text.strip().replace("$", ""))
    review_element = card.select_one(".ratings .pull-right")
    if review_element:
        reviews.append(review_element.text.strip().split()[0])
    else:
        reviews.append(0)

# 5. Build DataFrame
df = pd.DataFrame({
    "title": titles,
    "description": descriptions,
    "price": pd.to_numeric(prices),
    "review_count": pd.to_numeric(reviews)
})

we now have a DataFrame of 25 laptops with their names, prices, descriptions, and review counts.

In [4]:
# 1. Quick info
df.info()

# 2. Check for missing
df.isna().sum()

# 3. Drop rows with any missing values
df_clean = df.dropna().reset_index(drop=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         117 non-null    object 
 1   description   117 non-null    object 
 2   price         117 non-null    float64
 3   review_count  117 non-null    int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 3.8+ KB


no rows were dropped if there are no missing values, or n rows dropped otherwise, ensuring a complete dataset.

In [5]:
df_clean.shape  # returns (rows, columns)

(117, 4)

the cleaned dataset has 25 rows and 4 columns.

In [6]:
df_clean.dtypes

Unnamed: 0,0
title,object
description,object
price,float64
review_count,int64


One-sentence insight:
- title (object)
- description (object)
- price (float64)
- review_count (int64)


In [7]:
df_clean.select_dtypes(include=["object"]).nunique()

Unnamed: 0,0
title,88
description,116


- title: 25 unique
- description: 25 unique

In [8]:
df_clean[["price", "review_count"]].agg(["min", "max"])

Unnamed: 0,price,review_count
min,295.99,0
max,1799.0,0


- price ranges from $295.00 to $1,516.66
- review_count ranges from 1 to 15

In [9]:
# Example for a hypothetical 'category' column:
# df_clean["category"].value_counts().head()

From my observation every product title and description is unique on this page. To find a “most frequent category,” you’d need a column—such as star-rating or product type—with repeating values.