
# DATA 304 — Module 5, Session 1
## HTML and Web Scraping Demo

This notebook demonstrates:
- Basics of HTTP requests
- Parsing HTML with BeautifulSoup
- Extracting tables with `pandas.read_html`
- Parsing semi-structured content (div/span listings)
- A small activity for practice


In [1]:

# Imports
import sys
print(sys.version)

# Core libs for this session
import pandas as pd
from bs4 import BeautifulSoup

import requests

# Utility
from io import StringIO


3.12.1 (main, Nov 27 2025, 10:47:52) [GCC 13.3.0]



## 1) HTTP Requests Pattern

Typical pattern when fetching a page:
1. Make a `GET` request.
2. Check status code.
3. Use `response.text` as the HTML to parse.

Below we **show** the pattern. The environment here has no internet, so the actual request is wrapped in a try/except and falls back to a local HTML sample.


In [5]:

# URL
url = "https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"

try:
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    html = resp.text
    print("Fetched from the web:", url)
except Exception as e:
    print("Falling back to local sample HTML due to:", e.__class__.__name__)
    html2 = '\n<!DOCTYPE html>\n<html>\n  <head>\n    <title>Demo Page</title>\n  </head>\n  <body>\n    <h1 class="title">Sample Headline</h1>\n    <p id="msg">Hello, world.</p>\n\n    <h2>Top Stories</h2>\n    <ul>\n      <li><a href="/story/1">Story One</a></li>\n      <li><a href="/story/2">Story Two</a></li>\n      <li><a href="/story/3">Story Three</a></li>\n    </ul>\n\n    <h2>Population Table</h2>\n    <table>\n      <thead>\n        <tr><th>Country</th><th>Population</th></tr>\n      </thead>\n      <tbody>\n        <tr><td>Aland</td><td>30,000</td></tr>\n        <tr><td>Bravo</td><td>1,250,000</td></tr>\n        <tr><td>Charlie</td><td>9,999,999</td></tr>\n      </tbody>\n    </table>\n\n    <h2>Products</h2>\n    <div class="product">\n      <span class="name">Widget A</span>\n      <span class="price">$9.99</span>\n    </div>\n    <div class="product promo">\n      <span class="name">Widget B</span>\n      <span class="price">$14.50</span>\n    </div>\n  </body>\n</html>\n'  # local demo HTML string
    print("Using local sample HTML string instead.")


Falling back to local sample HTML due to: HTTPError
Using local sample HTML string instead.


In [6]:
import requests

url = "https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status()
html = resp.text
html

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of FIFA World Cup finals - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-fo


## 2) Parsing HTML with BeautifulSoup

We create a `BeautifulSoup` object and then query elements by tag, attribute, or CSS selectors.


In [7]:

soup = BeautifulSoup(html, "html.parser")

# Extract title text
page_title = soup.title.text if soup.title else None
print("Page title:", page_title)

Page title: List of FIFA World Cup finals - Wikipedia


In [9]:
# Example: get the H1 with class 'title'
h1_title = soup.find("h1", {"class": "title"})
print("H1 .title ->", h1_title.text if h1_title else None)

H1 .title -> None


In [10]:
h1_title = soup.find("h1")
h1_title

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">List of FIFA World Cup finals</span></h1>

In [11]:
# Extract story links under 'Top Stories'
links = [(a.text.strip(), a.get("href")) for a in soup.select("ul li a")]
links[:5]

[('Main page', '/wiki/Main_Page'),
 ('Contents', '/wiki/Wikipedia:Contents'),
 ('Current events', '/wiki/Portal:Current_events'),
 ('Random article', '/wiki/Special:Random'),
 ('About Wikipedia', '/wiki/Wikipedia:About')]


## 3) Extracting Tables with `pandas.read_html`

`pandas.read_html` can parse one or more tables from a page or from an HTML string. It returns a list of DataFrames.


In [12]:
# Use StringIO to provide the HTML string as a file-like object
tables = pd.read_html(StringIO(html))  # requires lxml or html5lib installed
print(f"Found {len(tables)} table(s).")
tables[0].head()


Found 8 table(s).


Unnamed: 0,0,1
0,Founded,1930; 96 years ago
1,Current champions,Argentina (3rd title)
2,Most championships,Brazil (5 titles)


In [13]:
tables[1]

Unnamed: 0,0,1
0,"London (1966)Paris (1938, 1998)Berlin (2006)Ro...",Montevideo (1930)Buenos Aires (1978)Rio de Jan...



## 4) Semi-Structured Content: Product Listings

Many sites use `<div>`/`<span>` structures rather than `<table>`. We can extract and normalize these into a DataFrame.


In [14]:
html2

'\n<!DOCTYPE html>\n<html>\n  <head>\n    <title>Demo Page</title>\n  </head>\n  <body>\n    <h1 class="title">Sample Headline</h1>\n    <p id="msg">Hello, world.</p>\n\n    <h2>Top Stories</h2>\n    <ul>\n      <li><a href="/story/1">Story One</a></li>\n      <li><a href="/story/2">Story Two</a></li>\n      <li><a href="/story/3">Story Three</a></li>\n    </ul>\n\n    <h2>Population Table</h2>\n    <table>\n      <thead>\n        <tr><th>Country</th><th>Population</th></tr>\n      </thead>\n      <tbody>\n        <tr><td>Aland</td><td>30,000</td></tr>\n        <tr><td>Bravo</td><td>1,250,000</td></tr>\n        <tr><td>Charlie</td><td>9,999,999</td></tr>\n      </tbody>\n    </table>\n\n    <h2>Products</h2>\n    <div class="product">\n      <span class="name">Widget A</span>\n      <span class="price">$9.99</span>\n    </div>\n    <div class="product promo">\n      <span class="name">Widget B</span>\n      <span class="price">$14.50</span>\n    </div>\n  </body>\n</html>\n'

In [15]:
soup2 = BeautifulSoup(html2, "html.parser")
products = []
for card in soup2.select("div.product"):
    name = card.find("span", class_="name")
    price = card.find("span", class_="price")
    products.append({
        "name": name.text.strip() if name else None,
        "price_raw": price.text.strip() if price else None,
        "is_promo": "promo" in (card.get("class") or []),
    })

df_products = pd.DataFrame(products)
df_products


Unnamed: 0,name,price_raw,is_promo
0,Widget A,$9.99,False
1,Widget B,$14.50,True


In [16]:

# Clean the price column into numeric where possible
def parse_price(x):
    if x is None:
        return None
    x = x.replace("$", "").replace(",", "").strip()
    try:
        return float(x)
    except ValueError:
        return None

df_products["price"] = df_products["price_raw"].map(parse_price)
df_products.drop(columns=["price_raw"], inplace=True)
df_products


Unnamed: 0,name,is_promo,price
0,Widget A,False,9.99
1,Widget B,True,14.5


In [17]:
url3 = "https://books.toscrape.com"
headers = {"User-Agent": "Mozilla/5.0"}
resp3 = requests.get(url3, headers=headers, timeout=10)
resp3.raise_for_status()
html3 = resp3.text
soup3 = BeautifulSoup(html3, "html.parser")

In [18]:
rows = []
for card in soup3.select("article.product_pod"):
    a = card.select_one("h3 a")
    rows.append({
        "name": a.get("title"),
        "price_raw": card.select_one("p.price_color").text.strip(),
        "in_stock": "In stock" in card.select_one("p.instock.availability").text,
        "rating": next((c for c in card.select_one("p.star-rating").get("class", []) if c != "star-rating"), None),
        "url": requests.compat.urljoin(url3, a.get("href")),
    })

df = pd.DataFrame(rows)
df.head()

Unnamed: 0,name,price_raw,in_stock,rating,url
0,A Light in the Attic,Â£51.77,True,Three,https://books.toscrape.com/catalogue/a-light-i...
1,Tipping the Velvet,Â£53.74,True,One,https://books.toscrape.com/catalogue/tipping-t...
2,Soumission,Â£50.10,True,One,https://books.toscrape.com/catalogue/soumissio...
3,Sharp Objects,Â£47.82,True,Four,https://books.toscrape.com/catalogue/sharp-obj...
4,Sapiens: A Brief History of Humankind,Â£54.23,True,Five,https://books.toscrape.com/catalogue/sapiens-a...



## 5) Summary

- Use `requests` to fetch pages. Check status codes. Respect site policies.
- Parse HTML with BeautifulSoup. Use `find`, `find_all`, and CSS selectors.
- Use `pandas.read_html` for HTML tables when available.
- For semi-structured content, select container elements and normalize to a DataFrame.
- Always validate and clean extracted data.
