<a href="https://colab.research.google.com/github/Katkins178/KendallA_DTSC3020_Fall2025/blob/main/KA0955_Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [2]:
#Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


### 2) Common Imports & Polite Headers

In [3]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [4]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [5]:
# Q1 — Write your answer here
url = 'https://www.iban.com/country-codes'
html = fetch_html(url)
soup = BeautifulSoup(html,'lxml')
tables = pd.read_html(html)
raw_df = tables[0] # Select the first DataFrame from the list
flatten_headers(raw_df)

df = pd.DataFrame(raw_df)

# Rename columns
df.columns = ["Country","Alpha-2","Alpha-3","Numeric"]

# Clean data
df["Alpha-2"] = (df["Alpha-2"].astype(str).str.replace(r"\s+", "", regex = True).str.upper())
df["Alpha-3"] = (df["Alpha-3"].astype(str).str.replace(r"\s+", "", regex = True).str.upper())
df["Numeric"] = (df["Numeric"].astype("Int64"))

# Save to a csv
df.to_csv("data_q1.csv", index=False) # Unsure if this should happen before or after sorting
sorted_df = df.sort_values("Numeric",ascending=False)
top15 = sorted_df.head(15)
display(top15)
print("Saved -> data_q1.csv")

  tables = pd.read_html(html)


Unnamed: 0,Country,Alpha-2,Alpha-3,Numeric
247,Zambia,ZM,ZMB,894
246,Yemen,YE,YEM,887
192,Samoa,WS,WSM,882
244,Wallis and Futuna,WF,WLF,876
240,Venezuela (Bolivarian Republic of),VE,VEN,862
238,Uzbekistan,UZ,UZB,860
237,Uruguay,UY,URY,858
35,Burkina Faso,BF,BFA,854
243,Virgin Islands (U.S.),VI,VIR,850
236,United States of America (the),US,USA,840


Saved -> data_q1.csv


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [6]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [13]:
# Q2 — Write your answer here

url = "https://news.ycombinator.com/"
html = fetch_html(url)
soup = BeautifulSoup(html,"lxml")

stories = []
for block in soup.select("tr.athing"): # Each block is a story's main row
  rank_element = block.select_one("span.rank")
  rank = rank_element.text.strip() if rank_element else ""

  title_element = block.select_one(".titleline a")
  title = title_element.text.strip() if title_element else ""
  link = title_element['href'] if title_element else ""

  # The subtext (points, comments) is in the *next* tr sibling
  subtext_row = block.find_next_sibling("tr")
  points = "0 points"
  comments = "0 comments"

  if subtext_row:
    score_element = subtext_row.select_one(".score")
    if score_element:
      points = score_element.text.strip()

    # Find the comments link by checking all links in the subtext
    all_subtext_links = subtext_row.select(".subtext a")
    for a_tag in all_subtext_links:
      text = a_tag.text.strip()
      if "comment" in text or "discuss" in text:
        comments = text
        break

  stories.append({"Rank": rank, "Title": title,"Link":link,"Points":points,"Comments":comments})

# Create DataFrame
df = pd.DataFrame(stories)

# Clean data
df["Points"] = df["Points"].str.extract(r'(\d+)').fillna('0').astype(int)
df["Comments"] = df["Comments"].str.extract(r'(\d+)').fillna('0').astype(int)

sorted_df = df.sort_values("Points", ascending=False)
top15 = sorted_df.head(15)

display(top15)
df.to_csv("data_q2.csv", index=False)
print("Saved -> data_q2.csv")

Unnamed: 0,Rank,Title,Link,Points,Comments
10,11.0,Leaving Meta and PyTorch,https://soumith.ch/blog/2025-11-06-leaving-met...,663,162
25,26.0,Meta projected 10% of 2024 revenue came from s...,https://sherwood.news/tech/meta-projected-10-o...,601,455
15,16.0,Denmark's government aims to ban access to soc...,https://apnews.com/article/denmark-social-medi...,349,250
9,10.0,I Love OCaml,https://mccd.space/posts/ocaml-the-worlds-best/,273,186
4,5.0,"YouTube Removes Windows 11 Bypass Tutorials, C...",https://news.itsfoss.com/youtube-removes-windo...,219,69
21,22.0,Apple is crossing a Steve Jobs red line,https://kensegall.com/2025/11/07/apple-is-cros...,189,166
11,12.0,James Watson has died,https://www.nytimes.com/2025/11/07/science/jam...,173,80
7,8.0,VLC's Jean-Baptiste Kempf Receives the Europea...,https://fsfe.org/news/2025/news-20251107-01.en...,171,24
0,1.0,AI is Dunning-Kruger as a service,https://christianheilmann.com/2025/10/30/ai-is...,154,105
1,2.0,Myna: Monospace typeface designed for symbol-h...,https://github.com/sayyadirfanali/Myna,150,65


Saved -> data_q2.csv
