<a href="https://colab.research.google.com/github/DanTran12/Danny_DTSC3020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [2]:
 #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


### 2) Common Imports & Polite Headers

In [3]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [10]:
def q1_read_table(html: str):
    tables = pd.read_html(html)
    for table in tables:
        if len(table.columns) >= 3:
            return table
def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
      df.columns = [' '.join([str(x) for x in tup if str(x) != 'nan']).strip()
                      for tup in df.columns.values]
    else:
      df.columns = [str(c).strip() for c in df.columns]


    df["Country"] = (df["Country"].astype(str)
                 .str.replace(r"\[.*?\]", "", regex=True)
                 .str.strip())
    df['Alpha-2'] = df['Alpha-2 code'].str.upper()
    df['Alpha-3'] = df['Alpha-3 code'].str.upper()
    df['Numeric'] = pd.to_numeric(df['Numeric']).astype('Int64')

    return df

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
   df = df.sort_values(by = 'Numeric', ascending = False)
   df.to_csv('data_q1.csv', index = False)
   print('data_q1.csv saved')
   return df.head(15)


In [11]:
# Q1 — Write your answer here

df =  q1_read_table('https://www.iban.com/country-codes')
q1_clean(df)
q1_sort_top(df)


data_q1.csv saved


Unnamed: 0,Country,Alpha-2 code,Alpha-3 code,Numeric,Alpha-2,Alpha-3
247,Zambia,ZM,ZMB,894,ZM,ZMB
246,Yemen,YE,YEM,887,YE,YEM
192,Samoa,WS,WSM,882,WS,WSM
244,Wallis and Futuna,WF,WLF,876,WF,WLF
240,Venezuela (Bolivarian Republic of),VE,VEN,862,VE,VEN
238,Uzbekistan,UZ,UZB,860,UZ,UZB
237,Uruguay,UY,URY,858,UY,URY
35,Burkina Faso,BF,BFA,854,BF,BFA
243,Virgin Islands (U.S.),VI,VIR,850,VI,VIR
236,United States of America (the),US,USA,840,US,USA


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [33]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(url: str) -> pd.DataFrame:
  resp = requests.get(url, headers=HEADERS, timeout=10)
  resp.raise_for_status()
  soup = BeautifulSoup(resp.text, 'html.parser')
  posts = []

  story_rows = soup.find_all('tr', class_='athing')

  for story_row in story_rows:
      # Extract data from the 'athing' row
      rank_tag = story_row.find('span', class_='rank')
      rank = int(rank_tag.string.strip('.')) if rank_tag else 0

      title_tag = story_row.find('a', class_='storylink')
      title = title_tag.string.strip() if title_tag else ''
      link = title_tag['href'] if title_tag else ' '
      subtext_row = story_row.find_next_sibling('tr')

      points = 0
      comments = 0

      if subtext_row:
          subtext_td = subtext_row.find('td', class_='subtext')
          if subtext_td:
              # Extract points
              points_tag = subtext_td.find('span', class_='score')
              if points_tag:
                  points_str = points_tag.string.replace(' points', '').strip()
                  points = int(points_str)


              comments_tag = subtext_td.find('a', string=re.compile(r'(\d+)\s*comment'))
              if comments_tag:
                  match = re.search(r'(\d+)', comments_tag.text)
                  comments = int(match.group(1)) if match else 0
              elif subtext_td.find('a', string='discuss'):
                  comments = 0


      posts.append({
          'rank': rank,
          'title': title,
          'link': link,
          'points': points,
          'comments': comments,
      })
  return pd.DataFrame(posts)

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    df['rank'] = pd.to_numeric(df['rank'], errors='coerce').fillna(0).astype(int)
    df['points'] = pd.to_numeric(df['points'], errors='coerce').fillna(0).astype(int)
    df['comments'] = pd.to_numeric(df['comments'], errors='coerce').fillna(0).astype(int)
    df['title'] = df['title'].fillna('')
    df['link'] = df['link'].fillna('')

    return df

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    df = df.sort_values(by = 'points', ascending = False)
    df.to_csv('data_q2.csv', index = False)
    print('data_q2.csv saved')
    return df.head(15)


In [34]:
# Q2 — Write your answer here

df = q2_parse_items('https://news.ycombinator.com/')
q2_clean(df)
q2_sort_top(df)



data_q2.csv saved


Unnamed: 0,rank,title,link,points,comments
2,3,,,572,66
4,5,,,382,199
17,18,,,250,209
9,10,,,202,76
3,4,,,167,517
10,11,,,156,15
6,7,,,155,164
7,8,,,147,78
19,20,,,143,108
16,17,,,136,52
