<a href="https://colab.research.google.com/github/SavhBwbd/Saurav_DTSC3020_Fall2025/blob/main/Assignment_6_Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **‚ÄúWrite your answer here‚Äù** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‚Äë15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‚Äë15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‚Äë15 sorted by `points`.


In [2]:
#Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


In [None]:
### 2) Common Imports & Polite Headers

In [3]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


In [None]:
## Question 1 ‚Äî IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (‚â•4 cols; you may add more)
**Clean:** trim spaces; `Alpha-2/Alpha-3` ‚Üí **UPPERCASE**; `Numeric` ‚Üí **int** (nullable OK)
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ‚â•3 columns.


In [4]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [6]:
import pandas as pd
import requests

URL = "https://www.iban.com/country-codes"

def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >=3 columns."""
    tables = pd.read_html(html)
    for t in tables:
        if t.shape[1] >= 3:
            df = t.copy()
            df.columns = [str(c).strip() for c in df.columns]
            return df
    raise ValueError("No valid table (‚â•3 cols) found.")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: trim, uppercase codes, convert numeric."""
    # normalize column names
    df.columns = [c.strip().replace("Code", "code") for c in df.columns]

    # flexible renaming if needed
    rename_map = {}
    for c in df.columns:
        if "alpha-2" in c.lower():
            rename_map[c] = "Alpha-2"
        elif "alpha-3" in c.lower():
            rename_map[c] = "Alpha-3"
        elif "numeric" in c.lower():
            rename_map[c] = "Numeric"
        elif "country" in c.lower():
            rename_map[c] = "Country"
    df = df.rename(columns=rename_map)

    # strip spaces
    for c in df.columns:
        if df[c].dtype == "object":
            df[c] = df[c].astype(str).str.strip()

    # uppercase codes
    for col in ["Alpha-2", "Alpha-3"]:
        if col in df.columns:
            df[col] = df[col].str.upper()

    # numeric conversion
    if "Numeric" in df.columns:
        df["Numeric"] = pd.to_numeric(df["Numeric"], errors="coerce").astype("Int64")

    return df

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N."""
    num_col = None
    for c in df.columns:
        if "numeric" in c.lower():
            num_col = c
            break
    if not num_col:
        raise KeyError("No numeric column found in DataFrame.")
    df_sorted = df.sort_values(by=num_col, ascending=False, na_position="last")
    return df_sorted.head(top)

# Main
html = requests.get(URL).text
df_raw = q1_read_table(html)
df_clean = q1_clean(df_raw)
df_top15 = q1_sort_top(df_clean, 15)

# Save and display
df_clean.to_csv("data_q1.csv", index=False)
print("Top 15 by Numeric code (desc):")
print(df_top15)


Top 15 by Numeric code (desc):
                                               Country Alpha-2 Alpha-3  \
247                                             Zambia      ZM     ZMB   
246                                              Yemen      YE     YEM   
192                                              Samoa      WS     WSM   
244                                  Wallis and Futuna      WF     WLF   
240                 Venezuela (Bolivarian Republic of)      VE     VEN   
238                                         Uzbekistan      UZ     UZB   
237                                            Uruguay      UY     URY   
35                                        Burkina Faso      BF     BFA   
243                              Virgin Islands (U.S.)      VI     VIR   
236                     United States of America (the)      US     USA   
219                       Tanzania, United Republic of      TZ     TZA   
108                                        Isle of Man      IM     IMN   
113    

  tables = pd.read_html(html)


In [None]:
## Question 2 ‚Äî Hacker News (front page)
**URL:** https://news.ycombinator.com/
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)
**Clean:** cast `points`/`comments`/`rank` ‚Üí **int** (non-digits ‚Üí 0), fill missing text fields
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [7]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [8]:
# Q2 ‚Äî Write your answer here

# Fetching
url = "https://news.ycombinator.com/"
html = fetch_html(url)

# front-page stories
from bs4 import BeautifulSoup

def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    """
    soup = BeautifulSoup(html, "lxml")
    stories = soup.select(".athing")
    data = []

    for story in stories:
        rank_tag = story.select_one(".rank")
        title_tag = story.select_one(".titleline a")
        subtext = story.find_next_sibling("tr").select_one(".subtext")

        rank = rank_tag.get_text(strip=True).replace(".", "") if rank_tag else "0"
        title = title_tag.get_text(strip=True) if title_tag else ""
        link = title_tag["href"] if title_tag and title_tag.has_attr("href") else ""

        # subtext may be missing
        if subtext:
            points_tag = subtext.select_one(".score")
            user_tag = subtext.select_one(".hnuser")
            comments_tag = subtext.find_all("a")[-1]
        else:
            points_tag = user_tag = comments_tag = None

        points = points_tag.get_text(strip=True) if points_tag else "0"
        user = user_tag.get_text(strip=True) if user_tag else ""
        comments_text = comments_tag.get_text(strip=True) if comments_tag else "0"

        # Sometimes comments might say discuss or hide
        if not comments_text or "comment" not in comments_text:
            comments_text = "0"

        data.append({
            "rank": rank,
            "title": title,
            "link": link,
            "user": user,
            "points": points,
            "comments": comments_text
        })
    return pd.DataFrame(data)

#Clean numeric fields and fill missing
def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values."""
    def extract_int(text):
        text = str(text)
        digits = re.findall(r"\d+", text)
        return int(digits[0]) if digits else 0

    for col in ["rank", "points", "comments"]:
        df[col] = df[col].apply(extract_int)

    df = df.fillna("")
    return df

#Sort by points
def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N."""
    df_sorted = df.sort_values(by="points", ascending=False)
    return df_sorted.head(top)

#Run pipeline
df_raw = q2_parse_items(html)
df_clean = q2_clean(df_raw)
df_top15 = q2_sort_top(df_clean)

#Save and preview
df_clean.to_csv("data_q2.csv", index=False)
print("data_q2.csv saved successfully.\n")
print("üîù Top 15 Hacker News Stories by Points:\n")
print(df_top15[["rank", "title", "points", "comments", "user"]])

data_q2.csv saved successfully.

üîù Top 15 Hacker News Stories by Points:

    rank                                              title  points  comments  \
21    22                                Why is Zig so cool?     489       428   
18    19  Valdi ‚Äì A cross-platform UI framework that del...     459       187   
23    24                 Ticker: Don't die of heart disease     388       342   
27    28  Myna: Monospace typeface designed for symbol-h...     375       168   
3      4  Study identifies weaknesses in how AI systems ...     295       154   
1      2         Marko ‚Äì A declarative, HTML‚Äëbased language     196       101   
11    12  IP blocking the UK is not enough to comply wit...     167       193   
28    29  Making Democracy Work: Fixing and Simplifying ...     153        46   
6      7                                       WriterdeckOS     120        58   
9     10                    Aver√≠a: The Average Font (2011)     118        26   
12    13  Cloudflare scru