# GDACS Earthquake Event Scraper
## Assignment 1 — Disaster Study Module

This notebook scrapes earthquake event data from the [Global Disaster Alert and Coordination System (GDACS)](https://www.gdacs.org) using **Selenium** with a headless Safari/Chrome browser (no explicit ChromeDriver required — uses `selenium-manager` auto-provisioning introduced in Selenium 4.6+).

### Events Scraped
| Label | Country | Type | URL |
|---|---|---|---|
| Philippines – Historical | Philippines | EQ | eventid=1230629 |
| Philippines – Recent | Philippines | EQ | eventid=1502713 |
| Afghanistan – Historical | Afghanistan | EQ | eventid=1327560 |
| Afghanistan – Recent | Afghanistan | EQ | eventid=1508467 |

### Data Extracted (per event)
- **Summary Tab**: Event title, magnitude, depth, event date (UTC), GDACS score, alert level, country, exposed population (summary)
- **Impact Tab (Shakemap)**: Magnitude, depth, event date, exposed population (detailed)
- **Impact Tab (INFORM)**: INFORM coping capacity score, vulnerability score
- **Media Tab**: Total articles, articles about casualties, articles in last hour, peak news day


## Cell 1 — Imports & Configuration

In [6]:
import os
import re
import time
import pandas as pd
from datetime import datetime, timezone
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

print('Imports successful.')
import selenium; print(f'Selenium version: {selenium.__version__}')


Imports successful.
Selenium version: 4.40.0


## Cell 2 — Event Configuration

All four GDACS event URLs and their metadata labels.

In [7]:
EVENTS = [
    {
        "label":   "Philippines – Historical",
        "country": "Philippines",
        "period":  "Historical",
        "url":     "https://www.gdacs.org/report.aspx?eventid=1230629&episodeid=1327163&eventtype=EQ",
    },
    {
        "label":   "Philippines – Recent",
        "country": "Philippines",
        "period":  "Recent",
        "url":     "https://www.gdacs.org/report.aspx?eventid=1502713&episodeid=1663312&eventtype=EQ",
    },
    {
        "label":   "Afghanistan – Historical",
        "country": "Afghanistan",
        "period":  "Historical",
        "url":     "https://www.gdacs.org/report.aspx?eventid=1327560&episodeid=1449790&eventtype=EQ",
    },
    {
        "label":   "Afghanistan – Recent",
        "country": "Afghanistan",
        "period":  "Recent",
        "url":     "https://www.gdacs.org/report.aspx?eventid=1508467&episodeid=1669785&eventtype=EQ",
    },
]

# Output directory for the cleaned CSV
OUTPUT_DIR = os.path.dirname(os.path.abspath("__file__"))
OUTPUT_CSV = os.path.join(OUTPUT_DIR, "gdacs_earthquake_data.csv")

print("Events configured:")
for e in EVENTS:
    print(f"  • {e['label']} → {e['url']}")

Events configured:
  • Philippines – Historical → https://www.gdacs.org/report.aspx?eventid=1230629&episodeid=1327163&eventtype=EQ
  • Philippines – Recent → https://www.gdacs.org/report.aspx?eventid=1502713&episodeid=1663312&eventtype=EQ
  • Afghanistan – Historical → https://www.gdacs.org/report.aspx?eventid=1327560&episodeid=1449790&eventtype=EQ
  • Afghanistan – Recent → https://www.gdacs.org/report.aspx?eventid=1508467&episodeid=1669785&eventtype=EQ


## Cell 3 — Selenium Driver Setup

Uses **Selenium 4.6+ built-in `selenium-manager`** which automatically downloads the correct ChromeDriver for the installed Chrome version — no manual `chromedriver` installation required.

In [8]:
def create_driver(headless: bool = True) -> webdriver.Chrome:
    """
    Create and return a Chrome WebDriver instance.
    
    Selenium 4.6+ includes selenium-manager which automatically
    downloads the correct ChromeDriver — no manual driver needed.
    
    Parameters
    ----------
    headless : bool
        If True, Chrome runs without a visible window (default).
    
    Returns
    -------
    webdriver.Chrome
    """
    options = ChromeOptions()
    if headless:
        options.add_argument("--headless=new")          # 'new' headless mode (Chrome 112+)
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1920,1080")
    options.add_argument(
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
    )
    # selenium-manager handles driver download automatically
    driver = webdriver.Chrome(options=options)
    driver.implicitly_wait(5)
    return driver


# Quick smoke-test
print("Testing driver creation...")
_test = create_driver(headless=True)
print(f"Driver created successfully. User-Agent: {_test.execute_script('return navigator.userAgent')}")
_test.quit()
print("Driver closed. Ready to scrape.")

Testing driver creation...
Driver created successfully. User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Driver closed. Ready to scrape.


## Cell 4 — Helper: Safe Text Extraction

Convenience functions to extract text from DOM elements without raising exceptions.

In [9]:
def safe_text(driver, xpath: str, default: str = 'N/A') -> str:
    """Return stripped text of the first XPATH match, or default."""
    try:
        return driver.find_element(By.XPATH, xpath).text.strip()
    except Exception:
        return default


def safe_texts(driver, xpath: str) -> list:
    """Return list of stripped texts for all XPATH matches."""
    try:
        return [el.text.strip() for el in driver.find_elements(By.XPATH, xpath)]
    except Exception:
        return []


def label_value(driver, label: str, default: str = 'N/A') -> str:
    """
    Extract the value cell from a two-column label/value table.
    Tries exact match first, then partial match.
    """
    for xp in [
        f"//td[normalize-space(text())='{label}']/following-sibling::td[1]",
        f"//td[contains(text(),'{label}')]/following-sibling::td[1]",
    ]:
        val = safe_text(driver, xp)
        if val and val != 'N/A':
            return val
    return default


def build_urls(base_url: str) -> dict:
    """
    GDACS report pages are SEPARATE PAGES (not in-page tabs).
    Given the base report URL, derive the Impact and Media page URLs.

    URL patterns (confirmed from live DOM inspection):
      Summary : https://www.gdacs.org/report.aspx?eventid=X&episodeid=Y&eventtype=EQ
      Impact  : https://www.gdacs.org/Earthquakes/report_shakemap.aspx?eventid=X&episodeid=Y&eventtype=EQ
      Media   : https://www.gdacs.org/media.aspx?eventid=X&episodeid=Y&eventtype=EQ
    """
    # Parse eventid, episodeid, eventtype from query string
    eid  = re.search(r'eventid=(\d+)',   base_url).group(1)
    epid = re.search(r'episodeid=(\d+)', base_url).group(1)
    etype = re.search(r'eventtype=(\w+)', base_url).group(1)
    qs = f'eventid={eid}&episodeid={epid}&eventtype={etype}'
    return {
        'summary': base_url,
        'impact':  f'https://www.gdacs.org/Earthquakes/report_shakemap.aspx?{qs}',
        'media':   f'https://www.gdacs.org/media.aspx?{qs}',
    }


print('Helper functions defined.')


Helper functions defined.


## Cell 5 — Summary Tab Scraper

In [10]:
def scrape_summary(driver, url: str) -> dict:
    """
    Navigate to the Summary page and scrape all headline fields.

    URL: report.aspx?eventid=X&episodeid=Y&eventtype=EQ
    """
    data = {}
    driver.get(url)
    time.sleep(5)

    # ─ Event Title (─ confirmed: div.alert_title) ─────────────────────────────
    raw_title = safe_text(driver, "//div[@class='alert_title']")
    if raw_title == 'N/A':
        raw_title = safe_text(driver, "//div[contains(@class,'alert_title')]")
    data['event_title'] = re.sub(r'\s+', ' ', raw_title).strip()

    # ─ Summary table (td label / td value pairs) ───────────────────────
    data['gdacs_id']  = label_value(driver, 'GDACS ID')
    data['magnitude'] = label_value(driver, 'Earthquake Magnitude:')
    data['depth_km']  = label_value(driver, 'Depth:')
    data['lat_lon']   = label_value(driver, 'Lat/Lon:')
    raw_date = label_value(driver, 'Event Date:')
    data['event_date_utc'] = raw_date.split('\n')[0].strip() if raw_date != 'N/A' else 'N/A'
    data['exposed_population_summary'] = label_value(driver, 'Exposed Population:')

    # ─ GDACS Score (confirmed: last td in tbody#tableScoreMain data row) ───
    gdacs_score = 'N/A'
    try:
        tbody = driver.find_element(By.XPATH, "//tbody[@id='tableScoreMain']")
        rows  = tbody.find_elements(By.TAG_NAME, 'tr')
        if len(rows) >= 2:
            tds = rows[1].find_elements(By.TAG_NAME, 'td')
            if tds:
                gdacs_score = tds[-1].text.strip()
    except Exception:
        pass
    if not gdacs_score or gdacs_score == 'N/A':
        gdacs_score = safe_text(driver, "//td[contains(@class,'cell_matrix_gdacs')]")
    data['gdacs_score'] = gdacs_score

    # ─ Alert Level (confirmed: page <title> = 'Overall Orange/Green/Red Earthquake...') ─
    am = re.search(r'Overall\s+(Green|Orange|Red)', driver.title, re.I)
    if am:
        data['alert_level'] = am.group(1).capitalize()
    else:
        try:
            sv = float(gdacs_score)
            data['alert_level'] = 'Green' if sv < 1.0 else ('Orange' if sv <= 2.0 else 'Red')
        except (ValueError, TypeError):
            data['alert_level'] = 'N/A'

    # ─ Country (from event_title: 'M 6.7 in Philippines on 18 Aug...') ────────
    cm = re.search(r'\bin\s+([A-Za-z ,()-]+?)\s+on\s', data.get('event_title', ''), re.I)
    data['country'] = cm.group(1).strip() if cm else 'N/A'

    return data


print('scrape_summary() defined.')


scrape_summary() defined.


## Cell 6 — Impact Tab Scraper

In [1]:
def scrape_impact(driver, url: str) -> dict:
    """
    Navigate to the Impact (Shakemap) page and scrape impact metrics.

    URL: Earthquakes/report_shakemap.aspx?eventid=X&episodeid=Y&eventtype=EQ
    """
    data = {}
    driver.get(url)
    time.sleep(5)

    # ─ Core parameters ────────────────────────────────────────────────
    data['impact_magnitude'] = label_value(driver, 'Earthquake Magnitude:')
    data['impact_depth_km']  = label_value(driver, 'Depth:')
    raw_date = label_value(driver, 'Event Date:')
    data['impact_event_date_utc'] = raw_date.split('\n')[0].strip() if raw_date != 'N/A' else 'N/A'

    # ─ Exposed Population ──────────────────────────────────────────
    # Two formats:
    #   A: 'About 11000 people in MMI\n2570000 people within 100km'
    #   B: '860 thousand (in MMI>=VII)'
    raw_pop = label_value(driver, 'Exposed Population:')
    pop_mmi = pop_100km = 'N/A'
    if raw_pop and raw_pop != 'N/A':
        m1 = re.search(r'About\s+([\d,]+)\s+people\s+in\s+MMI', raw_pop, re.I)
        m2 = re.search(r'([\d,.]+\s*(?:thousand|million)?)\s*\(in\s*MMI', raw_pop, re.I)
        mk = re.search(r'([\d,]+)\s+people\s+within\s+100\s*km', raw_pop, re.I)
        pop_mmi   = m1.group(1).replace(',','').strip() if m1 else (m2.group(1).strip() if m2 else 'N/A')
        pop_100km = mk.group(1).replace(',','').strip() if mk else 'N/A'
    data['exposed_population_mmi']   = pop_mmi
    data['exposed_population_100km'] = pop_100km

    # ─ INFORM Coping Capacity ───────────────────────────────────────
    inform_score = 'N/A'
    for lbl in ('INFORM Coping capacity of the alert score:', 'INFORM Coping capacity', 'Coping capacity', 'INFORM'):
        val = label_value(driver, lbl)
        if val and val != 'N/A':
            inform_score = val
            break
    data['inform_coping_capacity'] = inform_score

    # ─ Vulnerability Score ─────────────────────────────────────────
    vuln_score = 'N/A'
    for lbl in ('Country Vulnerability', 'Vulnerability', 'Socio-economic vulnerability', 'Socio-economic'):
        val = label_value(driver, lbl)
        if val and val != 'N/A':
            vuln_score = val
            break
    data['vulnerability_score'] = vuln_score

    # ─ Tsunami Score ──────────────────────────────────────────────
    # Score table (tbody#tableScoreMain): header tds use 'cell_type_matrix',
    # data tds use cell_orange/green/red_matrix. Find Tsunami column by header text.
    tsunami_score = 'N/A'
    try:
        tbody = driver.find_element(By.XPATH, "//tbody[@id='tableScoreMain']")
        rows  = tbody.find_elements(By.TAG_NAME, 'tr')
        if len(rows) >= 2:
            h_tds = rows[0].find_elements(By.TAG_NAME, 'td')
            d_tds = rows[1].find_elements(By.TAG_NAME, 'td')
            tsunami_col = next((i for i, td in enumerate(h_tds) if 'Tsunami' in td.text), None)
            if tsunami_col is not None and len(d_tds) > tsunami_col:
                tsunami_score = d_tds[tsunami_col].text.strip()
    except Exception:
        pass
    data['tsunami_score'] = tsunami_score

    # ─ Secondary Risks ───────────────────────────────────────────
    secondary_els = driver.find_elements(
        By.XPATH,
        "//td[ancestor::tbody and ("
        "contains(normalize-space(.),'Secondary') or "
        "contains(normalize-space(.),'Landslide') or "
        "contains(normalize-space(.),'secondary hazard')"
        ")]/following-sibling::td[1]"
    )
    risks = [el.text.strip() for el in secondary_els if el.text.strip()]
    data['secondary_risks'] = '; '.join(risks) if risks else 'N/A'

    return data


print('scrape_impact() defined.')


scrape_impact() defined.


## Cell 7 — Media Tab Scraper

In [2]:
def scrape_media(driver, url: str) -> dict:
    """
    Navigate to the Media page (media.aspx) and scrape article counts and news data.
    """
    data = {}
    driver.get(url)
    time.sleep(5)

    # ─ Media Coverage counts ─────────────────────────────────────────
    data['total_articles']     = label_value(driver, 'Articles:')
    data['casualty_articles']  = label_value(driver, 'Articles about casualties:')
    data['articles_last_hour'] = label_value(driver, 'Articles in last hour:')

    # ─ Peak News Day ──────────────────────────────────────────────
    # div#newsForDay contains a nested table structure.
    # Most reliable: BeautifulSoup parse with bar div title attributes.
    # Bar div title format: '2020-08-21T00:00:00: 3'  (date: count)
    date_cells  = []
    count_cells = []
    soup = BeautifulSoup(driver.page_source, 'lxml')
    nfd  = soup.find('div', id='newsForDay')
    if nfd:
        for bar_div in nfd.find_all('div', title=True):
            t   = bar_div.get('title', '')
            cnt_m = re.search(r'(\d{4}-\d{2}-\d{2})T[\d:]+:\s*(\d+)$', t)
            if cnt_m:
                # Convert YYYY-MM-DD to DD/MM for consistency with GDACS display
                raw_date = cnt_m.group(1)         # e.g. '2020-08-21'
                parts    = raw_date.split('-')    # ['2020','08','21']
                disp_date = f'{parts[2]}/{parts[1]}'  # '21/08'
                date_cells.append(disp_date)
                count_cells.append(cnt_m.group(2))

    if date_cells and count_cells:
        try:
            ints = [int(c) for c in count_cells]
            peak = ints.index(max(ints))
            data['peak_news_day']   = date_cells[peak]
            data['peak_news_count'] = str(max(ints))
            data['news_per_day']    = '; '.join(f'{d}:{c}' for d, c in zip(date_cells, count_cells))
        except (ValueError, IndexError):
            data['peak_news_day'] = data['peak_news_count'] = data['news_per_day'] = 'N/A'
    else:
        data['peak_news_day'] = data['peak_news_count'] = data['news_per_day'] = 'N/A'

    # ─ Social Media Note ───────────────────────────────────────────
    social_note = safe_text(
        driver, "//h2[contains(text(),'Social media')]/following-sibling::p[1]"
    )
    data['social_media_note'] = social_note

    # ─ News Headlines ──────────────────────────────────────────────
    headlines = []
    for xp in [
        "//div[contains(@class,'media') or @id='medialist']//a[string-length(normalize-space(text()))>15]",
        "//table[contains(@class,'article') or contains(@class,'news')]//a",
    ]:
        items = safe_texts(driver, xp)
        if items:
            headlines = [h.strip() for h in items if len(h.strip()) > 15]
            break

    if not headlines:
        soup2 = BeautifulSoup(driver.page_source, 'lxml')
        seen  = set()
        for a in soup2.find_all('a', href=True):
            href = a['href']
            text = a.get_text(' ', strip=True)
            if (len(text) > 20 and text not in seen
                    and (href.startswith('http') or href.startswith('/'))
                    and not any(x in href for x in ['javascript', 'mailto', '#'])):
                seen.add(text)
                headlines.append(text)
                if len(headlines) >= 25:
                    break

    data['news_headlines'] = ' | '.join(headlines[:20]) if headlines else 'N/A'

    return data


print('scrape_media() defined.')


scrape_media() defined.


## Cell 8 — Main Scraping Loop

Iterates over all 4 events, scrapes Summary → Impact → Media for each, and collects results into a list of dicts.

In [5]:
def scrape_event(event: dict, headless: bool = True) -> dict:
    """
    Open a single GDACS event and scrape all three pages.

    Each 'tab' on GDACS is actually a separate page:
      Summary : report.aspx
      Impact  : Earthquakes/report_shakemap.aspx
      Media   : media.aspx
    """
    urls   = build_urls(event['url'])
    driver = create_driver(headless=headless)
    result = {
        'label':      event['label'],
        'country':    event['country'],
        'period':     event['period'],
        'url':        event['url'],
        'scraped_at': datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC'),
    }

    try:
        print(f"\n{'='*60}")
        print(f"Scraping: {event['label']}")

        # ── Summary page (report.aspx) ───────────────────────────────
        print('  → Summary page ...')
        summary_data = scrape_summary(driver, urls['summary'])
        result.update(summary_data)
        print(f"     Title     : {summary_data.get('event_title')}")
        print(f"     Alert     : {summary_data.get('alert_level')} (score={summary_data.get('gdacs_score')})")
        print(f"     Country   : {summary_data.get('country')}")

        # ── Impact page (report_shakemap.aspx) ───────────────────────
        print('  → Impact page ...')
        impact_data = scrape_impact(driver, urls['impact'])
        result.update(impact_data)
        print(f"     Pop (MMI)   : {impact_data.get('exposed_population_mmi')}")
        print(f"     Pop (100km) : {impact_data.get('exposed_population_100km')}")
        print(f"     Tsunami     : {impact_data.get('tsunami_score')}")
        print(f"     INFORM      : {impact_data.get('inform_coping_capacity')}")

        # ── Media page (media.aspx) ────────────────────────────────
        print('  → Media page  ...')
        media_data = scrape_media(driver, urls['media'])
        result.update(media_data)
        print(f"     Total articles : {media_data.get('total_articles')}")
        print(f"     Casualty art.  : {media_data.get('casualty_articles')}")
        print(f"     Peak news day  : {media_data.get('peak_news_day')} ({media_data.get('peak_news_count')} articles)")

        result['scrape_status'] = 'success'

    except Exception as exc:
        result['scrape_status'] = f'error: {exc}'
        print(f'  [ERROR] {exc}')

    finally:
        driver.quit()

    return result


print('scrape_event() defined. Ready to run.')


scrape_event() defined. Ready to run.


## Cell 9 — Run the Scraper

In [11]:
# Set headless=False if you want to watch the browser in action
HEADLESS = True

all_results = []

for event in EVENTS:
    record = scrape_event(event, headless=HEADLESS)
    all_results.append(record)

print(f"\n{'='*60}")
print(f"✓ Scraping complete. {len(all_results)} events processed.")


Scraping: Philippines – Historical
  → Summary page ...
     Title     : M 6.7 in Philippines on 18 Aug 2020 00:03 UTC
     Alert     : Orange (score=1.4)
     Country   : Philippines
  → Impact page ...
     Pop (MMI)   : 11000
     Pop (100km) : 2570000
     Tsunami     : 0.4
     INFORM      : 4.2 (Philippines)
  → Media page  ...
     Total articles : 5
     Casualty art.  : 1 (20%)
     Peak news day  : 21/08 (3 articles)

Scraping: Philippines – Recent
  → Summary page ...
     Title     : M 6.9 in Philippines on 30 Sep 2025 13:59 UTC
     Alert     : Orange (score=1.7)
     Country   : Philippines
  → Impact page ...
     Pop (MMI)   : 860 thousand
     Pop (100km) : N/A
     Tsunami     : 0.6
     INFORM      : 4.2 (Philippines)
  → Media page  ...
     Total articles : 1684
     Casualty art.  : 571 (33.9%)
     Peak news day  : 01/10 (794 articles)

Scraping: Afghanistan – Historical
  → Summary page ...
     Title     : M 5.9 in Afghanistan on 21 Jun 2022 20:54 UTC
     Ale

## Cell 10 — Build & Display the DataFrame

In [None]:
# Define a consistent column order for the final dataset
COLUMNS = [
    # Metadata
    "label", "country", "period", "url", "scraped_at", "scrape_status",
    # Summary Tab
    "event_title", "gdacs_id", "magnitude", "depth_km", "lat_lon",
    "event_date_utc", "gdacs_score", "alert_level",
    "exposed_population_summary",
    # Impact Tab
    "impact_magnitude", "impact_depth_km", "impact_event_date_utc",
    "exposed_population_mmi", "exposed_population_100km",
    "inform_coping_capacity", "vulnerability_score",
    "secondary_risks", "tsunami_score",
    # Media Tab
    "total_articles", "casualty_articles", "articles_last_hour",
    "peak_news_day", "peak_news_count", "news_per_day",
    "social_media_note", "news_headlines",
]

df = pd.DataFrame(all_results)

# Add any missing columns as N/A
for col in COLUMNS:
    if col not in df.columns:
        df[col] = "N/A"

# Re-order to defined column order (keep any extra cols at the end)
extra_cols = [c for c in df.columns if c not in COLUMNS]
df = df[COLUMNS + extra_cols]

print(f"DataFrame shape: {df.shape}")
print("\nColumn list:")
for c in df.columns:
    print(f"  {c}")

df.head()


## Cell 11 — Preview Key Fields

In [13]:
KEY_COLS = [
    "label", "magnitude", "alert_level", "gdacs_score", "country",
    "event_date_utc", "exposed_population_100km",
    "inform_coping_capacity", "total_articles", "casualty_articles",
    "peak_news_day", "peak_news_count",
]

display_cols = [c for c in KEY_COLS if c in df.columns]
display(df[display_cols].T)  # Transpose for easy reading

Unnamed: 0,0,1,2,3
label,Philippines – Historical,Philippines – Recent,Afghanistan – Historical,Afghanistan – Recent
magnitude,6.7M,6.9M,5.9M,6.3M
alert_level,Orange,Orange,Red,Red
gdacs_score,1.4,1.7,4.1,3
country,Philippines,Philippines,Afghanistan,Afghanistan
event_date_utc,18 Aug 2020 00:03 UTC,30 Sep 2025 13:59 UTC,21 Jun 2022 20:54 UTC,02 Nov 2025 20:29 UTC
exposed_population_100km,2570000,,,
inform_coping_capacity,4.2 (Philippines),4.2 (Philippines),7.5 (Afghanistan),7.5 (Afghanistan)
total_articles,5,1684,1897,779
casualty_articles,1 (20%),571 (33.9%),1090 (57.5%),438 (56.2%)


## Cell 12 — Export to CSV

In [14]:
df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8-sig")
print(f"✓ CSV saved to: {OUTPUT_CSV}")
print(f"  Rows: {len(df)},  Columns: {len(df.columns)}")

# Quick validation
df_check = pd.read_csv(OUTPUT_CSV)
print(f"  Validation read-back OK — {df_check.shape[0]} rows, {df_check.shape[1]} cols")
df_check[["label", "magnitude", "alert_level", "total_articles"]].to_string(index=False)

✓ CSV saved to: /Users/krishhiv/Desktop/Spring 2026/DSM/Assignment 1/scraping/gdacs_earthquake_data.csv
  Rows: 4,  Columns: 32
  Validation read-back OK — 4 rows, 32 cols


'                   label magnitude alert_level  total_articles\nPhilippines – Historical      6.7M      Orange               5\n    Philippines – Recent      6.9M      Orange            1684\nAfghanistan – Historical      5.9M         Red            1897\n    Afghanistan – Recent      6.3M         Red             779'

## Cell 13 — Quick Summary Print

Human-readable per-event summary for a sanity check.

In [15]:
for _, row in df.iterrows():
    print(f"""
{'─'*55}
{row['label']}
{'─'*55}
  Event Title   : {row.get('event_title', 'N/A')}
  GDACS ID      : {row.get('gdacs_id', 'N/A')}
  Magnitude     : {row.get('magnitude', 'N/A')}
  Depth         : {row.get('depth_km', 'N/A')}
  Event Date    : {row.get('event_date_utc', 'N/A')}
  Alert Level   : {row.get('alert_level', 'N/A')} (score = {row.get('gdacs_score', 'N/A')})
  Country       : {row.get('country', 'N/A')}

  --- Impact ---
  Pop in MMI    : {row.get('exposed_population_mmi', 'N/A')}
  Pop w/in 100km: {row.get('exposed_population_100km', 'N/A')}
  INFORM Coping : {row.get('inform_coping_capacity', 'N/A')}
  Vulnerability : {row.get('vulnerability_score', 'N/A')}
  Tsunami Score : {row.get('tsunami_score', 'N/A')}
  Secondary Risk: {row.get('secondary_risks', 'N/A')}

  --- Media ---
  Total Articles: {row.get('total_articles', 'N/A')}
  Casualty Art. : {row.get('casualty_articles', 'N/A')}
  Art. Last Hour: {row.get('articles_last_hour', 'N/A')}
  Peak News Day : {row.get('peak_news_day', 'N/A')} ({row.get('peak_news_count', 'N/A')} articles)

  Scrape Status : {row.get('scrape_status', 'N/A')}
""")


───────────────────────────────────────────────────────
Philippines – Historical
───────────────────────────────────────────────────────
  Event Title   : M 6.7 in Philippines on 18 Aug 2020 00:03 UTC
  GDACS ID      : EQ 1230629
  Magnitude     : 6.7M
  Depth         : 10 Km
  Event Date    : 18 Aug 2020 00:03 UTC
  Alert Level   : Orange (score = 1.4)
  Country       : Philippines

  --- Impact ---
  Pop in MMI    : 11000
  Pop w/in 100km: 2570000
  INFORM Coping : 4.2 (Philippines)
  Vulnerability : N/A
  Tsunami Score : 0.4
  Secondary Risk: N/A

  --- Media ---
  Total Articles: 5
  Casualty Art. : 1 (20%)
  Art. Last Hour: 0
  Peak News Day : 21/08 (3 articles)

  Scrape Status : success


───────────────────────────────────────────────────────
Philippines – Recent
───────────────────────────────────────────────────────
  Event Title   : M 6.9 in Philippines on 30 Sep 2025 13:59 UTC
  GDACS ID      : EQ 1502713
  Magnitude     : 6.9M
  Depth         : 10 Km
  Event Date    : 30 