# Data Analytics Project

## Nobel Prize Winners Analysis (1901‚Äì2025)


---

### Project context

This notebook is part of the **final project for the Data Analytics module**.  
It focuses on the **preparation, cleaning, and transformation** of the Nobel Prize dataset covering the years **1901 to 2025**.

### What comes next

Subsequent notebooks will continue with:

- **Exploratory Data Analysis (EDA)** to identify trends, patterns, and anomalies  
- Creation of an **interactive dashboard** (Plotly Dash) for visualization and interpretation of results  

### Data source

- **Nobel Prize API (v2.1)**  
  Endpoints used: `/nobelPrizes`, `/laureates`  
  Raw data stored as JSON snapshots in `data/raw/`


## üåç Dataset Overview

**Source:** Nobel Prize API (NobelPrize.org, API v2.1)  
**Endpoints used:** `/nobelPrizes`, `/laureates`  
**Scope:** Nobel Prize awards and laureate profiles (persons + organizations)  
**Period:** 1901 ‚Äì 2025  
**Format:** JSON (raw snapshots), transformed to CSV (processed tables)

This project uses the official Nobel Prize dataset provided via the Nobel Prize API.  
It combines two key data domains:

- **Prize-level information** (year, category, prize amounts, award status, award dates)  
- **Laureate-level information** (person/organization metadata, names, gender, birth/death details, external identifiers)

- https://api.nobelprize.org/2.1/nobelPrizes 
- https://api.nobelprize.org/2.1/laureates 

The raw API responses are stored as timestamped JSON snapshots to ensure reproducibility.  
After extraction, the data is cleaned and transformed into relational tables suitable for analysis and dashboarding.

### Key variables included

- **Award year** (e.g., `awardYear`)  
- **Category** (e.g., Physics, Chemistry, Medicine, Literature, Peace, Economic Sciences)  
- **Laureate identifiers** (unique IDs linking prizes and laureates)  
- **Laureate type** (person vs. organization)  
- **Gender** (for persons, when available)  
- **Birth/Death information** (date, city, country ‚Äî where available)  
- **Motivation text** (citation/reason for award)  
- **Prize share / portion** (how a prize is split among laureates)  
- **External references** (Wikipedia, Wikidata IDs, where available)

This dataset enables **historical, categorical, demographic, and textual analysis** of Nobel Prize awards over more than a century, supporting both descriptive analytics and interactive storytelling through dashboards.

---

## üéØ Project Objectives

The primary aim of this project is to analyze and visualize Nobel Prize awards from 1901 to 2025 in order to:

1. **Identify long-term trends** in Nobel Prize distribution across categories and years.  
2. **Analyze laureate demographics** (e.g., gender, country of birth) and how they change over time.  
3. **Examine patterns of prize sharing** (single vs. multiple laureates, share distributions).  
4. **Explore award motivations** using text analysis to detect common themes and category-specific language.  
5. **Deliver an interactive dashboard** (Plotly Dash) to support filtering and interpretation of insights.

Through systematic **data cleaning**, **exploratory data analysis (EDA)**, and **interactive visualizations**, this project provides a structured and data-driven overview of Nobel Prize history and its evolution up to 2025.


# 1. Setup & Data Loading <a id="setup-loading"></a>

## 1.1 Imports & Project Paths <a id="imports-paths"></a>

**Goal:** Initialize the notebook environment, import core libraries, and define consistent project paths so that data files are stored **outside** the `notebooks/` folder.

**Project structure (target):**
- `Data_Analytics_Project/notebooks/` ‚Üí Jupyter notebooks (`.ipynb`)
- `Data_Analytics_Project/data/raw/` ‚Üí raw JSON snapshots from the API
- `Data_Analytics_Project/data/processed/` ‚Üí cleaned tables (CSV)

**Actions in the next code cell:**
- Import required libraries:
  - File handling: `Path` (pathlib), `json`, `time`, `datetime`
  - Data analysis: `pandas`, `numpy`
  - API calls: `requests`
- Detect `PROJECT_ROOT` automatically (one level above `notebooks/`)
- Create data folders if missing:
  - `data/raw/`
  - `data/processed/`

**Expected output:**
- No import errors  
- Printed paths showing:
  - current working directory (`CWD`)
  - project root (`PROJECT_ROOT`)
  - confirmed folders: `RAW_DIR`, `PROCESSED_DIR`


In [None]:
# Step 1.1 ‚Äî Imports + project paths (notebook is inside /notebooks)

from pathlib import Path
import json
import time
from datetime import datetime, timezone

import numpy as np
import pandas as pd
import requests

# --- Resolve project root ---
# notebook is in: Data_Analytics_Project/notebooks/
# then project root is one folder up from current working directory
CWD = Path.cwd().resolve()
PROJECT_ROOT = CWD.parent

# Expected structure check
print("CWD:", CWD)
print("PROJECT_ROOT:", PROJECT_ROOT)
print("Has /notebooks:", (PROJECT_ROOT / "notebooks").is_dir())
print("Has /data     :", (PROJECT_ROOT / "data").is_dir())

# --- Define data folders (outside notebooks) ---
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"

RAW_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("\nFolders ready:")
print("RAW_DIR      :", RAW_DIR)
print("PROCESSED_DIR:", PROCESSED_DIR)


## 1.2 Fetch Data from Nobel Prize API (Save Raw JSON) <a id="fetch-api"></a>

**Goal:** Download Nobel Prize data from the official Nobel Prize API and store it as raw JSON snapshots for reproducibility.

**API source:**
- Base URL: `https://api.nobelprize.org/2.1`
- Endpoints:
  - `/nobelPrizes` (prize-level information)
  - `/laureates` (laureate-level information)

**Outputs (saved files):**
- `data/raw/nobelPrizes_<timestamp>.json`
- `data/raw/laureates_<timestamp>.json`

**Actions in the next code cell:**
- Fetch all pages from each endpoint using `limit/offset` pagination
- Save results as timestamped JSON files (UTC time)
- Print record counts and file paths

**Expected output:**
- Two saved JSON files in `data/raw/`
- Printed confirmation:
  - file paths
  - number of prize records and laureate records


In [None]:
# Step 1.2 ‚Äî Fetch from Nobel API and save raw JSON snapshots

BASE_URL = "https://api.nobelprize.org/2.1"

def fetch_all(endpoint: str, root_key: str, params=None, limit=1000, polite_sleep=0.2):
    """
    Fetch all items from a Nobel API endpoint using limit/offset pagination.
    """
    if params is None:
        params = {}

    all_items = []
    offset = 0

    while True:
        page_params = dict(params)
        page_params["limit"] = limit
        page_params["offset"] = offset

        url = f"{BASE_URL}{endpoint}"
        r = requests.get(url, params=page_params, timeout=60)
        r.raise_for_status()
        payload = r.json()

        items = payload.get(root_key, [])
        if not items:
            break

        all_items.extend(items)

        # last page
        if len(items) < limit:
            break

        offset += limit
        time.sleep(polite_sleep)

    return all_items

# 1) Fetch prizes (year/category level)
nobel_prizes = fetch_all("/nobelPrizes", "nobelPrizes", limit=1000)

# 2) Fetch laureates (person/org level)
laureates = fetch_all("/laureates", "laureates", limit=1000)

# Save snapshots with UTC timestamp
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
prizes_path = RAW_DIR / f"nobelPrizes_{ts}.json"
laureates_path = RAW_DIR / f"laureates_{ts}.json"

with open(prizes_path, "w", encoding="utf-8") as f:
    json.dump(nobel_prizes, f, ensure_ascii=False, indent=2)

with open(laureates_path, "w", encoding="utf-8") as f:
    json.dump(laureates, f, ensure_ascii=False, indent=2)

print("‚úÖ Saved raw snapshots:")
print(" -", prizes_path, "records:", len(nobel_prizes))
print(" -", laureates_path, "records:", len(laureates))


In [None]:
# Step 1.3 ‚Äî Verify the files exist in the correct folder

raw_files = sorted(RAW_DIR.glob("*.json"))
print("RAW_DIR:", RAW_DIR)
print("JSON files found:", len(raw_files))
raw_files[-5:]


## 1.4 Load Raw Nobel API Data (JSON Snapshots) <a id="load-raw-json"></a>

**Goal:** Load the most recent raw Nobel Prize API snapshots from the local project folder (`data/raw/`) and perform a quick structural check before any cleaning or transformation.

> **Note:** This step assumes the raw JSON snapshots already exist (created in the previous step: *Fetch from API and save raw JSON*).

**Inputs (raw files):**
- `data/raw/nobelPrizes_*.json`
- `data/raw/laureates_*.json`

**Outputs (in-memory objects):**
- `nobel_prizes` ‚Üí list of prize records (year √ó category, includes laureate references)
- `laureates` ‚Üí list of laureate records (persons + organizations)

**Actions in the next code cell:**
- List available raw JSON snapshots in `data/raw/`
- Automatically select the **latest** snapshot for each dataset
- Load JSON into Python objects using `json.load()`
- Run initial inspection:
  - number of records (`len(...)`)
  - preview top-level keys (`nobel_prizes[0].keys()`, `laureates[0].keys()`)

**Validation / checks:**
- Confirm that both snapshots were found and loaded successfully
- Confirm the dataset includes award years up to **2025** (checked in the next inspection step)

**Expected result:**
- Printed file paths of the latest snapshots  
- Counts like `(n_prize_records, n_laureate_records)`  
- Data is ready for transformation into analysis tables


In [None]:
# 1.4 Load latest raw JSON snapshots + quick inspection

def load_latest_json(prefix: str, folder=RAW_DIR):
    files = sorted(folder.glob(f"{prefix}_*.json"))
    if not files:
        raise FileNotFoundError(f"No files found: {folder}/{prefix}_*.json")
    latest_path = files[-1]
    with open(latest_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data, latest_path

nobel_prizes, prizes_path = load_latest_json("nobelPrizes")
laureates, laureates_path = load_latest_json("laureates")

print("Loaded latest files:")
print(" - nobelPrizes:", prizes_path)
print(" - laureates  :", laureates_path)

print("\nRecord counts:")
print(" - nobel_prizes:", len(nobel_prizes))
print(" - laureates  :", len(laureates))

print("\nTop-level keys preview:")
print(" - nobel_prizes[0] keys:", list(nobel_prizes[0].keys())[:20])
print(" - laureates[0] keys   :", list(laureates[0].keys())[:20])


In [None]:
# Quick sanity check: max award year should be 2025
prizes_df_raw = pd.json_normalize(nobel_prizes)

if "awardYear" in prizes_df_raw.columns:
    prizes_df_raw["awardYear_num"] = pd.to_numeric(prizes_df_raw["awardYear"], errors="coerce")
    print("\nAward year range:")
    print(" - min:", int(prizes_df_raw["awardYear_num"].min()))
    print(" - max:", int(prizes_df_raw["awardYear_num"].max()))
else:
    print("\nColumn 'awardYear' not found. Available columns (first 25):")
    print(prizes_df_raw.columns.tolist()[:25])