# DSBA Programming & Visualization — Assignment 02 (Enhanced)
**Author:** Adil Vural  
**Generated:** 2025-10-15 18:02

This is a cleaned, reproducible version of your notebook for *Programming & Visualization — Assignment 02*.
It adds:
- Clear sections for **Q1–Q4**
- Robust **helpers** for safe loading, validation, joins, and tidy transforms
- Idiomatic **pandas** patterns (melt, pivot, merge, groupby, vectorized conditions)
- Built-in **quality checks** (duplicate keys, invalid seasons, dtype mismatches)
- Collapsible blocks with your **original cells** (kept intact)


## How to use & tips
1. Update the **Data paths** below to your 4 CSV files.
2. Run **Environment & Helpers** to load utilities.
3. Proceed Q1→Q4. Each block has:
   - A short **goal**
   - A **clean reference solution** you can adapt
   - Minimal prints and assertive checks
4. Keep merges **left** from your main facts (Olympics) to avoid losing rows unintentionally.
5. Normalize keys before joining:
   - **Country names** (strip, title, mapping)
   - **Seasons** (only `'Summer'` / `'Winter'` are valid).
6. After each major step, **`shape`, `head()`, and `isna().mean()`** are your best friends.


## Environment & Helpers

In [10]:

# === Environment & Helpers ===
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

VALID_SEASONS = {"Summer", "Winter"}

def safe_read_csv(path, **kwargs):
    try:
        df = pd.read_csv(path, **kwargs)
        print(f"Loaded {path} -> {df.shape}")
        return df
    except FileNotFoundError:
        raise FileNotFoundError(f"File not found: {path}. Update the data path.")
    except Exception as e:
        raise RuntimeError(f"Failed reading {path}: {e}")

def normalize_country(s):
    return (s.astype(str)
              .str.strip()
              .str.replace(r"\s+", " ", regex=True))

def validate_season(series):
    s = series.astype(str).str.strip().str.title()
    bad = ~s.isin(VALID_SEASONS)
    if bad.any():
        bad_vals = sorted(s[bad].unique().tolist())
        import warnings
        warnings.warn(f"Invalid season values detected: {bad_vals}. Only {sorted(VALID_SEASONS)} are allowed.")
    return s.where(s.isin(VALID_SEASONS), np.nan)

def assert_unique_keys(df, keys, name=""):
    if not set(keys).issubset(df.columns):
        missing = list(set(keys) - set(df.columns))
        raise KeyError(f"{name}: missing keys {missing}")
    dups = df.duplicated(subset=keys).sum()
    if dups:
        raise ValueError(f"{name}: {dups} duplicate key rows on {keys}. Deduplicate first.")
    return True

def dedupe_first(df, keys):
    """Keep first row per key combination (stable)."""
    return (df.sort_values(list(keys))
              .drop_duplicates(subset=list(keys), keep="first"))

def safe_merge(left, right, on, how="left", validate=None, suffixes=("", "_r")):
    if validate is None:
        validate = "m:1"
    res = left.merge(right, on=on, how=how, validate=validate, suffixes=suffixes)
    return res

print("Helpers ready ✔")


Helpers ready ✔


## Data paths & Loading

In [3]:

# === Data paths & Loading (edit if needed) ===
OLYMPICS_CSV   = "olympics_raw.csv"
GDP_CSV        = "gdp_long.csv"
POP_CSV        = "population_long.csv"
FLAGS_CSV      = "flags.csv"

try:
    olympics_raw   = safe_read_csv(OLYMPICS_CSV)
    gdp_long       = safe_read_csv(GDP_CSV)
    population_raw = safe_read_csv(POP_CSV)
    flags          = safe_read_csv(FLAGS_CSV)
except Exception as e:
    print(e)


Loaded olympics_raw.csv -> (20170, 6)
File not found: gdp_long.csv. Update the data path.


## Q1 – Transformations
**Goal:** Load data, clean types, and perform tidy reshapes:
- **A. Melt** medal counts to long format
- **B. Pivot** back to wide by medal type / season


In [11]:

# Ensure olympics_raw is loaded
if "olympics_raw" not in globals():
    olympics_raw = safe_read_csv(OLYMPICS_CSV)

# --- Q1A: Melt medals to long ---
df = olympics_raw.copy()

# Normalize keys
if "country_name" in df.columns:
    df["country_name"] = normalize_country(df["country_name"])
if "game_season" in df.columns:
    df["game_season"] = validate_season(df["game_season"])

# If medals are wide (GOLD/SILVER/BRONZE), melt them
medal_cols = [c for c in df.columns if c.upper() in {"GOLD","SILVER","BRONZE"}]
if medal_cols:
    olympics_long = df.melt(
        id_vars=[c for c in df.columns if c not in medal_cols],
        value_vars=medal_cols,
        var_name="medal_type",
        value_name="medal_count"
    )
else:
    # Already long
    olympics_long = df.copy()
    if "medal_count" not in olympics_long.columns:
        olympics_long["medal_count"] = 1  # fallback if each row is a medal

print("olympics_long:", olympics_long.shape)
display(olympics_long.head(10))


olympics_long: (20170, 7)


Unnamed: 0,game,medal_type,country_name,game_location,game_season,game_year,medal_count
0,athens-1896,GOLD,Australia,Greece,Summer,1896,1
1,athens-1896,GOLD,Australia,Greece,Summer,1896,1
2,athens-1896,BRONZE,Austria,Greece,Summer,1896,1
3,athens-1896,GOLD,Austria,Greece,Summer,1896,1
4,athens-1896,BRONZE,Austria,Greece,Summer,1896,1
5,athens-1896,GOLD,Austria,Greece,Summer,1896,1
6,athens-1896,SILVER,Austria,Greece,Summer,1896,1
7,athens-1896,BRONZE,Denmark,Greece,Summer,1896,1
8,athens-1896,BRONZE,Denmark,Greece,Summer,1896,1
9,athens-1896,SILVER,Denmark,Greece,Summer,1896,1


In [12]:

# --- Q1B: Pivot by medal_type and season ---
idx = ["country_name", "game_year", "game_season"]
for k in idx:
    if k not in olympics_long.columns:
        print(f"WARNING: missing key {k} in olympics_long")

pivot_medals = (olympics_long
                .groupby(idx + ["medal_type"], dropna=False)["medal_count"]
                .sum()
                .reset_index()
                .pivot(index=idx, columns="medal_type", values="medal_count")
                .fillna(0)
                .reset_index())

cols = [c for c in ["country_name","game_year","game_season","BRONZE","SILVER","GOLD"] if c in pivot_medals.columns]
pivot_medals = pivot_medals[cols]
print("pivot_medals:", pivot_medals.shape)
display(pivot_medals.head(10))


pivot_medals: (1779, 6)


medal_type,country_name,game_year,game_season,BRONZE,SILVER,GOLD
0,Afghanistan,2008,Summer,1.0,0.0,0.0
1,Afghanistan,2012,Summer,1.0,0.0,0.0
2,Algeria,1984,Summer,2.0,0.0,0.0
3,Algeria,1992,Summer,1.0,0.0,1.0
4,Algeria,1996,Summer,1.0,0.0,2.0
5,Algeria,2000,Summer,3.0,1.0,1.0
6,Algeria,2008,Summer,1.0,1.0,0.0
7,Algeria,2012,Summer,0.0,0.0,1.0
8,Algeria,2016,Summer,0.0,2.0,0.0
9,Argentina,1924,Summer,2.0,3.0,1.0


## Q2 – Joins
**Goal:** Combine Olympics with GDP, population, and flags.
- Ensure **keys align** (`country_name`, `year`)
- Guard against **duplicates** and **invalid seasons**
- Prefer **left join** from Olympics facts


In [14]:

# Ensure gdp_long is loaded (from cell 5)
if "gdp_long" not in globals():
    try:
        gdp_long = safe_read_csv(GDP_CSV)
    except FileNotFoundError:
        # Try to create gdp_long from gdp_raw if available
        if "gdp_raw" in globals():
            gdp_long = pd.melt(
                gdp_raw,
                id_vars=['country_name', 'country_code'],
                value_vars=gdp_raw.columns[4:],
                var_name='year',
                value_name='gdp'
            )
            gdp_long['year'] = gdp_long['year'].str.replace('x', '').astype(int)
            print("Generated gdp_long from gdp_raw:", gdp_long.shape)
        else:
            raise FileNotFoundError("File not found: gdp_long.csv and gdp_raw is not loaded. Update the data path.")

gdp = gdp_long.copy()
gdp["country_name"] = normalize_country(gdp.get("country_name", gdp.get("entity","")).astype(str))

population = population_raw.rename(columns={"entity":"country_name","code":"country_code"}).copy()
population["country_name"] = normalize_country(population["country_name"])

if {"country_code","year"}.issubset(gdp.columns):
    gdp = dedupe_first(gdp, keys=["country_code","year"])

if {"country_name","year"}.issubset(population.columns):
    population = dedupe_first(population, keys=["country_name","year"])

# Flags (country level only)
flags2 = flags.rename(columns={"name":"country_name"}).copy()
flags2["country_name"] = normalize_country(flags2["country_name"])
flags2 = dedupe_first(flags2, keys=["country_name"])

# Olympics base
base = pivot_medals.copy()
year_col = "game_year" if "game_year" in base.columns else "year"
base = base.rename(columns={year_col:"year"})

# Join GDP
if "country_code" in gdp.columns and "country_code" in base.columns:
    left_on, right_on = ["country_code","year"], ["country_code","year"]
else:
    left_on, right_on = ["country_name","year"], ["country_name","year"]

olympics_gdp = base.merge(gdp, left_on=left_on, right_on=right_on, how="left", validate="m:1", suffixes=("","_gdp"))
print("olympics_gdp:", olympics_gdp.shape)

# Join Population
olympics_gdp_population = olympics_gdp.merge(
    population[["country_name","year","population"]],
    on=["country_name","year"], how="left", validate="m:1"
)
print("olympics_gdp_population:", olympics_gdp_population.shape)

# Join Flags
olympics_full = olympics_gdp_population.merge(
    flags2, on="country_name", how="left", validate="m:1"
)
print("olympics_full:", olympics_full.shape)
display(olympics_full.head(10))


FileNotFoundError: File not found: gdp_long.csv and gdp_raw is not loaded. Update the data path.

## Q3 – Functions and Conditions
**Goal:** Write clean, testable functions for common queries:
- Total medals per country/season
- GDP per capita
- Validated season filters (only `'Summer'`/`'Winter'`)


In [None]:

def total_medals(df, country=None, season=None):
    d = df.copy()
    if country is not None:
        d = d[d["country_name"].str.casefold() == str(country).strip().casefold()]
    if season is not None:
        season = str(season).strip().title()
        if season not in VALID_SEASONS:
            import warnings
            warnings.warn(f"Season '{season}' is invalid. Use 'Summer' or 'Winter'. Returning empty result.")
            return d.iloc[0:0]
        d = d[d["game_season"] == season]
    medal_cols = [c for c in ["BRONZE","SILVER","GOLD"] if c in d.columns]
    d["total_medals"] = d[medal_cols].sum(axis=1)
    return d.groupby(["country_name","game_season"], dropna=False)["total_medals"].sum().reset_index()

def gdp_per_capita(df):
    if not {"gdp","population"}.issubset(df.columns):
        raise KeyError("Columns 'gdp' and 'population' required.")
    out = df.copy()
    out["gdp_per_capita"] = np.where(out["population"]>0, out["gdp"]/out["population"], np.nan)
    return out

medals_nl_summer = total_medals(olympics_full, country="Netherlands", season="Summer")
medals_invalid   = total_medals(olympics_full, country="Netherlands", season="Herfst")
display(medals_nl_summer.head())
display(medals_invalid)


## Q4 – Functions and Iteration
**Goal:** Apply functions across years/countries; compute summarized panels and simple trends.


In [None]:

medal_cols = [c for c in ["BRONZE","SILVER","GOLD"] if c in olympics_full.columns]
panel = olympics_full.copy()
panel["total_medals"] = panel[medal_cols].sum(axis=1)

medals_by_year = (panel.groupby(["country_name","year"], dropna=False)["total_medals"]
                       .sum().reset_index())

medals_by_year["ma3_medals"] = (medals_by_year
                                .sort_values(["country_name","year"])
                                .groupby("country_name")["total_medals"]
                                .transform(lambda s: s.rolling(3, min_periods=1).mean()))

print(medals_by_year.shape)
display(medals_by_year.head(12))


## Appendix
- Your original cells are preserved below (collapsed by topic) so nothing is lost.
- Feel free to copy/paste any bespoke logic into the new structured cells above.


<details><summary><strong>Original — Imports</strong></summary>

(see cell below)

</details>

In [None]:
import pandas as pd

# Data inlezen 

flags_raw = 'flags_raw.csv'
gdp_raw = 'gdp_raw.csv'
olympics_path = 'olympics_raw.csv'
population_path = 'population_raw.csv'

flags_raw = pd.read_csv(flags_raw)
gdp_raw = pd.read_csv(gdp_raw)
olympics_raw = pd.read_csv(olympics_path)
population_raw = pd.read_csv(population_path)

# Eerste paar rijen tonen ter controle
print(flags_raw.columns)
print(gdp_raw.columns)
print(olympics_raw.columns)
print(population_raw.columns)

print(flags_raw.dtypes)
print(gdp_raw.dtypes)
print(olympics_raw.dtypes)
print(population_raw.dtypes)

<details><summary><strong>Original — Transformations</strong></summary>

(see cell below)

</details>

In [None]:

# After melting, remove the 'x' prefix from the year column and convert to integer. This is necessary for correct merging later.
#gdp_long = gdp_long.assign(year=gdp_long['year'].str.replace('x', '').astype(int))


# 1: Melt de GDP-data hiermee maak je van elke jaar kolom een aparte rij per land 
gdp_long = pd.melt(
    gdp_raw,
    id_vars=['country_name', 'country_code'],  # vaste kolommen die gelijk blijven. Deze 2 kolommen worden niet gemelt
    value_vars=gdp_raw.columns[4:],             # alle jaarkolommen vanaf positie 5 hebben.
    var_name='year',                           # nieuwe kolom voor jaartal
    value_name='gdp'                           # nieuwe kolom voor GDP-waarde
)

# 2: Verwijder 'x' uit het jaartal en zet om naar integer

gdp_long['year'] = gdp_long['year'].str.replace('x', '').astype(int)

# 3: Bekijk het resultaat
print(gdp_long.head())
print(gdp_long.columns)

# controleer de nieuwe data types
#print(gdp_long.dtypes)

# Use pd.pivot_table instead of pd.pivot to handle the extra arguments aggfunc/fill_value. pivot_table is a more flexible function with the same syntax.

# After pivoting the medal_type to columns, reset the index to turn the multi-index into columns for easier merging later.
#<your_dataframe>.reset_index(inplace=True) 
#<,>.columns.name = None  # Remove the 'medal_type' name from columns - not needed but looks better

# Pivot maken van de olympics data
olympics_pivot = pd.pivot_table(
    olympics_raw ,
    index=['country_name', 'game_year', 'game_location', 'game_season'],  # Groepeert op deze kolommen
    columns='medal_type',                                                 # medaillekleur wordt kolom
    aggfunc='size',                                                       # telt aantal medailles
    fill_value=0                                                          # vult lege waarden met 0
)
olympics_pivot.reset_index(inplace=True)
olympics_pivot.columns.name = None  # verwijder de naam van de kolomindex
print(olympics_pivot.head())

# controleer de nieuwe data types
#print(gdp_raw.columns)
print('gdp_long.columns:', gdp_long.columns)
print('olympics_raw.columns:', olympics_raw.columns)
print('olympics_pivot.columns:', olympics_pivot.columns)
print('population_raw.columns:', population_raw.columns)   

# === Simpele joins (alle rijen uit olympics_pivot behouden) ===

# Zorg dat kolomnamen overeenkomen
olympics_pivot = olympics_pivot.rename(columns={"game_year": "year"})
population = population_raw.rename(columns={"entity": "country_name", "code": "country_code"})[["country_name", "year", "population"]]
flags_clean = flags_raw.rename(columns={"name": "country_name"})

# Olympics en GDP join
olympics_gdp = olympics_pivot.merge(
    gdp_long[["country_name", "year", "gdp"]],
    how="left",
    on=["country_name", "year"]
)

# Olympics, GDP en Population join
olympics_gdp_population = olympics_gdp.merge(
    population,
    how="left",
    on=["country_name", "year"]
)

# olympics_gdp_population_flags
olympics_gdp_population_flags = olympics_gdp_population.merge(
    flags_clean,
    how="left",
    on="country_name"
)

#print(olympics_gdp_population_flags.columns)
#print(olympics_gdp_population_flags.head())
print (olympics_gdp_population_flags.game_season.unique())



<details><summary><strong>Original — Functions</strong></summary>

(see cell below)

</details>

In [None]:
def landen_met_gouden_medailles(data, seizoen='Summer', x_gouden=0):
    # Controleer op geldige seizoenswaarde
    if seizoen not in ['Winter', 'Summer']:
        raise ValueError("Ongeldig seizoen. Kies 'Winter' of 'Summer'.")
    
    # Filter de data op het gekozen seizoen
    gefilterde_data = data[data['game_season'] == seizoen]
    
    # Groepeer op land en jaar, en tel het aantal gouden medailles per editie
    goud_per_editie = gefilterde_data.groupby(['country_name', 'year'])['GOLD'].sum().reset_index()
    
    # Vind landen die in alle edities meer dan x gouden medailles hebben gehaald
    landen = []
    for land, groep in goud_per_editie.groupby('country_name'):
        if (groep['GOLD'] > x_gouden).all():
            landen.append(land)
    return landen

def landen_met_medaille(data, seizoen='Summer', grens=0, kleur='GOLD', meer=True):
    """
    Geeft een lijst van landen die bij alle Spelen van een bepaald seizoen meer/minder dan 'grens' medailles van een bepaalde kleur hebben gehaald.
    """
    if seizoen not in ['Winter', 'Summer']:
        print(f"Warning: '{seizoen}' is geen geldig seizoen. Kies 'Winter' of 'Summer'.")
        return []

    # Filter op seizoen
    df_seizoen = data[data['game_season'] == seizoen]

    # Controleer of de gekozen kleur een kolom is
    if kleur not in df_seizoen.columns:
        print(f"Warning: '{kleur}' is geen geldige medaillekleur-kolom in deze data.")
        return []

    # Groepeer per land en tel het totaal aantal medailles van de gekozen kleur
    aantal_per_land = df_seizoen.groupby('country_name')[kleur].sum()

    # Selecteer landen die voldoen aan de grens (meer/minder)
    if meer:
        landen = aantal_per_land[aantal_per_land > grens].index.tolist()
    else:
        landen = aantal_per_land[aantal_per_land < grens].index.tolist()
    return landen



<details><summary><strong>Original — Iteration</strong></summary>

(see cell below)

</details>

In [None]:
# Loops voor meer dan 50, 100, 250, 500 gouden medailles
grenzen = [50, 100, 250, 500]
for grens in grenzen:
    landen_meer = landen_met_medaille(olympics_gdp_population_flags
                                      , seizoen='Summer'
                                      , grens=grens, kleur='GOLD'
                                      , meer=True)
    print(f"Landen met meer dan {grens} gouden medailles (Summer):", landen_meer)
    print(len(landen_meer))

<details><summary><strong>Original — Misc</strong></summary>

(see cell below)

</details>

In [None]:
# Data check: Toon per dataframe het aantal lege waarden (NaN) per kolom
print("Lege waarden per kolom in df_flags:\n", flags_raw.isna().sum())
print("\nLege waarden per kolom in df_gdp:\n", gdp_raw.isna().sum())
print("\nLege waarden per kolom in df_olympics:\n", olympics_raw.isna().sum())
print("\nLege waarden per kolom in df_population:\n", population_raw.isna().sum())

# data bekijken
print(gdp_raw.head())

# Nogmaals de data bekijken
print(df_olympics.head())

# x_gouden = 1, 2, 10, 20
# Test de functie met verschillende parameters
landen_met_gouden = landen_met_gouden_medailles(olympics_gdp_population_flags
                                                , seizoen='Winter'
                                                , x_gouden=1)
print(landen_met_gouden)

landen_met_gouden = landen_met_gouden_medailles(olympics_gdp_population_flags
                                                , seizoen='Winter'
                                                , x_gouden=2)
print(landen_met_gouden)

landen_met_gouden = landen_met_gouden_medailles(olympics_gdp_population_flags
                                                , seizoen='Summer'
                                                , x_gouden=10)
print(landen_met_gouden)

landen_met_gouden = landen_met_gouden_medailles(olympics_gdp_population_flags
                                                , seizoen='Summer'
                                                , x_gouden=20)
print(landen_met_gouden)

# Alle landen die tijdens alle Zomerspelen in totaal minder dan 1 gouden medaille hebben gehaald
landen_minder_dan_1_goud = landen_met_medaille(olympics_gdp_population_flags, seizoen='Summer', grens=1, kleur='GOLD', meer=False)
print("Landen met minder dan 1 gouden medaille (Summer):", landen_minder_dan_1_goud)
print(len(landen_minder_dan_1_goud))
