# 04 — Tableau Ready Dataset: ASEAN Carbon Emission (2000–2024)

Notebook ini menyiapkan dataset final untuk visualisasi di Tableau.

Tujuan:
- Mengunci definisi metrik di Python agar konsisten
- Menghasilkan file CSV siap pakai untuk dashboard Tableau
- Mengurangi kebutuhan calculated field kompleks di Tableau

Output utama:
- data/tableau/trend_4y_long.csv
  Dataset long-format untuk tren 4-tahunan (average annual)

- data/tableau/ranking_change.csv
  Dataset untuk ranking perubahan start vs end period (average annual)

Catatan metodologi:
- Grouping 4-tahunan di-anchor ke 2000
- Tren dan perbandingan hanya memakai grup 4-tahunan yang lengkap
- Agregasi 4-tahunan memakai rata-rata tahunan (average annual)
- Periode pembanding default: 2000–2003 vs 2020–2023 (year_group 2000 vs 2020)


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path


## Load Data

Dataset dipanggil dari file processed lokal agar konsisten dengan notebook sebelumnya.


In [None]:
data_path = "data/process/owid_co2_asean_2000_2024.csv"
df = pd.read_csv(data_path)
df.shape


In [None]:
sorted(df["country"].unique().tolist()), int(df["year"].min()), int(df["year"].max())


## Standardize Column Names

Nama kolom dirapikan agar konsisten untuk metrik total dan sumber emisi.


In [None]:
rename_map = {
    "co2": "co2_total",
    "coal_co2": "co2_coal",
    "oil_co2": "co2_oil",
    "gas_co2": "co2_gas",
    "cement_co2": "co2_cement",
    "flaring_co2": "co2_flaring",
}
df = df.rename(columns=rename_map)
df.columns


## Select Core Columns

Kolom inti dipilih untuk kebutuhan Tableau.


In [None]:
cols = [
    "country",
    "year",
    "population",
    "co2_total",
    "co2_per_capita",
    "co2_coal",
    "co2_oil",
    "co2_gas",
    "co2_cement",
    "co2_flaring",
]

missing_cols = [c for c in cols if c not in df.columns]
missing_cols


In [None]:
df_core = df[cols].copy()
df_core.shape


## Create 4-Year Group and Aggregate (Average Annual)

Year group di-anchor ke 2000.
Hanya grup lengkap (4 tahun) yang dipakai untuk output Tableau.


In [None]:
df_core["year_group"] = 2000 + ((df_core["year"] - 2000) // 4) * 4

group_counts = (
    df_core.groupby(["country", "year_group"])
    .size()
    .reset_index(name="n_years")
)

df_4y = (
    df_core.groupby(["country", "year_group"], as_index=False)
    .mean(numeric_only=True)
)

df_4y_full = df_4y.merge(group_counts, on=["country", "year_group"], how="left")
df_4y_complete = df_4y_full[df_4y_full["n_years"] == 4].copy()

df_4y_full.shape, df_4y_complete.shape


## Output 1: Trend Dataset (Long Format)

Dataset ini cocok untuk Tableau:
- line chart tren
- small multiples per negara
- metric selector dengan filter

Skema:
- country
- year_group
- metric
- value
- unit
- aggregation_type


In [None]:
metric_map = {
    "co2_total": ("CO2 Total", "tonnes", "average_annual"),
    "co2_per_capita": ("CO2 per Capita", "tonnes_per_person", "average_annual"),
    "co2_coal": ("CO2 from Coal", "tonnes", "average_annual"),
    "co2_oil": ("CO2 from Oil", "tonnes", "average_annual"),
    "co2_gas": ("CO2 from Gas", "tonnes", "average_annual"),
    "co2_cement": ("CO2 from Cement", "tonnes", "average_annual"),
    "co2_flaring": ("CO2 from Flaring", "tonnes", "average_annual"),
}

available_metrics = [m for m in metric_map.keys() if m in df_4y_complete.columns]

trend_long = df_4y_complete[["country", "year_group"] + available_metrics].melt(
    id_vars=["country", "year_group"],
    var_name="metric_key",
    value_name="value"
)

trend_long["metric"] = trend_long["metric_key"].map(lambda x: metric_map[x][0])
trend_long["unit"] = trend_long["metric_key"].map(lambda x: metric_map[x][1])
trend_long["aggregation_type"] = trend_long["metric_key"].map(lambda x: metric_map[x][2])

trend_long = trend_long.drop(columns=["metric_key"])
trend_long = trend_long.sort_values(["metric", "country", "year_group"]).reset_index(drop=True)

trend_long.head(10)


## Output 2: Ranking Change Dataset (Start vs End Period)

Dataset ini cocok untuk Tableau:
- bar chart ranking perubahan
- table perubahan start-end
- scatter change total vs change per capita

Skema:
- country
- metric
- unit
- aggregation_type
- start_period
- end_period
- start_value
- end_value
- change_value
- change_pct


In [None]:
start_period = int(df_4y_complete["year_group"].min())
end_period = int(df_4y_complete["year_group"].max())
start_period, end_period


In [None]:
ranking_metrics = ["co2_total", "co2_per_capita"] + [m for m in available_metrics if m.startswith("co2_") and m not in ["co2_total", "co2_per_capita"]]
ranking_metrics = [m for m in ranking_metrics if m in df_4y_complete.columns]

start_df = df_4y_complete[df_4y_complete["year_group"] == start_period][["country"] + ranking_metrics].copy()
end_df = df_4y_complete[df_4y_complete["year_group"] == end_period][["country"] + ranking_metrics].copy()

chg = start_df.merge(end_df, on="country", suffixes=("_start", "_end"))

rows = []
for m in ranking_metrics:
    start_col = f"{m}_start"
    end_col = f"{m}_end"

    out = pd.DataFrame({
        "country": chg["country"],
        "metric_key": m,
        "start_period": start_period,
        "end_period": end_period,
        "start_value": chg[start_col],
        "end_value": chg[end_col],
    })
    out["change_value"] = out["end_value"] - out["start_value"]
    out["change_pct"] = np.where(
        out["start_value"].abs() > 0,
        out["change_value"] / out["start_value"] * 100,
        np.nan
    )
    rows.append(out)

ranking_change = pd.concat(rows, ignore_index=True)

ranking_change["metric"] = ranking_change["metric_key"].map(lambda x: metric_map[x][0] if x in metric_map else x)
ranking_change["unit"] = ranking_change["metric_key"].map(lambda x: metric_map[x][1] if x in metric_map else "unknown")
ranking_change["aggregation_type"] = ranking_change["metric_key"].map(lambda x: metric_map[x][2] if x in metric_map else "average_annual")

ranking_change = ranking_change.drop(columns=["metric_key"])
ranking_change = ranking_change.sort_values(["metric", "change_value"], ascending=[True, False]).reset_index(drop=True)

ranking_change.head(10)


## Output 3: Decomposition Dataset (Top Countries)

Dataset ini cocok untuk Tableau:
- stacked bar decomposition perubahan
- fokus pada top N negara berdasarkan perubahan CO2 total

Skema:
- country
- source
- start_period
- end_period
- start_value
- end_value
- change_value
- unit
- aggregation_type


In [None]:
top_n = 5

rank_total = ranking_change[ranking_change["metric"] == "CO2 Total"].copy()
top_countries = rank_total.sort_values("change_value", ascending=False)["country"].head(top_n).tolist()
top_countries


In [None]:
sources = ["co2_coal", "co2_oil", "co2_gas", "co2_cement", "co2_flaring"]
sources = [s for s in sources if s in df_4y_complete.columns]

start_src = df_4y_complete[df_4y_complete["year_group"] == start_period][["country"] + sources].copy()
end_src = df_4y_complete[df_4y_complete["year_group"] == end_period][["country"] + sources].copy()

src = start_src.merge(end_src, on="country", suffixes=("_start", "_end"))
src = src[src["country"].isin(top_countries)].copy()

decomp_rows = []
for s in sources:
    out = pd.DataFrame({
        "country": src["country"],
        "source_key": s,
        "start_period": start_period,
        "end_period": end_period,
        "start_value": src[f"{s}_start"],
        "end_value": src[f"{s}_end"],
    })
    out["change_value"] = out["end_value"] - out["start_value"]
    decomp_rows.append(out)

decomp = pd.concat(decomp_rows, ignore_index=True)
decomp["source"] = decomp["source_key"].map(lambda x: metric_map[x][0] if x in metric_map else x)
decomp["unit"] = decomp["source_key"].map(lambda x: metric_map[x][1] if x in metric_map else "unknown")
decomp["aggregation_type"] = decomp["source_key"].map(lambda x: metric_map[x][2] if x in metric_map else "average_annual")
decomp = decomp.drop(columns=["source_key"])
decomp = decomp.sort_values(["country", "source"]).reset_index(drop=True)

decomp.head(10)


## Save Outputs for Tableau

File disimpan ke folder data/tableau.


In [None]:
out_dir = Path("data/tableau")
out_dir.mkdir(parents=True, exist_ok=True)

trend_path = out_dir / "trend_4y_long.csv"
ranking_path = out_dir / "ranking_change.csv"
decomp_path = out_dir / "decomposition_top_countries.csv"

trend_long.to_csv(trend_path, index=False)
ranking_change.to_csv(ranking_path, index=False)
decomp.to_csv(decomp_path, index=False)

trend_path.as_posix(), ranking_path.as_posix(), decomp_path.as_posix()


## Quick Checks

Bagian ini memastikan file output memiliki struktur yang sesuai untuk Tableau.


In [None]:
trend_long.shape, ranking_change.shape, decomp.shape


In [None]:
trend_long.head(5)


In [None]:
ranking_change.head(5)


In [None]:
decomp.head(5)


## Tableau Usage Notes

Dashboard yang umum:
- Trend: filter metric, line chart year_group vs value, color by country
- Ranking: filter metric, bar chart country vs change_value, label change_pct
- Decomposition: stacked bar country vs change_value, stack by source

Definisi metrik:
- Semua nilai pada trend_4y_long dan ranking_change adalah average annual pada grup 4-tahunan
- start_period dan end_period di ranking_change mengikuti grup 4-tahunan lengkap pertama dan terakhir
