# 02 Data Quality Check — ASEAN Carbon Emission (2000–2024)
Notebook ini mengevaluasi kualitas dataset hasil preprocessing: 
- konsistensi skema
- missing values 
- duplikasi
- cakupan tahun
- validasi nilai 

Tujuannya memastikan data layak untuk EDA dan analisis lanjutan tanpa membuat keputusan pembersihan yang agresif.

In [10]:
import pandas as pd
import numpy as np


## Load Data
Dataset hasil olahan dari notebook 01 dipanggil dari folder data/process. 
Langkah ini dilakukan agar proses analisis tidak mengulang download data mentah dan filter dari awal.

In [11]:
df = pd.read_csv("data/process/owid_co2_asean_2000_2024.csv")
df.head()


Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Brunei,2000,BRN,326429.0,,0.0,0.0,5.886,-0.092,-1.537,...,,0.017,0.0,0.0,0.0,0.0,9.218,8.123,-2.344,-39.83
1,Brunei,2001,BRN,333353.0,,0.0,0.0,5.758,-0.128,-2.178,...,,0.017,0.0,0.0,0.0,0.0,9.554,8.428,-2.237,-38.852
2,Brunei,2002,BRN,340108.0,,0.0,0.0,5.285,-0.473,-8.206,...,,0.017,0.0,0.0,0.0,0.0,8.517,7.424,-1.717,-32.479
3,Brunei,2003,BRN,346650.0,,0.0,0.0,6.14,0.854,16.162,...,,0.018,0.0,0.0,0.0,0.0,9.6,8.381,-0.516,-8.412
4,Brunei,2004,BRN,352921.0,,0.0,0.0,5.967,-0.173,-2.817,...,,0.018,0.0,0.0,0.0,0.0,8.826,7.879,-0.508,-8.519


Dataset ini akan menjadi dasar untuk seluruh analisis pada notebook berikutnya.

Selanjutnya ukuran dataset diperiksa sebagai validasi awal.

In [5]:
df.shape


(6350, 79)

## Struktur dan Tipe Data
Informasi tipe data dan ringkasan missing value ditampilkan untuk overview awal.

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6350 entries, 0 to 6349
Data columns (total 79 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   country                                    6350 non-null   object 
 1   year                                       6350 non-null   int64  
 2   iso_code                                   5450 non-null   object 
 3   population                                 5725 non-null   float64
 4   gdp                                        3782 non-null   float64
 5   cement_co2                                 5722 non-null   float64
 6   cement_co2_per_capita                      5622 non-null   float64
 7   co2                                        6175 non-null   float64
 8   co2_growth_abs                             5875 non-null   float64
 9   co2_growth_prct                            5833 non-null   float64
 10  co2_including_luc       

## Validasi Skema dan Tipe Data
Konversi tipe data agar konsisten. Kolom numerik dipaksa numeric, jika invalid menjadi NaN.

In [18]:
df["year"] = df["year"].astype(int)

numeric_cols = df.columns.difference(["country"])
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

df.dtypes


Unnamed: 0,0
country,object
year,int64
iso_code,float64
population,float64
gdp,float64
...,...
temperature_change_from_n2o,float64
total_ghg,float64
total_ghg_excluding_lucf,float64
trade_co2,float64


## Duplicate Check (Country–Year)
Unit observasi adalah kombinasi country dan year.

In [7]:
dup_count = df.duplicated(["country", "year"]).sum()
dup_count


np.int64(0)

Selanjutnya cek duplikasi penuh

In [19]:
df.duplicated().sum()


np.int64(0)

Nilai duplikasi bernilai nol, menandakan setiap data bersifat unik dan tidak duplikat.

## Missing Values Overview
Persentase missing value dihitung untuk setiap kolom.

In [8]:
missing_pct = (
    df.isnull()
    .mean()
    .mul(100)
    .round(2)
    .sort_values(ascending=False)
)
missing_pct


Unnamed: 0,0
other_co2_per_capita,76.38
cumulative_other_co2,75.59
other_industry_co2,75.59
share_global_other_co2,75.59
share_global_cumulative_other_co2,75.59
...,...
share_of_temperature_change_from_ghg,6.69
temperature_change_from_co2,6.69
co2,2.76
year,0.00


Hasil ini menunjukkan kolom apa saja yang memiliki missing value yang tinggi dan berpotensi bermasalah untuk dilakukan analisis.

## Missing Values per Negara
missing value dilakukan per negara untuk menghindari bias struktural yang tersembunyi.

In [9]:
missing_by_country = (
    df.groupby("country")
    .apply(lambda x: x.isnull().mean().mul(100).round(2))
)
missing_by_country


  .apply(lambda x: x.isnull().mean().mul(100).round(2))


Unnamed: 0_level_0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,100.0
Africa,0.0,0.0,100.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0
Africa (GCP),0.0,0.0,100.0,100.0,100.0,100.0,100.0,0.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
Albania,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0
Algeria,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wallis and Futuna,0.0,0.0,0.0,0.0,100.0,4.0,4.0,0.0,0.0,0.0,...,100.0,0.0,100.0,0.0,0.0,100.0,100.0,100.0,100.0,100.0
World,0.0,0.0,100.0,0.0,60.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Yemen,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,100.0
Zambia,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0


Langkah ini penting karena missing rendah secara global bisa menyembunyikan missing ekstrem pada negara tertentu.

## Coverage Tahun per Negara
Coverage tahun diperiksa untuk mmastikan cakupan periode 2000–2024 merata.

In [12]:
coverage = (
    df.groupby("country")["year"]
    .agg(["min", "max", "nunique"])
    .sort_values("nunique")
)
coverage


Unnamed: 0_level_0,min,max,nunique
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brunei,2000,2024,25
Cambodia,2000,2024,25
Indonesia,2000,2024,25
Laos,2000,2024,25
Malaysia,2000,2024,25
Myanmar,2000,2024,25
Philippines,2000,2024,25
Singapore,2000,2024,25
Thailand,2000,2024,25
Vietnam,2000,2024,25


Hasil ini menunjukkan apakah seluruh negara memiliki cakupan tahun yang seimbang pada periode 2000–2024.

## Coverage vs Ekspektasi
2000–2024 seharusnya 25 tahun data per negara

In [20]:
expected_years = 25
coverage["missing_years"] = expected_years - coverage["nunique"]
coverage.sort_values("missing_years", ascending=False)


Unnamed: 0_level_0,min,max,nunique,missing_years
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brunei,2000,2024,25,0
Cambodia,2000,2024,25,0
Indonesia,2000,2024,25,0
Laos,2000,2024,25,0
Malaysia,2000,2024,25,0
Myanmar,2000,2024,25,0
Philippines,2000,2024,25,0
Singapore,2000,2024,25,0
Thailand,2000,2024,25,0
Vietnam,2000,2024,25,0


## Validasi Domain Nilai
Cek nilai di luar domain wajar: tahun, emisi negatif, dan populasi nol/negatif.

In [21]:
invalid_years = df.loc[~df["year"].between(2000, 2024), ["country", "year"]]
invalid_years


Unnamed: 0,country,year


In [22]:
neg_cols = [c for c in numeric_cols if (df[c] < 0).any()]
neg_cols


['co2_growth_abs',
 'co2_growth_prct',
 'co2_including_luc_growth_abs',
 'co2_including_luc_growth_prct',
 'trade_co2',
 'trade_co2_share']

In [23]:
pop_issue = df.loc[df["population"] <= 0, ["country", "year", "population"]]
pop_issue


Unnamed: 0,country,year,population


## Kolom Konstan dan All-Zero
Kolom konstan atau all-zero bisa menyesatkan di EDA.



In [24]:
const_cols = [c for c in numeric_cols if df[c].nunique(dropna=True) <= 1]
zero_cols = [c for c in numeric_cols if (df[c].fillna(0) == 0).all()]

const_cols, zero_cols


(['cumulative_other_co2',
  'iso_code',
  'other_co2_per_capita',
  'other_industry_co2',
  'share_global_cumulative_other_co2',
  'share_global_other_co2'],
 ['cumulative_other_co2',
  'iso_code',
  'other_co2_per_capita',
  'other_industry_co2',
  'share_global_cumulative_other_co2',
  'share_global_other_co2'])

## Konsistensi Antar Kolom (Sanity Check)
Jika unit mendukung, cek keselarasan co2_per_capita vs co2 dan population.

In [25]:
if {"co2", "population", "co2_per_capita"}.issubset(df.columns):
    mask = (
        df["co2"].notna()
        & df["population"].notna()
        & df["co2_per_capita"].notna()
        & (df["population"] > 0)
    )
    if mask.any():
        # Asumsi: co2 dalam million tonnes
        calc = df.loc[mask, "co2"] * 1e6 / df.loc[mask, "population"]
        diff = (calc - df.loc[mask, "co2_per_capita"]).abs()
        diff.describe()
else:
    "Kolom co2/population/co2_per_capita tidak lengkap"


## Definisi Kolom Inti untuk Analisis
Kolom inti didefinisikan secara eksplisit agar tidak terjadi redefinisi diam diam di notebook berikutnya.

In [15]:
core_columns = [
    "country",
    "year",
    "population",
    "co2",
    "co2_per_capita",
    "coal_co2",
    "oil_co2",
    "gas_co2",
    "cement_co2",
    "flaring_co2"
]

missing_core = missing_pct[core_columns]
missing_core


Unnamed: 0,0
country,0.0
year,0.0
population,9.84
co2,2.76
co2_per_capita,9.06
coal_co2,38.24
oil_co2,8.17
gas_co2,43.21
cement_co2,9.89
flaring_co2,6.69


## Evaluasi Threshold Missing
Threshold missing 10 persen digunakan sebagai rule of thumb untuk eksplorasi awal.
Kolom di atas threshold diidentifikasi tanpa langsung dihapus.

In [16]:
threshold = 10

cols_above_threshold = missing_pct[missing_pct > threshold]
cols_below_threshold = missing_pct[missing_pct <= threshold]

cols_above_threshold, cols_below_threshold


(other_co2_per_capita                         76.38
 cumulative_other_co2                         75.59
 other_industry_co2                           75.59
 share_global_other_co2                       75.59
 share_global_cumulative_other_co2            75.59
 consumption_co2_per_gdp                      51.04
 consumption_co2_per_capita                   48.30
 trade_co2                                    47.51
 trade_co2_share                              47.51
 gas_co2_per_capita                           44.00
 consumption_co2                              43.73
 share_global_cumulative_gas_co2              43.21
 cumulative_gas_co2                           43.21
 gas_co2                                      43.21
 share_global_gas_co2                         43.21
 energy_per_gdp                               40.63
 gdp                                          40.44
 coal_co2_per_capita                          39.02
 cumulative_coal_co2                          38.24
 share_globa

Keputusan untuk mengecualikan kolom tetap ditunda ke tahap analisis, bukan dipaksakan di tahap quality check.

## Ringkasan Kualitas
Ringkasan metrik untuk mempermudah keputusan sebelum EDA.

In [26]:
quality_summary = {
    "rows": len(df),
    "cols": df.shape[1],
    "duplicate_country_year": int(df.duplicated(["country", "year"]).sum()),
    "duplicate_full_row": int(df.duplicated().sum()),
    "countries": df["country"].nunique(),
    "year_min": int(df["year"].min()),
    "year_max": int(df["year"].max()),
    "neg_value_cols": len(neg_cols) if "neg_cols" in globals() else None,
    "const_cols": len(const_cols),
    "zero_cols": len(zero_cols),
}

pd.Series(quality_summary)


Unnamed: 0,0
rows,250
cols,79
duplicate_country_year,0
duplicate_full_row,0
countries,10
year_min,2000
year_max,2024
neg_value_cols,6
const_cols,6
zero_cols,6


## Dataset Ringkas untuk EDA
Dataset tidak dimodifikasi pada tahap ini.
Notebook ini hanya menghasilkan informasi kualitas dan definisi kolom.
Sebagai validasi akhir, jumlah baris per negara ditampilkan.

In [17]:
df["country"].value_counts().sort_values()


Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
Brunei,25
Cambodia,25
Indonesia,25
Laos,25
Malaysia,25
Myanmar,25
Philippines,25
Singapore,25
Thailand,25
Vietnam,25


Output dari notebook ini digunakan sebagai dasar transparan untuk EDA dan analisis lanjutan, tanpa klaim normatif tentang kelayakan data.

----
## Conclusions
- Struktur data konsisten untuk unit observasi country-year, jika duplikasi bernilai nol.
- Missing values perlu dipantau, terutama jika terkonsentrasi pada negara tertentu.
- Negara dengan missing_years tinggi perlu diberi catatan saat interpretasi tren.
- Kolom dengan nilai negatif, konstan, atau semua nol perlu ditinjau sebelum digunakan sebagai variabel utama.
- Dataset tidak diubah selain penyeragaman tipe pada kolom inti numeric
----