# Merge Cleaned Datasets and Deep Cleaning

1. Load cleaned datasets (`IRENA`, `OWID CO₂`, `WGI`)
2. Inspect each dataset for types, missing values, and column consistency
3. Merge datasets step by step using `ISO3` and `year`
4. Handle missing values (fill, drop, or keep as NA)
5. Generate a final combined dataset ready for EDA and visualizations
ISO3 and year alignment across all datasets

Step-by-step merging — beginner-friendly

Missing values handled carefully

Clean final dataset ready for analysis

In [None]:
# Imports and setup
import pandas as pd
from pathlib import Path

clean_dir = Path("data/cleaned")
final_dir = Path("data/final")
final_dir.mkdir(exist_ok=True)

In [None]:
#Load datasets
# Load cleaned IRENA
irena = pd.read_csv(clean_dir / "cleaned_irena.csv")
print("IRENA shape:", irena.shape)
irena.head(2)

# Load cleaned OWID
owid = pd.read_csv(clean_dir / "cleaned_owid.csv")
print("OWID shape:", owid.shape)
owid.head(2)

# Load cleaned WGI
wgi = pd.read_csv(clean_dir / "cleaned_wgi.csv")
print("WGI shape:", wgi.shape)
wgi.head(2)


In [None]:
# Check IRENA
print(irena[['country_iso','year']].isna().sum())

# Check OWID
print(owid[['country_iso','year']].isna().sum())

# Check WGI
print(wgi[['country_iso','year']].isna().sum())


## Check key columns for merging
We will merge datasets on:
- `country_iso` → standardized ISO3 code
- `year` → numeric year

Check if these columns exist and have no missing values

In [None]:
# Check IRENA
print(irena[['country_iso','year']].isna().sum())

# Check OWID
print(owid[['country_iso','year']].isna().sum())

# Check WGI
print(wgi[['country_iso','year']].isna().sum())

In [None]:
## 3. Merge Step-by-Step

# 1. Merge `IRENA` + `OWID` on `country_iso` and `year` (left join)
# 2. Merge the result with `WGI` (left join)
# 3. Keep all IRENA rows; missing OWID or WGI data will be NaN
# Merge IRENA + OWID
irena_owid = pd.merge(
    irena,
    owid,
    on=['country_iso','year'],
    how='left',
    suffixes=('_irena','_owid')
)
print("IRENA + OWID shape:", irena_owid.shape)

# Merge with WGI
final_df = pd.merge(
    irena_owid,
    wgi,
    on=['country_iso','year'],
    how='left',
    suffixes=('','_wgi')
)
print("Final merged shape:", final_df.shape)
final_df.head(3)


## 4. Check Missing Values

- Some countries or years may be missing OWID or WGI data
- Decide on imputation or leave as NA
- For numeric columns we can fill with 0 or median (depends on context)

In [None]:
# Missing values count
missing_counts = final_df.isna().sum().sort_values(ascending=False)
missing_counts.head(20)

In [None]:
#Code Imputation
# Example: fill missing numeric energy generation with 0
numeric_cols = [
    'electricity_generation_(gwh)',
    'electricity_installed_capacity_(mw)',
    'heat_generation_(tj)',
    'population',
    'gdp'
]

for col in numeric_cols:
    if col in final_df.columns:
        final_df[col] = final_df[col].fillna(0)

In [None]:
## 5. Check Data Types
final_df.dtypes

In [None]:
## 6. Quick Descriptive Stats
final_df.describe(include='all').T.head(10)

In [None]:
## 6. Save final merged dataset
final_path = final_dir / "final_combined.csv"
final_df.to_csv(final_path, index=False)
print("Saved final dataset:", final_path)
