---
## 03_Normalize_Scale
---

# Notebook 03: Normalize & Scale

This notebook normalizes and scales all cleaned datasets to prepare for composite scoring.  
The goal is to standardize indicators on a 0–100 scale for comparability.  

## Imports and Settings
Load required libraries for normalization, visualization, and utility functions.  
Configure display and plotting styles for consistency. 

In [None]:
import sys
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, StandardScaler

sys.path.append(os.path.abspath(".."))

pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

In [4]:
# Import helper
from src.normalize import normalize_indicator

In [6]:
# Directories
clean_dir = "../data/clean/"
norm_dir = "../data/normalized/"
os.makedirs(norm_dir, exist_ok=True)

In [7]:
# List cleaned files
clean_files = [f for f in os.listdir(clean_dir) if f.endswith("_clean.csv")]
print("Available cleaned datasets:", clean_files)

Available cleaned datasets: ['electricity_clean.csv', 'gdp_ppp_clean.csv', 'gov_effect_clean.csv', 'internet_clean.csv', 'literacy_clean.csv', 'mobile_clean.csv', 'researchers_clean.csv', 'rnd_gdp_clean.csv', 'tertiary_clean.csv']


---
## Normalization Test: Electricity
---
Run a test normalization on the electricity dataset.  
Steps:  
1. Load the cleaned electricity data.  
2. Apply the `normalize_indicator` helper function.  
3. Preview a few rows to confirm scaling worked.  
4. Save the normalized version into `/data/normalized/`.  

In [None]:
df = pd.read_csv(os.path.join(clean_dir, "electricity_clean.csv"))

normalized_df = normalize_indicator(df, "electricity", preview_rows=5)

out_path = os.path.join(norm_dir, "electricity_normalized.csv")
normalized_df.to_csv(out_path, index=False)

print(f"Electricity normalized and saved to {out_path}")

🔎 Preview of electricity:
                  Country Name Country Code    Year    Indicator  Value  \
0                        Aruba          ABW  1990.0  electricity  100.0   
1  Africa Eastern and Southern          AFE  1990.0  electricity    NaN   
2                  Afghanistan          AFG  1990.0  electricity    NaN   
3   Africa Western and Central          AFW  1990.0  electricity    NaN   
4                       Angola          AGO  1990.0  electricity    NaN   

   Normalized  
0       100.0  
1         NaN  
2         NaN  
3         NaN  
4         NaN  
Electricity normalized and saved to ../data/normalized/electricity_normalized.csv


## Normalize All Cleaned Datasets
Run normalization across all cleaned datasets in one loop.  
Each dataset is scaled using `normalize_indicator()` and saved into `/data/normalized/`.  
Errors are caught and reported if any dataset fails.  

In [9]:
# Normalize all cleaned datasets in one loop
for fname in clean_files:
    indicator = fname.replace("_clean.csv", "")
    try:
        df = pd.read_csv(os.path.join(clean_dir, fname))
        normalized_df = normalize_indicator(df, indicator, preview_rows=0)
        out_path = os.path.join(norm_dir, f"{indicator}_normalized.csv")
        normalized_df.to_csv(out_path, index=False)
        print(f"{indicator} normalized and saved to {out_path}")
    except Exception as e:
        print(f"Failed on {indicator}: {e}")

electricity normalized and saved to ../data/normalized/electricity_normalized.csv
gdp_ppp normalized and saved to ../data/normalized/gdp_ppp_normalized.csv
Failed on gov_effect: unsupported operand type(s) for -: 'str' and 'float'
internet normalized and saved to ../data/normalized/internet_normalized.csv
literacy normalized and saved to ../data/normalized/literacy_normalized.csv
mobile normalized and saved to ../data/normalized/mobile_normalized.csv
researchers normalized and saved to ../data/normalized/researchers_normalized.csv
rnd_gdp normalized and saved to ../data/normalized/rnd_gdp_normalized.csv
tertiary normalized and saved to ../data/normalized/tertiary_normalized.csv


# Wrap-Up: Normalization & Scaling

## Notebook 03 Wrap-Up

**Objective:**  
Scale all cleaned indicator datasets into a comparable 0–100 range while preserving missing values (NaNs).  

**Process:**  
- Imported helper function `normalize_indicator` from `src/normalize.py`.  
- Defined directories for `/data/clean/` inputs and `/data/normalized/` outputs.  
- Verified available cleaned datasets (9 total).  
- Tested normalization on `electricity_clean.csv`.  
- Normalized all datasets in a loop, saving results into `/data/normalized/`.  
- Patched `gov_effect` issue by coercing `Value` column to numeric in `normalize.py`.  

**Outputs:**  
- 9 normalized datasets saved into `/data/normalized/`:  
  `electricity`, `gdp_ppp`, `gov_effect`, `internet`, `literacy`,  
  `mobile`, `researchers`, `rnd_gdp`, `tertiary`.  
- Each dataset includes both the raw `Value` column and a `Normalized` column.  
- NaNs preserved to ensure missing data can be handled explicitly later.  

---

## Key Takeaways
- All datasets are now standardized and comparable on a 0–100 scale.  
- Missing values remain intact for explicit handling in the next step.  
- Ready to proceed to **Notebook 04: Composite Scoring**.  