# ðŸ’° Economic Inequality & Digital Divide

## Problem Statement
Digital access is increasingly tied to economic opportunity, education,
and productivity. Unequal digital penetration can reinforce existing
economic disparities.

This analysis examines how digital access correlates with development outcomes.

## Business & Policy Relevance
- Infrastructure investment planning
- Digital inclusion strategies
- Economic development forecasting

## Target Variables
- Internet penetration
- GDP / income indicators
- Education-related metrics
- Composite Digital Divide Score

## Scope & Limitations
- Country-level analysis
- Internet access used as a proxy for digital inclusion
- Forecasts are short-term and linear


## ðŸŸ¦ Phase 1 â€” Data Ingestion & Validation

### Objective

Ingest World Bank WDI bulk CSV data for a fixed set of economic and digital-access indicators, validate structure, and document indicator provenance.

This phase focuses strictly on:

- Load selected indicators
- Validate shape, countries, years
- Snapshot raw data
- No reshaping
- No enrichment
- No metadata assumptions

### ðŸŸ¦ 1.1 Load Raw Indicator Data

In [1]:
from utils.path_setup import setup_project_path
# from ingestion.
from utils.logger import get_logger

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from pathlib import Path

PROJECT_ROOT = setup_project_path()
os.chdir(PROJECT_ROOT)

logger = get_logger("domain3_phase1_eco_digital_divide_eda")

print("Project root:", PROJECT_ROOT)

Project root: d:\def_main\Code\MyProjects\eda-mlops-portfolio


In [2]:
RAW_DIR = Path("datasets/raw/eco-digital")

DATA_FILE = RAW_DIR / "wdi_eco_digital_Data.csv"

df_raw = pd.read_csv(DATA_FILE)

print("Data shape:", df_raw.shape)


Data shape: (1069, 69)


### ðŸŸ¦ 1.2 Schema Inspection (Data)

In [3]:
df_raw.columns.tolist()


['Country Name',
 'Country Code',
 'Series Name',
 'Series Code',
 '1960 [YR1960]',
 '1961 [YR1961]',
 '1962 [YR1962]',
 '1963 [YR1963]',
 '1964 [YR1964]',
 '1965 [YR1965]',
 '1966 [YR1966]',
 '1967 [YR1967]',
 '1968 [YR1968]',
 '1969 [YR1969]',
 '1970 [YR1970]',
 '1971 [YR1971]',
 '1972 [YR1972]',
 '1973 [YR1973]',
 '1974 [YR1974]',
 '1975 [YR1975]',
 '1976 [YR1976]',
 '1977 [YR1977]',
 '1978 [YR1978]',
 '1979 [YR1979]',
 '1980 [YR1980]',
 '1981 [YR1981]',
 '1982 [YR1982]',
 '1983 [YR1983]',
 '1984 [YR1984]',
 '1985 [YR1985]',
 '1986 [YR1986]',
 '1987 [YR1987]',
 '1988 [YR1988]',
 '1989 [YR1989]',
 '1990 [YR1990]',
 '1991 [YR1991]',
 '1992 [YR1992]',
 '1993 [YR1993]',
 '1994 [YR1994]',
 '1995 [YR1995]',
 '1996 [YR1996]',
 '1997 [YR1997]',
 '1998 [YR1998]',
 '1999 [YR1999]',
 '2000 [YR2000]',
 '2001 [YR2001]',
 '2002 [YR2002]',
 '2003 [YR2003]',
 '2004 [YR2004]',
 '2005 [YR2005]',
 '2006 [YR2006]',
 '2007 [YR2007]',
 '2008 [YR2008]',
 '2009 [YR2009]',
 '2010 [YR2010]',
 '2011 [YR2011]'

### ðŸŸ¦ 1.3 Indicator Validation

In [4]:
df_raw["Series Name"].unique()


array(['Individuals using the Internet (% of population)',
       'GDP per capita (current US$)', 'Population, total',
       'School enrollment, secondary (% gross)', nan], dtype=object)

### 1.4 Country Coverage Check

In [5]:
df_raw["Country Name"].nunique()


268

### 1.5 Temporal Coverage Check

In [6]:
year_cols = [c for c in df_raw.columns if "[YR" in c]

min_year = min(year_cols)
max_year = max(year_cols)

min_year, max_year, len(year_cols)


('1960 [YR1960]', '2024 [YR2024]', 65)

### ðŸŸ¦ Phase 1 Summary

#### Completed

- Raw WDI indicator data loaded
- Indicator scope validated (4 indicators only)
- Country and time coverage confirmed
- Indicator metadata documented

#### Deferred (by design)

- Cleaning â†’ Phase 3
- Reshaping â†’ Phase 3
- Inequality framing â†’ Phase 5
- Index creation â†’ Phase 6

## Phase 2 - Dataset Coverage & Bias
### ðŸŽ¯ Goal
- Understand coverage, missingness, and temporal alignment
- Identify structural bias in indicators
- No correlation
- No causality
- No interpretation

### ðŸ“¦ Assumptions (from Phase 1)
- Data is wide format
- One row = (Country, Indicator)
- Years are columns like 2000 [YR2000]

### 2.1 - Setup

In [7]:
df = df_raw.copy()

### 2.2 - Identify Structural Columns

In [8]:
id_cols = ["Country Name", "Country Code", "Series Name", "Series Code"]

year_cols = [
    c for c in df.columns 
    if c.endswith("]")
]


In [9]:
len(year_cols), year_cols[:5], year_cols[-5:]


(65,
 ['1960 [YR1960]',
  '1961 [YR1961]',
  '1962 [YR1962]',
  '1963 [YR1963]',
  '1964 [YR1964]'],
 ['2020 [YR2020]',
  '2021 [YR2021]',
  '2022 [YR2022]',
  '2023 [YR2023]',
  '2024 [YR2024]'])

### 2.3 - Indicator Inventory

In [10]:
indicator_inventory = (
    df[["Series Code", "Series Name"]]
    .drop_duplicates()
    .reset_index(drop=True)
)

indicator_inventory


Unnamed: 0,Series Code,Series Name
0,IT.NET.USER.ZS,Individuals using the Internet (% of population)
1,NY.GDP.PCAP.CD,GDP per capita (current US$)
2,SP.POP.TOTL,"Population, total"
3,SE.SEC.ENRR,"School enrollment, secondary (% gross)"
4,,


### 2.4 - Country Coverage per Indicator

In [11]:
countries_per_indicator = (
    df.groupby("Series Name")["Country Name"]
      .nunique()
      .sort_values(ascending=False)
)

countries_per_indicator


Series Name
GDP per capita (current US$)                        266
Individuals using the Internet (% of population)    266
Population, total                                   266
School enrollment, secondary (% gross)              266
Name: Country Name, dtype: int64

### 2.5 - Temporal Coverage per Indicator

In [12]:
def year_coverage(row):
    return row[year_cols].notna().sum()

df["non_null_years"] = df.apply(year_coverage, axis=1)


In [13]:
coverage_summary = (
    df.groupby("Series Name")["non_null_years"]
      .agg(["min", "median", "max"])
)

coverage_summary


Unnamed: 0_level_0,min,median,max
Series Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GDP per capita (current US$),65,65.0,65
Individuals using the Internet (% of population),65,65.0,65
"Population, total",65,65.0,65
"School enrollment, secondary (% gross)",65,65.0,65


### 2.6 - First Available Year (by Indicator)

In [14]:
def first_year_with_data(row):
    for c in year_cols:
        if pd.notna(row[c]):
            return c
    return np.nan


In [15]:
df["first_year"] = df.apply(first_year_with_data, axis=1)


In [16]:
first_year_summary = (
    df.groupby("Series Name")["first_year"]
      .value_counts()
      .sort_index()
)

first_year_summary


Series Name                                       first_year   
GDP per capita (current US$)                      1960 [YR1960]    266
Individuals using the Internet (% of population)  1960 [YR1960]    266
Population, total                                 1960 [YR1960]    266
School enrollment, secondary (% gross)            1960 [YR1960]    266
Name: count, dtype: int64

### 2.7 - Missingness Profile

In [17]:
missing_pct = (
    df[year_cols]
    .isna()
    .mean(axis=1) * 100
)

df["missing_pct"] = missing_pct


In [18]:
missing_summary = (
    df.groupby("Series Name")["missing_pct"]
      .describe()[["min", "mean", "max"]]
)

missing_summary


Unnamed: 0_level_0,min,mean,max
Series Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GDP per capita (current US$),0.0,0.0,0.0
Individuals using the Internet (% of population),0.0,0.0,0.0
"Population, total",0.0,0.0,0.0
"School enrollment, secondary (% gross)",0.0,0.0,0.0


### 2.8 - High-Income Bias Check (Structural)

In [19]:
countries_with_any_data = (
    df.groupby("Country Name")[year_cols]
      .apply(lambda x: x.notna().any().any())
)


In [20]:
countries_with_any_data.value_counts()

True     266
False      2
Name: count, dtype: int64

### Phase 2 Summary

- Dataset contains 4 indicators across multiple decades.
- Country coverage varies significantly by indicator.
- GDP and population show near-universal coverage.
- Although the Internet indicator exists structurally from 1960 in WDI, meaningful non-zero coverage begins much later; early values should be treated cautiously as retrospective or modeled entries.
- Education indicators show moderate temporal gaps.
- Structural missingness exists and must be handled explicitly.

No transformations or analytical claims were made.


## Phase 3 - Cleaning & Feature Readiness

### Goal :

- Reshape wide â†’ long
- Normalize identifiers
- Explicit missing-value handling
- Create normalized / per-capita metrics
- Save processed dataset

### 3.1 - Sanity Checkup

In [21]:
df_raw.shape


(1069, 69)

### 3.2 - Identify Structural Columns

In [22]:
id_cols = [
    "Country Name",
    "Country Code",
    "Series Name",
    "Series Code"
]

year_cols = [c for c in df_raw.columns if c.endswith("]")]


In [23]:
len(year_cols), year_cols[:3], year_cols[-3:]


(65,
 ['1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]'],
 ['2022 [YR2022]', '2023 [YR2023]', '2024 [YR2024]'])

### 3.3 - Reshape: Wide -> Long (Canonical Form)

**Unit of Analysis:**
- Country x Year x Indicator

In [24]:
df_long = (
    df_raw
    .melt(
        id_vars=id_cols,
        value_vars=year_cols,
        var_name="year_raw",
        value_name="value"
    )
)


In [25]:
df_long.shape


(69485, 6)

### 3.4 - Clean Year Column

In [26]:
df_long["year"] = (
    df_long["year_raw"]
    .str.extract(r"(\d{4})")
    .astype(int)
)


In [27]:
df_long.drop(columns="year_raw", inplace=True)


### 3.5 - Type Enforcement

In [28]:
df_long["value"] = pd.to_numeric(df_long["value"], errors="coerce")


In [29]:
df_long.dtypes

Country Name     object
Country Code     object
Series Name      object
Series Code      object
value           float64
year              int64
dtype: object

### 3.6 - Explicit Missing-Value Policy
We do not impute in this project.

Rules:
- Missing = structural or reporting absence
- Keep as NaN
- Track coverage later (Phase 4+)

In [30]:
missing_rate = df_long["value"].isna().mean() * 100
missing_rate


np.float64(33.25609843851191)

### 3.7 - Indicator Normalization (Naming)
Create **analysis-safe indicator keys**

In [31]:
INDICATOR_MAP = {
    "Individuals using the Internet (% of population)": "internet_users_pct",
    "GDP per capita (current US$)": "gdp_per_capita_usd",
    "Population, total": "population_total",
    "School enrollment, secondary (% gross)": "secondary_enrollment_pct"
}


In [47]:
df_long = df_long[df_long["Series Name"].notna()].copy()

In [48]:
df_long["indicator"] = df_long["Series Name"].map(INDICATOR_MAP)


In [49]:
df_long["indicator"].isna().sum()


np.int64(0)

### 3.8 - Keep Only Approved Indicators

In [36]:
df_long = df_long[
    df_long["indicator"].isin(INDICATOR_MAP.values())
].copy()


### 3.9 - Create Normalized / Derived Metrics
**GDP total (approximation, transparent)**

In [37]:
df_pivot = (
    df_long
    .pivot_table(
        index=["Country Name", "Country Code", "year"],
        columns="indicator",
        values="value"
    )
    .reset_index()
)


In [38]:
df_pivot["gdp_total_usd"] = (
    df_pivot["gdp_per_capita_usd"] * df_pivot["population_total"]
)


### 3.10 - Final Column Standardization

In [39]:
df_final = df_pivot.rename(columns={
    "Country Name": "country",
    "Country Code": "country_code"
})


In [46]:
df_final.columns.name = None

df_final.columns

Index(['country', 'country_code', 'year', 'gdp_per_capita_usd',
       'internet_users_pct', 'population_total', 'secondary_enrollment_pct',
       'gdp_total_usd'],
      dtype='object')

### 3.11 - Sanity Checks

In [41]:
df_final["country"].nunique()


265

In [42]:
df_final["year"].min(), df_final["year"].max()


(np.int64(1960), np.int64(2024))

In [43]:
df_final.isna().mean().sort_values(ascending=False)


indicator
internet_users_pct          0.608142
secondary_enrollment_pct    0.541553
gdp_per_capita_usd          0.153184
gdp_total_usd               0.153184
year                        0.000000
country_code                0.000000
country                     0.000000
population_total            0.000000
dtype: float64

### 3.12 - Save Processed Dataset

In [45]:
PROCESSED_DIR = Path("datasets/processed/eco_digital")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

output_path = PROCESSED_DIR / "eco_digital_clean_long.csv"
df_final.to_csv(output_path, index=False)


In [50]:
output_path


WindowsPath('datasets/processed/eco_digital/eco_digital_clean_long.csv')

### Phase 3 Summary

- Data reshaped to countryâ€“yearâ€“indicator format
- Years parsed and standardized
- Indicator names normalized and locked
- Missing values preserved explicitly
- Derived GDP total created transparently
- Dataset saved for EDA

No analysis or interpretation performed.
