### Monday 31th of March, 2025

## EDA Datathon project IV
## Global Poverty EDA

### Juan Domingo

## Citations

### Share of population living in extreme poverty
Michalis Moatsos (2021) – with major processing by Our World in Data. “Share of population living in extreme poverty – Long-run estimates” [dataset]. Michalis Moatsos, “Global extreme poverty - Present and past since 1820” [original data]. Retrieved April 1, 2025 from https://ourworldindata.org/grapher/share-of-population-living-in-extreme-poverty

### Gini CoefficientWorld Bank
World Bank Poverty and Inequality Platform (2024) – with major processing by Our World in Data. “Gini Coefficient – World Bank” [dataset]. World Bank Poverty and Inequality Platform, “World Bank Poverty and Inequality Platform (PIP) 20240627_2017, 20240627_2011” [original data]. Retrieved April 1, 2025 from https://ourworldindata.org/grapher/economic-inequality-gini-index

### GDP per capitaIn constant international-$ World Bank

Data compiled from multiple sources by World Bank (2025) – with minor processing by Our World in Data. “GDP per capita – World Bank – In constant international-$” [dataset]. Data compiled from multiple sources by World Bank, “World Development Indicators” [original data]. Retrieved April 1, 2025 from https://ourworldindata.org/grapher/gdp-per-capita-worldbank


In [1]:
# Data Analysis:
import pandas as pd
import numpy as np

In [2]:
# Display:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Virtual enviroment checking:

In [3]:
import sys
print(sys.executable)  # Debe mostrar la ruta a tu entorno virtual
#!which python  # En Linux/Mac

/home/juandomingo/factoriaf5/mod02-projs/eda-datathon-04/.venv/bin/python3


In [4]:
import sys
print(sys.prefix)  # Muestra el directorio base del entorno actual

/home/juandomingo/factoriaf5/mod02-projs/eda-datathon-04/.venv


In [5]:
# Graph configuration
plt.style.use('ggplot') # Nice style
# To display graphs in the notebook.
%matplotlib inline 

In [6]:
# Load files (adjust paths if necessary)
poverty = pd.read_csv("data/share-of-population-living-in-extreme-poverty.csv")
gdp = pd.read_csv("data/gdp-per-capita-worldbank.csv")
gini = pd.read_csv("data/economic-inequality-gini-index.csv")

# Rename columns for clarity
poverty = poverty. rename(columns={"$1.90 a day - Share of population in poverty (Smoothed)": "poverty_rate"})
gdp = gdp.rename(columns={"GDP per capita, PPP (constant 2021 international $)": "gdp_per_capita"})
gini = gini.rename(columns={"Gini coefficient": "gini_index"})

In [10]:
# Combine poverty and GDP
df = pd.merge(
 poverty, 
 gdp, 
 on=["Code", "Year", "Entity"], # "Entity" = country name
 how="inner" # Keep only records with data in both
)

# Combine with Gini
df = pd.merge(
 df, 
 gini, 
 on=["Code", "Year", "Entity"], 
 how="left" # Keep all previous records
)

# View first rows
df.tail()

Unnamed: 0,Entity,Code,Year,poverty_rate,818119-annotations,gdp_per_capita,gini_index,990179-annotations
24,World,OWID_WRL,2014,10.106801,,17370.57,,
25,World,OWID_WRL,2015,10.604952,,17734.984,,
26,World,OWID_WRL,2016,9.568008,,18098.455,,
27,World,OWID_WRL,2017,9.322068,,18575.725,,
28,World,OWID_WRL,2018,8.614398,,19046.809,,


### Cleaning Data

In [8]:
import pandas as pd
import numpy as np
from scipy import stats
import json

with open('utils/region_mapping.json', 'r', encoding='utf-8-sig') as f:
    region_mapping = json.load(f)

# 1. Remove irrelevant columns
df_clean = df.drop(columns=['417485-annotations', 'Continent'], errors='ignore') # Ignore if no

# 2. Filter years (2000-2022 to have recent data but with coverage) Filter years (2000-2022 to have recent data but with coverage)
df_clean = df_clean[(df_clean['Year'] >= 2000) & (df_clean['Year'] <= 2022)]

# 3. Advanced missing value management
# 3.1. Remove countries less than 5 years old Remove countries with less than 5 years of data
min_years = 5
counts = df_clean['Code'].value_counts()
valid_countries = counts[counts >= min_years].index
df_clean = df_clean[df_clean['Code'].isin(valid_countries)]

# 3. Imputation by regional/year medians
df_clean['Region'] = df_clean['Entity'].map(region_mapping)
df_clean['gdp_per_capita'] = df_clean.groupby(['Region', 'Year'])['gdp_per_capita'].transform(
 lambda x: x.fillna(x.median())
)

# 4. Outlier detection and treatment
# 4.1. Use Z-score for numeric variables
numeric_cols = ['poverty_rate', 'gdp_per_capita', 'gini_index']
z_scores = np.abs(stats.zscore(df_clean[numeric_cols].dropna()))
df_clean = df_clean[(z_scores < 3).all(axis=1)] # Remove |Z| > 3

# 5. Validate consistency # 5.1. Remove outliers. Validation of consistency
# 5.1. Ensure poverty rate is between 0-100
df_clean = df_clean[(df_clean['poverty_rate'] >= 0) & (df_clean['poverty_rate'] <= 100)]

# 5.2. Positive GDP per capita
df_clean = df_clean[df_clean['gdp_per_capita'] > 0]

# 6. Normalize country names
# 6.1. Unify names (e.g., "Ivory Coast" -> "Ivory Coast")
df_clean['Entity'] = df_clean['Entity'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

# 7. Reset index after operations 7. Reset index after operations
df_clean = df_clean.reset_index(drop=True)

# 8. Final verification print(f“% Original”). Final verification
print(f"% Original data maintained: {len(df_clean)/len(df)*100:.2f}%")
print("Post-cleaning summary:")
print(df_clean.describe())

# Export to CSV
os.makedirs('data', exist_ok=True)
df_clean.to_csv("data/poverty_dataset_clean.csv", index=False)

ValueError: Item wrong length 0 instead of 19.