# Wealth of Nations – Economic and Social Indicators Across Countries

This notebook summarizes the main steps and findings of the project:

- Load the cleaned country-level dataset
- Explore descriptive statistics and correlations
- Visualize key relationships between GDP per capita and social indicators
- Estimate multiple regression models (global and by continent)
- Use K-means clustering to identify “development profiles” of countries


## 1. Setup and Data Loading

We use the cleaned dataset `data/clean_country_data.csv`, which already contains:

- `gdp_per_capita` (target variable)
- `life_expectancy` (average of male and female)
- `internet_users`, `fertility`, `unemployment`, `urban_population_growth`
- `continent`, `region`, `name`, `iso2`


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

plt.style.use("default")
sns.set_context("talk")

df = pd.read_csv("data/clean_country_data.csv")
df.head()


## 2. Code (Correlations with GDP per capita)

Here we  look at the varibales that are highly correlated with the main variable of interest


In [None]:
corr_cols = [
    "gdp_per_capita",
    "life_expectancy",
    "fertility",
    "internet_users",
    "unemployment",
    "urban_population_growth",
]

corr = df[corr_cols].corr()
corr["gdp_per_capita"]


## 3. Visualizing Key Relationships

We visualize how GDP per capita relates to life expectancy, internet usage, and fertility.  
Points are colored by continent.

In [None]:
def scatter_gdp_vs(var, df_plot=None):
    if df_plot is None:
        df_plot = df
    plt.figure(figsize=(8, 6))
    sns.scatterplot(
        data=df_plot,
        x=var,
        y="gdp_per_capita",
        hue="continent",
        alpha=0.8,
    )
    plt.xlabel(var.replace("_", " ").title())
    plt.ylabel("GDP per capita")
    plt.title(f"GDP per capita vs {var.replace('_', ' ').title()}")
    plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.tight_layout()
    plt.show()

for v in ["life_expectancy", "internet_users", "fertility"]:
    scatter_gdp_vs(v)


## 4. ## 4. Global Multiple Regression

We estimate a linear model of GDP per capita on several predictors:

\[
\text{GDPpc}_i = \beta_0 + \beta_1 \cdot \text{life\_expectancy}_i
+ \beta_2 \cdot \text{internet\_users}_i
+ \beta_3 \cdot \text{fertility}_i
+ \beta_4 \cdot \text{unemployment}_i
+ \beta_5 \cdot \text{urban\_population\_growth}_i + \varepsilon_i
\]


In [None]:
reg_cols = [
    "life_expectancy",
    "internet_users",
    "fertility",
    "unemployment",
    "urban_population_growth",
]

reg_df = df.dropna(subset=["gdp_per_capita"] + reg_cols).copy()
X = reg_df[reg_cols]
X = sm.add_constant(X)
y = reg_df["gdp_per_capita"]

global_model = sm.OLS(y, X).fit()
global_model.summary()


## 5. ### Interpretation (Global Model)

- The R-squared tells us how much of the variation in GDP per capita is explained by the predictors.
- Coefficients on **life expectancy** and **internet users** are positive and statistically significant.
- **Fertility** tends to have a negative association with GDP per capita.
- **Unemployment** and **urban population growth** contribute less to the overall explanatory power.


## 5. Regression by Continent

Next, we fit the same model separately for each continent to see whether the relationships differ across regions.


In [None]:
continents = sorted(df["continent"].dropna().unique())
results_by_continent = {}

for cont in continents:
    sub = df[df["continent"] == cont].dropna(subset=["gdp_per_capita"] + reg_cols)
    if len(sub) < 8:  # avoid very tiny samples
        continue
    X_sub = sm.add_constant(sub[reg_cols])
    y_sub = sub["gdp_per_capita"]
    m = sm.OLS(y_sub, X_sub).fit()
    results_by_continent[cont] = m
    print(f"\n=== {cont} ===")
    print(m.summary().tables[1])


## 6. K-Means Clustering

Finally, we use K-means clustering to group countries into development
profiles based on:

- gdp_per_capita
- life_expectancy
- internet_users
- fertility
- unemployment

We standardize the features first and then fit a K-means model with 4 clusters.

In [None]:
cluster_cols = [
    "gdp_per_capita",
    "life_expectancy",
    "internet_users",
    "fertility",
    "unemployment",
]

clust_df = df.dropna(subset=cluster_cols).copy()
X_clust = clust_df[cluster_cols].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clust)

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
clust_df["cluster"] = kmeans.fit_predict(X_scaled)

clust_df[
    [
        "name",
        "continent",
        "gdp_per_capita",
        "life_expectancy",
        "internet_users",
        "fertility",
        "unemployment",
        "cluster",
    ]
].head()


In [None]:
cluster_summary = clust_df.groupby("cluster")[cluster_cols].mean()
cluster_summary


### 6.1 Visualizing Clusters

We can plot GDP per capita vs life expectancy and color points by
their cluster assignment.

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=clust_df,
    x="life_expectancy",
    y="gdp_per_capita",
    hue="cluster",
    palette="tab10",
)

for _, row in clust_df.iterrows():
    plt.text(
        row["life_expectancy"] + 0.05,
        row["gdp_per_capita"] + 50,
        row["iso2"],
        fontsize=6,
    )

plt.xlabel("Life Expectancy")
plt.ylabel("GDP per capita")
plt.title("Clusters of Countries by Development Profile")
plt.tight_layout()
plt.show()


## 7. Summary of Findings

- Countries with higher **life expectancy** and higher **internet use** tend to have much higher GDP per capita.
- **Higher fertility** is generally associated with lower income per person.
- The strength and direction of these relationships vary by continent, suggesting that regional context matters.
- **K-means clustering** reveals distinct development profiles such as:
  - High-income digital economies,
  - Low-income high-fertility economies,
  - Intermediate transition economies.

