# Project 2: Mobile Data and Financial Inclusion

Welcome to my Project 2 notebook! The goal of this assignment is to take two different datasets, combine them in Python, and create a single visualization and statistical analysis that shows the relationship.
For my project, I want to explore a topic I care about: digital connectivity and financial access.

Topic: Does mobile broadband adoption correlate with greater financial inclusion?

Research Question: To what extent does the penetration of mobile broadband subscriptions (per 100 people) correlate with the rate of account ownership at a financial institution or mobile-money service (% ages 15+) across various countries and years?

Data Indicators

We are using two key World Bank Data360 indicators:

1. WB_FINDEX_ACCOUNT_T_D will serve as our financial inclusion dependent variable (Y)
2. ITU_DH_MOB_SUB_PER_100 will be our mobile broadband adoption independent variable (X)

Ready? Let's do it!

# Part 1: Loading & Cleaning Data
First, we'll import our packages, define the paths and make sure that our data is working:

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, linregress
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Define the file paths for the uploaded files
FINDEX_FILE = "WB_FINDEX_ACCOUNT_T_D_WIDEF.csv"
MOBILE_FILE = "ITU_DH_MOB_SUB_PER_100_WIDEF.csv"

# Set Plotly renderer for Jupyter/VS Code environments
pio.renderers.default = "notebook_connected+plotly_mimetype"


Mobile broadband raw data shape: (213167, 43)
Financial inclusion (Findex) raw data shape: (2193, 43)


In [10]:
# Load the raw data
try:
    df_findex_raw = pd.read_csv(FINDEX_FILE)
    df_mobile_raw = pd.read_csv(MOBILE_FILE)
    print("Data files loaded successfully.")
except FileNotFoundError as e:
    print(f"Error: One or more files not found. Check the file names. {e}")
    exit()

# Display the first few rows of each raw dataset for inspection
print("\n--- Findex Raw Data Head ---")
df_findex_raw.head

print("\n--- Mobile Raw Data Head ---")
df_mobile_raw.head

Data files loaded successfully.

--- Findex Raw Data Head ---

--- Mobile Raw Data Head ---


<bound method NDFrame.head of     FREQ FREQ_LABEL REF_AREA               REF_AREA_LABEL  \
0      A     Annual      ABW                        Aruba   
1      A     Annual      AFE  Africa Eastern and Southern   
2      A     Annual      AFG                  Afghanistan   
3      A     Annual      AFW   Africa Western and Central   
4      A     Annual      AGO                       Angola   
..   ...        ...      ...                          ...   
257    A     Annual      XKX                       Kosovo   
258    A     Annual      YEM                  Yemen, Rep.   
259    A     Annual      ZAF                 South Africa   
260    A     Annual      ZMB                       Zambia   
261    A     Annual      ZWE                     Zimbabwe   

                  INDICATOR  \
0    ITU_DH_MOB_SUB_PER_100   
1    ITU_DH_MOB_SUB_PER_100   
2    ITU_DH_MOB_SUB_PER_100   
3    ITU_DH_MOB_SUB_PER_100   
4    ITU_DH_MOB_SUB_PER_100   
..                      ...   
257  ITU_DH_MOB_SUB_

In [7]:
# Display all columns from the Mobile raw data to confirm the data structure
print("\n--- Mobile Data Columns (Full List) ---")
print(df_mobile_raw.columns.tolist())


--- Mobile Data Columns (Full List) ---
['FREQ', 'FREQ_LABEL', 'REF_AREA', 'REF_AREA_LABEL', 'INDICATOR', 'INDICATOR_LABEL', 'UNIT_MEASURE', 'UNIT_MEASURE_LABEL', 'AGG_METHOD', 'AGG_METHOD_LABEL', 'DATABASE_ID', 'DATABASE_ID_LABEL', 'UNIT_MULT', 'UNIT_MULT_LABEL', 'OBS_STATUS', 'OBS_STATUS_LABEL', 'OBS_CONF', 'OBS_CONF_LABEL', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024']


# Cleaning Dataset 1 (Financial Inclusion - Y)

First, I’ll load the data for the dependent variable (Y), Account Ownership, 
sourced from the World Bank’s Global Findex Database. 
This dataset provides intermittent data points (2011, 2014, 2017, 2021, 2022, 2024).


Account ownership at a financial institution or mobile-money service (% ages 15+ ) (Findex)
**Code:** `WB_FINDEX_ACCOUNT_T_D`

## Cleaning the Policy Data

My goal is to isolate the overall account ownership percentage for the entire adult population (ages 15+) and then transform the data structure for merging.

Thus, we conduct the following steps:

Filter the raw data to include only the total population for the main indicator: INDICATOR = WB_FINDEX_ACCOUNT_T_D, AGE = Y_GE15, SEX = _T, and COMP_BREAKDOWN_1 = _T.

Convert the data from wide format (years as columns) to long format (one row per Country-Year observation). The value column is renamed to Account_Ownership.

Drop rows with missing Account_Ownership values.

Rename REF_AREA to Code and REF_AREA_LABEL to Country for clarity.

In [None]:
# Identify year columns in the Findex data (columns that are numeric strings)
findex_year_cols = [
    col for col in df_findex_raw.columns if str(col).isdigit() and int(col) >= 2011
]

# 1. Filter for the specific indicator and demographic breakdown:
#    - INDICATOR: 'WB_FINDEX_ACCOUNT_T_D' (Account ownership)
#    - AGE: 'Y_GE15' (15 years old and over)
#    - SEX: '_T' (Total)
#    - COMP_BREAKDOWN_1: '_T' (Total)
df_findex_filtered = df_findex_raw[
    (df_findex_raw["INDICATOR"] == "WB_FINDEX_ACCOUNT_T_D")
    & (df_findex_raw["AGE"] == "Y_GE15")
    & (df_findex_raw["SEX"] == "_T")
    & (df_findex_raw["COMP_BREAKDOWN_1"] == "_T")
].copy()

# 2. Convert from wide to long format
df_findex_long = pd.melt(
    df_findex_filtered,
    id_vars=["REF_AREA", "REF_AREA_LABEL"],
    value_vars=findex_year_cols,
    var_name="Year",
    value_name="Account_Ownership",  # Y variable
)

# 3. Clean and convert data types
df_findex_long["Year"] = df_findex_long["Year"].astype(int)
df_findex_long["Account_Ownership"] = pd.to_numeric(
    df_findex_long["Account_Ownership"], errors="coerce"
)
df_findex_long.dropna(subset=["Account_Ownership"], inplace=True)

# Select relevant columns
df_findex_long = df_findex_long[
    ["REF_AREA", "REF_AREA_LABEL", "Year", "Account_Ownership"]
]
df_findex_long.rename(
    columns={"REF_AREA": "Code", "REF_AREA_LABEL": "Country"}, inplace=True
)

print(f"Findex data cleaned. Total observations: {len(df_findex_long)}")
print("\n--- Findex Long Data Head ---")
print(df_findex_long.head())


Findex data cleaned. Total observations: 4025

--- Findex Long Data Head ---
  Code          Country  Year  Account_Ownership
0  MEX           Mexico  2011          13.400245
1  MKD  North Macedonia  2011          82.886250
2  UGA           Uganda  2011           3.493182
3  PHL      Philippines  2011          23.217293
4  MYS         Malaysia  2011          52.513229


# Part 2: Loading & Cleaning Dataset 2 (Mobile Broadband - X)

Now for the independent variable (X), Mobile Broadband Subscriptions. I’m using the data from the ITU’s ICT Indicators Database, which provides annual data from 2000 through 2024.

In [None]:
## 3. Data Cleaning and Transformation - Mobile Broadband (X)
# **Indicator:** Mobile broadband subscriptions (per 100 people)
# **Code:** `IT.CEL.BBND.P2` (The file uses `ITU_DH_MOB_SUB_PER_100`)

# %%
# Identify year columns in the Mobile data
mobile_year_cols = [
    col for col in df_mobile_raw.columns if str(col).isdigit() and int(col) >= 2000
]

# 1. Filter for the specific indicator (IT.CEL.BBND.P2 might be derived, checking the INDICATOR column name used in the file)
# The file likely uses 'ITU_DH_MOB_SUB_PER_100' or similar code for mobile subscriptions
df_mobile_filtered = df_mobile_raw[
    df_mobile_raw["INDICATOR"] == "ITU_DH_MOB_SUB_PER_100"
].copy()

# 2. Convert from wide to long format
df_mobile_long = pd.melt(
    df_mobile_filtered,
    id_vars=["REF_AREA", "REF_AREA_LABEL"],
    value_vars=mobile_year_cols,
    var_name="Year",
    value_name="Mobile_Subscriptions",  # X variable
)

# 3. Clean and convert data types
df_mobile_long["Year"] = df_mobile_long["Year"].astype(int)
df_mobile_long["Mobile_Subscriptions"] = pd.to_numeric(
    df_mobile_long["Mobile_Subscriptions"], errors="coerce"
)
df_mobile_long.dropna(subset=["Mobile_Subscriptions"], inplace=True)

# Select relevant columns
df_mobile_long = df_mobile_long[["REF_AREA", "Year", "Mobile_Subscriptions"]]
df_mobile_long.rename(columns={"REF_AREA": "Code"}, inplace=True)

print(f"Mobile data cleaned. Total observations: {len(df_mobile_long)}")
print("\n--- Mobile Long Data Head ---")
print(df_mobile_long.head())

Mobile data cleaned. Total observations: 6210

--- Mobile Long Data Head ---
  Code  Year  Mobile_Subscriptions
0  ABW  2000             16.899300
1  AFE  2000              3.726379
2  AFG  2000              0.000000
3  AFW  2000              1.240855
4  AGO  2000              0.159347


# Part 3: Merging our Datasets

Now I’ll merge the two DataFrames on the common keys: Code (Country) and Year. 

An inner join is used to ensure we only retain observations where both indicators are present for the exact same country and year.

We also filter out known regional aggregate codes (e.g., 'WLD', 'SSA').

In [None]:
# Merge the two long format DataFrames using an inner join to keep only matching (Code, Year) pairs
df_merged = pd.merge(df_findex_long, df_mobile_long, on=["Code", "Year"], how="inner")

# Filter out regional aggregates (codes that are not typical country codes, e.g., 'WLD', 'SSA', etc.)
# A simple filter: exclude codes shorter than 3 characters or known aggregates.
# Keeping only 3-letter codes might be too aggressive, so we'll use a blacklist/heuristic:
aggregate_codes = [
    "WLD",
    "LMY",
    "UMC",
    "HIC",
    "LIC",
    "MIC",
    "SSA",
    "EAS",
    "ECS",
    "LCN",
    "MEA",
    "NAC",
    "SAS",
    "XCD",
]
df_merged = df_merged[~df_merged["Code"].isin(aggregate_codes)]

# Final check
print(f"Final merged dataset size: {len(df_merged)} data points.")
print("\n--- Merged Data Head (ready for analysis) ---")
print(df_merged.head())

# Define X and Y for analysis
X = df_merged["Mobile_Subscriptions"]
Y = df_merged["Account_Ownership"]

Final merged dataset size: 3616 data points.

--- Merged Data Head (ready for analysis) ---
  Code          Country  Year  Account_Ownership  Mobile_Subscriptions
0  MEX           Mexico  2011          13.400245               82.0725
1  MKD  North Macedonia  2011          82.886250              108.4680
2  UGA           Uganda  2011           3.493182               50.0591
3  PHL      Philippines  2011          23.217293               95.8688
4  MYS         Malaysia  2011          52.513229              125.7160


# Statistical Analysis (Global)

In this step, I calculate the overall correlation between mobile broadband penetration and financial-account ownership across all countries and all years in the dataset. 

After merging both indicators into a single dataframe, I compute:
- the Pearson correlation coefficient
- a simple linear regression (slope and intercept)
- the R² value

This gives me a global, pooled estimate of how strongly the two variables move together, regardless of region or income level.

In [None]:
# ## 5. Statistical Analysis
# Calculate the Pearson correlation coefficient ($r$), perform linear regression, and determine the R-squared ($R^2$) value.

# %%
# 1. Calculate Pearson's r
r_value, p_value = pearsonr(X, Y)

# 2. Perform Linear Regression (Y = mX + b)
slope, intercept, r_value_linreg, p_value_linreg, stderr = linregress(X, Y)

# 3. Calculate R-squared (coefficient of determination)
r_squared = r_value**2

# Output the results
print(f"--- Correlation & Regression Results (n={len(df_merged)}) ---")
print(f"Pearson Correlation Coefficient (r): {r_value:.4f}")
print(f"Coefficient of Determination (R²): {r_squared:.4f}")
print(f"P-value: {p_value:.4e}")
print(f"Regression Equation (Y = mX + b): Y = {slope:.3f} * X + {intercept:.3f}")
print(f"Interpretation:")

if abs(r_value) >= 0.7:
    strength = "Very Strong"
elif abs(r_value) >= 0.5:
    strength = "Strong"
elif abs(r_value) >= 0.3:
    strength = "Moderate"
else:
    strength = "Weak or Non-existent"

direction = "Positive" if r_value > 0 else "Negative"

print(
    f"There is a {strength} {direction} correlation between Mobile Broadband Subscriptions and Account Ownership."
)
# Cast the numpy.bool result of the comparison to a standard Python integer using int()
print(
    f"This means that as mobile broadband subscriptions (per 100 people) {('increase', 'decrease')[int(r_value < 0)]}, the percentage of people with an account tends to {('increase', 'decrease')[int(r_value < 0)]} as well."
)


--- Correlation & Regression Results (n=3616) ---
Pearson Correlation Coefficient (r): 0.4406
Coefficient of Determination (R²): 0.1941
P-value: 1.2701e-171
Regression Equation (Y = mX + b): Y = 0.347 * X + 22.293
Interpretation:
There is a Moderate Positive correlation between Mobile Broadband Subscriptions and Account Ownership.
This means that as mobile broadband subscriptions (per 100 people) increase, the percentage of people with an account tends to increase as well.


# Statistical Analysis (Sub-Saharan Africa)

To focus our analysis and ensure regional relevance, we filter the merged data specifically for the Sub-Saharan Africa (SSA) region. 

We then calculate the overall Pearson correlation coefficient ($r$) and perform linear regression for this regional subset across all years.

In [None]:
# ## 5. Statistical Analysis (Sub-Saharan Africa Focus)
# Calculate the Pearson correlation coefficient ($r$), perform linear regression, and determine the R-squared ($R^2$) value specifically for the Sub-Saharan Africa (SSA) region across all available years.

# %%
# 1. Define SSA Country Codes (re-using the list from Section 6/6.1 for consistency)
SSA_CODES = [
    "AGO",
    "BDI",
    "BEN",
    "BFA",
    "BWA",
    "CMR",
    "COG",
    "CIV",
    "ETH",
    "GHA",
    "KEN",
    "LBR",
    "MLI",
    "MOZ",
    "NAM",
    "NGA",
    "RWA",
    "SEN",
    "SLE",
    "SOM",
    "TZA",
    "UGA",
    "ZAF",
    "ZMB",
    "ZWE",
]

# 2. Filter the merged data for the SSA region
df_ssa = df_merged[df_merged["Code"].isin(SSA_CODES)].copy()

print(f"Sub-Saharan Africa merged dataset size: {len(df_ssa)} data points.")

# 3. Define X and Y for SSA analysis
X = df_ssa["Mobile_Subscriptions"]
Y = df_ssa["Account_Ownership"]

# 4. Calculate Pearson's r
r_value, p_value = pearsonr(X, Y)

# 5. Perform Linear Regression (Y = mX + b)
slope, intercept, r_value_linreg, p_value_linreg, stderr = linregress(X, Y)

# 6. Calculate R-squared (coefficient of determination)
r_squared = r_value**2

# Output the results
print(
    f"\n--- Correlation & Regression Results (SUB-SAHARAN AFRICA ACROSS ALL YEARS, n={len(df_ssa)}) ---"
)
print(f"Pearson Correlation Coefficient (r): {r_value:.4f}")
print(f"Coefficient of Determination (R²): {r_squared:.4f}")
print(f"P-value: {p_value:.4e}")
print(f"Regression Equation (Y = mX + b): Y = {slope:.3f} * X + {intercept:.3f}")
print(f"Interpretation:")

if abs(r_value) >= 0.7:
    strength = "Very Strong"
elif abs(r_value) >= 0.5:
    strength = "Strong"
elif abs(r_value) >= 0.3:
    strength = "Moderate"
else:
    strength = "Weak or Non-existent"

direction = "Positive" if r_value > 0 else "Negative"

print(
    f"There is a {strength} {direction} correlation between Mobile Broadband Subscriptions and Account Ownership in Sub-Saharan Africa."
)
# Cast the numpy.bool result of the comparison to a standard Python integer using int()
print(
    f"This means that as mobile broadband subscriptions (per 100 people) {('increase', 'decrease')[int(r_value < 0)]}, the percentage of people with an account tends to {('increase', 'decrease')[int(r_value < 0)]} as well."
)


Sub-Saharan Africa merged dataset size: 522 data points.

--- Correlation & Regression Results (SUB-SAHARAN AFRICA ACROSS ALL YEARS, n=522) ---
Pearson Correlation Coefficient (r): 0.4118
Coefficient of Determination (R²): 0.1696
P-value: 8.7438e-23
Regression Equation (Y = mX + b): Y = 0.252 * X + 23.688
Interpretation:
There is a Moderate Positive correlation between Mobile Broadband Subscriptions and Account Ownership in Sub-Saharan Africa.
This means that as mobile broadband subscriptions (per 100 people) increase, the percentage of people with an account tends to increase as well.


# Part 4: Visualizing our Data (Sub-Saharan Africa)
Since we have many datapoints in this anlysis, we will go with the subset of countries from Sub-saharan Africa. 
We want a chart displays:

- Scatter Points: Individual SSA countries plotted by their Account Ownership (Y) vs. Mobile Subscriptions (X).

- Regression Line: A red dashed line showing the linear trend for that specific year's data.

- R² Value: An annotation on each subplot shows the $R^2$ value for that year, indicating the goodness of fit for the linear model.

The visual output should reveal:

- A clear, strong positive trend is visible across most years.

- The slope of the regression line may change over time, indicating that the rate at which mobile broadband drives financial inclusion may be slowing or accelerating.

- The clustering of countries shifts, reflecting the growth in mobile broadband penetration across the region from 2011 to 2024.

In [None]:
# Define a subset of countries for regional analysis (Sub-Saharan Africa - SSA) to reduce plot compression
# This includes countries like Kenya, Nigeria, South Africa, Tanzania, etc., which are key mobile money markets.
SSA_CODES = [
    "AGO",
    "BDI",
    "BEN",
    "BFA",
    "BWA",
    "CMR",
    "COG",
    "CIV",
    "ETH",
    "GHA",
    "KEN",
    "LBR",
    "MLI",
    "MOZ",
    "NAM",
    "NGA",
    "RWA",
    "SEN",
    "SLE",
    "SOM",
    "TZA",
    "UGA",
    "ZAF",
    "ZMB",
    "ZWE",
]

# Filter the merged data for the SSA region
df_plot = df_merged[df_merged["Code"].isin(SSA_CODES)].copy()

# Redefine X and Y for the SSA subset for plotting
X_plot = df_plot["Mobile_Subscriptions"]
Y_plot = df_plot["Account_Ownership"]

# Recalculate regression line for the SSA subset for the plotted line
slope_plot, intercept_plot, r_plot, p_plot, stderr_plot = linregress(X_plot, Y_plot)
r_squared_plot = r_plot**2

# Calculate the regression line for plotting
x_line = np.linspace(X_plot.min(), X_plot.max(), 100)
y_line = slope_plot * x_line + intercept_plot

# 1. Create the figure
fig = go.Figure()

# 2. Add Scatter Plot (Color-coded by Year)
fig.add_trace(
    go.Scatter(
        x=df_plot["Mobile_Subscriptions"],
        y=df_plot["Account_Ownership"],
        mode="markers",
        name="Country-Year Observation",
        marker=dict(
            size=8,
            color=df_plot["Year"],  # Color based on the 'Year' column
            colorscale="Viridis",
            colorbar=dict(title="Year"),
            opacity=0.7,
            line=dict(width=1, color="DarkSlateGrey"),
        ),
        hovertemplate="<b>Country:</b> %{customdata[0]}<br>"
        + "<b>Year:</b> %{x}<br>"
        + "<b>Mobile Subscriptions:</b> %{x:.2f}<br>"
        + "<b>Account Ownership:</b> %{y:.2f}%<extra></extra>",
        customdata=df_plot[["Country"]],
    )
)

# 3. Add Regression Line
fig.add_trace(
    go.Scatter(
        x=x_line,
        y=y_line,
        mode="lines",
        name=f"Regression Line (Y = {slope_plot:.2f}X + {intercept_plot:.2f})",
        line=dict(color="red", dash="dash", width=2),
    )
)

# 4. Update Layout and Axes
fig.update_layout(
    # Set the height explicitly to 650 to prevent compression
    height=650,
    title={
        "text": "<b>Interactive Scatter Plot: Mobile Broadband vs. Financial Account Ownership (Sub-Saharan Africa Subset)</b>",
        "y": 0.95,  # Adjusted Y position slightly for better fit with new height
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis_title="Mobile Broadband Subscriptions (per 100 people) [X]",
    yaxis_title="Account Ownership (% ages 15+) [Y]",
    font=dict(family="Arial, sans-serif", size=12, color="#333"),
    hovermode="closest",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
    annotations=[
        dict(
            x=X_plot.min() + (X_plot.max() - X_plot.min()) * 0.05,
            y=Y_plot.max() - (Y_plot.max() - Y_plot.min()) * 0.05,
            xref="x",
            yref="y",
            text=f"$R^2$ (SSA) = {r_squared_plot:.4f}<br>$r$ (SSA) = {r_plot:.4f}",
            showarrow=False,
            bgcolor="rgba(255, 255, 255, 0.7)",
            bordercolor="rgba(0, 0, 0, 0.1)",
            borderwidth=1,
            font=dict(size=12),
        )
    ],
)

# 5. Display the interactive plot
fig.show()


While the scatterplot shows us a positive trend, the amount of datapoints still makes it hart to see the yearly trends per country. Let's try dividing the data by year to see if there are any changes over time. 

In [None]:
# Create a scatter plot with the data and overlay the regression line.
# *Note: The visualization now uses **Small Multiples** (subplots by year) for the Sub-Saharan Africa subset to eliminate compression and better visualize the changing correlation over time.*

# %%
# Define a subset of countries for regional analysis (Sub-Saharan Africa - SSA) to reduce plot compression
SSA_CODES = [
    "AGO",
    "BDI",
    "BEN",
    "BFA",
    "BWA",
    "CMR",
    "COG",
    "CIV",
    "ETH",
    "GHA",
    "KEN",
    "LBR",
    "MLI",
    "MOZ",
    "NAM",
    "NGA",
    "RWA",
    "SEN",
    "SLE",
    "SOM",
    "TZA",
    "UGA",
    "ZAF",
    "ZMB",
    "ZWE",
]

# Filter the merged data for the SSA region
df_plot = df_merged[df_merged["Code"].isin(SSA_CODES)].copy()

# Identify the Findex years available in the filtered SSA data
findex_years = df_plot["Year"].unique()
findex_years.sort()  # Ensure correct chronological order

# FIX: Update the grid to 3x2 to accommodate up to 6 unique Findex years (e.g., 2011, 2014, 2017, 2021, 2022, 2024)
fig = make_subplots(
    rows=3,
    cols=2,  # Changed from 2x2 to 3x2 to accommodate 6 years
    subplot_titles=[f"Year {year}" for year in findex_years],
    # Share X and Y axes across subplots for easier comparison of scales
    shared_xaxes=True,
    shared_yaxes=True,
    vertical_spacing=0.1,
    horizontal_spacing=0.1,
)

# Distinct colors for each year's data
color_palette = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]

# Loop through each year to create a separate subplot
for i, year in enumerate(findex_years):
    df_year = df_plot[df_plot["Year"] == year].copy()

    # Determine row and column for the subplot (1-based index)
    # The new logic correctly handles 3 rows (i.e., i=0,1 -> row 1; i=2,3 -> row 2; i=4,5 -> row 3)
    row = (i // 2) + 1
    col = (i % 2) + 1

    # Ensure there is enough data for analysis (at least 2 points to calculate correlation)
    if len(df_year) >= 2:
        X_year = df_year["Mobile_Subscriptions"]
        Y_year = df_year["Account_Ownership"]

        # Recalculate regression for this specific year
        slope_year, intercept_year, r_year, p_year, stderr_year = linregress(
            X_year, Y_year
        )
        r_squared_year = r_year**2

        # Calculate line for plotting
        # Use the min/max of the *entire* X range to ensure the line spans the full shared axis
        x_line_year = np.linspace(
            df_plot["Mobile_Subscriptions"].min(),
            df_plot["Mobile_Subscriptions"].max(),
            100,
        )
        y_line_year = slope_year * x_line_year + intercept_year

        # Add Scatter Plot for the specific year
        fig.add_trace(
            go.Scatter(
                x=X_year,
                y=Y_year,
                mode="markers",
                name=f"Data {year}",
                marker=dict(
                    size=10,
                    color=color_palette[
                        i % len(color_palette)
                    ],  # Use distinct color per year
                    opacity=0.8,
                    line=dict(width=1, color="DarkSlateGrey"),
                ),
                hovertemplate="<b>Country:</b> %{customdata[0]}<br>"
                + "<b>Mobile Subscriptions:</b> %{x:.2f}<br>"
                + "<b>Account Ownership:</b> %{y:.2f}%<extra></extra>",
                customdata=df_year[["Country"]],
                showlegend=False,  # Hide legend for data points
            ),
            row=row,
            col=col,
        )

        # Add Regression Line for the specific year
        fig.add_trace(
            go.Scatter(
                x=x_line_year,
                y=y_line_year,
                mode="lines",
                name=f"Trend {year} (R²={r_squared_year:.2f})",
                line=dict(color="red", dash="dash", width=2),
                showlegend=True,
            ),
            row=row,
            col=col,
        )

        # Add R-squared annotation
        fig.add_annotation(
            xref=f"x{i + 1}",  # Reference the subplot's axis
            yref=f"y{i + 1}",
            # Place annotation in the top-left of the individual subplot
            x=X_year.min() + (X_year.max() - X_year.min()) * 0.05,
            y=Y_year.max() - (Y_year.max() - Y_year.min()) * 0.05,
            xanchor="left",
            yanchor="top",
            text=f"$R^2$ = {r_squared_year:.2f}<br>$r$ = {r_year:.2f}",
            showarrow=False,
            bgcolor="rgba(255, 255, 255, 0.7)",
            bordercolor="rgba(0, 0, 0, 0.1)",
            borderwidth=1,
            font=dict(size=10),
        )

# Update layout for the entire figure
fig.update_layout(
    height=1000,  # Increased height for the 3x2 layout
    width=800,
    title={
        "text": "<b>Mobile Broadband vs. Financial Account Ownership Trends by Year (Sub-Saharan Africa Subset)</b>",
        "y": 0.98,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    font=dict(family="Arial, sans-serif", size=12, color="#333"),
    hovermode="closest",
    legend_title_text="Yearly Trendlines",
)

# Apply axis titles to the outer subplots only for the 3x2 grid
# Set X-axis titles only for the bottom row (row=3)
fig.update_xaxes(
    title_text="Mobile Broadband Subscriptions (per 100 people) [X]", row=3
)
# Set Y-axis titles only for the left column (col=1)
fig.update_yaxes(title_text="Account Ownership (% ages 15+) [Y]", col=1)

# Remove redundant X-axis titles from top two rows
fig.update_xaxes(title_text="", row=[1, 2])
# Remove redundant Y-axis titles from right column
fig.update_yaxes(title_text="", col=2)


# Conclusion

In this project I examined whether greater digital connectivity measured as mobile broadband subscriptions per 100 people  is associated with higher financial inclusion across countries (account ownership among adults 15+). 

Using merged datasets of broadband adoption and financial-account data from the World Bank, I found:
- A positive correlation between mobile broadband penetration and account ownership across the pooled country-year sample
- For the subset of Sub‑Saharan Africa (SSA), the positive relationship remains visible, though the strength of correlation and fit (R²) vary across years.
- Year-by-year scatterplots show that, while there is variation, many cohorts of countries in SSA with higher broadband penetration tend also to have higher financial inclusion, suggesting a meaningful association.

These findings support the hypothesis that improved mobile broadband access tends to coincide with higher levels of financial inclusion, particularly in contexts where mobile money and digital financial services may rely on connectivity.


### Interpretation & Policy Implications

The observed association suggests that national efforts to expand mobile broadband infrastructure might contribute to increasing financial access for adults, especially in developing regions such as SSA.

From a policy-design perspective, investing in broadband infrastructure could be a lever for improving financial inclusion. However, broadband access alone is unlikely to be sufficient: complementary policies (digital literacy, regulation of mobile-money services, supportive financial regulations) likely matter as well.

The variability across years in the SSA subset underlines that connectivity is only one part of the equation: other country-specific factors (economic conditions, regulatory frameworks, social trust, availability of financial services) may shape whether broadband actually translates into financial inclusion.


### Limitations

- Correlation ≠ causation: The analysis shows association, but does not establish that broadband adoption causes higher account ownership. Other unobserved variables (GDP per capita, education, urbanization, regulatory quality) may drive the relationship. 
- Data limitations: The dataset only covers certain indicator-years; many country-year pairs are missing, which may bias results toward countries with more frequent surveys.
- Linear model simplicity: The regression assumes a linear relationship; real world dynamics may be non-linear or have threshold effects (e.g. a minimum broadband penetration needed before financial inclusion rises). Also, the coefficient of determination (R²) only captures how much variance in account ownership is “explained” by broadband, likely only part of the story. 

- Aggregated national data: Country level aggregates mask within country variation (urban vs rural, income levels, etc.). The relationship at national scale may not reflect what happens at individual or community level.




# Data Sources 

### Mobile Broadband Data:
Source: World Bank, World Development Indicators (WDI)
Dataset: “Mobile broadband subscriptions (per 100 people)”
Link: https://data.worldbank.org/indicator/IT.CEL.BBND.P2

### Financial Account Ownership Data:
Source: World Bank, Global Findex Database
Dataset: “Account ownership at a financial institution or with a mobile-money service (% ages 15+)”
Link: https://databank.worldbank.org/source/global-financial-inclusion

AI Use Disclosure: This notebook was created with assistance from AI for data cleaning, merging, and visualization guidance.