# Task 2: Exploratory Data Analysis (EDA)

## Objective
Analyze the data to understand patterns and factors influencing financial inclusion in Ethiopia.

## 1. Dataset Overview
We begin by loading the unified dataset and summarizing its key dimensions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

# File Paths
data_path = "../data/raw/ethiopia_fi_unified_data.csv"
output_dir = "../reports/figures"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Load Data
df = pd.read_csv(data_path)
df['observation_date'] = pd.to_datetime(df['observation_date'], errors='coerce')
df['year'] = df['observation_date'].dt.year
obs_df = df[df['record_type'] == 'observation'].copy()
print(f"Dataset loaded with {len(df)} records.")

### Temporal Coverage and Data Quality
![Temporal Coverage](../reports/figures/task2_temporal_coverage.png)

The dataset shows high confidence (over 70% 'high') and broad coverage for core indicators like `ACC_OWNERSHIP`, but sparsity in recent usage disaggregations.

## 2. Access Analysis

### Account Ownership Trajectory
![Access Trajectory](../reports/figures/task2_access_trajectory.png)

Ethiopia saw a steady rise from 14% (2011) to 46% (2021). However, the 2021-2024 period shows a slower growth (+3pp) despite the digital revolution.

### Gender Gap Evolution
![Gender Gap](../reports/figures/task2_gender_gap.png)

A significant gap remains, with male ownership consistently leads female ownership by ~20 percentage points.

## 3. Usage (Digital Payments) Analysis

We analyze the adoption of mobile money and the transition from cash to digital.

In [None]:
# Filter for Usage and Mobile Money
usage_codes = ['USG_TELEBIRR_USERS', 'USG_MPESA_USERS', 'USG_P2P_COUNT', 'USG_ACTIVE_RATE']
usage_df = obs_df[obs_df['indicator_code'].isin(usage_codes)].copy()

# Mobile Money User Growth
mm_users = usage_df[usage_df['indicator_code'].isin(['USG_TELEBIRR_USERS', 'USG_MPESA_USERS'])].sort_values('observation_date')

plt.figure(figsize=(12, 6))
sns.lineplot(data=mm_users, x='year', y='value_numeric', hue='indicator', marker='s')
plt.title('Mobile Money Registered User Growth (Millions)')
plt.ylabel('Users (Millions)')
plt.savefig(os.path.join(output_dir, 'task2_mm_users.png'))
plt.show()

### P2P vs. Cash (ATM)
Analyzing the ratio of digital transfers to cash withdrawals.

In [None]:
crossover = obs_df[obs_df['indicator_code'] == 'USG_CROSSOVER'].sort_values('year')
if not crossover.empty:
    plt.figure(figsize=(10, 6))
    plt.plot(crossover['year'], crossover['value_numeric'], marker='^', color='green')
    plt.title('P2P / ATM Transaction Ratio Trend')
    plt.ylabel('Ratio')
    plt.savefig(os.path.join(output_dir, 'task2_usage_crossover.png'))
    plt.show()

## 4. Infrastructure and Enablers

Correlating technical enablers with inclusion outcomes.

In [None]:
infra_codes = ['INF_4G_COVERAGE', 'INF_MOBILE_PENETRATION', 'INF_ATM_DENSITY']
infra_df = obs_df[obs_df['indicator_code'].isin(infra_codes)].copy()

plt.figure(figsize=(12, 6))
sns.lineplot(data=infra_df, x='year', y='value_numeric', hue='indicator')
plt.title('Infrastructure Evolution in Ethiopia')
plt.savefig(os.path.join(output_dir, 'task2_infrastructure.png'))
plt.show()

## 5. Event Timeline and Visual Analysis

Overlaying key milestones on the Account Ownership trend.

In [None]:
events = df[df['record_type'] == 'event'].copy()
events['year'] = pd.to_datetime(events['observation_date'], errors='coerce').dt.year

plt.figure(figsize=(15, 7))
plt.plot(total_acc['year'], total_acc['value_numeric'], marker='o', label='Account Ownership')

for idx, row in events.dropna(subset=['year']).iterrows():
    plt.axvline(x=row['year'], color='red', linestyle='--', alpha=0.5)
    plt.text(row['year'], 5, row['indicator'], rotation=90, verticalalignment='bottom', color='red')

plt.title('Impact of Market Events on Account Ownership')
plt.savefig(os.path.join(output_dir, 'task2_event_overlay.png'))
plt.show()

## 6. Key Insights & Hypotheses

1. **Stagnation Mystery**: Despite 65M+ mobile money accounts, the 2024 Findex-equivalent ownership rate (+3pp) suggestions a "thin inclusion"â€”many users have the tool but do not yet perceive it as a full financial account.
2. **The Gender Barrier**: The gender gap is structural and has not narrowed significantly despite Telebirr's reach.
3. **Event Acceleration**: Telebirr's launch (2021) coincides with the explosive growth in P2P transaction values, even if account ownership growth slowed.
4. **Leading Indicators**: 4G coverage and Mobile Penetration remain strong predictors, but their marginal impact on *new* account opening is decreasing.
5. **Hypothesis**: Future growth relies on deepening **Usage** (savings, credit) rather than just **Access** (registration).