# Phase 5: Digital Exclusion Risk Scoring

## Objective
Quantify digital exclusion risk at the district-month level.

**Digital Exclusion Definition:**
> "Aadhaar exists, but digital usability is limited due to poor mobile linkage and demographic constraints."

## Input Files
- `data/processed/district_features.csv` (from Phase 2)
- `data/final/migration_intensity.csv` (from Phase 3)

## Output File
- `data/final/digital_exclusion_risk.csv`

## Risk Score Components
| Component | Weight | Source |
|-----------|--------|--------|
| Digital Gap | 0.5 | Phase 2 - digital_gap |
| Demographic Risk | 0.3 | 1 - youth_ratio |
| Migration Instability | 0.2 | ABS(migration_score) |

## Risk Level Buckets
| Score Range | Risk Level |
|-------------|------------|
| 70-100 | High |
| 40-69 | Medium |
| 0-39 | Low |

## Cell 1: Import Libraries and Set Paths

In [1]:
"""
Import Required Libraries
-------------------------
- pandas: Data manipulation
- pathlib: Cross-platform file path handling
"""
import pandas as pd
from pathlib import Path

# Define project paths
PROJECT_ROOT = Path.cwd().parent

# Input files
FEATURES_PATH = PROJECT_ROOT / "data" / "processed" / "district_features.csv"
MIGRATION_PATH = PROJECT_ROOT / "data" / "final" / "migration_intensity.csv"

# Output file
OUTPUT_PATH = PROJECT_ROOT / "data" / "final" / "digital_exclusion_risk.csv"

# Ensure output directory exists
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

print(f"‚úÖ Libraries imported")
print(f"üìÇ Features Input: {FEATURES_PATH}")
print(f"üìÇ Migration Input: {MIGRATION_PATH}")
print(f"üìÇ Output: {OUTPUT_PATH}")

‚úÖ Libraries imported
üìÇ Features Input: d:\Projects\ML - DataScience\Saarthi-Net-Data-Pipeline\data\processed\district_features.csv
üìÇ Migration Input: d:\Projects\ML - DataScience\Saarthi-Net-Data-Pipeline\data\final\migration_intensity.csv
üìÇ Output: d:\Projects\ML - DataScience\Saarthi-Net-Data-Pipeline\data\final\digital_exclusion_risk.csv


## Cell 2: Load Input Data from Phase 2 and Phase 3

In [2]:
"""
Load Input Data
---------------
1. district_features.csv (Phase 2): Contains digital_gap, youth_ratio
2. migration_intensity.csv (Phase 3): Contains migration_score

Both files are keyed by district_id and month.
"""

# Load Phase 2 features (digital_gap, youth_ratio)
df_features = pd.read_csv(FEATURES_PATH)
print(f"‚úÖ Loaded {len(df_features):,} rows from Phase 2 (district_features.csv)")
print(f"   Columns: {list(df_features.columns)}")

# Load Phase 3 migration scores
df_migration = pd.read_csv(MIGRATION_PATH)
print(f"\n‚úÖ Loaded {len(df_migration):,} rows from Phase 3 (migration_intensity.csv)")
print(f"   Columns: {list(df_migration.columns)}")

# Display sample data
print(f"\nüìã Sample features data:")
display(df_features[['district_id', 'month', 'youth_ratio', 'digital_gap']].head())

print(f"\nüìã Sample migration data:")
display(df_migration.head())

‚úÖ Loaded 240 rows from Phase 2 (district_features.csv)
   Columns: ['district_id', 'month', 'address_velocity', 'migration_index', 'youth_ratio', 'digital_gap', 'growth_3m_consistent']

‚úÖ Loaded 240 rows from Phase 3 (migration_intensity.csv)
   Columns: ['district_id', 'month', 'migration_score', 'migration_category']

üìã Sample features data:


Unnamed: 0,district_id,month,youth_ratio,digital_gap
0,GJ_AMD,2024-01,0.4038,0.1464
1,GJ_AMD,2024-02,0.4038,0.1965
2,GJ_AMD,2024-03,0.4038,0.2605
3,GJ_AMD,2024-04,0.4038,0.1367
4,GJ_AMD,2024-05,0.4038,0.1199



üìã Sample migration data:


Unnamed: 0,district_id,month,migration_score,migration_category
0,GJ_AMD,2024-01,0.0017,Moderate Inflow
1,GJ_AMD,2024-02,0.0154,High Inflow
2,GJ_AMD,2024-03,0.0105,Moderate Inflow
3,GJ_AMD,2024-04,0.0006,Moderate Inflow
4,GJ_AMD,2024-05,0.0246,High Inflow


## Cell 3: Merge Input Datasets

In [3]:
"""
Merge Input Datasets
--------------------
Combine Phase 2 features with Phase 3 migration scores.
Join on district_id and month (the composite key).
"""

# Select only the columns we need from each dataset
features_cols = ['district_id', 'month', 'youth_ratio', 'digital_gap']
migration_cols = ['district_id', 'month', 'migration_score']

# Merge on district_id and month
df = pd.merge(
    df_features[features_cols],
    df_migration[migration_cols],
    on=['district_id', 'month'],
    how='inner'
)

print(f"‚úÖ Merged datasets: {len(df):,} rows")
print(f"   Districts: {df['district_id'].nunique()}")
print(f"   Months: {df['month'].nunique()}")

# Verify no missing values after merge
missing = df.isna().sum().sum()
if missing == 0:
    print(f"‚úÖ No missing values after merge")
else:
    print(f"‚ö†Ô∏è Warning: {missing} missing values found")

print(f"\nüìã Merged data sample:")
df.head(10)

‚úÖ Merged datasets: 240 rows
   Districts: 20
   Months: 12
‚úÖ No missing values after merge

üìã Merged data sample:


Unnamed: 0,district_id,month,youth_ratio,digital_gap,migration_score
0,GJ_AMD,2024-01,0.4038,0.1464,0.0017
1,GJ_AMD,2024-02,0.4038,0.1965,0.0154
2,GJ_AMD,2024-03,0.4038,0.2605,0.0105
3,GJ_AMD,2024-04,0.4038,0.1367,0.0006
4,GJ_AMD,2024-05,0.4038,0.1199,0.0246
5,GJ_AMD,2024-06,0.4038,0.1369,0.0202
6,GJ_AMD,2024-07,0.4038,0.2611,0.0054
7,GJ_AMD,2024-08,0.4038,0.1263,0.0144
8,GJ_AMD,2024-09,0.4038,0.2278,0.0211
9,GJ_AMD,2024-10,0.4038,0.1471,-0.0023


## Cell 4: Compute Risk Components

**Three interpretable components:**

1. **Digital Gap Component (weight: 0.5)**
   - Directly from Phase 2: `digital_gap`
   - Higher values = more digital exclusion

2. **Demographic Risk Component (weight: 0.3)**
   - Formula: `1 - youth_ratio`
   - Lower youth population = higher demographic risk

3. **Migration Instability Component (weight: 0.2)**
   - Formula: `ABS(migration_score)`
   - Both high inflow and outflow create instability

In [4]:
"""
Compute Risk Components
-----------------------
Each component represents a different dimension of digital exclusion risk.

1. digital_gap: Direct measure of mobile vs address update disparity
   - Higher values indicate poor mobile linkage relative to address updates

2. demographic_risk: Inverse of youth ratio
   - Older populations tend to have lower digital literacy
   - Formula: 1 - youth_ratio

3. migration_instability: Absolute value of migration score
   - Both high inflow and outflow create service delivery challenges
   - Unstable populations face higher digital exclusion risk
"""

# Component 1: Digital Gap (directly from Phase 2)
# Already normalized between 0 and 1
df['digital_gap_component'] = df['digital_gap']

# Component 2: Demographic Risk = 1 - youth_ratio
# Lower youth ratio means older population with higher digital risk
df['demographic_risk'] = 1 - df['youth_ratio']

# Component 3: Migration Instability = ABS(migration_score)
# Both extreme inflow and outflow create instability
df['migration_instability'] = df['migration_score'].abs()

print(f"‚úÖ Risk components computed")
print(f"\nüìä Component statistics:")
print(f"   digital_gap_component: min={df['digital_gap_component'].min():.4f}, max={df['digital_gap_component'].max():.4f}")
print(f"   demographic_risk: min={df['demographic_risk'].min():.4f}, max={df['demographic_risk'].max():.4f}")
print(f"   migration_instability: min={df['migration_instability'].min():.4f}, max={df['migration_instability'].max():.4f}")

print(f"\nüìã Sample with components:")
df[['district_id', 'month', 'digital_gap_component', 'demographic_risk', 'migration_instability']].head(10)

‚úÖ Risk components computed

üìä Component statistics:
   digital_gap_component: min=0.1071, max=0.7030
   demographic_risk: min=0.5874, max=0.6141
   migration_instability: min=0.0000, max=0.0271

üìã Sample with components:


Unnamed: 0,district_id,month,digital_gap_component,demographic_risk,migration_instability
0,GJ_AMD,2024-01,0.1464,0.5962,0.0017
1,GJ_AMD,2024-02,0.1965,0.5962,0.0154
2,GJ_AMD,2024-03,0.2605,0.5962,0.0105
3,GJ_AMD,2024-04,0.1367,0.5962,0.0006
4,GJ_AMD,2024-05,0.1199,0.5962,0.0246
5,GJ_AMD,2024-06,0.1369,0.5962,0.0202
6,GJ_AMD,2024-07,0.2611,0.5962,0.0054
7,GJ_AMD,2024-08,0.1263,0.5962,0.0144
8,GJ_AMD,2024-09,0.2278,0.5962,0.0211
9,GJ_AMD,2024-10,0.1471,0.5962,0.0023


## Cell 5: Compute Digital Exclusion Score

**Formula:**
```
raw_risk = 0.5 * digital_gap + 0.3 * demographic_risk + 0.2 * migration_instability
digital_exclusion_score = ROUND(MIN(1, raw_risk) * 100)
```

In [5]:
"""
Compute Digital Exclusion Score
--------------------------------
Weighted combination of three risk components:
- Digital Gap: 50% weight (primary driver)
- Demographic Risk: 30% weight (population structure)
- Migration Instability: 20% weight (stability factor)

Final score is normalized to 0-100 scale as an integer.
"""

# Define weights (must sum to 1.0)
WEIGHT_DIGITAL_GAP = 0.5
WEIGHT_DEMOGRAPHIC = 0.3
WEIGHT_MIGRATION = 0.2

# Compute raw risk score (weighted sum)
df['raw_risk'] = (
    WEIGHT_DIGITAL_GAP * df['digital_gap_component'] +
    WEIGHT_DEMOGRAPHIC * df['demographic_risk'] +
    WEIGHT_MIGRATION * df['migration_instability']
)

# Clamp raw_risk to [0, 1] and scale to 0-100
# Formula: ROUND(MIN(1, raw_risk) * 100)
df['digital_exclusion_score'] = (
    df['raw_risk']
    .clip(lower=0, upper=1)  # Clamp to [0, 1]
    .mul(100)                 # Scale to 0-100
    .round()                  # Round to nearest integer
    .astype(int)              # Convert to integer
)

print(f"‚úÖ Digital exclusion scores computed")
print(f"\nüìä Score statistics:")
print(f"   Min: {df['digital_exclusion_score'].min()}")
print(f"   Max: {df['digital_exclusion_score'].max()}")
print(f"   Mean: {df['digital_exclusion_score'].mean():.1f}")
print(f"   Median: {df['digital_exclusion_score'].median():.0f}")

print(f"\nüìã Sample scores:")
df[['district_id', 'month', 'raw_risk', 'digital_exclusion_score']].head(12)

‚úÖ Digital exclusion scores computed

üìä Score statistics:
   Min: 23
   Max: 54
   Mean: 37.5
   Median: 36

üìã Sample scores:


Unnamed: 0,district_id,month,raw_risk,digital_exclusion_score
0,GJ_AMD,2024-01,0.2524,25
1,GJ_AMD,2024-02,0.28019,28
2,GJ_AMD,2024-03,0.31121,31
3,GJ_AMD,2024-04,0.24733,25
4,GJ_AMD,2024-05,0.24373,24
5,GJ_AMD,2024-06,0.25135,25
6,GJ_AMD,2024-07,0.31049,31
7,GJ_AMD,2024-08,0.24489,24
8,GJ_AMD,2024-09,0.29698,30
9,GJ_AMD,2024-10,0.25287,25


## Cell 6: Assign Risk Level Categories

**Risk Level Buckets:**
| Score Range | Risk Level |
|-------------|------------|
| 70-100 | High |
| 40-69 | Medium |
| 0-39 | Low |

In [6]:
"""
Assign Risk Level Categories
----------------------------
Categorize digital_exclusion_score into three risk levels:

- High (70-100): Critical digital exclusion, needs immediate intervention
- Medium (40-69): Moderate risk, requires monitoring
- Low (0-39): Acceptable digital inclusion levels
"""

def assign_risk_level(score: int) -> str:
    """
    Assign risk level based on digital exclusion score.
    
    Args:
        score: Integer score from 0-100
    
    Returns:
        One of: 'High', 'Medium', 'Low'
    """
    if score >= 70:
        return "High"
    elif score >= 40:
        return "Medium"
    else:
        return "Low"

# Apply risk level assignment
df['risk_level'] = df['digital_exclusion_score'].apply(assign_risk_level)

print(f"‚úÖ Risk levels assigned")
print(f"\nüìä Risk level distribution:")
risk_dist = df['risk_level'].value_counts().sort_index()
for level in ['High', 'Low', 'Medium']:  # Alphabetical order
    if level in risk_dist.index:
        count = risk_dist[level]
        pct = count / len(df) * 100
        print(f"   {level}: {count} ({pct:.1f}%)")

print(f"\nüìä Mean score by risk level:")
print(df.groupby('risk_level')['digital_exclusion_score'].mean().round(1))

print(f"\nüìã Sample with risk levels:")
df[['district_id', 'month', 'digital_exclusion_score', 'risk_level']].head(12)

‚úÖ Risk levels assigned

üìä Risk level distribution:
   Low: 143 (59.6%)
   Medium: 97 (40.4%)

üìä Mean score by risk level:
risk_level
Low       31.0
Medium    47.0
Name: digital_exclusion_score, dtype: float64

üìã Sample with risk levels:


Unnamed: 0,district_id,month,digital_exclusion_score,risk_level
0,GJ_AMD,2024-01,25,Low
1,GJ_AMD,2024-02,28,Low
2,GJ_AMD,2024-03,31,Low
3,GJ_AMD,2024-04,25,Low
4,GJ_AMD,2024-05,24,Low
5,GJ_AMD,2024-06,25,Low
6,GJ_AMD,2024-07,31,Low
7,GJ_AMD,2024-08,24,Low
8,GJ_AMD,2024-09,30,Low
9,GJ_AMD,2024-10,25,Low


## Cell 7: Prepare Final Output Schema

In [7]:
"""
Prepare Final Output
--------------------
Select only the required columns in the exact schema order.

Output Schema:
- district_id (string)
- month (string, YYYY-MM)
- digital_exclusion_score (integer, 0-100)
- risk_level (string: High, Medium, Low)
"""

# Define exact output schema
OUTPUT_COLUMNS = [
    'district_id',
    'month',
    'digital_exclusion_score',
    'risk_level'
]

# Select only required columns
df_output = df[OUTPUT_COLUMNS].copy()

# Sort by district_id and month for deterministic output
df_output = df_output.sort_values(['district_id', 'month']).reset_index(drop=True)

print(f"‚úÖ Final output prepared: {len(df_output):,} rows")
print(f"\nüìä Columns: {list(df_output.columns)}")
print(f"\nüìã Data types:")
print(df_output.dtypes)
print(f"\nüìã First 12 rows:")
df_output.head(12)

‚úÖ Final output prepared: 240 rows

üìä Columns: ['district_id', 'month', 'digital_exclusion_score', 'risk_level']

üìã Data types:
district_id                object
month                      object
digital_exclusion_score     int32
risk_level                 object
dtype: object

üìã First 12 rows:


Unnamed: 0,district_id,month,digital_exclusion_score,risk_level
0,GJ_AMD,2024-01,25,Low
1,GJ_AMD,2024-02,28,Low
2,GJ_AMD,2024-03,31,Low
3,GJ_AMD,2024-04,25,Low
4,GJ_AMD,2024-05,24,Low
5,GJ_AMD,2024-06,25,Low
6,GJ_AMD,2024-07,31,Low
7,GJ_AMD,2024-08,24,Low
8,GJ_AMD,2024-09,30,Low
9,GJ_AMD,2024-10,25,Low


## Cell 8: Data Quality Validation

In [8]:
"""
Data Quality Checks
-------------------
Validate all required constraints before export.
"""

checks_passed = 0
total_checks = 7

# Check 1: No NaN values
nan_count = df_output.isna().sum().sum()
if nan_count == 0:
    print("‚úÖ Check 1: No NaN values")
    checks_passed += 1
else:
    print(f"‚ùå Check 1: Found {nan_count} NaN values")

# Check 2: digital_exclusion_score is integer
if df_output['digital_exclusion_score'].dtype in ['int64', 'int32', 'int']:
    print("‚úÖ Check 2: digital_exclusion_score is integer type")
    checks_passed += 1
else:
    print(f"‚ùå Check 2: digital_exclusion_score is {df_output['digital_exclusion_score'].dtype}, expected integer")

# Check 3: digital_exclusion_score is between 0 and 100
score_min = df_output['digital_exclusion_score'].min()
score_max = df_output['digital_exclusion_score'].max()
if score_min >= 0 and score_max <= 100:
    print(f"‚úÖ Check 3: digital_exclusion_score in range [0, 100] (actual: {score_min}-{score_max})")
    checks_passed += 1
else:
    print(f"‚ùå Check 3: digital_exclusion_score out of range [{score_min}, {score_max}]")

# Check 4: Valid risk_level values
valid_levels = {'High', 'Medium', 'Low'}
actual_levels = set(df_output['risk_level'].unique())
if actual_levels.issubset(valid_levels):
    print(f"‚úÖ Check 4: Valid risk_level values: {actual_levels}")
    checks_passed += 1
else:
    print(f"‚ùå Check 4: Invalid risk_level found: {actual_levels - valid_levels}")

# Check 5: risk_level is non-null
if df_output['risk_level'].notna().all():
    print("‚úÖ Check 5: risk_level is non-null")
    checks_passed += 1
else:
    print("‚ùå Check 5: risk_level has null values")

# Check 6: Correct column count (exactly 4)
if len(df_output.columns) == 4:
    print("‚úÖ Check 6: Exactly 4 columns (no extra columns)")
    checks_passed += 1
else:
    print(f"‚ùå Check 6: Expected 4 columns, got {len(df_output.columns)}")

# Check 7: One row per district-month (no duplicates)
duplicates = df_output.duplicated(subset=['district_id', 'month']).sum()
if duplicates == 0:
    print(f"‚úÖ Check 7: No duplicate district-month combinations")
    checks_passed += 1
else:
    print(f"‚ùå Check 7: Found {duplicates} duplicate district-month rows")

print(f"\n{'='*50}")
if checks_passed == total_checks:
    print(f"‚úÖ ALL {total_checks} VALIDATION CHECKS PASSED")
else:
    print(f"‚ö†Ô∏è {checks_passed}/{total_checks} checks passed")

‚úÖ Check 1: No NaN values
‚úÖ Check 2: digital_exclusion_score is integer type
‚úÖ Check 3: digital_exclusion_score in range [0, 100] (actual: 23-54)
‚úÖ Check 4: Valid risk_level values: {'Low', 'Medium'}
‚úÖ Check 5: risk_level is non-null
‚úÖ Check 6: Exactly 4 columns (no extra columns)
‚úÖ Check 7: No duplicate district-month combinations

‚úÖ ALL 7 VALIDATION CHECKS PASSED


## Cell 9: Export to CSV

In [9]:
"""
Export to CSV
-------------
Save the digital exclusion risk data in CSV format.
"""

# Export to CSV (no index column)
df_output.to_csv(OUTPUT_PATH, index=False)

print(f"‚úÖ CSV exported to: {OUTPUT_PATH}")
print(f"\nüìä Export Summary:")
print(f"   Total rows: {len(df_output):,}")
print(f"   Districts: {df_output['district_id'].nunique()}")
print(f"   Months: {df_output['month'].nunique()}")
print(f"\nüìä Risk level distribution:")
for level in ['High', 'Medium', 'Low']:
    count = (df_output['risk_level'] == level).sum()
    pct = count / len(df_output) * 100
    print(f"   {level}: {count} ({pct:.1f}%)")
print(f"\nüìã Schema:")
for col in df_output.columns:
    sample_val = df_output[col].iloc[0]
    print(f"   {col}: {df_output[col].dtype} (e.g., {sample_val})")
print(f"\nüìÑ First 10 rows:")
df_output.head(10)

‚úÖ CSV exported to: d:\Projects\ML - DataScience\Saarthi-Net-Data-Pipeline\data\final\digital_exclusion_risk.csv

üìä Export Summary:
   Total rows: 240
   Districts: 20
   Months: 12

üìä Risk level distribution:
   High: 0 (0.0%)
   Medium: 97 (40.4%)
   Low: 143 (59.6%)

üìã Schema:
   district_id: object (e.g., GJ_AMD)
   month: object (e.g., 2024-01)
   digital_exclusion_score: int32 (e.g., 25)
   risk_level: object (e.g., Low)

üìÑ First 10 rows:


Unnamed: 0,district_id,month,digital_exclusion_score,risk_level
0,GJ_AMD,2024-01,25,Low
1,GJ_AMD,2024-02,28,Low
2,GJ_AMD,2024-03,31,Low
3,GJ_AMD,2024-04,25,Low
4,GJ_AMD,2024-05,24,Low
5,GJ_AMD,2024-06,25,Low
6,GJ_AMD,2024-07,31,Low
7,GJ_AMD,2024-08,24,Low
8,GJ_AMD,2024-09,30,Low
9,GJ_AMD,2024-10,25,Low


## Phase 5 Complete

### Output File
`data/final/digital_exclusion_risk.csv`

### Schema
| Field | Type | Description |
|-------|------|-------------|
| district_id | string | District identifier |
| month | string | YYYY-MM format |
| digital_exclusion_score | integer | Risk score (0-100) |
| risk_level | string | High, Medium, or Low |

### Risk Score Formula
```
raw_risk = 0.5 * digital_gap + 0.3 * (1 - youth_ratio) + 0.2 * ABS(migration_score)
digital_exclusion_score = ROUND(MIN(1, raw_risk) * 100)
```

### Risk Level Buckets
- **High** (70-100): Critical digital exclusion
- **Medium** (40-69): Moderate risk
- **Low** (0-39): Acceptable inclusion

### Next Steps
All ML phase outputs are now ready for backend API integration.