# Step 3: Interpretability & Use (Exercise notebook)

In the final step of this exercise we combine the results of the data integrity (coverage) checks from **Step 1** and the analytical robustness analysis from **Step 2** to assess how reliable the OD matrix is for disaster preparedness.

**Goal:** Decide how (and where) the mobility indicator can be safely used for disaster‑related decision‑making. We'll identify which districts or OD pairs have sufficient coverage and stable flows, and where strong caveats or additional data are needed.

1. **Document the inputs, computations, flags and outputs.**  The inserted
   Markdown explains what data goes in, what transformations are performed,
   why a scaling factor is applied (because the sample includes only
   approximately 1 000 synthetic users), and how the resulting flags and
   quality levels should be interpreted.
2. **Filter the analysis to a set of seven Admin 2 districts.**  Only
   observations and flows whose home district belongs to the specified list
   (Santana de Parnaíba, São Paulo, Mairiporã, Guarulhos, Mauá,
   São Bernardo do Campo and Santo André) are used in coverage and OD flow
   calculations.
3. **Add an interactive Folium map.**  The map overlays the Admin 2 shapefile
   with colour-coded polygons based on the `recommended_use` flag and a
   tooltip showing detailed metrics for the selected districts; all other
   polygons are drawn in grey.

## Data and Inputs

This exercise uses the following inputs:

- **events.csv:** synthetic call detail records (user_id, timestamp,
  cell_id).  These records provide the sequences of network events for each
  user.  We use them to compute speeds, classify ‘stay’ vs ‘move’ segments and
  derive origin–destination (OD) flows between administrative areas.
- **diaries.csv:** a pre‑computed stay/move diary for each user, including
  home locations (`stay_type == 'home'`).  We use only the home records (with
  longitude/latitude) to assign each user to an Admin 2 district.
- **cells.csv:** metadata about each cell tower (cell_id with latitude and
  longitude).  These coordinates allow us to convert event sequences into
  metric coordinates for distance and speed calculations.
- **BRA_adm2.shp / BRA_adm2_pop.csv:** the vector boundary of Brazil’s
  second administrative level and a table of synthetic population data at district (Admin2) level, derived from WorldPop (`pop_sum`) per
  Admin 2 district.  These layers are used to spatially join points to


**Important note on scaling:** the synthetic dataset contains only about
1 000 users.  To make the coverage ratio (`num_subscribers / pop_sum`) more
interpretable for training purposes, we multiply the number of subscribers
per district by a constant scaling factor (e.g. 10 000) before dividing by
population.  This factor does **not** reflect an actual market share; when
working with real data you should calibrate the factor appropriately.

**Target districts:** all analyses below are limited to the following
Admin 2 districts (ID and name). Only OD flows between these districts are analysed.

| ID_2 | Name                     |
|----:|-------------------------|
| 4918 | Santana de Parnaíba     |
| 4877 | São Paulo               |
| 4677 | Mairiporã               |
| 4570 | Guarulhos               |
| 4688 | Mauá                    |
| 4859 | São Bernardo do Campo   |
| 4920 | Santo André             |

## 1 – Setup and Data Loading

To evaluate interpretability, we need both coverage information (from Step 1) and OD flow sensitivity (from Step 2). This section loads the necessary datasets and recomputes the intermediate tables if they are not already available.

We assume the following files are available in the working directory:
* `events.csv`, `diaries.csv`, `cells.csv` – the raw synthetic MPD data.
* `BRA_adm2_pop.csv` and `BRA_adm2.shp` – synthetic population data at district (Admin2) level, derived from WorldPop and administrative boundaries at the Admin 2 level.
* The helper functions defined in Step 2 (to compute OD flows) are reproduced below.


In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
import warnings
warnings.filterwarnings("ignore")

# File paths (adjust if you organise your files differently)
events_fp = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/events.parquet'
diaries_fp = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/diaries.csv'
cells_fp = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/cells.csv'
adm2_pop_fp = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm2_pop.csv'

# Load CSV data
events = pd.read_parquet(events_fp)
diaries = pd.read_csv(diaries_fp)
cells = pd.read_csv(cells_fp)
adm2_pop = pd.read_csv(adm2_pop_fp)

# Load only the Admin2 boundary from the shapefile zip using geopandas
# The shapefile zip contains BRA_adm0.shp, BRA_adm1.shp, BRA_adm2.shp, BRA_adm3.shp.
adm2_gdf = gpd.read_file(f"https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm2.gpkg")
# Ensure CRS is WGS84
adm2_gdf = adm2_gdf.to_crs(epsg=4326)

print(f'Loaded {len(events):,} events, {len(diaries):,} diary records, {len(cells):,} cells and {len(adm2_gdf):,} Admin2 polygons.')

### RUN THIS HELPER FUNCTIONS CELL BELOW

In [None]:
# helper functions form previous steps just run this cell

import geopandas as gpd
import numpy as np

def add_coordinates_in_meters_to_events(events_df: pd.DataFrame, cells_df: pd.DataFrame) -> pd.DataFrame:
    """
    Merge events with cell coordinates (longitude/latitude) and project them to a metric CRS (UTM).
    Returns the events dataframe with additional columns: longitude, latitude, x, y (in metres).
    """
    merged_df = events_df.merge(cells_df, on='cell_id', how='left')
    gdf = gpd.GeoDataFrame(
        merged_df,
        geometry=gpd.points_from_xy(merged_df['longitude'], merged_df['latitude']),
        crs='EPSG:4326'
    )
    # Project to UTM zone 23S (EPSG:31983) – appropriate for Brazil
    gdf_projected = gdf.to_crs(epsg=31983)
    gdf_projected['x'] = gdf_projected.geometry.x
    gdf_projected['y'] = gdf_projected.geometry.y
    return gdf_projected.drop(columns=['geometry'])

def add_speed_metric(df: pd.DataFrame, group_by: str = 'user_id') -> pd.DataFrame:
    """
    Computes backward-looking speed between consecutive cell changes for each user.
    A constant speed is assigned within each stay/move segment.
    """
    aux_df = df.sort_values([group_by, 'timestamp']).reset_index(drop=True)
    aux_df['timestamp'] = pd.to_datetime(aux_df['timestamp'])
    
    def compute_speed(group: pd.DataFrame) -> pd.DataFrame:
        n = len(group)
        if n == 0:
            return group.assign(speed=np.nan)
        x = group['x'].to_numpy(dtype=float)
        y = group['y'].to_numpy(dtype=float)
        ts = group['timestamp'].to_numpy('datetime64[ns]')
        cell = group['cell_id'].to_numpy()
        pos = np.arange(n)
        # Identify the start index of each new cell segment
        change_mask = np.r_[True, cell[1:] != cell[:-1]]
        change_pos = pos[change_mask]
        # For each index i, find the start of its current segment
        idx_cur = np.searchsorted(change_pos, pos, side='right') - 1
        cur_pos = np.full(n, -1, dtype=int)
        mask_cur = idx_cur >= 0
        cur_pos[mask_cur] = change_pos[idx_cur[mask_cur]]
        # Previous segment start for each row
        prev_idx = idx_cur - 1
        prev_pos = np.full(n, -1, dtype=int)
        mask_prev = prev_idx >= 0
        prev_pos[mask_prev] = change_pos[prev_idx[mask_prev]]
        # Compute speed only when both prev_pos and cur_pos are valid and different
        speed = np.full(n, np.nan, dtype=float)
        valid = (prev_pos != -1) & (cur_pos != -1) & (cur_pos != prev_pos)
        if valid.any():
            dt_h = (ts[cur_pos[valid]] - ts[prev_pos[valid]]) / np.timedelta64(1, 'h')
            dist_km = np.hypot(
                x[cur_pos[valid]] - x[prev_pos[valid]],
                y[cur_pos[valid]] - y[prev_pos[valid]]
            ) / 1000.0
            speed_vals = np.full(valid.sum(), np.nan)
            nonzero = dt_h > 0
            speed_vals[nonzero] = dist_km[nonzero] / dt_h[nonzero]
            speed[valid] = speed_vals
        return group.assign(speed=speed)
    aux_df = aux_df.groupby(group_by, group_keys=False).apply(compute_speed)
    return aux_df

def determine_stay_move_from_speed(df: pd.DataFrame, high_speed_threshold_kmh: float = 10.0, low_speed_threshold_kmh: float = 3.0, group_by: str = 'user_id') -> pd.DataFrame:
    """
    Classify each event as 'move' or 'stay' based on speed thresholds.  
    High threshold marks definite moves; contiguous rows with speed above the low threshold are also marked as moves.  
    """
    def process_user(group: pd.DataFrame) -> pd.DataFrame:
        group = group.sort_values('timestamp').reset_index(drop=True)
        group['speed'] = pd.to_numeric(group['speed'], errors='coerce')
        # Initial classification: speed > high_speed → move
        group['event_type'] = np.where(group['speed'] > high_speed_threshold_kmh, 'move', 'stay')
        seed = group['event_type'].eq('move')
        cand = group['speed'] > low_speed_threshold_kmh
        mask = seed | cand
        # Identify contiguous candidate islands
        block_id = (mask.ne(mask.shift(fill_value=False))).cumsum()
        block_id = block_id.where(mask)
        # If any seed exists in the block, extend moves to include candidates
        has_seed = seed.groupby(block_id).transform('any').fillna(False).astype(bool)
        extend = cand & has_seed
        group['event_type'] = np.where(seed | extend, 'move', 'stay')
        return group
    result = df.groupby(group_by, group_keys=False).apply(process_user)
    return result

def collapse_stay_move(df: pd.DataFrame, time_col: str = 'timestamp', type_col: str = 'event_type', lat_col: str = 'latitude', lon_col: str = 'longitude', group_by: str = 'user_id') -> pd.DataFrame:
    """
    Collapse consecutive events of the same type into segments.  
    For stay segments, compute a time-weighted centroid; for move segments, lat/lon may remain NaN.
    """
    out = df.copy()
    out[time_col] = pd.to_datetime(out[time_col], errors='coerce')
    out = out.sort_values([group_by, time_col])
    out['_seg_id'] = out.groupby(group_by)[type_col].transform(lambda s: (s != s.shift()).cumsum())
    seg_bounds = out.groupby([group_by, '_seg_id']).agg(
        event_type=(type_col, 'first'),
        start_time=(time_col, 'first'),
        end_time=(time_col, 'last'),
    )
    out = out.join(seg_bounds[['end_time']], on=[group_by, '_seg_id'], rsuffix='_seg')
    out['_next_time'] = out.groupby([group_by, '_seg_id'])[time_col].shift(-1)
    w = (out['_next_time'].fillna(out['end_time']) - out[time_col]).dt.total_seconds().clip(lower=0)
    out['_w'] = w
    out['_wlat'] = out['_w'] * out[lat_col]
    out['_wlon'] = out['_w'] * out[lon_col]
    agg = out.groupby([group_by, '_seg_id']).agg(
        event_type=(type_col, 'first'),
        start_time=(time_col, 'first'),
        end_time=(time_col, 'last'),
        w_sum=('_w', 'sum'),
        wlat_sum=('_wlat', 'sum'),
        wlon_sum=('_wlon', 'sum'),
        lat_mean=(lat_col, 'mean'),
        lon_mean=(lon_col, 'mean'),
    )
    agg['duration_s'] = (agg['end_time'] - agg['start_time']).dt.total_seconds()
    stay_mask = agg['event_type'].eq('stay')
    has_weight = agg['w_sum'] > 0
    agg['latitude'] = np.where(stay_mask & has_weight, agg['wlat_sum'] / agg['w_sum'], np.where(stay_mask, agg['lat_mean'], np.nan))
    agg['longitude'] = np.where(stay_mask & has_weight, agg['wlon_sum'] / agg['w_sum'], np.where(stay_mask, agg['lon_mean'], np.nan))
    result = (
        agg.reset_index(level='_seg_id', drop=True).reset_index()[
            [group_by, 'event_type', 'start_time', 'end_time', 'duration_s', 'latitude', 'longitude']
        ].sort_values([group_by, 'start_time']).reset_index(drop=True)
    )
    return result

def map_points_to_admin2(points_df: pd.DataFrame, adm2_gdf: gpd.GeoDataFrame, lat_col: str = 'latitude', lon_col: str = 'longitude') -> pd.DataFrame:
    """
    Assign each point in `points_df` to an Admin2 polygon using a spatial join.  
    Returns a copy of the input DataFrame with added columns `GID_2` and `NAME_2`.  
    """
    gdf = gpd.GeoDataFrame(
        points_df,
        geometry=gpd.points_from_xy(points_df[lon_col], points_df[lat_col]),
        crs='EPSG:4326'
    )
    join = gpd.sjoin(gdf, adm2_gdf[['ID_2', 'NAME_2', 'geometry']], how='left', predicate='within')
    # Return DataFrame without geometry
    return join.drop(columns=['geometry'])


from typing import Tuple, List

def compute_od_flows(events_df: pd.DataFrame,
                     cells_df: pd.DataFrame,
                     target_user_ids: List[int],
                     high_speed: float,
                     low_speed: float) -> pd.DataFrame:
    """
    Compute origin–destination flows at Admin2 level for a given set of users and speed thresholds.
    
    Parameters
    ----------
    events_df : pd.DataFrame
        Raw CDR events with columns: user_id, timestamp, cell_id
    cells_df : pd.DataFrame
        Tower metadata with longitude and latitude
    target_user_ids : list
        IDs of users whose flows should be included
    high_speed : float
        High speed threshold (km/h)
    low_speed : float
        Low speed threshold (km/h)

    Returns
    -------
    pd.DataFrame
        Origin–destination flows with columns: origin_gid, destination_gid, flow
    """
    # Filter events to target users
    target_events = events_df[events_df['user_id'].isin(target_user_ids)].copy()
    if target_events.empty:
        return pd.DataFrame(columns=['origin_gid', 'destination_gid', 'flow'])
    
    # Merge coordinates and project to meters
    ev_proj = add_coordinates_in_meters_to_events(target_events, cells_df)
    
    # Compute speed
    ev_speed = add_speed_metric(ev_proj)
    
    # Classify events into stay/move
    ev_classified = determine_stay_move_from_speed(ev_speed, high_speed_threshold_kmh=high_speed, low_speed_threshold_kmh=low_speed)
    
    # Collapse into segments and only keep stay segments (for origin/destination coordinates)
    segments = collapse_stay_move(ev_classified)
    
    # Remove segments with no duration or no coordinates (could occur if all NaN)
    segments = segments.dropna(subset=['latitude', 'longitude'])
    if segments.empty:
        return pd.DataFrame(columns=['origin_gid', 'destination_gid', 'flow'])
    
    # Map stay segments to admin2 codes
    segments_with_admin = map_points_to_admin2(segments, adm2_gdf)
    
    # Build OD trips: for each user, pair consecutive stays (origin → destination)
    trips = []
    for user, group in segments_with_admin.groupby('user_id'):
        group = group.sort_values('start_time').reset_index(drop=True)
        # iterate over consecutive pairs
        for i in range(len(group) - 1):
            origin_row = group.iloc[i]
            dest_row = group.iloc[i + 1]
            origin_gid = origin_row['ID_2']
            dest_gid = dest_row['ID_2']
            # Only count if both origin and destination codes are valid
            if pd.notna(origin_gid) and pd.notna(dest_gid):
                trips.append((origin_gid, dest_gid))
    # Summarise flows
    if not trips:
        return pd.DataFrame(columns=['origin_gid', 'destination_gid', 'flow'])
    trips_df = pd.DataFrame(trips, columns=['origin_gid', 'destination_gid'])
    flows = trips_df.value_counts().reset_index(name='flow')
    return flows


## Coverage calculation

The following code cell computes the coverage ratio per district for the
target areas.  Home records are spatially joined to Admin 2 polygons,
filtered to the seven districts listed above, and grouped by district name.

In [None]:
# ------------------------------------------------------------------
# 0. Config: list of Admin2 we focus on for the exercise (7 target areas)
#    These are used for coverage & interpretability, but OD users
#    are selected from a larger set (top N Admin2) as in Step 2.
# ------------------------------------------------------------------
allowed_ids = [4918, 4877, 4677, 4570, 4688, 4859, 4920]
allowed_names = [
    'Santana de Parnaíba', 'São Paulo', 'Mairiporã',
    'Guarulhos', 'Mauá', 'São Bernardo do Campo', 'Santo André'
]

# ------------------------------------------------------------------
# 1. Home locations for ALL Admin2 (same idea as Step 2 – no filter yet)
# ------------------------------------------------------------------
home_df = (
    diaries[diaries['stay_type'] == 'home']
    .dropna(subset=['longitude', 'latitude'])
    .copy()
)

home_gdf = gpd.GeoDataFrame(
    home_df,
    geometry=gpd.points_from_xy(home_df['longitude'], home_df['latitude']),
    crs='EPSG:4326'
)

home_join = gpd.sjoin(
    home_gdf,
    adm2_gdf[['ID_2', 'NAME_2', 'geometry']],
    how='left',
    predicate='within'
)

# ------------------------------------------------------------------
# 2. Count residents per Admin2 (all districts)
# ------------------------------------------------------------------
home_counts_all = (
    home_join
    .groupby(['ID_2', 'NAME_2'])['user_id']
    .nunique()
    .reset_index(name='num_subscribers')
)

home_counts_all = home_counts_all.sort_values('num_subscribers', ascending=False)

# ------------------------------------------------------------------
# 3. Define target_admin2 for OD flows
#    -> Top N Admin2 by number of residents (same logic as Step 2)
#    If Step 2 uses head(12), set TOP_N = 12
# ------------------------------------------------------------------
TOP_N = 12
target_admin2 = home_counts_all.head(TOP_N)['ID_2'].tolist()

print("Top Admin2 districts by number of subscribers (used for OD):")
display(home_counts_all.head(TOP_N))

# ------------------------------------------------------------------
# 4. Build coverage table only for the 7 focus Admin2 (allowed_ids)
#    and scale subscribers for demonstration of coverage levels.
# ------------------------------------------------------------------
district_cov = (
    adm2_pop[['ID_2', 'NAME_2', 'pop_sum']]
    .merge(home_counts_all, on=['ID_2', 'NAME_2'], how='left')
)

district_cov['num_subscribers'] = district_cov['num_subscribers'].fillna(0)

# keep only our 7 focus areas
district_cov = district_cov[district_cov['ID_2'].isin(allowed_ids)].copy()

# Scale num_subscribers just for illustration of coverage_ratio
# (we only have 1000 users; this is for demo, not real calibration)
SCALE_FACTOR = 7500  # does NOT affect trip counts, only coverage_ratio

district_cov['coverage_ratio'] = (
    district_cov['num_subscribers'] * SCALE_FACTOR / district_cov['pop_sum']
)

def classify_cov(r):
    if r >= 0.4:
        return 'High'
    elif r >= 0.2:
        return 'Medium'
    else:
        return 'Low'

district_cov['coverage_flag'] = district_cov['coverage_ratio'].apply(classify_cov)

print("Coverage table (scaled, 7 focus Admin2):")
display(district_cov)

### Coverage flag definitions

- **High:** the scaled coverage ratio is ≥ 0.4.  The synthetic MNO sample
  represents at least half of the synthetic population (after scaling), so
  population estimates are considered robust.
- **Medium:** the ratio is between 0.2 and 0.4.  The operator’s coverage is
  moderate; estimates can be used for rough trends but not detailed
  planning without caution.
- **Low:** the ratio is below 0.2.  There are too few subscribers in this
  district, so any indicators derived from these data may be unreliable.

> **Note:** These threshold values and the scaling factor are *synthetic* and
> chosen only for this exercise so that we can clearly see High / Medium / Low
> coverage examples. For real analyses, they must be calibrated using
> realistic information about market share and subscriber coverage.

## OD flows and sensitivity

The next two cells compute origin–destination flows under two sets of speed
thresholds (Method A and Method B) and quantify how sensitive each district
is to that parameter choice.

In [None]:
# ------------------------------------------------------------------
# 0. Users whose home is in one of the target Admin2
#    (same definition as Step 2: top N Admin2, not only 7 focus areas)
# ------------------------------------------------------------------
target_user_ids = (
    home_join[home_join['ID_2'].isin(target_admin2)]['user_id']
    .unique()
    .tolist()
)
print(f'Number of users with home in target Admin2: {len(target_user_ids)}')

# ------------------------------------------------------------------
# 1. Compute OD flows for Method A & B (same parameters as Step 2)
# ------------------------------------------------------------------
od_A = compute_od_flows(
    events, cells, target_user_ids,
    high_speed=10, low_speed=3
).rename(columns={'flow': 'flow_A'})

od_B = compute_od_flows(
    events, cells, target_user_ids,
    high_speed=20, low_speed=3
).rename(columns={'flow': 'flow_B'})

print(f'Method A produced {len(od_A)} OD pairs.')
print(f'Method B produced {len(od_B)} OD pairs.')

# ------------------------------------------------------------------
# 2. Merge flows + compute differences (mirror Step 2)
# ------------------------------------------------------------------
od_merged = pd.merge(
    od_A, od_B,
    on=['origin_gid', 'destination_gid'],
    how='outer'
).fillna(0)

# difference (A - B) and relative / percent change
od_merged['diff_AB'] = od_merged['flow_A'] - od_merged['flow_B']
od_merged['rel_diff'] = np.where(
    od_merged['flow_A'] > 0,
    (od_merged['flow_B'] - od_merged['flow_A']) / od_merged['flow_A'],
    np.nan
)
od_merged['pct_change'] = od_merged['rel_diff'] * 100

# ------------------------------------------------------------------
# 3. Attach Admin2 names
# ------------------------------------------------------------------
adm2_names = adm2_gdf[['ID_2', 'NAME_2']].drop_duplicates()

origin_lookup = adm2_names.rename(
    columns={'ID_2': 'origin_gid', 'NAME_2': 'Origin Admin2'}
)
destination_lookup = adm2_names.rename(
    columns={'ID_2': 'destination_gid', 'NAME_2': 'Destination Admin2'}
)

od_named = (
    od_merged
    .merge(origin_lookup, on='origin_gid', how='left')
    .merge(destination_lookup, on='destination_gid', how='left')
)

# ------------------------------------------------------------------
# 4. Residents per Admin2 (from home_join; same base used for coverage)
# ------------------------------------------------------------------
residents = (
    home_join
    .groupby('ID_2')['user_id']
    .nunique()
    .reset_index(name='num_residents')
)

res_origin = residents.rename(
    columns={'ID_2': 'origin_gid', 'num_residents': 'Origin_residents'}
)
res_dest = residents.rename(
    columns={'ID_2': 'destination_gid', 'num_residents': 'Destination_residents'}
)

od_named = (
    od_named
    .merge(res_origin, on='origin_gid', how='left')
    .merge(res_dest, on='destination_gid', how='left')
)

# ------------------------------------------------------------------
# 5. Apply Step 2 filtering logic EXACTLY:
#
#   - Only OD where both origin & destination are in allowed_ids (7 focus areas)
#   - O != D (no intra-district)
#   - Origin_residents > 15 and Destination_residents > 15
#   - Trip count by Method A >= 100
# ------------------------------------------------------------------
mask = (
    (od_named['origin_gid'].isin(allowed_ids)) &
    (od_named['destination_gid'].isin(allowed_ids)) &
    (od_named['origin_gid'] != od_named['destination_gid']) &
    (od_named['Origin_residents'] > 15) &
    (od_named['Destination_residents'] > 15) &
    (od_named['flow_A'] >= 100)
)

od_filtered = od_named[mask].copy()
od_filtered = od_filtered.sort_values('pct_change', ascending=True)

od_output = od_filtered[[
    'Origin Admin2',
    'Destination Admin2',
    'flow_A',
    'flow_B',
    'diff_AB',
    'rel_diff',
    'pct_change',
    'Origin_residents',
    'Destination_residents'
]].rename(columns={
    'flow_A': 'Trip count by method A',
    'flow_B': 'Trip count by method B'
})

print(f'Filtered OD pairs (Step 3, same logic as Step 2 + 7 focus Admin2): {len(od_output)}')
display(od_output.head(20))

Now compute the sensitivity per district by looking at the maximum absolute
percentage change across all flows touching each district (either as origin or
destination):

In [None]:
# 5. Compute max absolute percent change for each district
#    (based on pct_change = (Flow_B - Flow_A) / Flow_A * 100 )
origin_sens = (
    od_merged.groupby('origin_gid')['pct_change']
    .apply(lambda x: np.nanmax(np.abs(x)))
    .reset_index(name='max_abs_percent_change')
)

dest_sens = (
    od_merged.groupby('destination_gid')['pct_change']
    .apply(lambda x: np.nanmax(np.abs(x)))
    .reset_index(name='max_abs_percent_change_dest')
)

# 6. Merge with coverage table (district_cov is already limited to the focus Admin2)
district_sens = district_cov.merge(
    origin_sens, left_on='ID_2', right_on='origin_gid', how='left'
).merge(
    dest_sens, left_on='ID_2', right_on='destination_gid', how='left'
)

# 7. Combine origin / destination sensitivity into a single value per district
_vals = district_sens[['max_abs_percent_change', 'max_abs_percent_change_dest']].to_numpy()
district_sens['max_abs_percent_change'] = np.nanmax(_vals, axis=1)

# drop helper columns we no longer need
district_sens = district_sens.drop(
    columns=['max_abs_percent_change_dest', 'origin_gid', 'destination_gid']
)

# 8. Classify sensitivity level for each district
def classify_sensitivity(x):
    if pd.isna(x):
        # If we never see this district in any OD pair, treat sensitivity as High (very uncertain)
        return 'High'
    x_abs = abs(x)
    return 'Low' if x_abs < 20 else ('Medium' if x_abs < 40 else 'High')

district_sens['sensitivity_flag'] = district_sens['max_abs_percent_change'].apply(classify_sensitivity)

print("Coverage & sensitivity table:")
display(
    district_sens[
        ['ID_2', 'NAME_2', 'pop_sum', 'num_subscribers',
         'coverage_ratio', 'coverage_flag',
         'max_abs_percent_change', 'sensitivity_flag']
    ]
)

### Sensitivity flag definitions

- **Low:** the maximum absolute percent change across OD flows touching this
  district is less than 20 %.  The flows are stable with respect to the
  speed-threshold parameter.
- **Medium:** the max percent change is between 20 % and 40 %.  Flows are
  moderately sensitive; indicators should be used with caution.
- **High:** the max percent change is greater than 40 % or missing.  OD flows
  vary significantly under different parameter choices; use these indicators
  only for rough approximations.

> **Note:** These sensitivity thresholds are also *synthetic* and chosen to
> highlight clear differences between Low / Medium / High sensitivity in this
> small 1,000-user sample. In a real deployment, the cut-offs should be tuned
> to the context and validated against domain knowledge.

## Quality assessment and recommended use

Combine the coverage and sensitivity flags to derive a quality level and a
recommended use for each district:

In [None]:
# 9️.Derive quality level
def derive_quality(row):
    cov = row['coverage_flag']
    sens = row['sensitivity_flag']
    if cov == 'High' and sens == 'Low':
        return 'High quality'
    elif cov in ('High', 'Medium') and sens in ('Low', 'Medium'):
        return 'Medium quality'
    else:
        return 'Low quality'

district_sens['quality_level'] = district_sens.apply(derive_quality, axis=1)

# 10.Recommended use based on quality
def recommend_use(q):
    if q == 'High quality':
        return 'Suitable for detailed evacuation planning'
    elif q == 'Medium quality':
        return 'Suitable only for rough prioritisation'
    else:
        return 'Not suitable without additional information'

district_sens['recommended_use'] = district_sens['quality_level'].apply(recommend_use)

print('Final quality assessment:')
display(
    district_sens[[
        'ID_2','NAME_2','coverage_ratio','coverage_flag',
        'max_abs_percent_change','sensitivity_flag',
        'quality_level','recommended_use'
    ]]
)

### Quality & use definitions

> **Note:** The quality levels below are based on *synthetic* thresholds
> chosen for this exercise (coverage cut-offs and sensitivity cut-offs).  
> They are meant to illustrate how coverage and robustness might be combined.
> In a real application, these thresholds should be calibrated using
> domain expertise and validation against ground truth.

- **High quality:**  
  - Typically occurs when **coverage is High** and **sensitivity is Low**.  
  - We see enough subscribers to represent the synthetic population after
    scaling, *and* OD flows are stable when we change the speed thresholds.  
  - The OD matrix is considered reliable for **detailed disaster planning**
    in this district (e.g. evacuation routing, shelter allocation).

- **Medium quality:**  
  - Occurs for **intermediate combinations**, e.g. High/Medium coverage with
    Low/Medium sensitivity.  
  - The results capture the **main patterns**, but either sample size or
    robustness is not strong enough for precise decisions.  
  - Use these indicators to **set broad priorities only** (e.g. which areas
    to look at first), not for fine-grained interventions.

- **Low quality:**  
  - Triggered when **coverage is Low** or **sensitivity is High** (flows
    react strongly to parameter changes or are missing).  
  - Either we see too few subscribers, or the OD estimates are highly
    unstable under reasonable modelling choices.  
  - These indicators **should not be used for operational planning**
    without additional data, re-calibration of parameters, or external
    validation (e.g. surveys, traffic counts, other operators).


##  Folium choropleth map

Finally, create an interactive map showing the `recommended_use` for the
target districts.  Districts outside the target list are coloured grey

In [None]:
import folium

# Colour palette for the three categories
colour_map = {
    'Suitable for detailed evacuation planning': '#2ca02c',  # green
    'Suitable only for rough prioritisation': '#ff7f0e',     # orange
    'Not suitable without additional information': '#d62728' # red
}

# Create a base map centred on the target districts
centre_lat = home_join['latitude'].mean()
centre_lon = home_join['longitude'].mean()
m = folium.Map(location=[centre_lat, centre_lon], zoom_start=9, tiles='cartodbpositron')

# Create a lookup for colours and tooltip info by ID_2
info_lookup = district_sens.set_index('ID_2').to_dict(orient='index')

# Define a style function for colouring polygons
def style_function(feature):
    gid = feature['properties']['ID_2']
    if gid in info_lookup:
        use = info_lookup[gid]['recommended_use']
        return {
            'fillColor': colour_map[use],
            'color': 'black',
            'weight': 0.5,
            'fillOpacity': 0.6
        }
    else:
        return {
            'fillColor': '#cccccc',  # grey for non-target areas
            'color': 'white',
            'weight': 0.5,
            'fillOpacity': 0.4
        }

# Define a tooltip function to show details for target districts
def tooltip_function(feature):
    gid = feature['properties']['ID_2']
    if gid in info_lookup:
        info = info_lookup[gid]
        return (
            f"ID_2: {gid}<br>"
            f"Name: {info['NAME_2']}<br>"
            f"Coverage ratio: {info['coverage_ratio']:.3f}<br>"
            f"Coverage flag: {info['coverage_flag']}<br>"
            f"Max abs % change: {info['max_abs_percent_change']:.1f}<br>"
            f"Sensitivity flag: {info['sensitivity_flag']}<br>"
            f"Quality level: {info['quality_level']}<br>"
            f"Recommended use: {info['recommended_use']}"
        )
    else:
        return ''

# Add the GeoJson layer with style and tooltips
folium.GeoJson(
    adm2_gdf.__geo_interface__,
    name='Admin2',
    style_function=style_function,
    tooltip=folium.GeoJsonTooltip(
        fields=[],
        aliases=[],
        labels=False,
        sticky=False,
        toLocaleString=False,
        style=("background-color: white; border: 1px solid black; "
               "padding: 5px;")
    ),
    highlight_function=lambda x: {'weight': 3, 'color': 'yellow'}
).add_to(m)

# Override the tooltip for target districts: add markers at an interior point
for gid, info in info_lookup.items():
    # find the feature geometry
    poly = adm2_gdf[adm2_gdf['ID_2'] == gid].iloc[0].geometry

    # use an interior point instead of centroid (centroid can fall outside for concave shapes)
    pt = poly.representative_point()

    popup_html = (
        f"<b>{info['NAME_2']}</b><br>"
        f"Coverage ratio: {info['coverage_ratio']:.3f}<br>"
        f"Coverage flag: {info['coverage_flag']}<br>"
        f"Max abs % change: {info['max_abs_percent_change']:.1f}<br>"
        f"Sensitivity flag: {info['sensitivity_flag']}<br>"
        f"Quality level: {info['quality_level']}<br>"
        f"Recommended use: {info['recommended_use']}"
    )
    folium.Marker(
        location=[pt.y, pt.x],
        popup=folium.Popup(popup_html, max_width=300),
        icon=folium.Icon(color='blue', icon='info-sign')
    ).add_to(m)

print('Interactive map ready:')
m

# Options save to HTML file
# m.save('district_quality_map.html')

## 3 – Conclusions and Recommendations

In this final step we combined district-level coverage ratios with method
sensitivity metrics to assess the reliability of the OD matrix for disaster
preparedness. The resulting table summarises, for each district, the strength
of coverage and the degree of sensitivity to the speed thresholds.

You can now use the `quality_level` and `recommended_use` columns to determine
which districts are reliable enough for detailed evacuation planning and which
require caution or additional data. Adjust the coverage and sensitivity
thresholds as needed to explore different scenarios.

> **Important:** In this exercise we only work with a synthetic sample of
> 1,000 users. The coverage ratios and quality classes are therefore driven by
> two *illustrative* choices:
> - a scaling factor applied to the number of subscribers, and  
> - hand-picked thresholds for coverage and sensitivity.
> These numbers are not meant to reflect the real market share of any operator.

In a real application you would:
1. Replace the synthetic scaling by an evidence-based estimate of market share
   (e.g. using regulatory statistics or internal subscriber counts).
2. Re-tune the coverage and sensitivity thresholds so that the flags align with
   expert judgement and, where possible, external validation data.
3. Use the resulting `quality_level` and `recommended_use` fields as a
   **screening tool**:  
   - focus detailed evacuation planning on *High-quality* districts,  
   - use *Medium-quality* districts only for broad prioritisation, and  
   - treat *Low-quality* districts as “high-uncertainty areas” that require
     additional information or alternative data sources.
