
# Step 1: Data Integrity & Coverage

This notebook performs integrity and coverage checks for the synthetic call detail record (CDR) data set provided for Brazil (March 2025). We combine CDR event logs (`events.csv`), diary-based home locations (`diaries.csv`), the cell tower catalogue (`cells.csv`), and synthetic census aggregated to administrative boundaries (`BRA_adm1_pop.csv`, `BRA_adm2_pop.csv`, `BRA_adm3_pop.csv`).

The goal is to understand whether the CDR data adequately represents the population over time and across districts (admin units). We compute temporal coverage, spatial coverage, and a few basic integrity checks.


In [None]:
# Install GeoPandas and dependencies if not already available (optional in Colab)
import sys

try:
    import geopandas as gpd
except ImportError:
    # In Colab you can uncomment the next line to install geopandas and dependencies
    # !pip install geopandas shapely fiona pyproj rtree folium
    raise ImportError("GeoPandas is required to run this notebook. Please install it via pip in your environment.")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Set Pandas display options for better readability
pd.set_option('display.max_columns', 50)

# Define file paths – adjust these if you store the files elsewhere
DIARIES_FILE = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/diaries.csv'
EVENTS_FILE  = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/events.parquet'
CELLS_FILE   = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/cells.csv'
ADM1_POP_FILE = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm1_pop.csv'
ADM2_POP_FILE = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm2_pop.csv'
ADM3_POP_FILE = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm3_pop.csv'
ADM_SHAPE_ZIP = 'https://github.com/Flowminder/WB-GDF-Modern-Data-Workflows-Code-Tasks/raw/refs/heads/main/2.5.1_input_data/BRA_adm.zip'  # zipped shapefile containing admin boundaries

# Load core datasets
try:
    diaries = pd.read_csv(DIARIES_FILE, parse_dates=['initial_timestamp','final_timestamp'])
    print(f"Loaded diaries with {len(diaries):,} rows")
except FileNotFoundError:
    print("diaries.csv not found. Please ensure the file is in the working directory.")
    diaries = None

try:
    events = pd.read_parquet(EVENTS_FILE)
    events['timestamp'] = pd.to_datetime(events['timestamp'])
    print(f"Loaded events with {len(events):,} rows")
except FileNotFoundError:
    print("events.parquet not found. Please ensure the file is in the working directory.")
    events = None

try:
    cells = pd.read_csv(CELLS_FILE)
    print(f"Loaded cells with {len(cells):,} rows")
except FileNotFoundError:
    print("cells.csv not found. Please ensure the file is in the working directory.")
    cells = None

# Load population tables
adm1_pop = pd.read_csv(ADM1_POP_FILE) if ADM1_POP_FILE else None
adm2_pop = pd.read_csv(ADM2_POP_FILE) if ADM2_POP_FILE else None
adm3_pop = pd.read_csv(ADM3_POP_FILE) if ADM3_POP_FILE else None

# Read the administrative boundaries from the zipped shapefile using GeoPandas
# If you only need one level, read the corresponding layer name (e.g., 'BRA_adm1', 'BRA_adm2', 'BRA_adm3').
try:
    adm1_gdf = gpd.read_file(f"zip://{ADM_SHAPE_ZIP}!BRA_adm1.shp")
    adm2_gdf = gpd.read_file(f"zip://{ADM_SHAPE_ZIP}!BRA_adm2.shp")
    adm3_gdf = gpd.read_file(f"zip://{ADM_SHAPE_ZIP}!BRA_adm3.shp")
    print("Administrative boundaries loaded successfully")
except Exception as e:
    print(f"Error loading administrative boundaries: {e}")
    adm1_gdf = adm2_gdf = adm3_gdf = None

# Display a sample of the population tables
print('Adm1 population head:')
print(adm1_pop.head())
print('Adm2 population head:')
print(adm2_pop.head())
print('Adm3 population head:')
print(adm3_pop.head())



## 1. Temporal Coverage

The temporal coverage check answers questions such as:

* **How many CDR records are generated each day?**
* **How many unique subscribers are active each day?**
* **Are there gaps (missing days) or spikes in the CDR volumes?**

Below we extract the date from the event timestamps and compute these counts. If you are using diary entries instead of events, you may apply a similar approach on the diary timestamps.


In [None]:
if events is not None:
    # Extract date (year-month-day) from timestamps
    events['date'] = events['timestamp'].dt.normalize()  # sets time to midnight

    # Number of CDR records per day
    records_per_day = events.groupby('date').size().rename('num_records')

    # Number of unique active subscribers per day
    subscribers_per_day = events.groupby('date')['user_id'].nunique().rename('active_subscribers')

    # Combine into one DataFrame
    temporal_summary = pd.concat([records_per_day, subscribers_per_day], axis=1).reset_index()

    # Sort by date for readability
    temporal_summary = temporal_summary.sort_values('date')

    # Display a few rows
    print("Temporal coverage summary (first 10 rows):")
    print(temporal_summary.head(10))
else:
    print("No events data to compute temporal coverage.")


In [None]:
if events is not None and not events.empty:
    # Create full date range to identify missing days
    start_date = events['date'].min()
    end_date = events['date'].max()
    full_dates = pd.date_range(start_date, end_date, freq='D')

    # Create a complete temporal summary with all dates
    temporal_summary_complete = pd.DataFrame({'date': full_dates})

    # Merge with actual counts, filling missing dates with 0
    temporal_summary_complete = temporal_summary_complete.merge(
        temporal_summary[['date', 'active_subscribers']],
        on='date',
        how='left'
    ).fillna(0)

    observed_dates = events['date'].unique()
    missing_dates = [d for d in full_dates if d not in observed_dates]
    print(f"Missing dates (no events): {missing_dates}")

    # Add day of week indicator
    temporal_summary_complete['day_of_week'] = temporal_summary_complete['date'].dt.day_name()
    temporal_summary_complete['is_weekend'] = temporal_summary_complete['date'].dt.dayofweek >= 5
    temporal_summary_complete['day_type'] = temporal_summary_complete['is_weekend'].map({True: 'Weekend', False: 'Weekday'})

    import matplotlib.ticker as mtick
    import matplotlib.dates as mdates

    # Plot: Daily active subscribers with weekend shading
    plt.figure(figsize=(14, 6))
    plt.plot(temporal_summary_complete['date'], temporal_summary_complete['active_subscribers'],
             marker='o', linewidth=2, markersize=5, color='green')
    plt.title('Daily Active Subscribers with Weekend Highlighting', fontsize=14)
    plt.xlabel('Date')
    plt.ylabel('Number of active subscribers')
    plt.grid(True, linestyle='--', alpha=0.3)

    # Shade weekends
    for date, is_weekend in zip(temporal_summary_complete['date'], temporal_summary_complete['is_weekend']):
        if is_weekend:
            plt.axvspan(date, date + pd.Timedelta(days=1), alpha=0.2, color='gray')

    # Format y-axis with comma separators
    plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, pos: f"{int(x):,}"))

    # Set x-axis to show dates
    plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    plt.xticks(rotation=45, ha='right')

    plt.tight_layout()
    plt.show()

    # Day of week pattern for active subscribers
    day_of_week_summary = temporal_summary_complete.groupby('day_of_week')['active_subscribers'].mean()
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    day_of_week_summary = day_of_week_summary.reindex(day_order)

    print("\nAverage Active Subscribers by Day of Week:")
    print(day_of_week_summary)

    # Bar chart for active subscribers by day of week
    plt.figure(figsize=(10, 6))
    plt.bar(range(len(day_of_week_summary)), day_of_week_summary,
            color=['#2ca02c' if day not in ['Saturday', 'Sunday'] else '#ff7f0e'
                   for day in day_of_week_summary.index])
    plt.xlabel('Day of Week')
    plt.ylabel('Average Number of Active Subscribers')
    plt.title('Average Active Subscribers by Day of Week', fontsize=14)
    plt.xticks(range(len(day_of_week_summary)), day_of_week_summary.index, rotation=45, ha='right')
    plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, pos: f"{int(x):,}"))
    plt.grid(True, linestyle='--', alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

    # Weekday vs Weekend summary
    weekday_weekend_summary = temporal_summary_complete.groupby('day_type')['active_subscribers'].agg(['mean', 'sum', 'count'])
    print("\nWeekday vs Weekend Summary for Active Subscribers:")
    print(weekday_weekend_summary)

else:
    print("Cannot compute missing days without events data.")


### Missing Days and Weekly Patterns

To detect gaps, we create a complete date range from the minimum to the maximum timestamp and compare it with the dates present in the data. Any missing dates indicate days with no CDR activity. We also mark weekends and week days to explore weekly patterns.


In [None]:
if events is not None and not events.empty:
    # Create full date range to identify missing days
    start_date = events['date'].min()
    end_date = events['date'].max()
    full_dates = pd.date_range(start_date, end_date, freq='D')

    # Create a complete temporal summary with all dates
    temporal_summary_complete = pd.DataFrame({'date': full_dates})

    # Merge with actual counts, filling missing dates with 0
    temporal_summary_complete = temporal_summary_complete.merge(
        temporal_summary[['date', 'num_records']],
        on='date',
        how='left'
    ).fillna(0)

    observed_dates = events['date'].unique()
    missing_dates = [d for d in full_dates if d not in observed_dates]
    print(f"Missing dates (no events): {missing_dates}")

    # Add day of week indicator
    temporal_summary_complete['day_of_week'] = temporal_summary_complete['date'].dt.day_name()
    temporal_summary_complete['is_weekend'] = temporal_summary_complete['date'].dt.dayofweek >= 5  # 5=Saturday, 6=Sunday
    temporal_summary_complete['day_type'] = temporal_summary_complete['is_weekend'].map({True: 'Weekend', False: 'Weekday'})

    # Plot daily record counts with weekend shading
    import matplotlib.ticker as mtick
    import matplotlib.dates as mdates

    # Plot: Daily observations with weekend shading
    plt.figure(figsize=(14, 6))
    plt.plot(temporal_summary_complete['date'], temporal_summary_complete['num_records'],
             marker='o', linewidth=2, markersize=5)
    plt.title('Daily Observations with Weekend Highlighting', fontsize=14)
    plt.xlabel('Date')
    plt.ylabel('Number of observations')
    plt.grid(True, linestyle='--', alpha=0.3)

    # Shade weekends
    for date, is_weekend in zip(temporal_summary_complete['date'], temporal_summary_complete['is_weekend']):
        if is_weekend:
            plt.axvspan(date, date + pd.Timedelta(days=1), alpha=0.2, color='gray')

    # Format y-axis with comma separators
    plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, pos: f"{int(x):,}"))

    # Set x-axis to show dates
    plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))  # Show every day
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    plt.xticks(rotation=45, ha='right')

    plt.tight_layout()
    plt.show()

    # Weekday vs Weekend summary statistics
    weekday_weekend_summary = temporal_summary_complete.groupby('day_type')['num_records'].agg(['mean', 'sum', 'count'])
    print("\nWeekday vs Weekend Summary:")
    print(weekday_weekend_summary)

    # Additional analysis: Day of week pattern
    day_of_week_summary = temporal_summary_complete.groupby('day_of_week')['num_records'].agg(['mean', 'sum', 'count'])
    # Sort by day of week order
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    day_of_week_summary = day_of_week_summary.reindex(day_order)

    print("\nDay of Week Summary:")
    print(day_of_week_summary)

    # Plot day of week pattern
    plt.figure(figsize=(10, 6))
    plt.bar(range(len(day_of_week_summary)), day_of_week_summary['mean'],
            color=['#1f77b4' if day not in ['Saturday', 'Sunday'] else '#ff7f0e'
                   for day in day_of_week_summary.index])
    plt.xlabel('Day of Week')
    plt.ylabel('Average Number of Observations')
    plt.title('Average Daily Observations by Day of Week', fontsize=14)
    plt.xticks(range(len(day_of_week_summary)), day_of_week_summary.index, rotation=45, ha='right')
    plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, pos: f"{int(x):,}"))
    plt.grid(True, linestyle='--', alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

else:
    print("Cannot compute missing days without events data.")

## 2. Spatial Coverage by District

To evaluate spatial coverage, we need to assign each subscriber’s home location to an administrative boundary (e.g., district). We use the `diaries` dataset to extract the home location (where `stay_type == 'home'`) for each user. We then perform a **spatial join** to determine which district polygon each home point falls within.

After obtaining the number of residents per district (from the CDR data) we merge these counts with the synthetic census population table (e.g., `BRA_adm3_pop.csv`) to compute the market share or population coverage:

$$
\text{coverage} =
\frac{\text{number of CDR residents in district}}
     {\text{synthetic census population of district}}
$$

We then flag low-coverage districts if coverage is below a threshold (e.g., 20 %).

However, note that we are only working with a sample of about 1,000 subscribers, so the raw coverage values will be artificially low and most districts will fall below the threshold unless we first scale this sample to approximate the full population.

In [None]:

if diaries is not None and adm3_gdf is not None and not adm3_gdf.empty:
    # Filter home stays
    home_stays = diaries[diaries['stay_type'] == 'home'].copy()
    # Drop rows with missing coordinates
    home_stays = home_stays.dropna(subset=['latitude','longitude'])

    # Create GeoDataFrame for home locations
    home_gdf = gpd.GeoDataFrame(
        home_stays,
        geometry=gpd.points_from_xy(home_stays['longitude'], home_stays['latitude']),
        crs='EPSG:4326'
    )

    # Ensure both layers use the same CRS (they already should)
    # Perform spatial join: assign each home point to the containing district polygon (admin3)
    joined = gpd.sjoin(home_gdf, adm3_gdf[['ID_3','NAME_3','geometry']], how='left', predicate='within')

    # Compute number of unique residents per district (based on user_id)
    residents_per_district = joined.groupby(['ID_3','NAME_3'])['user_id'].nunique().reset_index()
    residents_per_district = residents_per_district.rename(columns={'user_id':'n_subscribers'})

    # Merge with synthetic census population (pop_sum) from BRA_adm3_pop.csv
    adm3_pop_renamed = adm3_pop.rename(columns={'ID_3':'ID_3','pop_sum':'census_pop'})
    district_cov = residents_per_district.merge(adm3_pop_renamed[['ID_3','census_pop']], on='ID_3', how='left')

    # Compute coverage ratio and low_coverage_flag
    district_cov['coverage'] = district_cov['n_subscribers'] / district_cov['census_pop']
    # Flag low coverage (<0.2 by default)
    district_cov['low_coverage_flag'] = (district_cov['coverage'] < 0.2).astype(int)

    # Show results
    print("District-level coverage (first 10 rows):")
    print(district_cov.head(10))
else:
    print("Cannot compute spatial coverage missing diaries or admin boundaries.")



## 3. Basic Integrity Checks

Besides temporal and spatial coverage, it’s important to confirm that the underlying datasets are complete and consistent.

* **Missing days** – already computed above.
* **Towers with no district mapping** – we join each cell tower to the administrative boundaries to ensure all towers are assigned to a district. Unassigned towers may indicate errors in location data.


In [None]:
if cells is not None and adm3_gdf is not None:
    # Filter cells with valid coordinates
    cell_points = cells.dropna(subset=['longitude','latitude']).copy()
    cell_gdf = gpd.GeoDataFrame(
        cell_points,
        geometry=gpd.points_from_xy(cell_points['longitude'], cell_points['latitude']),
        crs='EPSG:4326'
    )

    # Spatial join to assign district ID
    cell_joined = gpd.sjoin(cell_gdf, adm3_gdf[['ID_3','NAME_3','geometry']], how='left', predicate='within')

    # Identify towers with no district mapping (ID_3 missing)
    no_mapping = cell_joined[cell_joined['ID_3'].isna()]
    print(f"Number of cell towers without district mapping: {len(no_mapping):,}")
    if len(no_mapping) > 0:
        print("Sample of unmapped towers:")
        print(no_mapping[['cell_id','latitude','longitude']].head())
else:
    print("Cannot check tower mapping without cells data or admin boundaries.")


In [None]:
# Interactive maps using Folium: cell towers and unmapped towers
import folium
from folium import plugins
import json

# Recompute cell tower geodata and district mapping
if cells is not None and adm3_gdf is not None:
    cell_points = cells.dropna(subset=['longitude','latitude']).copy()
    cell_gdf = gpd.GeoDataFrame(
        cell_points,
        geometry=gpd.points_from_xy(cell_points['longitude'], cell_points['latitude']),
        crs='EPSG:4326'
    )
    # Spatial join to find mapped/unmapped towers
    cell_joined = gpd.sjoin(cell_gdf, adm3_gdf[['ID_3','NAME_3','geometry']], how='left', predicate='within')
    no_mapping = cell_joined[cell_joined['ID_3'].isna()]

    # Calculate center for maps
    center_lat = cell_points['latitude'].mean()
    center_lon = cell_points['longitude'].mean()

    # =========================================
    # Map 1: All cell towers on district boundary
    # =========================================
    map1 = folium.Map(
        location=[center_lat, center_lon],
        zoom_start=6,
        tiles='CartoDB positron'
    )

    # Add Admin1 boundaries (provinces/states)
    if adm1_gdf is not None:
        folium.GeoJson(
            adm1_gdf,
            name='Admin1 Boundaries',
            style_function=lambda x: {
                'fillColor': 'transparent',
                'color': 'darkred',
                'weight': 2,
                'fillOpacity': 0
            },
            tooltip=folium.GeoJsonTooltip(
                fields=['NAME_1'] if 'NAME_1' in adm1_gdf.columns else [],
                aliases=['Province/State:'],
                localize=True
            ),
            show=False  # Hidden by default
        ).add_to(map1)

    # Add Admin2 boundaries (municipalities)
    if adm2_gdf is not None:
        folium.GeoJson(
            adm2_gdf,
            name='Admin2 Boundaries',
            style_function=lambda x: {
                'fillColor': 'transparent',
                'color': 'orange',
                'weight': 1,
                'fillOpacity': 0
            },
            tooltip=folium.GeoJsonTooltip(
                fields=['NAME_2'] if 'NAME_2' in adm2_gdf.columns else [],
                aliases=['Municipality:'],
                localize=True
            ),
            show=False  # Hidden by default
        ).add_to(map1)

    # Add Admin3 boundaries (districts) - shown by default
    folium.GeoJson(
        adm3_gdf,
        name='Admin3 Boundaries',
        style_function=lambda x: {
            'fillColor': 'lightgreen',
            'color': 'gray',
            'weight': 0.5,
            'fillOpacity': 0.2
        },
        tooltip=folium.GeoJsonTooltip(
            fields=['NAME_3', 'ID_3'],
            aliases=['District:', 'ID:'],
            localize=True
        ),
        show=True  # Shown by default
    ).add_to(map1)

    # Add all cell towers
    tower_group = folium.FeatureGroup(name='Cell Towers', show=True)
    for idx, row in cell_points.iterrows():
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=3,
            color='blue',
            fill=True,
            fillColor='blue',
            fillOpacity=0.7,
            popup=f"Tower ID: {row.get('cell_id', idx)}<br>Lat: {row['latitude']:.4f}<br>Lon: {row['longitude']:.4f}"
        ).add_to(tower_group)
    tower_group.add_to(map1)

    # Add title
    title_html = '''
        <div style="position: fixed;
                    top: 10px; left: 50px; width: 400px; height: 40px;
                    background-color: white; border:2px solid grey; z-index:9999;
                    font-size:16px; padding: 10px">
            <b>All cell towers on admin boundaries</b>
        </div>
    '''
    map1.get_root().html.add_child(folium.Element(title_html))

    # Add layer control
    folium.LayerControl(collapsed=False).add_to(map1)

    # Display map 1
    display(map1)
    # Or save: map1.save('map1_all_towers.html')

    # =========================================
    # Map 2: Unmapped cell towers
    # =========================================
    map2 = folium.Map(
        location=[center_lat, center_lon],
        zoom_start=6,
        tiles='CartoDB positron'
    )

    # Add Admin1 boundaries
    if adm1_gdf is not None:
        folium.GeoJson(
            adm1_gdf,
            name='Admin1 Boundaries',
            style_function=lambda x: {
                'fillColor': 'transparent',
                'color': 'darkred',
                'weight': 2,
                'fillOpacity': 0
            },
            tooltip=folium.GeoJsonTooltip(
                fields=['NAME_1'] if 'NAME_1' in adm1_gdf.columns else [],
                aliases=['Province/State:'],
                localize=True
            ),
            show=False
        ).add_to(map2)

    # Add Admin2 boundaries
    if adm2_gdf is not None:
        folium.GeoJson(
            adm2_gdf,
            name='Admin2 Boundaries',
            style_function=lambda x: {
                'fillColor': 'transparent',
                'color': 'orange',
                'weight': 1,
                'fillOpacity': 0
            },
            tooltip=folium.GeoJsonTooltip(
                fields=['NAME_2'] if 'NAME_2' in adm2_gdf.columns else [],
                aliases=['Municipality:'],
                localize=True
            ),
            show=False
        ).add_to(map2)

    # Add Admin3 boundaries
    folium.GeoJson(
        adm3_gdf,
        name='Admin3 Boundaries',
        style_function=lambda x: {
            'fillColor': 'lightgreen',
            'color': 'gray',
            'weight': 0.5,
            'fillOpacity': 0.2
        },
        tooltip=folium.GeoJsonTooltip(
            fields=['NAME_3', 'ID_3'],
            aliases=['District:', 'ID:'],
            localize=True
        ),
        show=True
    ).add_to(map2)

    # Add unmapped towers
    if len(no_mapping) > 0:
        unmapped_group = folium.FeatureGroup(name='Unmapped Towers', show=True)
        for idx, row in no_mapping.iterrows():
            folium.CircleMarker(
                location=[row['latitude'], row['longitude']],
                radius=4,
                color='red',
                fill=True,
                fillColor='red',
                fillOpacity=0.8,
                popup=f"Unmapped Tower<br>Lat: {row['latitude']:.4f}<br>Lon: {row['longitude']:.4f}"
            ).add_to(unmapped_group)
        unmapped_group.add_to(map2)

    # Add title
    title_html2 = '''
        <div style="position: fixed;
                    top: 10px; left: 50px; width: 400px; height: 40px;
                    background-color: white; border:2px solid grey; z-index:9999;
                    font-size:16px; padding: 10px">
            <b>Cell towers without district mapping ({} towers)</b>
        </div>
    '''.format(len(no_mapping))
    map2.get_root().html.add_child(folium.Element(title_html2))

    # Add layer control
    folium.LayerControl(collapsed=False).add_to(map2)

    # Display map 2
    display(map2)
    # Or save: map2.save('map2_unmapped_towers.html')

    print(f'Total cell towers: {len(cell_points)}')
    print(f'Unmapped towers: {len(no_mapping)} ({len(no_mapping)/len(cell_points)*100:.1f}%)')

else:
    print('Cannot plot maps without cells or admin boundary data.')


## 4. Conclusions

This notebook provides a reproducible workflow to assess the completeness of the CDR data in time and space, and to compare subscriber counts against synthetic census population data. Low coverage flags highlight districts where caution should be exercised when interpreting mobility indicators.

You can further refine the analysis by:

* Adjusting the coverage threshold depending on the acceptable level of market share.
* Extending the spatial join to `adm1` or `adm2` boundaries using `BRA_adm1_pop.csv` and `BRA_adm2_pop.csv` if district‑level detail is too fine.


In [None]:
# Bar chart: Top 10 districts by subscriber coverage
if 'district_cov' in locals():
    # Sort by coverage or number of subscribers
    top10 = district_cov.nlargest(10, 'n_subscribers')
    plt.figure(figsize=(10,5))
    plt.bar(top10['NAME_3'], top10['n_subscribers'])
    plt.xticks(rotation=45, ha='right')
    plt.title('Top 10 districts by number of subscribers')
    plt.ylabel('Number of subscribers')
    plt.xlabel('District')
    plt.tight_layout()
    plt.show()
else:
    print('District coverage data is not available for bar chart.')
