# Mozambique AA for Cholera – Threshold Analysis

This notebook supports the development of Anticipatory Action (AA) for cholera in Mozambique. It focuses on exploring and validating outbreak thresholds as defined in the national cholera preparedness and response plan.

The analysis leverages historical cholera case data and related risk indicators stored securely in Azure Blob Storage. Due to the sensitivity of the data, the data is not publicly available.

**Objectives:**
- Understand the thresholds set by the national plan and assess their applicability.
- Explore alternative data-driven thresholds to inform early action.

> **Note:** Ensure that access credentials for the Blob Storage container are configured before running the data loading section.


In [1]:
%load_ext jupyter_black
%load_ext autoreload
%autoreload 2

In [2]:
# Data and visualization
import pandas as pd
import numpy as np
import seaborn as sns
from pathlib import Path
from src.utils import *
import os
import re

# Azure Blob Storage
from azure.storage.blob import ContainerClient, BlobClient
import io

# Display settings
pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.0f}".format
sns.set(style="whitegrid")

In [3]:
DEV_BLOB_SAS = os.getenv("DSCI_AZ_BLOB_DEV_SAS")
DEV_BLOB_NAME = "imb0chd0dev"
DEV_BLOB_URL = f"https://{DEV_BLOB_NAME}.blob.core.windows.net/"
CONTAINER_NAME = "projects"
BLOB_PATH = "ds-aa-moz-cholera/raw/"
BLOB_PATH_WRITE = "ds-aa-moz-cholera/processed/"
container_url = f"{DEV_BLOB_URL}{CONTAINER_NAME}?{DEV_BLOB_SAS}"
container_client = ContainerClient.from_container_url(container_url)

## Data Cleaning

In [4]:
excel_blobs = [
    blob.name
    for blob in container_client.list_blobs(name_starts_with=BLOB_PATH)
    if blob.name.endswith((".xls", ".xlsx"))
]

dataframes = {}
for blob_name in excel_blobs:
    blob_url = f"{DEV_BLOB_URL}{CONTAINER_NAME}/{blob_name}?{DEV_BLOB_SAS}"
    blob_client = BlobClient.from_blob_url(blob_url)
    blob_data = blob_client.download_blob().readall()
    ext = os.path.splitext(blob_name)[1].lower()

    key = os.path.basename(blob_name).replace(ext, "")
    dataframes[key] = pd.ExcelFile(io.BytesIO(blob_data))

In [5]:
dataframes

{'Copy of CANAL ENDEMICO DE DIARREIA GAZA 2025': <pandas.io.excel._base.ExcelFile at 0x2929049bf50>,
 'Copy of CANAL ENDEMICO DE DIARREIA SOFALA 2025': <pandas.io.excel._base.ExcelFile at 0x29290515af0>,
 'Copy of CANAL ENDEMICO DE DIARREIA TETE 2025': <pandas.io.excel._base.ExcelFile at 0x2928ef89760>,
 'Copy of CANAL ENDEMICO MAPUTO CIDADE': <pandas.io.excel._base.ExcelFile at 0x29290d58b60>,
 'Copy of CANAL ENDEMICO MAPUTO PROVINCIA': <pandas.io.excel._base.ExcelFile at 0x29290676480>,
 'Copy of CANAL ENDEMICO ZAMBEZIA': <pandas.io.excel._base.ExcelFile at 0x29290bdcaa0>,
 'Copy of Canais endémicos da província de Manica_12.06. 2025': <pandas.io.excel._base.ExcelFile at 0x292910504a0>,
 'Copy of Canal endemico_P.Inhambane': <pandas.io.excel._base.ExcelFile at 0x29290426cf0>,
 'Copy of DISTRITOS_ NIASSA': <pandas.io.excel._base.ExcelFile at 0x29290e54410>,
 'Copy of Nampula_ Distritos Dados': <pandas.io.excel._base.ExcelFile at 0x29291050410>,
 'data DISTRITOS DE CABO DELGADO': <pand

In [6]:
cleaned_data = []
for file_key, xls_file in dataframes.items():
    province = infer_province_from_filename(file_key)
    if not province:
        continue

    for sheet in xls_file.sheet_names:
        df = xls_file.parse(sheet, header=None)
        district = sheet.strip()
        print(f"Parsing: {province} - {district}")

        cleaned_data.extend(parse_generalized_sheet(df, province, district))

Parsing: Gaza - MASSINGIR
Parsing: Gaza - MASSANGENA
Parsing: Gaza - MAPAI
Parsing: Gaza - MANDLAKAZI
Parsing: Gaza - MABALANE
Parsing: Gaza - LIMPOPO
Parsing: Gaza - GUIJA
Parsing: Gaza - CIDADE DE XAI XAI
Parsing: Gaza - CHONGOENE
Parsing: Gaza - CHOKWE
Parsing: Gaza - CHIGUBO
Parsing: Gaza - CHICUALACUALA
Parsing: Gaza - CHIBUTO
Parsing: Gaza - BILENE
Parsing: Gaza - Provincia de Gaza
⚠️ Skipping invalid district: Provincia de Gaza
Parsing: Gaza - Folha2
⚠️ Skipping invalid district: Folha2
Parsing: Gaza - Folha1
⚠️ Skipping invalid district: Folha1
Parsing: Sofala - Sofala
⚠️ Skipping invalid district: Sofala
Parsing: Sofala - Buzi
Parsing: Sofala - Caia
Parsing: Sofala - Chemba
Parsing: Sofala - Cheringoma
Parsing: Sofala - Chibabava
Parsing: Sofala - Beira
Parsing: Sofala - Dondo
Parsing: Sofala - Gorongosa
Parsing: Sofala - Machanga
Parsing: Sofala - Maringue
Parsing: Sofala - Morromeu
Parsing: Sofala - Muanza
Parsing: Sofala - Nhamatanda
Parsing: Tete - TETE
⚠️ Skipping invalid

In [7]:
province, district

('Cabo Delgado', 'ANCUABE')

In [8]:
cholera_df = pd.DataFrame(cleaned_data)
cholera_df = cholera_df.sort_values(["province", "district", "year", "week"])

In [9]:
cholera_df["province"].unique(), cholera_df["district"].unique()

(array(['Cabo Delgado', 'Gaza', 'Inhambane', 'Manica', 'Maputo Cidade',
        'Maputo Provincia', 'Nampula', 'Niassa', 'Sofala', 'Tete',
        'Zambezia'], dtype=object),
 array(['Ancuabe', 'Balama', 'Chiure', 'Cidade De Pemba', 'Ibo', 'Macomia',
        'Mecufi', 'Meluco', 'Metuge', 'Mocimboa Da Praia', 'Montepuez',
        'Mueda', 'Muidume', 'Namuno', 'Nangade', 'Palma', 'Quissanga',
        'Bilene', 'Chibuto', 'Chicualacuala', 'Chigubo', 'Chokwe',
        'Chongoene', 'Cidade De Xai Xai', 'Guija', 'Limpopo', 'Mabalane',
        'Mandlakazi', 'Mapai', 'Massangena', 'Massingir',
        'Cidade De Inhambane', 'Funhalouro', 'Govuro', 'Homoine',
        'Inharrime', 'Inhassro', 'Jangamo', 'Mabote', 'Massinga', 'Maxixi',
        'Morrumbene', 'Panda', 'Vilankulos', 'Zavala', 'Barrue',
        'Cidade De Chimoio', 'Gondola', 'Guro', 'Macate', 'Machaze',
        'Macossa', 'Manica', 'Mossurize', 'Sussundenga', 'Tambara',
        'Vanduzi', 'Kalhamankulo', 'Kamavota', 'Kamaxakene', 'K

In [10]:
# filter out rows with missing weeks or weeks > 53
cholera_df = cholera_df[
    (cholera_df["week"].notna()) & (cholera_df["week"] <= 53)
].reset_index(drop=True)

In [11]:
cholera_df.shape

(63943, 5)

In [12]:
cholera_df.to_csv(
    Path(os.getenv("AA_DATA_DIR"))
    / "private"
    / "processed"
    / "moz"
    / "cholera"
    / "cholera_data_all_cleaned.csv",
    index=False,
    encoding="utf-8",
)

In [12]:
# csv_buffer = io.StringIO()
# cholera_df.to_csv(csv_buffer, index=False)
# csv_bytes = csv_buffer.getvalue().encode("utf-8")

# file_name = "cholera_data_cleaned.csv"
# blob_write_url = (
#    DEV_BLOB_URL
#    + CONTAINER_NAME
#    + "/"
#    + BLOB_PATH_WRITE
#    + file_name
#    + "?"
#    + DEV_BLOB_SAS
# )
# blob_client = BlobClient.from_blob_url(blob_url)
# blob_client.upload_blob(csv_bytes, overwrite=True)