Script structure:
  1. Setup:
        * Paths
        * Utils functions
        * Load and check config file
  2. Load Data
        * **Routine data** (DHIS2) already formatted & aggregated (output of pipeline XXX)
        * **Population data** (DHIS2) already formatted & aggregated (output of pipeline YYY) & aggregated at **ADM2 x YEAR** level (⚠️ how to implement this better!⚠️)
        * (optional) **Care seeking (taux recherche soins)** (DHS)
        * (user choice) **Reporting Rate**, pick one of:
            * "**Dataset**": pre-cumputed (directly downloadable from SNIS DHIS2 instance) and formatted&aligned elsewhere (output of pipelibe `dhis2-reporting-rate`)
            * "**Data Element**: calculated from routine dhis2 data, based on reports for defined indicators and "active" facilities
  3. Calculate **Incidence**
     1. calculate **monthly cases**
     2. calculate **yearly incidence**: Crude, Adjusted 1 (Testing), Adjusted 2 (Reporting), (optional) Adjusted 3 (Care Seeking)

-------------------
**Naming harmonization to improve code readability:**

**Incidence**, COLUMN NAMES (always capitalized!):
* "INCIDENCE_CRUDE" = "Crude"
* "INCIDENCE_ADJ_TESTING" = "Adjusted 1 (Testing)"
* "INCIDENCE_ADJ_REPORTING" = "Adjusted 2 (Reporting)"
* _"INCIDENCE_ADJ_CARESEEKING" = "Adjusted 3 (Careseeking)"_ ⚠️is this good naming?

**Reporting Rate** data frames, based on two **methods**:
* follwo this structure: reporting\_rate\_\<method\>. So:
    * **Dataset**: `reporting_rate_dataset` (for report nb only: `reporting_rate_dataset_year`)
    * **Data Element** (Diallo 2025): `reporting_rate_dataelement` (for report nb only: `reporting_rate_dataelement_year`)

--------------------

### To do:
* add check on completeness of routine data per ADM2 * MONTH -> issue warning if data is missing for certain months ("holes" see https://bluesquare.slack.com/archives/C08DHT2JXEV/p1751982194834899 )

## 0. Parameters
👇 these are now ⚡**pipeline parameters**⚡!

In [None]:
# Parameters
# SNT_ROOT_PATH
# CONFIG_FILE_NAME

# N1_METHOD
# USE_CSB_DATA
# REPORTING_RATE_METHOD
# ROUTINE_DATA_CHOICE
# OUTLIER_DETECTION_METHOD

## 1. Setup

### 1.0. Fallback parameter values

In [None]:
# Set BACKUP VALUE: root path - NEVER CHANGE THIS!
if (!exists("SNT_ROOT_PATH")) {
  SNT_ROOT_PATH <- "/home/hexa/workspace" 
}

# Set BACKUP VALUE: CONFIG_FILE_NAME probably won't be needed in the future (TBD)
if (!exists("CONFIG_FILE_NAME")) {
  CONFIG_FILE_NAME <- "SNT_config.json" 
}


# ----- ⚡ Defined in pipeline.py code ---------------

# ⚡ For N1 calculations: use `SUSP-TEST` or `PRES`
if (!exists("N1_METHOD")) {
  N1_METHOD <- "SUSP-TEST" # set this as default in case there is not PRES (not all countries have it)
}

# ⚡ USE_CSB_DATA bool
if (!exists("USE_CSB_DATA")) {
  USE_CSB_DATA <- FALSE
}

# ⚡ Set BACKUP VALUE: choice of reporting rate method to use for incidence adjustement
if (!exists("REPORTING_RATE_METHOD")) { # was REPORTING_RATE_CHOICE
  REPORTING_RATE_METHOD <- "dataset" # "dataelement"
}

# For "Date Element" method ONLY
if (!exists("REPRATE_DELEMENT_METHOD_NUMERATOR")) { 
  REPRATE_DELEMENT_METHOD_NUMERATOR <- "n1" # "n1" | "n2" | ?"Not applicable"?
}

# For "Date Element" method ONLY
if (!exists("REPRATE_DELEMENT_METHOD_DENOMINATOR")) { 
  REPRATE_DELEMENT_METHOD_DENOMINATOR <- "d1" # "d1" | "d2" | ?"Not applicable"?
}



# ⚡ Set BACKUP VALUE: choice of routine data to use 
if (!exists("ROUTINE_DATA_CHOICE")) {
  ROUTINE_DATA_CHOICE <- "raw" # "raw_without_outliers" "imputed"
}

# ⚡ Set BACKUP VALUE: choice of OUTLIERS detection method
#  Needed when `ROUTINE_DATA_CHOICE == "raw_without_outliers"` OR `ROUTINE_DATA_CHOICE == "imputed"`
# Extract from filenames in DHIS2_OUTLIERS_REMOVAL_IMPUTATION
if (!exists("OUTLIER_DETECTION_METHOD")) {
  OUTLIER_DETECTION_METHOD <- "median3mad"  # "mean3sd", "iqr1-5", "magic_glasses_partial", "magic_glasses_complete"
}


### 1.1. Paths

In [None]:
# PROJECT PATHS
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') # this is where we store snt_utils.r
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data', 'dhis2') # same files as in Datasets but /data/ gets overwritten at each pipeline run

### 1.2. Utils functions

In [None]:
source(file.path(CODE_PATH, "snt_utils.r"))

### 1.3. Packages

In [None]:
# List required pcks  ---------------->  check  what are the really required libraries
required_packages <- c("arrow", # for .parquet
                       "tidyverse",
                       "stringi", 
                       "jsonlite", 
                       "httr", 
                       "reticulate")

# Execute function
install_and_load(required_packages)

#### For 📦{sf}, tell OH where to find stuff ...

In [None]:
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")

#### Set environment to load openhexa.sdk from the right path

In [None]:
# Set environment to load openhexa.sdk from the right path
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.4. Load and check `config` file

In [None]:
config_json <- tryCatch({ fromJSON(file.path(CONFIG_PATH, CONFIG_FILE_NAME)) },
    error = function(e) {
        msg <- paste0("[ERROR] Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, CONFIG_FILE_NAME)) 
log_msg(msg)

**Checks for SNT mandatory configuration fields**

In [None]:
snt_config_mandatory <- c("COUNTRY_CODE", "DHIS2_ADMINISTRATION_1", "DHIS2_ADMINISTRATION_2") 

for (conf in snt_config_mandatory) {
    print(paste(conf, ":", config_json$SNT_CONFIG[conf]))
    if (is.null(config_json$SNT_CONFIG[[conf]])) {
        msg <- paste("[ERROR] Missing configuration input:", conf)
        cat(msg)   
        stop(msg)
    }
}

**Save config fields as variables**

In [None]:
# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# Specific to INCIDENCE calculation (this nb)
# 🚨 How to treat 0 values (in this case: "SET_0_TO_NA" converts 0 to NAs) 🚨
# NA_TREATMENT <- config_json$SNT_CONFIG$NA_TREATMENT # 🚨 Used only for old method ... 🚨

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "ANY"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

In [None]:
# Fixed routine formatting columns
fixed_cols <- c('OU_ID','PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID') 
print(paste("Fixed routine data ('dhis2_routine') columns (always expected): ", paste(fixed_cols, collapse=", ")))

## 2. Load Data

### 2.1. **Routine** data (DHIS2) (parametrized choice)
already formatted & aggregated (output of pipeline XXX)

In [None]:
# ROUTINE_DATA_CHOICE <- "raw_without_outliers"
# ROUTINE_DATA_CHOICE <- "imputed"
# ROUTINE_DATA_CHOICE 

In [None]:
# Define `dataset_name` and `file_name` based on paramaters

if (ROUTINE_DATA_CHOICE == "raw") {
    
    dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
    file_name <- paste0(COUNTRY_CODE, "_routine.parquet")
    # print(c(dataset_name, file_name))
    
} else if (ROUTINE_DATA_CHOICE == "raw_without_outliers") {
    
    dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_REMOVAL_IMPUTATION
    file_name <- paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_DETECTION_METHOD, "_removed.parquet")
    # print(c(dataset_name, file_name))

} else if (ROUTINE_DATA_CHOICE == "imputed") {

    dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_REMOVAL_IMPUTATION
    file_name <- paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_DETECTION_METHOD, "_imputed.parquet")
    # print(c(dataset_name, file_name))
}

print(c(dataset_name, file_name))

In [None]:
# Load file from dataset  

dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) }, 
    error = function(e) {    
    # Check if the error message indicates that the file does not exist    
    if (grepl("does not exist", conditionMessage(e), ignore.case = TRUE)) { 
        msg <- paste0("[ERROR] File not found! 🛑 The file `", file_name, "` does not exist in `", 
                  dataset_name, "`. To generate it, execute the pipeline `DHIS2 Outliers Removal and Imputation`, choosing the appropriate method.")
    } else {
        msg <- paste0("[ERROR] 🛑 Error while loading DHIS2 routine data file for: ", COUNTRY_CODE, ". [ERROR DETAILS] " , conditionMessage(e))
    }            
    stop(msg)
})

msg <- paste0("DHIS2 routine data : `", file_name, "` loaded from dataset : `", dataset_name, "`. Dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
log_msg(msg)

In [None]:
head(dhis2_routine, 3)

#### Checks on routine data cols

**fixed_cols**

In [None]:
# Check if all "fixed" cols are present in dhis2_routine

actual_cols <- colnames(dhis2_routine) # dhis2_routine
missing_cols <- setdiff(fixed_cols, actual_cols) # Columns in fixed_cols but not in actual_cols)

# Check if all required columns are present
all_present <- length(missing_cols) == 0

if (all_present) { 
  log_msg(paste0("Success: The 'dhis2_routine' tibble contains all the expected 'fixed' columns: ",
                paste(fixed_cols, collapse = ", "), "."))
} else {
    log_msg(paste0(
      "🚨 Missing Columns: The following required columns are NOT present in 'dhis2_routine': ",
      paste(missing_cols, collapse = ", "),
      "."
    ), "warning")  # a warning is logged!
}

**DHIS2_INDICATORS**

In [None]:
# Check if all "DHIS2_INDICATORS" cols are present in dhis2_routine

actual_cols <- colnames(dhis2_routine) # dhis2_routine
missing_cols <- setdiff(DHIS2_INDICATORS, actual_cols) # Columns in fixed_cols but not in actual_cols)
all_present <- length(missing_cols) == 0

if (all_present) { 
  log_msg(paste0("Success: The 'dhis2_routine' tibble contains all the expected 'DHIS2_INDICATORS' columns: ",
                paste(fixed_cols, collapse = ", "), "."))
} else {
    log_msg(paste0(
      "🚨 Missing Columns: The following columns for DHIS2 INDICATORS are NOT present in 'dhis2_routine': ",
      paste(missing_cols, collapse = ", "),
      ".\n🚨 Looks like the ", CONFIG_FILE_NAME, " file was modified after extraction.\n🚨 The analysis will continue WITHOUT the missing indicators."
    ), "warning")
}

#### Checks on `N1_METHOD` selected
_**if**_ `N1_METHOD == PRES` then `PRES` must exist in config file and routine data <br>
_**else**_ N1 will use `SUSP-TEST` instead

In [None]:
# N1_METHOD <- "PRES"
# N1_METHOD <- "SUSP-TEST"
# DHIS2_INDICATORS

In [None]:
# Check that col `PRES` exists in both config file and routine data
if (N1_METHOD == "PRES") {
   pres_in_routine <- any(names(dhis2_routine) == "PRES")
   pres_in_config <- any(DHIS2_INDICATORS == "PRES")

    if (!pres_in_routine) {
        log_msg("🚨 Column `PRES` missing from routine data! 🚨 N1 calculations will use `SUSP-TEST` instead!", "warning")
    }
    if (!pres_in_config) {
        log_msg("⚙️ Note: `PRES` set as parameter in this pipeline, but not defined as indicator in the configuration file (SNT_config.json)", "warning")
    }
### set N1_METHOD to "SUSP-TEST" here ?
}

### 2.2. **Population** data (DHIS2) at **ADM2** x **YEAR** level 
already formatted & aggregated (output of pipeline YYY) and then further aggregated in `pipelines/dhis2_incidence/code/WIP/00_TEMP_population_formatting.ipynb`

Note: must run **00_TEMP_population_formatting**.ipynb first!

🚨 **Expected table** with these cols (bold = **must have**): ADM1_ID, **ADM2_ID**,**YEAR**, **POPULATION** (pop at ADM2 level)

In [None]:
# DHIS2 Dataset extract identifier
dhis2_formatted_dataset <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_population_adm2 <- tryCatch({ get_latest_dataset_file_in_memory(dhis2_formatted_dataset, paste0(COUNTRY_CODE, "_population.parquet")) }, 
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 population file for: " , COUNTRY_CODE, 
                                   " [ERROR DETAILS] ", conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 population data loaded from dataset : ", dhis2_formatted_dataset, 
              " dataframe dimensions: ", paste(dim(dhis2_population_adm2), collapse=", "))
log_msg(msg)

In [None]:
head(dhis2_population_adm2, 3)

### 2.3. (optional) **Care Seeking Behaviour** (CSB DHS) (taux recherche soins)
(20250728) Note: **changed units** (proportion to %), see https://bluesquare.atlassian.net/browse/SNT25-127 

In [None]:
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHS_INDICATORS
file_name <- glue::glue("{COUNTRY_CODE}_DHS_ADM1_PCT_CARESEEKING_SAMPLE_AVERAGE.parquet")

if (USE_CSB_DATA == TRUE) {
    # Read the data, if error (cannot find at defined path) -> set careseeking_data to NULL (so it doesn't break the function at # 3.)
    careseeking_data <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) },          
                  error = function(e) {
                      msg <- paste("🛑 Error while loading DHS Care Seeking data file from `", dataset_name, file_name ,"`.", conditionMessage(e))  # log error message
                      log_msg(msg, "error")
                      return(NULL) # make object NULL on error
                  })
 
    # Only print success messages and data info if careseeking_data is NOT NULL
    if (!is.null(careseeking_data)) {
        log_msg(paste0("Care Seeking data : ", file_name, " loaded from dataset : ", dataset_name))
        log_msg(paste0("Care Seeking data frame dimensions: ", nrow(careseeking_data), " rows, ", ncol(careseeking_data), " columns."))
        head(careseeking_data)
    } else {
        log_msg(paste0("🚨 Care-seeking data not loaded due to an error, `careseeking_data` is set to `NULL`!"), "warning")
    }
    
} else {
    # if `USE_CSB_DATA == FALSE` ... (basically, ignore CSB data)
    careseeking_data <- NULL
}

### 2.4. **Reporting Rate** (parametrized choice)  
Import Reporting Rate file based on user choice (parameter)

📅 **Important**: reporting rate must be **monthly**!

In [None]:
# REPORTING_RATE_METHOD = "dataset"
# REPORTING_RATE_METHOD = "dataelement"
REPORTING_RATE_METHOD

In [None]:
# Define dataset and file name (based on paramter)
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_REPORTING_RATE
file_name <- paste0(COUNTRY_CODE, "_reporting_rate_", REPORTING_RATE_METHOD, ".parquet") # "_month.parquet"

print(c(dataset_name, file_name))

In [None]:
# Define dataset and file name (based on paramter)
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_REPORTING_RATE

if (REPORTING_RATE_METHOD == "dataelement") { 
    # COD_reporting_rate_dataelement-n1-d1.parquet
    file_name <- paste0(COUNTRY_CODE, "_reporting_rate_", 
                        REPORTING_RATE_METHOD, "-", REPRATE_DELEMENT_METHOD_NUMERATOR, "-", REPRATE_DELEMENT_METHOD_DENOMINATOR, ".parquet") 
} else {
    file_name <- paste0(COUNTRY_CODE, "_reporting_rate_", REPORTING_RATE_METHOD, ".parquet") 
}

print(c(dataset_name, file_name))

In [None]:
# Load file from dataset
reporting_rate_month <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) }, 
                        error = function(e) {
                        msg <- paste("[ERROR] Error while loading Reporting Rate data for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                        cat(msg)
                        stop(msg)
})

msg <- paste0("Reporting Rate data : `", file_name, "` loaded from dataset : `", dataset_name, "`. Dataframe dimensions: ", paste(dim(reporting_rate_month), collapse=", "))
log_msg(msg)

#### 🔍 Check: cols and data type

In [None]:
# --- 0. Start validation (checks on) `reporting_rate_month` content ------------------------------

# Start message ...
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_REPORTING_RATE # "snt-dhis2-reporting-rate" 
msg <- paste("Checking requirements for `", file_name, "` ... ") # log error message
log_msg(msg) 

# --- 1. All expected cols are there? ------------------------------
required_cols <- c("ADM2_ID", "YEAR", "MONTH", "REPORTING_RATE")
missing_cols <- setdiff(required_cols, names(reporting_rate_month))
if (length(missing_cols) > 0) {
  stop(paste("Missing required columns:", paste(missing_cols, collapse = ", ")))
}

# --- 2. Values type match with expected? ------------------------------

# Initialize list of errors (stop) and warnings (issue message)
validation_errors <- list()
validation_warnings <- list()

# Check YEAR is integer (or can be converted), and is recent (current year - 5)
current_year <- as.integer(format(Sys.Date(), "%Y"))
min_valid_year <- current_year - 5

reporting_rate_month <- reporting_rate_month %>%
  mutate(YEAR = suppressWarnings(as.integer(YEAR)))

if (any(is.na(reporting_rate_month$YEAR))) {
  validation_errors <- c(validation_errors, "YEAR contains non-integer values")
} else if (any(reporting_rate_month$YEAR < min_valid_year, na.rm = TRUE)) {
  validation_errors <- c(validation_errors,
                         paste("Some YEAR values are too old (minimum valid year is", min_valid_year,
                               "). Found years:",
                               paste(unique(reporting_rate_month$YEAR[reporting_rate_month$YEAR < min_valid_year]), collapse = ", ")))
}

# Check MONTH is integer or can be converted to, and ranges betweeen 1 and 12
# maybe not needed ...

# Check REPORTING_RATE is numeric between 0 and 1
if (!is.numeric(reporting_rate_month$REPORTING_RATE)) {
  validation_errors <- c(validation_errors, "REPORTING_RATE must be numeric")
} else if (any(reporting_rate_month$REPORTING_RATE < 0 | reporting_rate_month$REPORTING_RATE > 1, na.rm = TRUE)) {
  validation_warnings <- c(validation_warnings, "`REPORTING_RATE` should be between 0 and 1, but some values are >1.")
}

# Report any validation errors and STOP the process
if (length(validation_errors) > 0) {
  stop(paste("[ERROR] 🛑 Data validation for `", file_name, "` FAILED: ", paste(validation_errors, collapse = " "))) 
}


# Report any warnings and print message (do not stop the process)
if (length(validation_warnings) > 0) {
  # Create the warning message string
  warning_msg_string <- paste("🚨 Warning: ", paste(validation_warnings, collapse = " ")) # collapse = "\n"
  log_msg(warning_msg_string, "warning") 
} else {
  success_msg_string <- " ... all looks good! We can proceed 🚀"
  log_msg(success_msg_string) 
}

### 2.5. **Shapes** for plotting maps (choropleths)

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_shapes <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_shapes.geojson")) }, 
                  error = function(e) {
                      msg <- paste0("[ERROR] Error while loading DHIS2 shapes data file for: " , COUNTRY_CODE, ". [ERROR DETAILS] ",
                                    conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 shapes data loaded from dataset : `", dataset_name, "`. Dataframe dimensions: ", paste(dim(dhis2_shapes), collapse = ", "))
log_msg(msg)

In [None]:
names(dhis2_shapes)

#### 🛑 ESTEBAN MSG: I labeled the messages until here onyl :(   

Changes :  
-Add the severity of msg using the log_msg("error message text", "error or warning")  
-Label the errors that will stop the process. These errors will be passed to the python pipeline code to be logged.  
The structure of the "labeling" is "[ERROR] Some message text here. [ERROR DETAILS] some details" -> usually i just add conditionMessage(e) here"  
The python code will catch the error and search for these labels and print (depending on the severity assigned in the python code) something like this:  
-> "Some message text here. some details" (the  label [ERROR DETAILS] is optional only..) 

  
After updating the script remember to push it to the main branch under the corresponding folder under pipelines/

-------------------------------

## 3. Calculate Incidence
First calculate monthly cases, then yearly incidence.

### 3.1 **Monthly cases**


These methods follow the standard WHO approach for estimating malaria incidence from routine health information systems (WHO, 2023).
As shown in the code, we begin by calculating **monthly malaria case metrics** (confirmed, tested, presumed) at the **ADM2** level and join them with the **monthly reporting rate**. 

This allows us to compute the **test positivity rate** (TPR, where `TPR` = `CONF` / `TEST`) and adjust for incomplete testing using the formula: 
> N1 = C + (P × C/T)

Which is equivalent to:
> N1 = C + (P × TPR)

where:
- N1 = cases adjusted for testing gaps 
- C = confirmed cases (`CONF`)
- **P** = presumed cases (either `SUSP` - `TEST` or directly available as `PRES`) <-- this is a parameter (`N1_METHOD`)
- T = tested cases (`TEST`)
- TPR = test positivity rate (`CONF` / `TEST`)
  
This produces `N1`, the number of cases adjusted for testing gaps, calculated at the monthly level in line with WHO recommendations to capture intra-annual variation.

Next, we adjust for incomplete reporting using: 
> N2 = N1 / R

where **R** is the monthly **reporting rate** (i.e., reports received divided by reports expected).

Finally, _if_ **careseeking** data is **available**, N3 is calculated as follows:
> N3 = N2 + (N2 * PROP_PRIV / PROP_PUBL) + (N2 * NO_TREAT / PROP_PUBL)

where:
- PRIVATE_CARE = proportion of kids treated in the **private** sector
- PUBLIC_CARE = proportion of kids treated in the **public** sector
- NO_CARE = proportion of kids which **did not receive any treatment**

Note that this assumes the same TPR across all sectors (private and public).



**Important note**<br>
In case reporting rate equals zero (none of the health facilities reported in a given month), N2 is set to `NA`. Note that the annual N2 will be underestimated, which is preferable compared to having `Inf` values.

-------------

This calculation expects (input):
* **routine_data**: DHIS2 routine data, formatted and aggregated at ADM2 and MONTH level. Tibble (df) _must_ contain the following cols: `YEAR`, `MONTH`, `ADM2`, `CONF`, `TEST`, `SUSP`, `PRES`.  
* **reporting_rate_data**: reporting rate calculated at ADM2 and MONTH level and expressed as proprtion **(0-1)**. Tibble (df) _must_ contain the following cols: `ADM2`, `YEAR`, `MONTH`, `reporting_rate`

The calculation produces (output):
* data frame with the following cols: `ADM2`, `YEAR`, `MONTH`, "value_" * (`CONF`, `TEST`, `SUSP`, `PRES`), `TPR`, `N1`, `N2`

-----------------

In [None]:
# Ensure correct data type for numerical columns ---------------------------------------
routine_data <- dhis2_routine %>%
    mutate(across(any_of(c("YEAR", "MONTH", "CONF", "TEST", "SUSP", "PRES")), as.numeric))

reporting_rate_data <- reporting_rate_month %>% # reporting_rate_data
    mutate(across(c(YEAR, MONTH, REPORTING_RATE), as.numeric))

In [None]:
# Core calculations ------------------------------------------------------------------------------
monthly_cases <- routine_data %>%
    group_by(ADM1_ID, ADM2_ID, YEAR, MONTH) %>% # ADM1 needed to join careseeking data
    summarise(
      CONF = sum(CONF, na.rm = TRUE),
      TEST = sum(TEST, na.rm = TRUE),
      SUSP = sum(SUSP, na.rm = TRUE),
      # Same as `PRES = sum(PRES, na.rm = TRUE)`, but doesn't break if `PRES` is missing!
      across(any_of("PRES"), ~sum(., na.rm = TRUE), .names = "PRES"), # <- handles missing 'PRES' column gracefully
      .groups = "drop"
    ) %>%
    left_join(reporting_rate_data,
              by = c("ADM2_ID", "YEAR", "MONTH")) %>%
    mutate(
      TPR = CONF / TEST
    )

In [None]:
# Calculate N1 based on `N1_METHOD` & availability of `PRES` -----------------------------

if (N1_METHOD == "SUSP-TEST") {
    monthly_cases <- monthly_cases %>%
      mutate(N1 = CONF + ((SUSP - TEST) * TPR))
      log_msg("Calculating N1 as `N1 = CONF + ((SUSP - TEST) * TPR)`")
  } else if (N1_METHOD == "PRES") {
    # if: column named "PRES" exists in `monthly_cases` and contains at least one non-missing value
    if ("PRES" %in% names(monthly_cases) && !all(is.na(monthly_cases$PRES))) {
      monthly_cases <- monthly_cases %>%
        mutate(N1 = CONF + (PRES * TPR))
        log_msg("Calculating N1 as `N1 = CONF + (PRES * TPR)`")
    } else {
      log_msg("🚨 Warning: 'PRES' not found in routine data or contains all `NA` values! 🚨 Calculating N1 using 'SUSP-TEST' method instead.")
      monthly_cases <- monthly_cases %>%
        mutate(N1 = CONF + ((SUSP - TEST) * TPR))
    }
  } else {
    log_msg("Invalid N1_METHOD. Please use 'PRES' or 'SUSP-TEST'.") # not really necessary ... 
  }


# Calculate N2
monthly_cases <- monthly_cases %>%
    mutate(
      N2 = ifelse(REPORTING_RATE == 0, NA_real_, N1 / REPORTING_RATE) # On the fly convert `RR == 0` to NA to avoid N2 == Inf
    )

In [None]:
# Only calculate N3 if CARESEEKING data is avaiable ---------------------------------------

if (!is.null(careseeking_data)) {
    monthly_cases <- monthly_cases %>%
      mutate(YEAR = as.numeric(YEAR)) %>% # keep as safety
      left_join(., careseeking_data,
                by = c("ADM1_ID")
      ) %>%
      mutate(
        # N3 = N2 + (N2 * PRIVATE_CARE / PUBLIC_CARE) + (N2 * NO_CARE / PUBLIC_CARE) # Formula from Rapport Stratification Burkina
        N3 = N2 + (N2 * PCT_PRIVATE_CARE / PCT_PUBLIC_CARE) + (N2 * PCT_NO_CARE / PCT_PUBLIC_CARE) # CSB values changed from PROPORTION to %, but formula uses ratios so nothing should change
      )
  } else {
    print("🦘 Careseeking data not available, skipping calculation of N3.")
  }

In [None]:
# 'verbose': print reporting message ---------------------------------------

zero_reporting <- reporting_rate_data %>%
      filter(REPORTING_RATE == 0) %>%
      summarise(
        n_months_zero_reporting = n(),
        affected_zones = n_distinct(ADM2_ID)
      )

if (zero_reporting$n_months_zero_reporting > 0) {
      msg_verbose <- paste0("🚨 Note: ", zero_reporting$n_months_zero_reporting,
                            " ADM2 had `REPORTING_RATE == 0` across ",
                            zero_reporting$affected_zones, " zones. These N2 values were set to NA.")

      log_msg(msg_verbose)
    } else {
      log_msg("✅ Note: no ADM2 has `REPORTING_RATE == 0`. All N2 values were preserved.")
    }

In [None]:
head(monthly_cases, 3)

### 🔍 Data **coherence** checks on **monthly cases**
Check for ratios or differences that will cause negative values -> which will causes adjusted incidence to be lower than the values it adjust


Namely, the following relationships among INDICATORs:
* SUSP-TEST
* CONF/TEST
* N1 == CONF ... (when PRES == 0)

#### 1. `PRES == 0`: causes `N1 == CONF` 
(if `N1_METHOD == "PRES"`)

In [None]:
# Run this check only if N1_METHOD == "PRES" (else, problem doesn't exist)
if (N1_METHOD == "PRES") {
    nr_of_pres_0_adm2_month <- monthly_cases |> filter(PRES == 0) |> nrow()
    log_msg(glue::glue("🚨 Note: using `PRES` for incidence adjustement, but `PRES == 0` for {nr_of_pres_0_adm2_month} rows (ADM2 x MONTH)."), "warning")
}


#### 2. `SUSP-TEST`: if negative, then N1 smaller or equal to CONF (ADJ =< CRUDE)
(if `N1_METHOD == "SUSP-TEST"`)

In [None]:
# SUSP-TEST: if negative, then N1 smaller or equal to CONF (ADJ =< CRUDE)
if (N1_METHOD == "SUSP-TEST") {
    nr_of_negative <- monthly_cases |> mutate(SUSP_minus_TEST = SUSP - TEST) |> filter(SUSP_minus_TEST < 0) |> nrow() 
    log_msg(glue::glue(
        "🚨 Note: using `SUSP-TEST` for incidence adjustement, but higher tested than suspected cases (`SUSP - TEST < 0`) detected in {nr_of_negative} rows (ADM2 x MONTH)."
    ), "warning")
}

#### 3. `CONF/TEST` = `TPR` (to calculate N1: Incidence adjusted for **Testing**)
This **ratio should** always be **< 1** because **there should _not_ be more confirmed cases than tested** ...

(but if very small, then N1 could be smaller or equal to CONF (so ADJ INC =< CRUDE))

In [None]:
# CONF/TEST = "TPR": should always be < 1 because there should not be more confirmed cases than tested ...
#             (but if very small, then N1 could be smaller or equal to CONF (so ADJ INC =< CRUDE))

more_confirmed_than_tested <- monthly_cases |> mutate(CONF_divby_TEST = CONF / TEST) |> filter(CONF_divby_TEST > 1) |> nrow() 

if (more_confirmed_than_tested > 0) {
    log_msg(glue::glue("🚨 Note: higher confirmed than tested cases (`CONF/TEST`) detected in {more_confirmed_than_tested} rows (ADM2 x MONTH)."), "warning")
}

### 3.2 **Yearly incidence**
After calculating N1 and N2 for each `ADM2`-`MONTH`, we aggregate the data annually to compute the yearly totals (sums) for crude cases (`CONF`), `N1` and `N2`. Finally, we compute:
* Crude incidence: C / POP × 1000
* Incidence adjusted for testing: N1 / POP × 1000
* Incidence adjusted for testing and reporting: N2 / POP × 1000
* Incidence adjusted for testing, reporting and careseeking behaviour (optional): N3 / POP × 1000

--------------

The calculation expects (input):
* **case_data**: as the output of `calculate_monthly_cases()`, or a tibble/data frame with the following cols: `ADM2`, `YEAR`, `MONTH`, "value_" * (CONF, TEST, SUSP, PRES), `TPR`, `N1`, `N2`  
* **population_data**: df of population data formatted and aligned, aggregated at ADM2 and YEAR level. A tibble/data frame that _must_ contain the following cols: `ADM2`, `YEAR`, `POPULATION`

The calculation produces (output): 
* a data frame with the following cols: ADM2_ID, YEAR, CONF, N1, N2, `INCIDENCE_CRUDE`, `INCIDENCE_ADJ_TESTING`, `INCIDENCE_ADJ_REPORTING`

--------------------

In [None]:
# ---- 1. Enforce column types upfront ----
case_data <- monthly_cases %>% 
    mutate(across(where(is.numeric), as.numeric))  # Convert all numeric columns
  
population_data <- dhis2_population_adm2 %>% # population_data
    mutate(across(c(YEAR, POPULATION), as.numeric))

In [None]:
# ---- 2. Core calculation ----
yearly_incidence <- case_data %>%
    group_by(ADM2_ID, YEAR) %>%
    summarise(
        # 🚨 removed `na.rm = TRUE` on 20250702 - if things break check here! 🚨 
      across(c(CONF, N1, N2), ~sum(.)), #, na.rm = TRUE)), # 🔍 PROBLEM: if NA's, the sum of N2 by YEAR is smaller than the sum of N1 cos missing data for RR!
      # across(any_of(c("CONF", "TEST", "SUSP", "PRES", "N1", "N2")), ~sum(.)), # silenced as not necessary to also summarize "TEST", "SUSP", "PRES"
      .groups = "drop"
    ) %>%
    left_join(
      population_data,
      by = c("ADM2_ID", "YEAR")
    ) %>%
    mutate(
      INCIDENCE_CRUDE = CONF / POPULATION * 1000,
      INCIDENCE_ADJ_TESTING = N1 / POPULATION * 1000,
      INCIDENCE_ADJ_REPORTING = N2 / POPULATION * 1000
    ) |>
    ungroup()

In [None]:
# ---- 3. Optional careseeking adjustment ----
if (!is.null(careseeking_data) && "N3" %in% names(case_data)) {
    n3_data <- case_data %>%
      group_by(ADM2_ID, YEAR) %>%
      summarise(N3 = sum(N3, na.rm = TRUE),
                .groups = "drop") |>
      ungroup()
    
    yearly_incidence <- yearly_incidence %>%
      left_join(n3_data, by = c("ADM2_ID", "YEAR")) %>%
      mutate(
        INCIDENCE_ADJ_CARESEEKING = N3 / POPULATION * 1000
      )
  } else {
    yearly_incidence <- yearly_incidence |>
      mutate(
        INCIDENCE_ADJ_CARESEEKING = NA
            )
  }

In [None]:
head(yearly_incidence, 3)

### 🔍 Data **coherence** checks on **yearly incidence**
Here we check if values of Indicidence (already at `YEAR` resolution) make sense in relation to each other.<br>
Namely:
* crude values should be the lowest, and any consecutive **adjustment** should cause the incidence values to **increase** or remain the **same** - but should never be lower!

#### 1. `INCIDENCE_ADJ_TESTING` (adj. level 1) should always be greater than `INCIDENCE_CRUDE` (not adjusted)

In [None]:
# same as below but different cols ... 
# Count TRUE values, handling potential NAs in the result of if_else
nr_of_impossible_values <- yearly_incidence |>
  mutate(IMPOSSIBLE_VALUE = if_else(INCIDENCE_ADJ_TESTING < INCIDENCE_CRUDE, TRUE, FALSE)) |>
  pull(IMPOSSIBLE_VALUE) |>
  sum(na.rm = TRUE) 

# Warning if any impossible values are found
if (nr_of_impossible_values > 0) {
  log_msg(glue::glue("🚨 Warning: found {nr_of_impossible_values} rows where INCIDENCE_ADJ_TESTING < INCIDENCE_CRUDE!"), "warning")
} else log_msg("✅ For all YEAR and ADM2, `INCIDENCE_CRUDE` is smaller than `INCIDENCE_ADJ_TESTING` (as expected).")

# Check if all values in a column are NA
if (all(is.na(yearly_incidence$INCIDENCE_ADJ_TESTING))) {
  log_msg("🚨 Warning: all values of `INCIDENCE_ADJ_TESTING` are `NA`s", "warning")
}


#### 2. `INCIDENCE_ADJ_REPORTING` (adj. level 2) should always be greater than `INCIDENCE_ADJ_TESTING` (adj. level 1)

In [None]:
# Count TRUE values, handling potential NAs in the result of if_else
nr_of_impossible_values <- yearly_incidence |>
  mutate(IMPOSSIBLE_VALUE = if_else(INCIDENCE_ADJ_REPORTING < INCIDENCE_ADJ_TESTING, TRUE, FALSE)) |>
  pull(IMPOSSIBLE_VALUE) |>
  sum(na.rm = TRUE) 

# Warning if any impossible values are found
if (nr_of_impossible_values > 0) {
  log_msg(glue::glue("🚨 Warning: found {nr_of_impossible_values} rows where INCIDENCE_ADJ_REPORTING < INCIDENCE_ADJ_TESTING!"), "warning")
} else log_msg("✅ For all YEAR and ADM2, `INCIDENCE_ADJ_TESTING` is smaller than `INCIDENCE_ADJ_REPORTING` (as expected).")

# Check if all values in a column are NA
if (all(is.na(yearly_incidence$INCIDENCE_ADJ_REPORTING))) {
  log_msg("🚨 Warning: all values of `INCIDENCE_ADJ_REPORTING` are `NA`s", "warning")
}

### 3.3. **Mean** Incidence across **all** available **years**

To keep in mind for future:
* consider taking the `median()` instead. But only if we have at least 3 years of data ... !
* possibly make this **parametrized** so that the user can decide the interval. But this choice needs to be dynamic, not hardcoded (pipeline.py code should read the data to offer choice of years that _exist_)

In [None]:
yearly_incidence_mean <- yearly_incidence |>
group_by(ADM1_ID, ADM2_ID) |>
summarise(
      across(starts_with("INCIDENCE"), ~mean(., na.rm = TRUE)), # 🔍 pox PROBLEM here: if missing data for RR -> sum of N2 by YEAR is smaller than the sum of N1 !
      .groups = "drop"
    ) |>
ungroup()

print(dim(yearly_incidence_mean))
head(yearly_incidence_mean, 3)

## 4. Export to `/data/dhis2_incidence/` folder

Dynamically include **reporting rate method** used (`rr-method-`) in **filename**

In [None]:
# Reusable function to generate filename and save data
save_yearly_incidence <- function(yearly_incidence, data_path, file_extension, write_function) {
  
  # Base filename parts
  base_name_parts <- c(
    COUNTRY_CODE, 
    "_incidence_year_routine-data-", ROUTINE_DATA_CHOICE, 
    "_rr-method-", REPORTING_RATE_METHOD
  )
  
  # Add reporting rate specific parts if applicable
  if (REPORTING_RATE_METHOD == "dataelement") {
    specific_parts <- c(
      "-", REPRATE_DELEMENT_METHOD_NUMERATOR, "-", 
      REPRATE_DELEMENT_METHOD_DENOMINATOR
    )
    base_name_parts <- c(base_name_parts, specific_parts)
  }
  
  # Concatenate all parts to form the final filename
  file_name <- paste0(c(base_name_parts, file_extension), collapse = "")
  file_path <- file.path(data_path, "incidence", file_name)
  output_dir <- dirname(file_path)

  # Check if the output directory exists, else create it
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  # Flexibility to use function as provided in argument: "write_csv" or "arrow::write_parquet" ... 
  write_function(yearly_incidence, file_path)

  log_msg(paste0("Exporting : ", file_path))
}

In [None]:
# Cleanup
path_to_clear <- file.path(DATA_PATH, "incidence")
files_to_delete <- list.files(path_to_clear, full.names = TRUE, recursive = TRUE)
unlink(files_to_delete, recursive = TRUE)
log_msg(glue::glue("🧹 Deleting all existing files from `{path_to_clear}`. Output of current pipeline run will replace output of previous run."))

In [None]:
# ACtually export the data

# CSV
save_yearly_incidence(yearly_incidence, DATA_PATH, ".csv", write_csv)

# Parquet
save_yearly_incidence(yearly_incidence, DATA_PATH, ".parquet", arrow::write_parquet)