Script structure:
  1. Setup:
        * Paths
        * Utils functions
        * Load and check config file
  2. Load Data
        * **Routine data** (DHIS2) already formatted & aggregated (output of pipeline XXX)
        * **Population data** (DHIS2) already formatted & aggregated (output of pipeline YYY) & aggregated at **ADM2 x YEAR** level<br>
            **Note**: in some Countries (i.e., Niger), population and and crude incidence data is also available for **specific sections** of the popultion (i.e., preganant women, children under 5)
        * (optional) **Care seeking (taux recherche soins)** (DHS)
        * **Reporting Rate**, based on what is available (last run reporting rate pipeline), uses _either_ one of:
            * "**Dataset**": pre-cumputed (directly downloadable from SNIS DHIS2 instance) and formatted&aligned elsewhere (output of pipelibe `dhis2-reporting-rate`)
            * "**Data Element**: calculated from routine DHIS2 data, based on reports for defined indicators and "active" facilities
  3. Calculate **Incidence**
     1. calculate **monthly cases**
     2. calculate **yearly incidence**: Crude, Adjusted 1 (Test Positivity Rate), Adjusted 2 (Reporting Rate), (optional) Adjusted 3 (Care Seeking Behaviour)

-------------------
**Naming harmonization to improve code readability:**

**Incidence**, COLUMN NAMES (always capitalized!):
* "INCIDENCE_CRUDE" = "Crude"
* "INCIDENCE_ADJ_TESTING" = "Adjusted 1 (Testing)"
* "INCIDENCE_ADJ_REPORTING" = "Adjusted 2 (Reporting)"
* _"INCIDENCE_ADJ_CARESEEKING" = "Adjusted 3 (Careseeking)"_ ‚ö†Ô∏èis this good naming?

**Reporting Rate** data frames, based on two **methods**:
* follwo this structure: reporting\_rate\_\<method\>. So:
    * **Dataset**: `reporting_rate_dataset` (for report nb only: `reporting_rate_dataset_year`)
    * **Data Element** (Diallo 2025): `reporting_rate_dataelement` (for report nb only: `reporting_rate_dataelement_year`)

--------------------

### To do:
* add check on completeness of routine data per ADM2 * MONTH -> issue warning if data is missing for certain months ("holes" see https://bluesquare.slack.com/archives/C08DHT2JXEV/p1751982194834899 )

## 0. Parameters
üëá these are now ‚ö°**pipeline parameters**‚ö°!

## 1. Setup

### 1.0. Validate parameters

In [None]:
# ----- ‚ö° Defined in pipeline.py code ---------------
if (!exists("N1_METHOD")) N1_METHOD <- "SUSP-TEST" # ‚ö° For N1 calculations: use `SUSP-TEST` or `PRES`
if (!exists("ROUTINE_DATA_CHOICE")) ROUTINE_DATA_CHOICE <- "raw" # "raw_without_outliers" "imputed"
if (!exists("OUTLIER_DETECTION_METHOD")) OUTLIER_DETECTION_METHOD <- "mean"  # ["mean", "median", "iqr", "mg_partial", "mg_complete"]
if (!exists("USE_CSB_DATA")) USE_CSB_DATA <- FALSE # ‚ö° USE_CSB_DATA bool
if (!exists("USE_ADJUSTED_POPULATION")) USE_ADJUSTED_POPULATION <- FALSE # ‚ö° USE_ADJUSTED_POPULATION bool 

#### ‚ö†Ô∏è NER Specific : Parameter settings

In [None]:
# Parameter for "NER":
if (!exists("DISAGGREGATION_SELECTION")) DISAGGREGATION_SELECTION <- NULL  # options: # PREGNANT, UNDER5

### 1.1. Run setup

In [None]:
# PROJECT PATHS
SNT_ROOT_PATH <- "/home/hexa/workspace" 
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') # this is where we store snt_utils.r
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data', 'dhis2') # same files as in Datasets but /data/ gets overwritten at each pipeline run

source(file.path(CODE_PATH, "snt_utils.r")) # utils
source(file.path(CODE_PATH, "snt_palettes.r")) # palettes 

# List required pcks
required_packages <- c("arrow", "tidyverse", "stringi", "jsonlite", "httr", "reticulate", "glue")
install_and_load(required_packages)

# Set environment to load openhexa.sdk from the right path
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.2. Load and check `config` file

**Checks for SNT mandatory configuration fields**

In [None]:
config_json <- tryCatch({ fromJSON(file.path(CONFIG_PATH, "SNT_config.json")) },
    error = function(e) {
        msg <- paste0("[ERROR] Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, "SNT_config.json")) 
log_msg(msg)

# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "ANY"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

# Fixed routine formatting columns
fixed_cols <- c('OU_ID','PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID') 
print(paste("Fixed routine data ('dhis2_routine') columns (always expected): ", paste(fixed_cols, collapse=", ")))

### 1.3. Helper function(s)

In [None]:
# helper function 
resolve_routine_filename <- function(outliers_method, routine_choice) {  
    if (routine_choice == "raw") return("_routine.parquet")
    is_removed <- FALSE
    if (routine_choice == "raw_without_outliers") is_removed <- TRUE 
    removed_status <- if (is_removed) "_removed" else "_imputed"    
    return(glue::glue("_routine_outliers-{outliers_method}{removed_status}.parquet"))
} 

## 2. Load Data

### 2.1. **Routine** data (DHIS2) (parametrized choice)

In [None]:
# select routine dataset and filename
if (ROUTINE_DATA_CHOICE == "raw") {    
    routine_dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
    routine_name <- resolve_routine_filename(OUTLIER_DETECTION_METHOD, ROUTINE_DATA_CHOICE)
    routine_filename <- paste0(COUNTRY_CODE, routine_name)
} else {    
    routine_dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_IMPUTATION
    routine_name <- resolve_routine_filename(OUTLIER_DETECTION_METHOD, ROUTINE_DATA_CHOICE)
    routine_filename <- paste0(COUNTRY_CODE, routine_name)
}

In [None]:
# Load file from dataset  
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(routine_dataset_name, routine_filename) }, 
    error = function(e) {    
    # Check if the error message indicates that the file does not exist    
    if (grepl("does not exist", conditionMessage(e), ignore.case = TRUE)) { 
        msg <- paste0("[ERROR] File not found! üõë The file `", routine_filename, "` does not exist in `", 
                  routine_dataset_name, "`. To generate it, execute the pipeline `DHIS2 Outliers Removal and Imputation`, choosing the appropriate method.")
    } else {
        msg <- paste0("[ERROR] üõë Error while loading DHIS2 routine data file for: ", COUNTRY_CODE, ". [ERROR DETAILS] " , conditionMessage(e))
    }            
    stop(msg)
})

msg <- paste0("DHIS2 routine data : `", routine_filename, "` loaded from dataset : `", routine_dataset_name, "`. Dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
log_msg(msg)

dim(dhis2_routine)
head(dhis2_routine, 2)

#### Checks on routine data columns

 `fixed_cols`: Fixed columns that should be always present regardless of the config.

In [None]:
# Check if all "fixed" cols are present in dhis2_routine
actual_cols <- colnames(dhis2_routine) # dhis2_routine
missing_cols <- setdiff(fixed_cols, actual_cols) # Columns in fixed_cols but not in actual_cols)

# Check if all required columns are present
all_present <- length(missing_cols) == 0
if (all_present) { 
    log_msg(paste0("The 'dhis2_routine' tibble contains all the expected 'fixed' columns: ", paste(fixed_cols, collapse = ", "), "."))
} else {
    log_msg(paste0("üö® Missing Columns: The following required columns are NOT present in 'dhis2_routine': ", paste(missing_cols, collapse = ", "), "."), "warning")
}

`DHIS2_INDICATORS`: Indicators, as defined in the config.json file, are expected to be present if the extraction pipeline and this pipeline are run on the same config settings.

In [None]:
# Check if all "DHIS2_INDICATORS" cols are present in dhis2_routine
missing_cols <- setdiff(DHIS2_INDICATORS, actual_cols) # all elements in DHIS2_INDICATORS but not in actual_cols
all_present <- length(missing_cols) == 0
if (all_present) { 
    log_msg(paste0("The 'dhis2_routine' tibble contains all the expected 'DHIS2_INDICATORS' columns: ", paste(DHIS2_INDICATORS, collapse = ", "), "."))
} else {
    log_msg(paste0(
      "üö® Missing Columns: The following columns for DHIS2 INDICATORS are NOT present in 'dhis2_routine': ",
      paste(missing_cols, collapse = ", "),
      ".\nüö® Looks like the config.json file was modified after extraction.\nüö® The analysis will continue WITHOUT the missing indicators."
    ), "warning")
}

#### Checks on `N1_METHOD` selected
_**if**_ `N1_METHOD == PRES` then `PRES` must exist in config.json file _and_ in routine data <br>
_**else**_ N1 will use `SUSP-TEST` instead

In [None]:
# Check that col `PRES` exists in both config file and routine data
if (N1_METHOD == "PRES") {
    pres_in_routine <- any(names(dhis2_routine) == "PRES")
    pres_in_config <- any(DHIS2_INDICATORS == "PRES")

    if (!pres_in_routine) {
        log_msg("üõë Column `PRES` missing from routine data! üö® N1 calculations will use `SUSP-TEST` instead!", "error")
        stop()
    }
    if (!pres_in_config) {
        log_msg("‚öôÔ∏è Note: `PRES` set as parameter in this pipeline, but not defined as indicator in the configuration file (SNT_config.json)", "error")
        stop()
    }
}

#### ‚ö†Ô∏è NER Specific : Indicator selection

Niger NER indicators column selection - ORIGINAL

In [None]:
# # Niger NER indicators column selection - ORIGINAL
# INDICATORS_FOUND <- FALSE
# if (COUNTRY_CODE == "NER" & N1_METHOD == "SUSP-TEST" & !is.null(DISAGGREGATION_SELECTION)) {    
#     susp_column <- glue("SUSP_{DISAGGREGATION_SELECTION}")
#     test_column <- glue("TEST_{DISAGGREGATION_SELECTION}")
#     conf_column <- glue("CONF_{DISAGGREGATION_SELECTION}")
    
#     if (!all(c(susp_column, test_column, conf_column) %in% colnames(dhis2_routine))) {        
#         log_msg(glue("NER Specific: Indicator version selection: {DISAGGREGATION_SELECTION} is not present in routine dataset."), "warning")
#         log_msg(glue("NER Specific: Using total indicators selection: SUSP, TEST, CONF."), "warning")        
#     } else {
#         log_msg(glue("NER Specific: Using indicator: {susp_column} in routine dataset."))
#         log_msg(glue("NER Specific: Using indicator: {test_column} in routine dataset."))
#         log_msg(glue("NER Specific: Using indicator: {conf_column} in routine dataset."))
#         dhis2_routine$SUSP <- dhis2_routine[[susp_column]]
#         dhis2_routine$TEST <- dhis2_routine[[test_column]]
#         dhis2_routine$CONF <- dhis2_routine[[conf_column]]
#         INDICATORS_FOUND <- TRUE
#     }    
# }

NEW by GP (20260109): add case for N1_METHOD == "PRES"

In [None]:
# # NEW by GP (20260109): add case for N1_METHOD == "PRES"
# if (COUNTRY_CODE == "NER" & N1_METHOD == "PRES" & !is.null(DISAGGREGATION_SELECTION)) {    
#     pres_column <- glue("PRES_{DISAGGREGATION_SELECTION}")
#     test_column <- glue("TEST_{DISAGGREGATION_SELECTION}")
#     conf_column <- glue("CONF_{DISAGGREGATION_SELECTION}")
    
#     if (!all(c(pres_column, test_column, conf_column) %in% colnames(dhis2_routine))) {        
#         log_msg(glue("NER Specific: Indicator version selection: {DISAGGREGATION_SELECTION} is not present in routine dataset."), "warning")
#         log_msg(glue("NER Specific: Using total indicators selection: PRES, TEST, CONF."), "warning")        
#     } else {
#         log_msg(glue("NER Specific: Using indicator: {pres_column} in routine dataset."))
#         log_msg(glue("NER Specific: Using indicator: {test_column} in routine dataset."))
#         log_msg(glue("NER Specific: Using indicator: {conf_column} in routine dataset."))
#         dhis2_routine$PRES <- dhis2_routine[[pres_column]]
#         dhis2_routine$TEST <- dhis2_routine[[test_column]]
#         dhis2_routine$CONF <- dhis2_routine[[conf_column]]
#         INDICATORS_FOUND <- TRUE
#     }    
# }

NEW by GP (20260109): add case for N1_METHOD == "PRES" --- like above but DRY

In [None]:
# # NEW by GP (20260109): add case for N1_METHOD == "PRES" , and make it DRY

# INDICATORS_FOUND <- FALSE # üëà 

# if (COUNTRY_CODE == "NER" && !is.null(DISAGGREGATION_SELECTION) && N1_METHOD %in% c("SUSP-TEST", "PRES")) {
  
#   # 1. Determine the dynamic prefix based on the method
#   prefix_method <- ifelse(N1_METHOD == "SUSP-TEST", "SUSP", "PRES")
#   prefix_all <- c(prefix_method, "TEST", "CONF") # `CONF` and `TEST` are always needed, regardless of N1_METHOD
#   target_colname <- glue("{prefix_all}_{DISAGGREGATION_SELECTION}")
  
#   # 2. Check if all required columns exist in the dataset
#   if (all(target_colname %in% colnames(dhis2_routine))) {
    
#     # 3. Use a loop or vectorized assignment to map columns and log success
#     for (i in seq_along(prefix_all)) {
#       log_msg(glue("NER Specific: Using indicator: {target_colname[i]} in routine dataset."))
#       dhis2_routine[[prefix_all[i]]] <- dhis2_routine[[target_colname[i]]]
#       # create vector of targets for later use
#       if (i == 1) {
#         targets <- target_colname[i]
#       } else {
#         targets <- c(targets, target_colname[i])
#       }
#     }
#     INDICATORS_FOUND <- TRUE
    
#   } else {
#     # 4. Handle the "not found" case
#     log_msg(glue("NER Specific: Indicator version selection: {DISAGGREGATION_SELECTION} is not present in routine dataset."), "warning")
#     log_msg(glue("NER Specific: Using total indicators selection: {paste(targets, collapse = ', ')}."), "warning")
#   }
# }

In [None]:
# NEW by GP (20260112): re-add state flag "INDICATORS_FOUND", and fix else logic 

INDICATORS_FOUND <- FALSE # üëà 

if (COUNTRY_CODE == "NER" && !is.null(DISAGGREGATION_SELECTION) && N1_METHOD %in% c("SUSP-TEST", "PRES")) {
  
  # Determine the dynamic prefix based on the method
  prefix_method <- ifelse(N1_METHOD == "SUSP-TEST", "SUSP", "PRES")
  prefix_all    <- c(prefix_method, "TEST", "CONF") 
  # Define the expected column names 
  # (also make available for the 'else' warning message if the check fails)
  target_colnames <- glue("{prefix_all}_{DISAGGREGATION_SELECTION}")
  
  if (all(target_colnames %in% colnames(dhis2_routine))) {
    
    # We map the specific columns (e.g., SUSP_UNDER5) to generic names (e.g., SUSP)
    dhis2_routine[prefix_all] <- dhis2_routine[target_colnames]
    
    for (col in target_colnames) {
      log_msg(glue("NER Specific: Successfully mapped indicator: {col}"))
    }
    
    # Signal success for the next code block
    INDICATORS_FOUND <- TRUE
    
  } else {
    missing_cols <- setdiff(target_colnames, colnames(dhis2_routine))
    log_msg(glue("NER Specific: Disaggregation '{DISAGGREGATION_SELECTION}' failed."), "warning")
    log_msg(glue("NER Specific: Missing columns in routine dataset: {paste(missing_cols, collapse = ', ')}"), "warning")
  }
}

### 2.2. Load population data at level ADM2 x YEAR

Already formatted & aggregated.  

**Expecting** table with these **cols** (bold = **must have**): 
* ADM1_ID
* **ADM2_ID**
* **YEAR**
* **POPULATION** (pop at ADM2 level)

In [None]:
# Select population file 
if (USE_ADJUSTED_POPULATION) {
    dhis2_pop_dataset <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_POPULATION_TRANSFORMATION
} else {
    dhis2_pop_dataset <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
}
     
# Load file from dataset
dhis2_population_adm2 <- tryCatch({ get_latest_dataset_file_in_memory(dhis2_pop_dataset, paste0(COUNTRY_CODE, "_population.parquet")) }, 
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 population file for: " , COUNTRY_CODE, 
                                   " [ERROR DETAILS] ", conditionMessage(e))  # log error message , 
                      cat(msg)
                      stop(msg)
})

log_msg(glue("DHIS2 population data loaded from dataset: {dhis2_pop_dataset}. Dataframe dimensions: {paste(dim(dhis2_population_adm2), collapse=', ')}"))

In [None]:
dhis2_population_adm2 |> head()

#### ‚ö†Ô∏è NER Specific : Population column selection

In [None]:
# Niger NER population column selection
# Inconsistent naming in configuration could cause issues. Namely, POPULATION_U5 vs POPULATION_UNDER5 ...
if (COUNTRY_CODE == "NER" & INDICATORS_FOUND) { 
    # Create named vector (~dictionary) to map DISAGGREGATION_SELECTION to population column names
    population_map <- c("UNDER5" = "POPULATION_U5",
                        "PREGNANT" = "POPULATION_FE")
    POPULATION_SELECTION <- population_map[[DISAGGREGATION_SELECTION]]   
    if (!(POPULATION_SELECTION %in% colnames(dhis2_population_adm2))) {
        log_msg(glue("NER Specific: Column '{POPULATION_SELECTION}' not found in Population dataset."), "warning")
        POPULATION_SELECTION <- "POPULATION"
    }
    # The selected column is assigned to POPULATION col so that later code can use it generically
    dhis2_population_adm2$POPULATION <- dhis2_population_adm2[[POPULATION_SELECTION]]
    log_msg(glue("NER Specific: Column '{POPULATION_SELECTION}' selected as population values."))
}

#### 2.2.1 **Population** data (DHIS2) columns selection.


In [None]:
dhis2_population_adm2 <- dhis2_population_adm2 %>% select(YEAR, ADM1_NAME, ADM1_ID, ADM2_NAME, ADM2_ID, POPULATION) 

dim(dhis2_population_adm2)
head(dhis2_population_adm2, 2)

### 2.3. (optional) **Care Seeking Behaviour** (CSB DHS) (taux recherche soins)
(20250728) Note: **changed units** (proportion to %), see https://bluesquare.atlassian.net/browse/SNT25-127 

In [None]:
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHS_INDICATORS
file_name <- glue::glue("{COUNTRY_CODE}_DHS_ADM1_PCT_CARESEEKING_SAMPLE_AVERAGE.parquet")

if (USE_CSB_DATA == TRUE) {
    # Read the data, if error (cannot find at defined path) -> set careseeking_data to NULL (so it doesn't break the function at # 3.)
    careseeking_data <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) },          
                  error = function(e) {
                      msg <- paste("üõë Error while loading DHS Care Seeking data file from `", dataset_name, file_name ,"`.", conditionMessage(e))  # log error message
                      log_msg(msg, "error")
                      return(NULL) # make object NULL on error
                  })
 
    # Only print success messages and data info if careseeking_data is NOT NULL
    if (!is.null(careseeking_data)) {
        log_msg(paste0("Care Seeking data : ", file_name, " loaded from dataset : ", dataset_name))
        log_msg(paste0("Care Seeking data frame dimensions: ", nrow(careseeking_data), " rows, ", ncol(careseeking_data), " columns."))
        head(careseeking_data)
    } else {
        log_msg(paste0("üö® Care-seeking data not loaded due to an error, `careseeking_data` is set to `NULL`!"), "warning")
    }
    
} else {
    # if `USE_CSB_DATA == FALSE` ... (basically, ignore CSB data)
    careseeking_data <- NULL
}

### 2.4. Load Reporting Rate 

Import Reporting Rate file based on what is available in the latest OH Dataset version (which depends on last run reporting rate pipepline).

üìÖ **Important**: reporting rate must be **monthly**!

In [None]:
# function**
# Define dataset and file name (based on paramter)
rr_dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_REPORTING_RATE
file_name_de <- paste0(COUNTRY_CODE, "_reporting_rate_dataelement.parquet")
file_name_ds <- paste0(COUNTRY_CODE, "_reporting_rate_dataset.parquet")

# Try loading dataelement reporting rates.
reporting_rate_month <- tryCatch({
    df_loaded <- get_latest_dataset_file_in_memory(rr_dataset_name, file_name_de)
    log_msg(glue("Reporting Rate data: `{file_name_de}` loaded from dataset: `{rr_dataset_name}`. Dataframe dimensions: {paste(dim(df), collapse=', ')}"))
    REPORTING_RATE_METHOD <- "dataelement"
    df_loaded
}, 
    error = function(e) {    
        cat(glue("[ERROR] Error while loading Reporting Rate 'dataelement' version for: {COUNTRY_CODE} {conditionMessage(e)}"))
        return(NULL)
})

# Try loading dataset reporting rates.
if (is.null(reporting_rate_month)) {
    reporting_rate_month <- tryCatch({
        df_loaded <- get_latest_dataset_file_in_memory(rr_dataset_name, file_name_ds) 
        log_msg(glue("Reporting Rate data: `{file_name_ds}` loaded from dataset: `{rr_dataset_name}`. Dataframe dimensions: {paste(dim(df), collapse=', ')}"))
        REPORTING_RATE_METHOD <- "dataset"
        df_loaded
    }, 
    error = function(e) {    
        stop(glue("[ERROR] Error while loading Reporting Rate 'dataset' version for: {COUNTRY_CODE} {conditionMessage(e)}")) # raise error
    })
}

rm(df_loaded)
dim(reporting_rate_month)
head(reporting_rate_month, 2)

#### üîç Checkon data completeness for `REPORTING_RATE` data
Normally we should have "complete" data (no missing or `NA` values). However, when using certain datasets (from pipeline: "Reporting Rate (Dataset)") we might have incomplete coverage and hence `NA`s ... <br>
These are "problematic" because **N2** (Incidence adj 2) will also have `NA` values.

In [None]:
# Check on data completeness for REPORTING RATE data: 
# check how many values of REPORTING_RATE are NA
na_count <- sum(is.na(reporting_rate_month$REPORTING_RATE))     
if (na_count > 0) {
    log_msg(glue("‚ö†Ô∏è Warning: Reporting Rate data contains {na_count} missing values (NA) in 'REPORTING_RATE' column."), "warning")
} else {
    log_msg("‚úÖ Reporting Rate data contains no missing values (NA) in 'REPORTING_RATE' column.")
}

-------------------------------

## 3. Calculate Incidence
First calculate monthly cases, then yearly incidence.

### 3.1 **Monthly cases**


These methods follow the standard WHO approach for estimating malaria incidence from routine health information systems (WHO, 2023).
As shown in the code, we begin by calculating **monthly malaria case metrics** (confirmed, tested, presumed) at the **ADM2** level and join them with the **monthly reporting rate**. 

This allows us to compute the **test positivity rate** (TPR, where `TPR` = `CONF` / `TEST`) and adjust for incomplete testing using the formula: 
> **N1** = `CONF` + (`PRES` √ó `CONF` / `TEST`)

Which is equivalent to:
> **N1** = `CONF` + (`PRES` √ó **TPR**)

where:
- **N1** = cases adjusted for testing gaps 
- `CONF` = **confirmed** cases
- `PRES` = **presumed** cases (either `SUSP` - `TEST` or directly available as `PRES`) üëà this is a parameter (`N1_METHOD`)
- `TEST` = **tested** cases 
- **TPR** = Test Positivity Rate (`CONF` / `TEST`)
  
This produces `N1`, the number of cases adjusted for testing gaps, calculated at the monthly level in line with WHO recommendations to capture intra-annual variation.

Next, we adjust for incomplete reporting using: 
> **N2** = **N1** / `REPORTING_RATE`

where `REPORTING_RATE` is at the monthly levele, and is the ratio of received reports (submission to DHIS2) divided by the expected reports.

Finally, _if_ **careseeking** data is **available**, N3 is calculated as follows:
> **N3** = N2 + (N2 * PROP_PRIV / PROP_PUBL) + (N2 * NO_TREAT / PROP_PUBL)

where:
- PRIVATE_CARE = proportion of kids treated in the **private** sector
- PUBLIC_CARE = proportion of kids treated in the **public** sector
- NO_CARE = proportion of kids which **did not receive any treatment**

Note that this assumes the same TPR across all sectors (private and public).



**Important note**<br>
In case reporting rate equals zero (none of the health facilities reported in a given month), N2 is set to `NA`. Note that the annual N2 will be underestimated, which is preferable compared to having `Inf` values.

-------------

This calculation expects (input):
* **routine_data**: DHIS2 routine data, formatted and aggregated at ADM2 and MONTH level. Tibble (df) _must_ contain the following cols: `YEAR`, `MONTH`, `ADM2`, `CONF`, `TEST`, `SUSP`, `PRES`.  
* **reporting_rate_data**: reporting rate calculated at ADM2 and MONTH level and expressed as proprtion **(0-1)**. Tibble (df) _must_ contain the following cols: `ADM2`, `YEAR`, `MONTH`, `reporting_rate`

The calculation produces (output):
* data frame with the following cols: `ADM2`, `YEAR`, `MONTH`, "value_" * (`CONF`, `TEST`, `SUSP`, `PRES`), `TPR`, `N1`, `N2`

-----------------

In [None]:
# Ensure correct data type for numerical columns ---------------------------------------
routine_data <- dhis2_routine %>%
    mutate(across(any_of(c("YEAR", "MONTH", "CONF", "TEST", "SUSP", "PRES")), as.numeric))

reporting_rate_data <- reporting_rate_month %>% # reporting_rate_data
    mutate(across(c(YEAR, MONTH, REPORTING_RATE), as.numeric))

#### 3.1.0. Aggregate at `ADM2` x `MONTH` & calculate **TPR**

In [None]:
# Check for TEST > SUSP
routine_data |> mutate(SUSP_minus_TEST = SUSP - TEST) |> filter(SUSP_minus_TEST < 0) |> nrow() 

In [None]:
# # Group & compute TPR
# monthly_cases <- routine_data %>%
#     group_by(ADM1_ID, ADM2_ID, YEAR, MONTH) %>% # ADM1 needed to join careseeking data
#     summarise(
#       CONF = sum(CONF, na.rm = TRUE),
#       TEST = sum(TEST, na.rm = TRUE),
#       SUSP = sum(SUSP, na.rm = TRUE),
#       across(any_of("PRES"), ~sum(., na.rm = TRUE), .names = "PRES"), # <- handles missing 'PRES' column gracefully
#       .groups = "drop"
#     ) %>%
#     left_join(reporting_rate_data,
#               by = c("ADM2_ID", "YEAR", "MONTH")) %>%
#     # Calculate TPR based on CONF and TEST
#     # Note: if TEST is 0 or NA, set TPR = 1 (to avoid division by zero which produces Inf)
#     mutate(
#       TPR = ifelse(!is.na(CONF) & !is.na(TEST) & (TEST != 0), CONF / TEST, 1)
#     )

In [None]:
# NEW (GP-20251219): 
# Moved `# Cleaning TEST data for "SUSP-TEST" method` here because aggregation (`summarize()`) introduces new cases where TEST > SUSP

# Group & compute TPR
monthly_cases <- routine_data %>%
    group_by(ADM1_ID, ADM2_ID, YEAR, MONTH) %>% # ADM1 needed to join careseeking data
    summarise(
      CONF = sum(CONF, na.rm = TRUE),
      TEST = sum(TEST, na.rm = TRUE),
      SUSP = sum(SUSP, na.rm = TRUE),
      across(any_of("PRES"), ~sum(., na.rm = TRUE), .names = "PRES"), # <- handles missing 'PRES' column gracefully
      .groups = "drop") %>%
    # Cleaning TEST data for "SUSP-TEST" method
    mutate(TEST = ifelse(N1_METHOD == "SUSP-TEST" & !is.na(SUSP) & (TEST > SUSP), SUSP, TEST)) %>%
    left_join(reporting_rate_data,
              by = c("ADM2_ID", "YEAR", "MONTH")) %>%   
    # Calculate TPR based on CONF and TEST
    # Note: if TEST is 0 or NA, set TPR = 1 (to avoid division by zero which produces Inf)
    mutate( 
      TPR = ifelse(!is.na(CONF) & !is.na(TEST) & (TEST != 0), CONF / TEST, 1)
    )

In [None]:
# Check for TEST > SUSP
monthly_cases |> mutate(SUSP_minus_TEST = SUSP - TEST) |> filter(SUSP_minus_TEST < 0) |> nrow() 

#### 3.1.1. Calculate **N1**

In [None]:
# Calculate N1 based on `N1_METHOD` & availability of `PRES` 

if (N1_METHOD == "SUSP-TEST") {
    monthly_cases <- monthly_cases %>%
      mutate(N1 = CONF + ((SUSP - TEST) * TPR))
      log_msg("Calculating N1 as `N1 = CONF + ((SUSP - TEST) * TPR)`")
} else if (N1_METHOD == "PRES") {
    # if: column named "PRES" exists in `monthly_cases` and contains at least one non-missing value
    if ("PRES" %in% names(monthly_cases) && !all(is.na(monthly_cases$PRES))) {
      monthly_cases <- monthly_cases %>%
        mutate(N1 = CONF + (PRES * TPR))
        log_msg("‚ÑπÔ∏è Calculating N1 as `N1 = CONF + (PRES * TPR)`")
    } else {
      log_msg("üö® Warning: 'PRES' not found in routine data or contains all `NA` values! üö® Calculating N1 using 'SUSP-TEST' method instead.")
      monthly_cases <- monthly_cases %>%
        mutate(N1 = CONF + ((SUSP - TEST) * TPR))
    }
} else {
    log_msg("Invalid N1_METHOD. Please use 'PRES' or 'SUSP-TEST'.") # not really necessary ... 
}

#### 3.1.2. Calculate **N2**

In [None]:
# Calculate N2
monthly_cases <- monthly_cases %>%
    mutate(
      N2 = ifelse(REPORTING_RATE == 0, NA_real_, N1 / REPORTING_RATE) # On the fly convert `RR == 0` to NA to avoid N2 == Inf
    )

In [None]:
# Log msg about zero REPORTING RATE cases and warn that N2 set to NA

zero_reporting <- reporting_rate_data %>%
      filter(REPORTING_RATE == 0) %>%
      summarise(
        n_months_zero_reporting = n(),
        affected_zones = n_distinct(ADM2_ID)
      )

if (zero_reporting$n_months_zero_reporting > 0) {    
    log_msg(glue("üö® Note: {zero_reporting$n_months_zero_reporting} rows had `REPORTING_RATE == 0` across ",
                 "{zero_reporting$affected_zones} ADM2. These N2 values were set to NA."))
} else {
  log_msg("‚úÖ Note: no ADM2 has `REPORTING_RATE == 0`. All N2 values were preserved.")
}

#### 3.1.3. (optional) Calculate **N3**

In [None]:
# Only calculate N3 if CARESEEKING data is avaiable 
if (!is.null(careseeking_data)) {
    monthly_cases <- monthly_cases %>%
    mutate(YEAR = as.numeric(YEAR)) %>% # keep as safety
    left_join(., careseeking_data, by = c("ADM1_ID")) %>%
    mutate(
        N3 = N2 + (N2 * PCT_PRIVATE_CARE / PCT_PUBLIC_CARE) + (N2 * PCT_NO_CARE / PCT_PUBLIC_CARE) 
    )
} else {
    print("ü¶ò Careseeking data not available, skipping calculation of N3.")
}

In [None]:
head(monthly_cases, 3)

#### üíæ Export `monthly_cases` (for üììreport notebook)
For coherence checks, which need monthly resolution ... !

In [None]:
# Save monthly_cases as .parquet file 
arrow::write_parquet(monthly_cases, file.path(DATA_PATH, "incidence", paste0(COUNTRY_CODE, "_monthly_cases.parquet")))

# Log msg
log_msg(glue("Monthly cases data saved to: {file.path(DATA_PATH, 'incidence', paste0(COUNTRY_CODE, '_monthly_cases.parquet'))}"))

### üîç Data **coherence** checks on **monthly cases**
Check for ratios or differences that will cause negative values -> which will causes adjusted incidence to be lower than the values it adjust


Namely, the following relationships among INDICATORs:
* SUSP-TEST
* CONF/TEST
* N1 == CONF ... (when PRES == 0)

#### 1. `PRES == 0`: causes `N1 == CONF` 
(if `N1_METHOD == "PRES"`)

In [None]:
# Run this check only if N1_METHOD == "PRES" (else, problem doesn't exist)
if (N1_METHOD == "PRES") {
    nr_of_pres_0_adm2_month <- monthly_cases |> filter(PRES == 0) |> nrow()
    log_msg(glue("üö® Note: using `PRES` for incidence adjustement, but `PRES == 0` for {nr_of_pres_0_adm2_month} rows (ADM2 x MONTH)."), "warning")
}

#### 2. `SUSP-TEST`: if negative, then N1 smaller or equal to CONF (ADJ =< CRUDE)
(if `N1_METHOD == "SUSP-TEST"`)

In [None]:
# SUSP - TEST: if negative (TEST > SUSP), then N1 smaller or equal to CONF, which then causes ADJ ‚â§ CRUDE
if (N1_METHOD == "SUSP-TEST") {
    nr_of_negative <- monthly_cases |> mutate(SUSP_minus_TEST = SUSP - TEST) |> filter(SUSP_minus_TEST < 0) |> nrow() 
    if (nr_of_negative > 0) {
        log_msg(
        glue("üö® Note: using formula `SUSP - TEST` for incidence adjustement, but higher tested than suspected cases (`SUSP < TEST`) detected in {nr_of_negative} rows (ADM2 x MONTH)."),
        "warning"
        )
    }
}

#### 3. `CONF/TEST` = `TPR` (to calculate N1: Incidence adjusted for **Testing**)
This **ratio should** always be **‚â§ 1** because **there should _not_ be more confirmed cases than tested** ...

(but if very small, then N1 could be smaller or equal to CONF (so ADJ INC ‚â§ CRUDE))

In [None]:
more_confirmed_than_tested <- monthly_cases |> mutate(CONF_divby_TEST = CONF / TEST) |> filter(CONF_divby_TEST > 1) |> nrow() 

if (more_confirmed_than_tested > 0) {
    log_msg(glue("üö® Note: higher confirmed than tested cases (`CONF/TEST`) detected in {more_confirmed_than_tested} rows (ADM2 x MONTH)."), "warning")
}

### 3.2 **Yearly incidence**
After calculating N1 and N2 for each `ADM2`-`MONTH`, we aggregate the data annually to compute the yearly totals (sums) for crude cases (`CONF`), `N1` and `N2`. Finally, we compute:
* Crude incidence: C / POP √ó 1000
* Incidence adjusted for testing: N1 / POP √ó 1000
* Incidence adjusted for testing and reporting: N2 / POP √ó 1000
* Incidence adjusted for testing, reporting and careseeking behaviour (optional): N3 / POP √ó 1000

--------------

The calculation expects (input):
* **monthly_cases**: as the output of `calculate_monthly_cases()`, or a tibble/data frame with the following cols: `ADM2`, `YEAR`, `MONTH`, "value_" * (CONF, TEST, SUSP, PRES), `TPR`, `N1`, `N2`  
* **population_data**: df of population data formatted and aligned, aggregated at ADM2 and YEAR level. A tibble/data frame that _must_ contain the following cols: `ADM2`, `YEAR`, `POPULATION`

The calculation produces (output): 
* a data frame with the following cols: ADM2_ID, YEAR, CONF, N1, N2, `INCIDENCE_CRUDE`, `INCIDENCE_ADJ_TESTING`, `INCIDENCE_ADJ_REPORTING`

--------------------

In [None]:
# ---- 1. Enforce column types upfront ----
monthly_cases <- monthly_cases %>% 
    mutate(across(where(is.numeric), as.numeric))  # Convert all numeric columns
  
population_data <- dhis2_population_adm2 %>% # population_data
    mutate(across(c(YEAR, POPULATION), as.numeric))

In [None]:
# ---- 2. Core calculation ----
yearly_incidence <- monthly_cases %>%
    group_by(ADM2_ID, YEAR) %>%
    summarise(
        # üö® removed `na.rm = TRUE` on 20250702 - if things break check here! üö® 
      across(c(CONF, N1, N2), ~sum(.)), #, na.rm = TRUE)), # üîç PROBLEM: if NA's in N2 (due to missing RR data), the sum of N2 by YEAR is smaller than the sum of N1 !
      # across(any_of(c("CONF", "TEST", "SUSP", "PRES", "N1", "N2")), ~sum(.)), # silenced as not necessary to also summarize "TEST", "SUSP", "PRES"
      .groups = "drop"
    ) %>%
    left_join(
      population_data,
      by = c("ADM2_ID", "YEAR")
    ) %>%
    mutate(
      INCIDENCE_CRUDE = CONF / POPULATION * 1000,
      INCIDENCE_ADJ_TESTING = N1 / POPULATION * 1000,
      INCIDENCE_ADJ_REPORTING = N2 / POPULATION * 1000
    ) |>
    ungroup()

In [None]:
# ---- 3. Optional careseeking adjustment ----
if (!is.null(careseeking_data) && "N3" %in% names(monthly_cases)) {
    n3_data <- monthly_cases %>%
      group_by(ADM2_ID, YEAR) %>%
      summarise(N3 = sum(N3, na.rm = TRUE),
                .groups = "drop") |>
      ungroup()
    
    yearly_incidence <- yearly_incidence %>%
      left_join(n3_data, by = c("ADM2_ID", "YEAR")) %>%
      mutate(
        INCIDENCE_ADJ_CARESEEKING = N3 / POPULATION * 1000
      )
  } else {
    yearly_incidence <- yearly_incidence |>
      mutate(
        INCIDENCE_ADJ_CARESEEKING = NA
            )
  }

In [None]:
head(yearly_incidence, 3)

### üîç Data **coherence** checks on **yearly incidence**
Here we check if values of Indicidence (already at `YEAR` resolution) make sense in relation to each other.<br>
Namely:
* crude values should be the lowest, and any consecutive **adjustment** should cause the incidence values to **increase** or remain the **same** - but should never be lower!

#### 1. `INCIDENCE_ADJ_TESTING` (adj. level 1) should always be greater than `INCIDENCE_CRUDE` (not adjusted)

In [None]:
# same as below but different cols ... 
# Count TRUE values, handling potential NAs in the result of if_else
nr_of_impossible_values <- yearly_incidence |>
  mutate(IMPOSSIBLE_VALUE = if_else(INCIDENCE_ADJ_TESTING < INCIDENCE_CRUDE, TRUE, FALSE)) |>
  pull(IMPOSSIBLE_VALUE) |>
  sum(na.rm = TRUE) 

# Warning if any impossible values are found
if (nr_of_impossible_values > 0) {
  log_msg(glue::glue("üö® Warning: found {nr_of_impossible_values} rows where INCIDENCE_ADJ_TESTING < INCIDENCE_CRUDE!"), "warning")
} else log_msg("‚úÖ For all YEAR and ADM2, `INCIDENCE_CRUDE` is smaller than `INCIDENCE_ADJ_TESTING` (as expected).")

# Check if all values in a column are NA
if (all(is.na(yearly_incidence$INCIDENCE_ADJ_TESTING))) {
  log_msg("üö® Warning: all values of `INCIDENCE_ADJ_TESTING` are `NA`s", "warning")
}


#### 2. `INCIDENCE_ADJ_REPORTING` (adj. level 2) should always be greater than `INCIDENCE_ADJ_TESTING` (adj. level 1)

In [None]:
# Count TRUE values, handling potential NAs in the result of if_else
nr_of_impossible_values <- yearly_incidence |>
  mutate(IMPOSSIBLE_VALUE = if_else(INCIDENCE_ADJ_REPORTING < INCIDENCE_ADJ_TESTING, TRUE, FALSE)) |>
  pull(IMPOSSIBLE_VALUE) |>
  sum(na.rm = TRUE) 

# Warning if any impossible values are found
if (nr_of_impossible_values > 0) {
  log_msg(glue::glue("üö® Warning: found {nr_of_impossible_values} rows where INCIDENCE_ADJ_REPORTING < INCIDENCE_ADJ_TESTING!"), "warning")
} else log_msg("‚úÖ For all YEAR and ADM2, `INCIDENCE_ADJ_TESTING` is smaller than `INCIDENCE_ADJ_REPORTING` (as expected).")

# Check if all values in a column are NA
if (all(is.na(yearly_incidence$INCIDENCE_ADJ_REPORTING))) {
  log_msg("üö® Warning: all values of `INCIDENCE_ADJ_REPORTING` are `NA`s", "warning")
}

### 3.3. **Mean** Incidence across **all** available **years**

To keep in mind for future:
* consider taking the `median()` instead. But only if we have at least 3 years of data ... !
* possibly make this **parametrized** so that the user can decide the interval. But this choice needs to be dynamic, not hardcoded (pipeline.py code should read the data to offer choice of years that _exist_)

In [None]:
yearly_incidence_mean <- yearly_incidence |>
    group_by(ADM1_ID, ADM2_ID) |>
    summarise(
          across(starts_with("INCIDENCE"), ~mean(., na.rm = TRUE)), # üîç pox PROBLEM here: if missing data for RR -> sum of N2 by YEAR is smaller than the sum of N1 !
          .groups = "drop"
        ) |>
    ungroup()

print(dim(yearly_incidence_mean))
head(yearly_incidence_mean, 3)

## 4. Export to `/data/dhis2_incidence/` folder

Dynamically include **reporting rate method** used (`rr-method-`) in **filename**

In [None]:
# Reusable function to generate filename and save data
save_yearly_incidence <- function(yearly_incidence, data_path, file_extension, write_function) {
  
  # Base filename parts
  base_name_parts <- c(
    COUNTRY_CODE, 
    "_incidence_year_routine-data-", ROUTINE_DATA_CHOICE, 
    "_rr-method-", REPORTING_RATE_METHOD
  )
   
  # Concatenate all parts to form the final filename
  file_name <- paste0(c(base_name_parts, file_extension), collapse = "")
  file_path <- file.path(data_path, "incidence", file_name)
  output_dir <- dirname(file_path)

  # Check if the output directory exists, else create it
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  # Flexibility to use function as provided in argument: "write_csv" or "arrow::write_parquet" ... 
  write_function(yearly_incidence, file_path)

  log_msg(paste0("Exporting : ", file_path))
}

#### ‚ö†Ô∏è NER Specific : Computation details 

Provide a msg to the user to indicate that the results correspond to a specific version of indicators and population (under5, pregnant or totals).

In [None]:
if (COUNTRY_CODE == "NER" & INDICATORS_FOUND) {
    log_msg(glue("‚ÑπÔ∏è The results have been computed using the following Indicators: {paste(targets, collapse=', ')}"))
    log_msg(glue("‚ÑπÔ∏è The results have been computed using the following Population: {POPULATION_SELECTION}"))
}

In [None]:
# Export the data

# CSV
save_yearly_incidence(yearly_incidence, DATA_PATH, ".csv", write_csv)

# Parquet
save_yearly_incidence(yearly_incidence, DATA_PATH, ".parquet", arrow::write_parquet)