# Data Element reporting rate: based on reporting of one or more indicators
**_Partially_ following methods by WHO and as per Diallo (2025) paper**

To accurately measure data completeness, we calculate the **monthly** reporting rate per **ADM2**, as the **proportion** of **facilities** (HF or `OU_ID`) that in a given month submitted data for either a single indicator (i.e., **confirmed** malaria case as `CONF`) or for _any_ of the chosen indicators (i.e., `CONF`, `SUSP`, `TEST`). 

Basically, "Data Element" reporting rate is the number of facilities reporting on 1 or more given indicators, over the total number of facilities.<br>

For this method the user is allowed to **chose** how to calculate both the **numerator** and the **denominator**.<br> 
Specifically:  

* **Numerator**: is the number of facilities that _actually reported_ data, and it is estimated based on whether a facility (OU_ID) submitted data for **_any_** of the **selected indicators**
    Note: we **recommend** always including `CONF` because it is a core indicator consistently tracked across the dataset. This choice ensures alignment with the structure of the incidence calculation, which is also mainly based on confirmed cases.
    <br>
    <br>
* **Denominator**: is the number of facilities _expected_ to report. This number can be obtained in two different ways:    
    * `"ROUTINE_ACTIVE_FACILITIES"`: uses the col `EXPECTED_REPORTS` from the df `active_facilities`.<br>
      This is calculated as the number of "**active**" facilities (OU_ID), defined as those that submitted _any_ data **at least once in a given year**, across **all** indicators extracted in `dhis2_routine` (namely: all aggregated indicators as defined in the SNT_config.json file, see: `config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS`)
    * `"PYRAMID_OPEN_FACILITIES"`: _COMPLETE!!_

<br>

This method improves over simple binary completeness flags by accounting for both spatial (facility coverage) and temporal (monthly timeliness) dimensions. <br>

In [None]:
# Parameters
# SNT_ROOT_PATH <- "/home/hexa/workspace" 

## 1. Setup

In [None]:
# Project paths
SNT_ROOT_PATH <- "/home/hexa/workspace" 
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') # this is where we store snt_utils.r
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data', 'dhis2')  

# Load utils
source(file.path(CODE_PATH, "snt_utils.r"))

# Load libraries 
required_packages <- c("arrow", "tidyverse", "stringi", "jsonlite", "httr", "reticulate", "glue")
install_and_load(required_packages)

# Environment variables
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")

# Load OpenHEXA sdk
openhexa <- import("openhexa.sdk")

### 1.1. Validate parameters

In [None]:
# Current options: 
# "COUNTRY_CODE_routine.parquet" (RAW data)
# "COUNTRY_CODE_routine_outliers-mean_removed.parquet" 
# "COUNTRY_CODE_routine_outliers-mean_imputed.parquet"
# "COUNTRY_CODE_routine_outliers-median_removed.parquet"
# "COUNTRY_CODE_routine_outliers-median_imputed.parquet"            
# "COUNTRY_CODE_routine_outliers-iqr_removed.parquet"
# "COUNTRY_CODE_routine_outliers-iqr_imputed.parquet"
# "COUNTRY_CODE_routine_outliers-trend_removed.parquet"
# "COUNTRY_CODE_routine_outliers-trend_imputed.parquet" 
if (!exists("ROUTINE_FILE")) ROUTINE_FILE <- glue("{COUNTRY_CODE}_routine.parquet") # default

# Options: "ROUTINE_ACTIVE_FACILITIES", "PYRAMID_OPEN_FACILITIES"
if (!exists("DATAELEMENT_METHOD_DENOMINATOR")) DATAELEMENT_METHOD_DENOMINATOR <- "ROUTINE_ACTIVE_FACILITIES" 

### 1.2. Load and check `snt config` file

In [None]:
# Load SNT config
config_json <- tryCatch({ jsonlite::fromJSON(file.path(CONFIG_PATH, "SNT_config.json")) },
    error = function(e) {
        msg <- paste0("[ERROR] Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

log_msg(paste0("SNT configuration loaded from : ", file.path(CONFIG_PATH, "SNT_config.json")))

In [None]:
# Configuration settings
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# How to treat 0 values (in this case: "SET_0_TO_NA" converts 0 to NAs)
NA_TREATMENT <- config_json$SNT_CONFIG$NA_TREATMENT
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)  # indicators list

numerator_indicators  <- c("SUSP", "TEST", "CONF")  # Parameters in config
volume_activity_indicator <- "CONF"  # Parameters in config
fixed_cols <- c('PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID', 'OU_ID')
fixed_cols_rr <- c('YEAR', 'MONTH', 'ADM2_ID', 'REPORTING_RATE') # Fixed cols for exporting RR tables

### 1.3. 🔍 Check: at least 1 indicator must be selected
The use can toggle on/off each of the indicators. Therefore, need to make sure at least one is ON. <br>
Indicator `CONF` is mandatory, but I think it looks better if they're all displayed in the Run pipeline view (more intuitive).

In [None]:
if (!length(numerator_indicators) > 0) {
    msg <- "[ERROR] Error: no indicator selected, cannot perform calculation of reporting rate method. Select at least one (e.g., `CONF`)."
    cat(msg)   
    stop(msg)
}

## 2. Load Data

### 2.1. Load routine data (DHIS2) 
already formatted & aggregated (output of pipeline XXX)

In [None]:
# DHIS2 Dataset identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_IMPUTATION

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, ROUTINE_FILE) }, 
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})
dhis2_routine <- dhis2_routine %>% mutate(across(c(PERIOD, YEAR, MONTH), as.numeric)) # Ensure correct data type for numerical columns 

# log
log_msg(glue("DHIS2 routine file {ROUTINE_FILE} loaded from dataset : {dataset_name} dataframe dimensions: {paste(dim(dhis2_routine), collapse=', ')}"))
dim(dhis2_routine)
head(dhis2_routine, 2)

### 2.3. Load organisation units (DHIS2 pyramid)

In [None]:
# Load file from dataset
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

dhis2_pyramid_formatted <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_pyramid.parquet")) }, 
                error = function(e) {
                    msg <- paste("Error while loading DHIS2 pyramid FORMATTED data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                    cat(msg)
                    stop(msg)
})
    
msg <- paste0("DHIS2 pyramid FORMATTED data loaded from dataset : `", dataset_name, "`. Dataframe dimensions: ", paste(dim(dhis2_pyramid_formatted), collapse=", "))
log_msg(msg)
dim(dhis2_pyramid_formatted)
head(dhis2_pyramid_formatted,2)

### 2.3. 🔍 Check expected indicators present in rountine data.
Based on which indicator(s) are selected (if any)

In [None]:
if (!all(numerator_indicators %in% names(dhis2_routine))) {
        log_msg(glue("🚨 Warning: one or more of the follow column is missing from `dhis2_routine`: {paste(numerator_indicators, collapse = ', ')}"), "warning")
}

if (!(volume_activity_indicator %in% names(dhis2_routine))) {
    msg <- glue("[ERROR] Volume activity indicator {volume_activity_indicator} not present in the routine data. Process cannot continue.")
    cat(msg)
    stop(msg)
}

## 3. Reporting rates computations

In [None]:
# Define start and end period based on routine data 
PERIOD_START <- dhis2_routine$PERIOD %>% min()
PERIOD_END <- dhis2_routine$PERIOD %>% max()

period_vector <- format(seq(ym(PERIOD_START), ym(PERIOD_END), by = "month"), "%Y%m")
cat(glue("Start period: {PERIOD_START} end period: {PERIOD_END} periods count: {length(period_vector)}"))

#### 3.1. Build master table

In [None]:
log_msg(glue("Build master table with periods from {PERIOD_START} to {PERIOD_END} periods count: {length(period_vector)}"))

# Master table contains all period x organisation unit combinations
facility_master <- dhis2_pyramid_formatted %>%
    rename(
        OU_ID = glue::glue("LEVEL_{config_json$SNT_CONFIG$ANALYTICS_ORG_UNITS_LEVEL}_ID"),
        OU_NAME = glue::glue("LEVEL_{config_json$SNT_CONFIG$ANALYTICS_ORG_UNITS_LEVEL}_NAME"),
        ADM2_ID = str_replace(ADMIN_2, "NAME", "ID"),
        ADM2_NAME = all_of(ADMIN_2),
        ADM1_ID = str_replace(ADMIN_1, "NAME", "ID"),
        ADM1_NAME = all_of(ADMIN_1)
    ) %>%
    select(ADM1_ID, ADM1_NAME, ADM2_ID, ADM2_NAME, OU_ID, OU_NAME, OPENING_DATE, CLOSED_DATE) %>%
    distinct() %>%
    tidyr::crossing(PERIOD = period_vector) %>%
    mutate(PERIOD=as.numeric(PERIOD))
    

#### 3.2. Compute data element availability (numerator)

In [None]:
log_msg(glue("Computing data element availability based on selection: {paste(numerator_indicators, collapse=', ')}"))

# join rountine indicator values and complete numerator selection
facility_master_routine <- facility_master %>% 
    left_join(dhis2_routine %>% select(OU_ID, PERIOD, all_of(DHIS2_INDICATORS)), 
              by = c("OU_ID", "PERIOD")) %>%
    mutate(
        YEAR = as.numeric(substr(PERIOD, 1, 4)),
        DE_AVAILABLE = ifelse(
            rowSums(!is.na(across(all_of(numerator_indicators))) & across(all_of(numerator_indicators)) > 0) > 0, 1, 0),        
        COUNT = 1 # Counting every facility
    )

dim(facility_master_routine)
head(facility_master_routine, 3)

#### 3.3. Compute health facility `OPEN` flag.

In [None]:
# Is the facility open for the period ? 
facility_master_routine <- facility_master_routine %>%
    mutate(            
        OPEN = ifelse(
          (!is.na(as.Date(OPENING_DATE)) & (as.Date(OPENING_DATE) > as.Date(ym(PERIOD)))) | 
          (!is.na(CLOSED_DATE) & (as.Date(CLOSED_DATE) < as.Date(ym(PERIOD)))),
          0, 1
        )
  )

#### 3.4. Compute health facility `IS_ACTIVE` flag

In [None]:
# Is the facility reporting data elements over the year ?
facility_master_routine_01 <- facility_master_routine %>%
    group_by(OU_ID, YEAR) %>%
    mutate(IS_ACTIVE = max(DE_AVAILABLE, na.rm = TRUE)) %>%  # compute per OU_ID × YEAR
    ungroup()

#### 3.5. Compute advanced weight (based on volume of activity)
Volume of activity = confirmed cases (parameterized by `volume_activity_indicator`)

In [None]:
log_msg(glue("Computing volume of activity using indicator: '{volume_activity_indicator}'"))

# Compute HF and district level 'volume_activity_indicator'
mean_monthly_cases <- dhis2_routine %>% 
    select(ADM2_ID, OU_ID, !!sym(volume_activity_indicator)) %>% 
    group_by(ADM2_ID, OU_ID) %>% 
    summarise(MEAN_HF_ACTIVITY = mean(!!sym(volume_activity_indicator), na.rm=TRUE), .groups = "drop")

mean_monthly_cases_adm2 <- mean_monthly_cases %>% 
    select(ADM2_ID, MEAN_HF_ACTIVITY) %>% 
    group_by(ADM2_ID) %>% 
    summarise(TOTAL_MEAN_ADM2_ACTIVITY = sum(MEAN_HF_ACTIVITY, na.rm=TRUE), 
              HF_COUNT = n())

# Compute weights
hf_weights <- mean_monthly_cases %>% 
    left_join(mean_monthly_cases_adm2, by = "ADM2_ID") %>%
    mutate(WEIGHT = MEAN_HF_ACTIVITY / TOTAL_MEAN_ADM2_ACTIVITY * HF_COUNT)

# Join with rest of data
facility_master_routine_02 <- facility_master_routine_01 %>%
    left_join(hf_weights %>% select(OU_ID, WEIGHT), by = c("OU_ID"))

#### 3.6. Compute weighted availability

In [None]:
# Compute weighted DE_AVAILABLE 
log_msg(glue("Computing weighted availability."))

facility_master_routine_02$DE_AVAILABLE_W <- facility_master_routine_02$DE_AVAILABLE * facility_master_routine_02$WEIGHT
facility_master_routine_02$COUNT_W <- facility_master_routine_02$COUNT * facility_master_routine_02$WEIGHT   
facility_master_routine_02$OPEN_W <- facility_master_routine_02$OPEN * facility_master_routine_02$WEIGHT
facility_master_routine_02$IS_ACTIVE_W <- facility_master_routine_02$IS_ACTIVE * facility_master_routine_02$WEIGHT

dim(facility_master_routine_02)
head(facility_master_routine_02, 2)

#### 3.7. Aggregate data at ADM2 level

In [None]:
log_msg(glue("Aggregating data at admin level 2."))

reporting_rate_adm2 <- facility_master_routine_02 %>% 
    group_by(ADM1_ID, ADM1_NAME, ADM2_ID, ADM2_NAME, YEAR, PERIOD) %>%
    summarise(DE_AVAILABLE = sum(DE_AVAILABLE, na.rm = TRUE),
              HF = sum(COUNT, na.rm = TRUE),
              OPEN_HF = sum(OPEN, na.rm = TRUE),
              ACTIVE_HF = sum(IS_ACTIVE, na.rm = TRUE),
              DE_AVAILABLE_W = sum(DE_AVAILABLE_W, na.rm = TRUE),
              HF_W = sum(COUNT_W, na.rm = TRUE),
              OPEN_HF_W = sum(OPEN_W, na.rm = TRUE),
              ACTIVE_HF_W = sum(IS_ACTIVE_W, na.rm = TRUE), 
              .groups = "drop") %>%
      mutate(
        RR_TOTAL_HF = DE_AVAILABLE / HF,
        RR_OPEN_HF = DE_AVAILABLE / OPEN_HF,
        RR_ACTIVE_HF = DE_AVAILABLE / ACTIVE_HF,
        RR_TOTAL_HF_W = DE_AVAILABLE_W / HF_W,
        RR_OPEN_HF_W = DE_AVAILABLE_W / OPEN_HF_W,
        RR_ACTIVE_HF_W = DE_AVAILABLE_W / ACTIVE_HF_W
      )

dim(reporting_rate_adm2)
head(reporting_rate_adm2, 2)

In [None]:
# quick check
# head(reporting_rate_adm2 %>% filter(ADM2_ID=="CyHuq664hCU", PERIOD==201801))

## 4. 📁 Export to `data/` folder

### 4.1. Select results and format

In [None]:
# one of the 2 options
if (DATAELEMENT_METHOD_DENOMINATOR == "ROUTINE_ACTIVE_FACILITIES") { 
    rr_column_selection <- "RR_ACTIVE_HF_W"
} else {
    rr_column_selection <- "RR_OPEN_HF_W"
}

In [None]:
log_msg(glue("Formatting table for '{DATAELEMENT_METHOD_DENOMINATOR}' selection."))

# Select column and format final table
reporting_rate_dataelement <- reporting_rate_adm2 %>%
    mutate(MONTH = PERIOD %% 100) %>%
    rename(REPORTING_RATE = !!sym(rr_column_selection)) %>%
    select(all_of(fixed_cols_rr))

print(dim(reporting_rate_dataelement))
head(reporting_rate_dataelement, 3)

### 4.2. Write files

In [None]:
output_data_path <- file.path(DATA_PATH, "reporting_rate")

# parquet
file_path <- file.path(output_data_path, paste0(COUNTRY_CODE, "_reporting_rate_dataelement.parquet"))
write_parquet(reporting_rate_dataelement, file_path)
log_msg(glue("Exported : {file_path}"))

In [None]:
# csv
file_path <- file.path(output_data_path, paste0(COUNTRY_CODE, "_reporting_rate_dataelement.csv"))
write.csv(reporting_rate_dataelement, file_path, row.names = FALSE)
log_msg(glue("Exported : {file_path}"))