Script structure:

  0. Parameters: set back-up values for parameters, for when the notebook is run manually (_noy_ via pipeline)
  1. Setup:
        * Paths
        * Utils functions
  2. Load Data
        * **Routine data** (DHIS2) already formatted & aggregated (output of pipeline XXX)
        * **Reporting** (DHIS2) pre-computed, already formatted & aggregated (output of pipeline ???)
        * **Shapes** (DHIS2) for plotting (this could be removed if we move the plots to "report/EDA" nb)
  3. Calculate **Reportng Rate (RR)**
        * "**Dataset**": using pre-computed reportings from DHIS2/SNIS (was: "DHIS2")
        * "**Data Element**": using calculated expected nr of report (nr of active facilities) (was: "CONF")
        * <s>"ANY" (based on old code - BFA)</s> 🤫
  4. **Export** reporting rate data to **Datasets** as .csv and .parquet files

-------------------
**Naming harmonization to improve code readability**:

**Reporting Rate** data frames, based on different **methods**:
* follwo this structure: `reporting_rate_<method>`. So:
    * **Dataset** (using pre-computed reporting) : `reporting_rate_dataset`
    * **Data Element** (Diallo 2025) : `reporting_rate_dataelement`
        * **Note**: when exported, the file name carries also the info for the choice of **n**umerator and **d**enominator<br>
          Example: COD_reporting_rate_dataelement-**n**1-**d**1.csv

--------------------

🚧 **Notes / possible ToDo's**:
* Considering to remove **yearly** reporting rate calculations (`reporting_rate_*_year`) as it does not seem to be used anywhere - excpet for the reporting/comparison/data quality check. Therefore, maybe better to only have this in the reporting nb (delete from main nb)

--------------------

## Parameters

Set Default values **if _not_ provided by pipeline**<br>
This makes the execution flexible and "safe": nb can be run manually from here or be executed via pipeline, without having to change anything in the code!

In [None]:
# Set BACKUP VALUE: root path - NEVER CHANGE THIS!
if (!exists("SNT_ROOT_PATH")) {
  SNT_ROOT_PATH <- "/home/hexa/workspace" 
}

# Run Dataset rep rate? <-- 🚨 Must run in anyways if `DATAELEMENT_METHOD_DENOMINATOR == "DHIS2_EXPECTED_REPORTS"` !
RUN_DATASET <- TRUE # FALSE  # ⚠️ Just placeholder, not used yet! ⚠️

# Run Data Elemenet rep rate?
RUN_DATAELEMENT <- TRUE # FALSE # ⚠️ Just placeholder, not used yet! ⚠️

# Data Elemenet RR. Choice: which INDICATORS to use to count the nr of reporting facilities 
if (!exists("DATAELEMENT_METHOD_NUMERATOR")) {
  DATAELEMENT_METHOD_NUMERATOR <- "CONF" # "CONF|SUSP|TEST"
}

# Data Elemenet RR. Choice: which df to use for nr of `EXPECTED_REPORTS` (DENOMINATOR) 
if (!exists("DATAELEMENT_METHOD_DENOMINATOR")) {
  DATAELEMENT_METHOD_DENOMINATOR <- "DHIS2_EXPECTED_REPORTS" # "ACTIVE_FACILITIES"
}


## 1. Setup

### 1.1. Paths

In [None]:
# PROJECT PATHS
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') # this is where we store snt_utils.r
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data', 'dhis2')  

### 1.2. Utils functions

In [None]:
source(file.path(CODE_PATH, "snt_utils.r"))

### 1.3. Packages

In [None]:
# List required pcks  ---------------->  check  what are the really required libraries
required_packages <- c("arrow", # for .parquet
                       "tidyverse",
                       "stringi", 
                       "jsonlite", 
                       "httr", 
                       "reticulate")

# Execute function
install_and_load(required_packages)

### 1.3.1. OpenHEXA-specific settings

#### For 📦{sf}, tell OH where to find stuff ...

In [None]:
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")

#### Set environment to load openhexa.sdk from the right path

In [None]:
# Set environment to load openhexa.sdk from the right path
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.4. Load and check `config` file

In [None]:
# Load SNT config

config_file_name <- "SNT_config.json" 
config_json <- tryCatch({
        jsonlite::fromJSON(file.path(CONFIG_PATH, config_file_name)) 
    },
    error = function(e) {
        msg <- paste0("Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from : ", file.path(CONFIG_PATH, config_file_name))
log_msg(msg)

**Save config fields as variables**

In [None]:
# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# How to treat 0 values (in this case: "SET_0_TO_NA" converts 0 to NAs)
NA_TREATMENT <- config_json$SNT_CONFIG$NA_TREATMENT

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "Ousmane"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

# Which reporting rate PRODUCT_UID to use (not that this is a dataset in COD, but 2 dataElements in BFA!)
REPORTING_RATE_PRODUCT_ID <- config_json$SNT_CONFIG$REPORTING_RATE_PRODUCT_UID

In [None]:
# DHIS2_INDICATORS
log_msg(paste("Expecting the following DHIS2 (aggregated) indicators : ", paste(DHIS2_INDICATORS, collapse=", ")))

In [None]:
# Fixed  cols for routine data formatting 
fixed_cols <- c('OU_ID','PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID') # (OU_NAME has homonimous values!)
# print(paste("Fixed routine data (`dhis2_routine`) columns (always expected): ", paste(fixed_cols, collapse=", ")))
log_msg(paste("Expecting the following columns from routine data (`dhis2_routine`) : ", paste(fixed_cols, collapse=", ")))

In [None]:
# Fixed cols for exporting RR tables: to export output tables with consistent structure
fixed_cols_rr <- c('YEAR', 'MONTH', 'ADM2_ID', 'REPORTING_RATE') 

## 2. Load Data

### 2.1. **Routine** data (DHIS2) 
already formatted & aggregated (output of pipeline XXX)

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_routine.parquet")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 routine data loaded from dataset : ", dataset_name, " dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
log_msg(msg)

In [None]:
# Ensure correct data type for numerical columns 
dhis2_routine <- dhis2_routine %>%
    mutate(across(c(PERIOD, YEAR, MONTH), as.numeric))

In [None]:
head(dhis2_routine, 3)

#### 🔍 Check expected cols for method **Data Element**, numerator using multiple indicators.
Only when: `DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST"`

In [None]:
# 'Since method `DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST"` was selected, expecting the following cols in routine data: `CONF`, `SUSP`, `TEST`.'

if (DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST") {
    log_msg('Since method `DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST"` was selected, expecting the following cols in routine data: `CONF`, `SUSP`, `TEST`.')

    expected_col <- c("CONF", "SUSP", "TEST")
    if ( length(which(expected_col %in% names(dhis2_routine))) < length(expected_col) ) {
    log_msg(paste0("🚨 Warning: one or more of the follow column is missing from `dhis2_routine`: ", paste(expected_col, collapse = ", "), "."), "warning")
    } else log_msg("✅ All expected columns are present in `dhis2_routine` data.")

}

### 2.2. **Reporting** pre-computed from DHIS2 
Data granularity:
* **ADM2**
* **MONTH** (PERIOD)

Note: data comes from different dataset (`DS_NAME`): `A SERVICES DE BASE`, `B SERVICES SECONDAIRES`,`D SERVICE HOPITAL` 

The col `DS_METRIC` indicates whether the `VALUE` is `EXPECTED_REPORTS`, `ACTUAL_REPORTS`

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
file_name <- paste0(COUNTRY_CODE, "_reporting.parquet")

# Load file from dataset
dhis2_reporting <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 pre-computed REPORTING data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 pre-computed REPORTING data loaded from file `", file_name, "` (from dataset : `", dataset_name, "`). Dataframe dimensions: ", paste(dim(dhis2_reporting), collapse=", "))
log_msg(msg)

In [None]:
# Convert VALUE col to <dbl> - should not be needed but keep as safety measure 
dhis2_reporting <- dhis2_reporting |>
mutate(across(c(PERIOD, YEAR, MONTH, VALUE), as.numeric))

In [None]:
head(dhis2_reporting, 3)

#### 2.2.1. **Filter** to keep only values for `PRODUCT_UID` defined in config.json

In [None]:
REPORTING_RATE_PRODUCT_ID

In [None]:
# Handle problems with incorrect configuration - to be improved 🚧
if (is.null(REPORTING_RATE_PRODUCT_ID)) {
    log_msg("🛑 Problem with definition of REPORTING_RATE_PRODUCT_ID, check `SNT_config.json` file!")
} else 
    product_name <- dhis2_reporting |> filter(PRODUCT_UID %in% REPORTING_RATE_PRODUCT_ID) |> pull(PRODUCT_NAME) |> unique()
    log_msg(glue::glue("Using REPORTING_RATE_PRODUCT_ID == `{REPORTING_RATE_PRODUCT_ID}`, corresponding to DHIS2 Product name : `{product_name}`."))

In [None]:
dhis2_reporting_filtered <- dhis2_reporting |>
# filter(PRODUCT_NAME == REPORTING_DS_NAME) |>
filter(PRODUCT_UID %in% REPORTING_RATE_PRODUCT_ID) |>
select(-PRODUCT_UID, -PRODUCT_NAME) # useless cols now

print(dim(dhis2_reporting_filtered))
head(dhis2_reporting_filtered)

#### 2.2.2. Format to produce `dhis2_reporting_expected`
🚨 Note: Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new denominator for REPORTING_RATE calculations (methods ANY and CONF)

In [None]:
dhis2_reporting_wide <- dhis2_reporting_filtered |> 
pivot_wider(
    names_from = PRODUCT_METRIC, 
    values_from = VALUE
)

print(dim(dhis2_reporting_wide))
head(dhis2_reporting_wide)

🚨 **Note**: Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new **denominator** for `REPORTING_RATE` calculations (methods ANY and CONF)

In [None]:
# Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new denomitor for RR calculations (methods ANY and CONF)

dhis2_reporting_expected <- dhis2_reporting_wide |> 
select(-ACTUAL_REPORTS)

print(dim(dhis2_reporting_expected))
head(dhis2_reporting_expected)

#### 2.2.3. **Checks** on data completeness: _do **periods match** with routine data?_
Lack of perfect overlap in periods between routine data and reporting rate data might create headhaches downstream!<br>
Specifically, **incidence** calculations will show **N2 smaller than N1** due to **aggregation by YEAR when NA** values are present!

In [None]:
# --- Check Year Compatibility ---
routine_years <- sort(unique(as.integer(dhis2_routine$YEAR))) # as.integer
expected_years <- sort(unique(as.integer(dhis2_reporting_expected$YEAR))) # as.integer

if (!setequal(routine_years, expected_years)) {
  missing_in_routine <- setdiff(expected_years, routine_years)
  missing_in_expected <- setdiff(routine_years, expected_years)

  if (length(missing_in_routine) > 0) {
    log_msg(paste0("🚨 Warning: YEAR value(s) present in 'dhis2_reporting_expected' but not in 'dhis2_routine': ",
                   paste(missing_in_routine, collapse = ", ")))
  }
  if (length(missing_in_expected) > 0) {
    log_msg(paste0("🚨 Warning: YEAR value(s) present in 'dhis2_routine' but not in 'dhis2_reporting_expected': ",
                   paste(missing_in_expected, collapse = ", ")))
  }
} else {
  log_msg("✅ YEAR values are consistent across 'dhis2_routine' and 'dhis2_reporting_expected'.")

  # --- Check Month Compatibility (if years are consistent) ---
  all_years <- unique(routine_years) # Or expected_years, they are the same now

  for (year_val in all_years) {
    routine_months_for_year <- dhis2_routine %>%
      filter(YEAR == year_val) %>%
      pull(MONTH) %>%
      unique() %>%
      sort()

    expected_months_for_year <- dhis2_reporting_expected %>%
      filter(YEAR == year_val) %>%
      pull(MONTH) %>%
      unique() %>%
      sort()

    if (!setequal(routine_months_for_year, expected_months_for_year)) {
      missing_in_routine_months <- setdiff(expected_months_for_year, routine_months_for_year)
      missing_in_expected_months <- setdiff(routine_months_for_year, expected_months_for_year)

      if (length(missing_in_routine_months) > 0) {
        log_msg(paste0("🚨 Warning: for YEAR ", year_val, ", MONTH value(s) '", paste(missing_in_routine_months, collapse = ", "),
                       "' present in 'dhis2_reporting_expected' but not in 'dhis2_routine'!"
                       ))
      }
      if (length(missing_in_expected_months) > 0) {
        log_msg(paste0("🚨 Warning: for YEAR ", year_val, ", MONTH value(s) '", paste(missing_in_expected_months, collapse = ", "), 
                       "' present in 'dhis2_routine' but not in 'dhis2_reporting_expected'!"
                       ))
      }
    } else {
      log_msg(paste0("✅ For year ", year_val, ", months are consistent across both data frames."))
    }
  }
}

### 2.3. **Shapes** for plotting maps (choropleths)

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_shapes <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_shapes.geojson")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 shapes data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 shapes data loaded from dataset : `", dataset_name, "`. Dataframe dimensions: ", paste(dim(dhis2_shapes), collapse=", "))
log_msg(msg)

In [None]:
# `head()` cannot display, needs ‘geojsonio’ (which I cannot install) so let's just check col names ... 
names(dhis2_shapes)

## 3. Calculate **Reporting Rate** (RR)
We compute it using 2 approaches, user can decided later on which one to use for incidence adjustment.

### 3.1. "**Dataset**" reporting rate: pre-computed, from **DHIS2**
Exrtacted from DHIS2 and formatted. 

Straightforward: `ACTUAL_REPORTS` / `EXPECTED_REPORTS` (just pivot `DS_METRIC` and divide)

In [None]:
reporting_rate_dataset <- dhis2_reporting_wide |> 
mutate(REPORTING_RATE = ACTUAL_REPORTS / EXPECTED_REPORTS)

print(dim(reporting_rate_dataset))
head(reporting_rate_dataset, 3)

#### Quick data quality check 🔍

In [None]:
# --- Define function ---------------------------
inspect_reporting_rate <- function(data_tibble) {

  # Dynamically get the name of the tibble passed to the function
  # This extracts the literal name of the variable passed (e.g., "reporting_rate_dhis2_month")
  tibble_name_full <- deparse(substitute(data_tibble))

  # Extract the 'method' part from the tibble name
  method <- stringr::str_extract(tibble_name_full, "(?<=reporting_rate_).*") # "(?<=reporting_rate_).*?(?=_month)"

  # Calculations for proportion of values > 1
  values_greater_than_1 <- sum(data_tibble$REPORTING_RATE > 1, na.rm = TRUE)
  total_values <- length(data_tibble$REPORTING_RATE)

  if (total_values > 0) {
    proportion <- values_greater_than_1 / total_values * 100
    min_rate <- min(data_tibble$REPORTING_RATE, na.rm = TRUE)
    max_rate <- max(data_tibble$REPORTING_RATE, na.rm = TRUE)
  } else {
    proportion <- 0
    min_rate <- NA # Set to NA if no values to calculate min/max
    max_rate <- NA # Set to NA if no values to calculate min/max
  }

  if (proportion == 0) {
      clarification = NULL
  } else {
      clarification = " (there are more reports than expected)"
  }

  # Print the formatted result
  log_msg(
    paste0(
      "🔍 For reporting rate method : `", method, "`, the values of REPORTING_RATE range from ", round(min_rate, 2),
      " to ", round(max_rate, 2),
      ", and ", round(proportion, 2), " % of values are >1", clarification, "."
    )
  )

  # Histogram
  hist(data_tibble$REPORTING_RATE, 
     breaks = 50)
}

In [None]:
inspect_reporting_rate(reporting_rate_dataset)

#### Subset cols

In [None]:
reporting_rate_dataset <- reporting_rate_dataset |> 
select(all_of(fixed_cols_rr))

dim(reporting_rate_dataset)
head(reporting_rate_dataset, 3)

----------------------------

### 3.2. Method **Data Element** reporting rate: based on reporting of one or more indicators
**_Partially_ following methods by WHO and as per Diallo (2025) paper**

To accurately measure data completeness, we calculate the **monthly** reporting rate per **ADM2**, as the **proportion** of **facilities** (HF or `OU_*`) that in a given month submitted data for either a single indicator (i.e., **confirmed** malaria case as `CONF`) or for _any_ of the chosen indicators (i.e., `CONF`, `SUSP`, `TEST`). 

Basically, the number of facilities reporting on a given indicator (1 or more), over the total number of facilities (within the same `ADM2_ID`).<br>

This method allows to **chose** how to calculate both the **numerator** and the **denominator**.<br> 
Specifically:
* Choice of **Numerator** depends on the paramater `DATAELEMENT_METHOD_NUMERATOR`, with options: 
    * `== "CONF"`: uses a **single** indicator (only look at submissions for `CONF`; confirmed malaria cases)
    * `== "CONF|SUSP|TEST"` : uses **multiple** indicators (look at submissions across `CONF`, `SUSP`, and `TEST`).<br>
      Note: in the latter, a facility (OU_ID) is **counted as "reporting"** if it **submitted data for _any_ of these indicators**.
* Choice of **Denominator** depends on the paramater `DATAELEMENT_METHOD_DENOMINATOR`, with options:
    * `== "DHIS2_EXPECTED_REPORTS"`: uses the col `EXPECTED_REPORTS` from the df `dhis2_reporting_expected` (which is obtained directly from DHIS2, and is the same denominator used to calculate the "Dataset" reporting rate)
    * `== "ACTIVE_FACILITIES"`: uses the col `EXPECTED_REPORTS` from the df `active_facilities`. This is calculated as the number of facilities (OU_ID) that submitted _any_ data at least once in a given year, across _all_ indicators extracted in `dhis2_rooutine` (all aggregated indicators as defined in the SNT_config.json file, see: `config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS`)

<br>

This method improves over simple binary completeness flags by accounting for both spatial (facility coverage) and temporal (monthly timeliness) dimensions. <br>

We use the presence of `CONF` data (confirmed malaria cases) because it is a core indicator consistently tracked across the dataset. This choice ensures alignment with the structure of the incidence calculation, which is also mainly based on confirmed cases.

### Calculate the **numerator**

In [None]:
# DATAELEMENT_METHOD_NUMERATOR <- "CONF" 
# DATAELEMENT_METHOD_NUMERATOR <- "CONF|SUSP|TEST" 
DATAELEMENT_METHOD_NUMERATOR

**Note**: the col `REPORTED` keeps the same name regardless of the value of `DATAELEMENT_METHOD_NUMERATOR` because 
in this way the code needs to be parametrized only once (here).


In [None]:
# Mark as "ACTIVE": only if `CONF` OR `CONF | SUSP | TEST` are not NA (= the HF reported some data)

if (DATAELEMENT_METHOD_NUMERATOR == "CONF") {
   dhis2_routine_active <- dhis2_routine %>%
   mutate(ACTIVE = if_else(!is.na(CONF), 1, 0)) 
    log_msg("Evaluating reporting facilities based only on indicator `CONF`.")
} else if (DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST") {
   dhis2_routine_active <- dhis2_routine %>%
   mutate(ACTIVE = if_else(!is.na(CONF) | !is.na(SUSP) | !is.na(TEST), 1, 0)) 
   log_msg("Evaluating reporting facilities based on indicators: `CONF`, `SUSP`, and `TEST`.")
}

dim(dhis2_routine_active)
head(dhis2_routine_active, 3)

In [None]:
# --- 1.  Calculate `SUBMITTED_REPORTS` as the nr of ACTIVE facilities (that REPORTED, each month) ------------------------

dhis2_routine_submitted <- dhis2_routine_active %>% # OLD: dhis2_routine_reporting_month <- dhis2_routine_reporting %>%
  group_by(ADM2_ID, YEAR, MONTH) %>% 
  summarise(
    SUBMITTED_REPORTS = sum(ACTIVE, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ungroup() %>%  
    mutate(YEAR = as.integer(YEAR),
           MONTH = as.integer(MONTH)
          ) 

print(dim(dhis2_routine_submitted))
head(dhis2_routine_submitted, 3)

### Calculate the **denominator**
This is to be used **only when** `DATAELEMENT_METHOD_DENOMINATOR ==`**`ACTIVE_FACILITIES`** 

In [None]:
# DATAELEMENT_METHOD_DENOMINATOR <- "DHIS2_EXPECTED_REPORTS" 
# DATAELEMENT_METHOD_DENOMINATOR <- "ACTIVE_FACILITIES"
DATAELEMENT_METHOD_DENOMINATOR

In [None]:
# Calculate the tot nr of facilities (distinct OU_ID) based on all HF that appear in the routine data (each YEAR)
# meaning: regardless of what indicators they submit data for, as long as they have submitted something

if (DATAELEMENT_METHOD_DENOMINATOR == "ACTIVE_FACILITIES") {
    active_facilities <- dhis2_routine %>%
    # Keep only rows where at least one indicator has non-NA value
    filter(if_any(any_of(DHIS2_INDICATORS), ~ !is.na(.))) %>%
    group_by(YEAR, ADM2_ID) %>%
    summarize(
      EXPECTED_REPORTS = n_distinct(OU_ID),
      .groups = "drop" # Optional: Removes the grouping structure from the final output
    )

    nr_of_rows <- nrow(active_facilities)
    log_msg(glue::glue("Produced df `active_facilities`, with column `EXPECTED_REPORTS` calculated from DHIS2 routine data. Dataframe `active_facilities` has {nr_of_rows} rows."))

    head(active_facilities, 3)
    
} else print("NOT calculating `active_facilities` as not needed ... ")


### Calculate **Reporting Rate** 

**Join df for Denominator**

**Note**<br>
in both df's (`dhis2_reporting_expected` OR `active_facilities`) the col `EXPECTED_REPORTS` has the same name to simplify parametrization: only difference between the 2 options is the df to be joined (right element in `left_join()`)

In [None]:
# --- 2. Join `dhis2_reporting_expected` OR `dhis2_calculated_expected` to add `EXPECTED_REPORTS` ------------------------------------------------

# Parametrized based on DATAELEMENT_METHOD_DENOMINATOR: left_join() the respective df
if (DATAELEMENT_METHOD_DENOMINATOR == "DHIS2_EXPECTED_REPORTS") {
    # Add df of rep rate extracted directly from DHIS2
    dhis2_routine_submitted_expected <- left_join(
    dhis2_routine_submitted, 
    dhis2_reporting_expected |> select(ADM2_ID, YEAR, MONTH, EXPECTED_REPORTS), # `dhis2_reporting_expected`
    by = join_by(ADM2_ID, YEAR, MONTH)
    ) 

    log_msg("Calculating `Data Element` reporting rate, using as denominator `EXPECTED_REPORTS` from DHIS2.")
    
} else if (DATAELEMENT_METHOD_DENOMINATOR == "ACTIVE_FACILITIES") {
    # Add df of rep rate CALCULATED based on submissiosn in dhis2 routine data
    dhis2_routine_submitted_expected <- left_join(
    dhis2_routine_submitted, 
    active_facilities, # has only cols: `YEAR`, `ADM2_ID`, `EXPECTED_REPORTS`
    by = join_by(ADM2_ID, YEAR) #, MONTH)
    ) 

    log_msg("Calculating `Data Element` reporting rate, using as denominator `EXPECTED_REPORTS` as CALCULATED from routine data.")
}

# Safety measures ...
dhis2_routine_submitted_expected <- dhis2_routine_submitted_expected|>
  # ungroup() %>%  
  mutate(YEAR = as.integer(YEAR),
         MONTH = as.integer(MONTH)
          ) 


print(dim(dhis2_routine_submitted_expected))
head(dhis2_routine_submitted_expected, 3)

In [None]:
# --- 3. Calculate `REPORTING_RATE` ------------------------------------------------
reporting_rate_dataelement <- dhis2_routine_submitted_expected |>
mutate(
    REPORTING_RATE = SUBMITTED_REPORTS / EXPECTED_REPORTS
  ) 

dim(reporting_rate_dataelement)
head(reporting_rate_dataelement, 3)

#### Quick data quality check 🔍

In [None]:
# inspect_reporting_rate(reporting_rate_conf_month)
inspect_reporting_rate(reporting_rate_dataelement)

#### Subset cols

In [None]:
# reporting_rate_conf_month <- reporting_rate_conf_month |> 
reporting_rate_dataelement <- reporting_rate_dataelement |> 
select(all_of(fixed_cols_rr))

head(reporting_rate_dataelement, 3)

#### Plot by MONTH (heatmap)

In [None]:
# Plot reporting rate heatmap
options(repr.plot.width = 20, repr.plot.height = 10) 

# reporting_rate_conf_month %>%
reporting_rate_dataelement %>%
mutate(
    DATE = as.Date(paste0(YEAR, "-", MONTH, "-01"))
    ) %>%
ggplot(., aes(x = DATE,  
              y = factor(ADM2_ID), 
              fill = REPORTING_RATE * 100)
      ) + 
  geom_tile() +
  scale_fill_viridis_c(
    option = "C",
    direction = 1,  # blue = low, yellow = high
    limits = c(0, 100),
    name = "Reporting rate (%)"
  ) +
  labs(
    title = "Monthly Reporting Rate by Health District",
    subtitle = "Each tile represents the reporting completeness per district per month",
    x = "Month",
    y = "Health District"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 9),
    axis.text.y = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "right",
    panel.grid = element_blank()
  )

### (SILENCED OLD) 3.2. Method "**ANY**: look at submissions for **_any_ indicator** that is present
The list of **indicators** is **defined** in the **config file**!

`#### Define cols used to evaluate HF "activity" (whether a HF is reporting or not)`

In [None]:
# cols_to_subset <- c(fixed_cols, DHIS2_INDICATORS)
# print(cols_to_subset)

# dhis2_routine_subset = dhis2_routine %>% 
#   # dplyr::select(all_of(cols_to_subset))  # ⚠️ TEMP switch as config.json was changed but not extracted data (some cols are missing) ⚠️
#   dplyr::select(any_of(cols_to_subset))

# # Print warning message in case there are indicators defined in the config but not present in the routine data
# if (length(cols_to_subset) > length(names(dhis2_routine_subset)) ) {
# log_msg(
#     paste0("🚨 Warning: the following columns are expected, but missing in dhis2_routine : ",  paste(setdiff(cols_to_subset, names(dhis2_routine_subset)), collapse = ", ") ) 
# )
#     }

`#### 🚨 Set `0` values to `NA``

In [None]:
# # ⚠️ To switch back: issue with changing config and expected cols ... ⚠️
# # Temp version of the code to handle missing cols (defined in confg file, hence -> DHIS2_INDICATORS , but missing from routine data)
# #  (because config file was changed but there was no new data extraction)

# DHIS2_INDICATORS_FILTERED <- intersect(names(dhis2_routine_subset), DHIS2_INDICATORS)

# print(DHIS2_INDICATORS)
# print(DHIS2_INDICATORS_FILTERED)

In [None]:
# # 0 value to NA 
# if (NA_TREATMENT == 'SET_0_TO_NA') { 
#     # dhis2_routine_subset[, DHIS2_INDICATORS][dhis2_routine_subset[, DHIS2_INDICATORS] == 0] <- NA  #  ⚠️ REACTIVATE THIS ⚠️
#     dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED][dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED] == 0] <- NA  # ⚠️ TEMP switch as config.json was changed but not extracted data ⚠️
#     msg <- paste0("✍🏽 Set 0 values to NA in cols : ", paste(names(dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED]), collapse=', ') )
#     log_msg(msg)
# }

In [None]:
# # HF considered "inactif" when all indicators are NA (= did not submit anything for these indicators), 
# #     else "actif" (= they submitted something)

# hf_active = dhis2_routine_subset %>%
#     dplyr::mutate(# nomiss = apply(dhis2_routine_subset[,DHIS2_INDICATORS], 1, function(y) sum(!is.na(y))), 
#                   nomiss = apply(dhis2_routine_subset[,DHIS2_INDICATORS_FILTERED], 1, function(y) sum(!is.na(y))), # ⚠️ TEMP SWITCH (cofing issue ... )
#                   varmis =ifelse(nomiss == 0, 0, 1),
#                   ACTIVE = ifelse(varmis == 0, FALSE, TRUE)) %>% # 🚨 GP changed to BOOLEAN to save space
#     dplyr::arrange(ADM1_ID, ADM2_ID, OU_ID, PERIOD) %>% 
#     dplyr::group_by(ADM1_ID, ADM2_ID, OU_ID) %>% 
#     dplyr::mutate(cummiss = sum(nomiss), 
#                   # inactivity = nomiss/length(DHIS2_INDICATORS) * 100, 
#                   inactivity = nomiss/length(DHIS2_INDICATORS_FILTERED) * 100, # ⚠️ TEMP SWITCH (cofing issue ... )
#                   start_date = ifelse(
#                     any(inactivity != 100, na.rm = TRUE),
#                     min(PERIOD[inactivity != 100], na.rm = TRUE),
#                     NA  # Default to NA if no valid values
#                     )) %>%
#     dplyr::filter(PERIOD >= start_date)

In [None]:
# head(hf_active, 3)

`#### 🚨 Here 👇 swap denominator: join `dhis2_reporting_expected` to replace `TOTAL_HF` with `EXPECTED_REPORTS``

In [None]:
# Break process: create intermediate df (`hf_active_month`) -> then join `dhis2_reporting_expected`

In [None]:
# # --- 1. create intermediate df `hf_active_month`: summarize nr of "active" (reporting) HF by month ------------------------
# hf_active_month <- hf_active %>% 
# # filter(ADM1_ID == "rWrCdr321Qu") |> # ⚠️⚠️⚠️ TEMP subset just for CODE development ... ! ⚠️⚠️⚠️
#     dplyr::group_by(ADM2_ID, YEAR, MONTH) %>%
#     dplyr::summarize(
#                      SUBMITTED_REPORTS = length(which(ACTIVE == TRUE)), # 🚨 GP changed to BOOLEAN to save space
#                      .groups = "drop") |>
# mutate(YEAR = as.integer(YEAR), 
#        MONTH = as.integer(MONTH)
#       )

# print(dim(hf_active_month))
# head(hf_active_month)

In [None]:
# # --- 2. then join `dhis2_reporting_expected` to `hf_active_month`: add denominator col `REPORTING_RATE` ------------------------
# reporting_rate_any_month <- left_join(hf_active_month, 
#                                       dhis2_reporting_expected |> select(ADM2_ID, YEAR, MONTH, EXPECTED_REPORTS),
#                                       by = join_by(ADM2_ID, YEAR, MONTH)   
#                                      )  |>
#     dplyr::mutate(
#         REPORTING_RATE = round(SUBMITTED_REPORTS/EXPECTED_REPORTS,2) # NEW
#     ) %>%
#     ungroup() %>%  
#     mutate(YEAR = as.integer(YEAR),
#            MONTH = as.integer(MONTH),
#           ) 

# print(dim(reporting_rate_any_month))
# head(reporting_rate_any_month)

`#### Quick data quality check 🔍`

In [None]:
# inspect_reporting_rate(reporting_rate_any_month)

`#### Subset cols`

In [None]:
# reporting_rate_any_month <- reporting_rate_any_month |> 
# select(all_of(fixed_cols_rr))

# head(reporting_rate_any_month, 3)

`#### Plot by MONTH (heatmap)`

In [None]:
# # Plot heatmap
# options(repr.plot.width = 20, repr.plot.height = 10)

# reporting_rate_any_month %>%
# mutate(
#     DATE = as.Date(paste(YEAR, MONTH, "01", sep = "-")), 
#     ADM2_ID = factor(ADM2_ID)
#     ) %>%
# ggplot(., 
#        aes(x = DATE, y = ADM2_ID, 
#            fill = REPORTING_RATE * 100) 
#       ) + 
#   geom_tile() +
#   scale_fill_viridis_c(
#     option = "C",
#     direction = 1,
#     limits = c(0, 100), 
#     name = "Reporting rate (%)"
#   ) +
#   labs(
#     title = "Taux de rapportage mensuel par district sanitaire",
#     subtitle = "Chaque tuile représente l’exhaustivité du rapportage par district et par mois",
#     x = "Mois",
#     y = "District sanitaire"
#   ) +
#   theme_minimal(base_size = 13) +
#   theme(
#     axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 9),
#     axis.text.y = element_text(size = 9),
#     plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
#     plot.subtitle = element_text(hjust = 0.5, size = 12),
#     legend.position = "right",
#     panel.grid = element_blank()
#   )

`#### <s>**Year**ly **mean** and **median** per **ADM2**</s>`

In [None]:
# # Mean
# reporting_rate_any_year_mean = reporting_rate_any_month %>%
#     group_by(ADM2_ID, YEAR) %>% 
#     summarise(REPORTING_RATE = round(mean(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
#     ungroup() %>%
#     mutate(YEAR = as.integer(YEAR)) 

# print(dim(reporting_rate_any_year_mean))
# head(reporting_rate_any_year_mean, 3)

In [None]:
# # Median
# reporting_rate_any_year_median = reporting_rate_any_month %>%
#     group_by(ADM2_ID, YEAR) %>% 
#     summarise(REPORTING_RATE = round(median(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
#     ungroup() %>%
#     mutate(YEAR = as.integer(YEAR))

# print(dim(reporting_rate_any_year_median))
# head(reporting_rate_any_year_median, 3)

# 4. Export

## 4.1. 📁 To /data/ folder

#### CSV

In [None]:
# Abbreviation for Data Elememnt chosen NUMERATOR
if (DATAELEMENT_METHOD_NUMERATOR == "CONF" ) {
    method_num = "n1"
} else if (DATAELEMENT_METHOD_NUMERATOR == "CONF|SUSP|TEST") {
    method_num = "n2"
    }

method_num


# Abbreviation for Data Elememnt chosen DENOMINATOR
if (DATAELEMENT_METHOD_DENOMINATOR == "DHIS2_EXPECTED_REPORTS") {
    method_den = "d1"
} else if (DATAELEMENT_METHOD_DENOMINATOR == "ACTIVE_FACILITIES") {
    method_den = "d2"
    }

method_den

In [None]:
# write function
snt_write_csv <- function(x, output_data_path, method) {
  
  full_directory_path <- file.path(output_data_path, "reporting_rate")
  
  if (!dir.exists(full_directory_path)) {
    dir.create(full_directory_path, recursive = TRUE)
  }

  if (method == "dataelement") {
      file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, "-", method_num, "-", method_den, ".csv"))
  } else {
      file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, ".csv")) # "_month.csv"
  }
  
  write_csv(x, file_path)

  log_msg(paste0("Exported : ", file_path))
}

In [None]:
# Method "Dataset"
snt_write_csv(x = reporting_rate_dataset, 
              output_data_path = DATA_PATH, 
              method = "dataset") 

# Method "Data Element"
snt_write_csv(x = reporting_rate_dataelement,
              output_data_path = DATA_PATH, 
              method = "dataelement")

#### parquet

In [None]:
# write function
snt_write_parquet <- function(x, output_data_path, method) {
  
  full_directory_path <- file.path(output_data_path, "reporting_rate")
  
  if (!dir.exists(full_directory_path)) {
    dir.create(full_directory_path, recursive = TRUE)
  }
    
  if (method == "dataelement") {
      file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, "-", method_num, "-", method_den, ".parquet"))
  } else {
      file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, ".parquet")) # "_month.csv"
  }
  
  arrow::write_parquet(x, file_path)

  log_msg(paste0("Exported : ", file_path))
}

In [None]:
# Method "Dataset"
snt_write_parquet(x = reporting_rate_dataset,
                  output_data_path = DATA_PATH,
                  method = "dataset"
                 )

# Method "Data Element"
snt_write_parquet(x = reporting_rate_dataelement,
                  output_data_path = DATA_PATH,
                  method = "dataelement"
                 )