Script structure:
* 0. Set parameters (will be part of pipeline so this block will be silenced)
  1. Setup:
        * Paths
        * Utils functions
  2. Load Data
        * **Routine data** (DHIS2) already formatted & aggregated (output of pipeline XXX)
        * **Reporting** (DHIS2) pre-computed, already formatted & aggregated (output of pipeline ???)
        * **Shapes** (DHIS2) for plotting (this could be removed if we move the plots to "report/EDA" nb)
  3. Calculate **Reportng Rate (RR)**
        * "DHIS2": using pre-computed reportings from DHIS2/SNIS
        * "ANY" (based on old code - BFA)
        * "CONF" (based on code in nb: `~/dhis2_incidence/code/WIP/code_from_fre/DRC_DHIS2_analyses_fvdb_v2.ipynb`)
     **Export** reporting rate data to **Datasets** as .csv and .parquet files
  6. 🚧 (possibly) Expand reporting: **data inspection** (plots and summary tables) - this might go to **dedicated nb** ...

-------------------
**Naming harmonization to improve code readability**:

**Reporting Rate** data frames, based on different **methods**:
* follwo this structure: `reporting_rate_<method>_<periodicity>`. So:
    * **DHIS2** (using pre-computed reporting) : `reporting_rate_dhis2_month`
    * **ANY** (as "this code simply tests for _any_ indicator that is present"): `reporting_rate_any_month`
    * **CONF** (Diallo 2025) : `reporting_rate_conf_month`

--------------------

🚧 **Notes / possible ToDo's**:
* Currently, the **denominator** for the methods "ANY" and "CONF" is the value of the col `EXPECTED_REPORTS` (from DHSI2/SNIS). Previoulsy it was calculated differently and stored in the cols `TOTAL_HF_ACTIVE` and `N_FACILITIES` respectively. I'm keeping these cols for now for the sake of comparison, but I _might delete later to streamline the code_.
* Considering to remove **yearly** reporting rate calculations (`reporting_rate_*_year`) as it does not seem to be used anywhere - excpet for the reporting/comparison/data quality check. Therefore, maybe better to only have this in the reporting nb (delete from main nb)

--------------------

## Parameters

No parameters in this nb :P

#### Set Default values **if _not_ provided by pipeline**
This makes the execution flexible and "safe": nb can be run manually from here or be executed via pipeline, without having to change anything in the code!

In [None]:
# Set BACKUP VALUE: root path - NEVER CHANGE THIS!
if (!exists("SNT_ROOT_PATH")) {
  SNT_ROOT_PATH <- "/home/hexa/workspace" 
}

## 1. Setup

### 1.1. Paths

In [None]:
# PROJECT PATHS
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') # this is where we store snt_utils.r
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data') # same as in Datasets but /data/ gets over written every time a new version of Datasets is pushed

### 1.2. Utils functions

In [None]:
source(file.path(CODE_PATH, "snt_utils.r"))

### 1.3. Packages

In [None]:
# List required pcks  ---------------->  check  what are the really required libraries
required_packages <- c("arrow", # for .parquet
                       "tidyverse",
                       "stringi", 
                       "sf",
                       "jsonlite", 
                       "httr", 
                       "reticulate")

# Execute function
install_and_load(required_packages)

### 1.3.1. OpenHEXA-specific settings

#### For 📦{sf}, tell OH where to find stuff ...

In [None]:
# Hope this gets fixed at the source one day ...
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")

#### Set environment to load openhexa.sdk from the right path

In [None]:
# Set environment to load openhexa.sdk from the right path
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.4. Load and check `config` file

In [None]:
# Load SNT config

config_file_name <- "SNT_config.json" 
config_json <- tryCatch({
        jsonlite::fromJSON(file.path(CONFIG_PATH, config_file_name)) 
    },
    error = function(e) {
        msg <- paste0("Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, config_file_name))
log_msg(msg)

**Save config fields as variables**

In [None]:
# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# How to treat 0 values (in this case: "SET_0_TO_NA" converts 0 to NAs)
NA_TREATMENT <- config_json$SNT_CONFIG$NA_TREATMENT

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "Ousmane"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

# Which reporting rate PRODUCT_UID to use (not that this is a dataset in COD, but 2 dataElements in BFA!)
REPORTING_RATE_PRODUCT_ID <- config_json$SNT_CONFIG$REPORTING_RATE_PRODUCT_UID

In [None]:
# Fixed  cols for routine data formatting 
fixed_cols <- c('OU_ID','PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID') # (OU_NAME has homonimous values!)
print(paste("Fixed routine data (\"dhis2_routine\") columns (always expected): ", paste(fixed_cols, collapse=", ")))

In [None]:
# Fixed cols for exporting RR tables
fixed_cols_rr <- c('YEAR', 'MONTH', 'ADM2_ID', 'REPORTING_RATE') 

## 2. Load Data

### 2.1. **Routine** data (DHIS2) 
already formatted & aggregated (output of pipeline XXX)

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_routine.parquet")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 routine data loaded from dataset : ", dataset_name, " dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
log_msg(msg)

In [None]:
# Ensure correct data type for numerical columns 
dhis2_routine <- dhis2_routine %>%
    mutate(across(c(PERIOD, YEAR, MONTH), as.numeric))

In [None]:
head(dhis2_routine)

### 2.2. **Reporting** pre-computed from DHIS2 
Data granularity:
* **ADM2**
* **MONTH** (PERIOD)

Note that data comes from different dataset (`DS_NAME`): `A SERVICES DE BASE`, `B SERVICES SECONDAIRES`,`D SERVICE HOPITAL` 

The col `DS_METRIC` indicates whether the `VALUE` is `EXPECTED_REPORTS`, `ACTUAL_REPORTS`

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
file_name <- paste0(COUNTRY_CODE, "_reporting.parquet")

# Load file from dataset
dhis2_reporting <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 pre-computed REPORTING data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 pre-computed REPORTING data : ", file_name, " loaded from dataset : ", dataset_name, ". Dataframe dimensions: ", paste(dim(dhis2_reporting), collapse=", "))
log_msg(msg)

In [None]:
# Convert VALUE col to <dbl> - safety measure 
dhis2_reporting <- dhis2_reporting |>
mutate(across(c(PERIOD, YEAR, MONTH, VALUE), as.numeric))

In [None]:
head(dhis2_reporting, 3)

#### 2.2.1. **Filter** to keep only values for `PRODUCT_UID` defined in config.json

In [None]:
REPORTING_RATE_PRODUCT_ID

In [None]:
# Handle problems with incorrect configuration - to be improved 🚧
if (is.null(REPORTING_RATE_PRODUCT_ID)) {
    log_msg("🛑 Problem with definition of REPORTING_RATE_PRODUCT_ID, check config.json file!")
}

In [None]:
dhis2_reporting_filtered <- dhis2_reporting |>
# filter(PRODUCT_NAME == REPORTING_DS_NAME) |>
filter(PRODUCT_UID %in% REPORTING_RATE_PRODUCT_ID) |>
select(-PRODUCT_UID, -PRODUCT_NAME) # useless cols now

print(dim(dhis2_reporting_filtered))
head(dhis2_reporting_filtered)

#### 2.2.2. Format to produce `dhis2_reporting_expected`
🚨 Note: Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new denominator for REPORTING_RATE calculations (methods ANY and CONF)

In [None]:
dhis2_reporting_wide <- dhis2_reporting_filtered |> 
pivot_wider(
    names_from = PRODUCT_METRIC, # DS_METRIC
    values_from = VALUE
)

print(dim(dhis2_reporting_wide))
head(dhis2_reporting_wide)

🚨 **Note**: Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new **denominator** for `REPORTING_RATE` calculations (methods ANY and CONF)

In [None]:
# Use `dhis2_reporting_expected$EXPECTED_REPORTS` as new denomitor for RR calculations (methods ANY and CONF)

dhis2_reporting_expected <- dhis2_reporting_wide |> 
select(-ACTUAL_REPORTS)

print(dim(dhis2_reporting_expected))
head(dhis2_reporting_expected)

#### 2.2.3. **Checks** on data completeness: do **periods match** with routine data?
Lack of perfect overlap in periods between routine data and reporting rate data might create headhaches down the line!<br>
Specifically, **incidence** calculations will show **N2 smaller than N1** due to **aggregation by YEAR when NA** values are present!

In [None]:
# # For testing of log msg only!

# dhis2_routine <- tibble(
#   YEAR = c(2020, 2020, 2021, 2021, 2020),
#   MONTH = c("01", "02", "01", "03", "01"),
#   VALUE = c(10, 20, 15, 25, 30)
# )

# dhis2_reporting_expected <- tibble(
#   YEAR = c(2020, 2020, 2021, 2021),
#   MONTH = c("01", "02", "01", "04"),
#   EXPECTED_VALUE = c(12, 22, 17, 27)
# )

In [None]:
# --- Check Year Compatibility ---
routine_years <- sort(unique(as.integer(dhis2_routine$YEAR))) # as.integer
expected_years <- sort(unique(as.integer(dhis2_reporting_expected$YEAR))) # as.integer

if (!setequal(routine_years, expected_years)) {
  missing_in_routine <- setdiff(expected_years, routine_years)
  missing_in_expected <- setdiff(routine_years, expected_years)

  if (length(missing_in_routine) > 0) {
    log_msg(paste0("🚨 Warning: YEAR value(s) present in 'dhis2_reporting_expected' but not in 'dhis2_routine': ",
                   paste(missing_in_routine, collapse = ", ")))
  }
  if (length(missing_in_expected) > 0) {
    log_msg(paste0("🚨 Warning: YEAR value(s) present in 'dhis2_routine' but not in 'dhis2_reporting_expected': ",
                   paste(missing_in_expected, collapse = ", ")))
  }
} else {
  log_msg("✅ YEAR values are consistent across 'dhis2_routine' and 'dhis2_reporting_expected'.")

  # --- Check Month Compatibility (if years are consistent) ---
  all_years <- unique(routine_years) # Or expected_years, they are the same now

  for (year_val in all_years) {
    routine_months_for_year <- dhis2_routine %>%
      filter(YEAR == year_val) %>%
      pull(MONTH) %>%
      unique() %>%
      sort()

    expected_months_for_year <- dhis2_reporting_expected %>%
      filter(YEAR == year_val) %>%
      pull(MONTH) %>%
      unique() %>%
      sort()

    if (!setequal(routine_months_for_year, expected_months_for_year)) {
      missing_in_routine_months <- setdiff(expected_months_for_year, routine_months_for_year)
      missing_in_expected_months <- setdiff(routine_months_for_year, expected_months_for_year)

      if (length(missing_in_routine_months) > 0) {
        log_msg(paste0("🚨 Warning: for YEAR ", year_val, ", MONTH value(s) '", paste(missing_in_routine_months, collapse = ", "),
                       "' present in 'dhis2_reporting_expected' but not in 'dhis2_routine'!"
                       ))
      }
      if (length(missing_in_expected_months) > 0) {
        log_msg(paste0("🚨 Warning: for YEAR ", year_val, ", MONTH value(s) '", paste(missing_in_expected_months, collapse = ", "), 
                       "' present in 'dhis2_routine' but not in 'dhis2_reporting_expected'!"
                       ))
      }
    } else {
      log_msg(paste0("✅ For year ", year_val, ", months are consistent across both data frames."))
    }
  }
}

### 2.3. **Shapes** for plotting maps (choropleths)

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_shapes <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_shapes.geojson")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 shapes data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 shapes data loaded from dataset : ", dataset_name, " dataframe dimensions: ", paste(dim(dhis2_shapes), collapse=", "))
log_msg(msg)

In [None]:
# `head()` cannot display, needs ‘geojsonio’ (which I cannot install) so let's just check col names ... 
names(dhis2_shapes)

## 3. Calculate **Reporting Rate** (RR)
We compute it using 3 approaches, user can decided later on which one to use for incidence adjustment.

### 3.0. From **DHIS2** pre-computed reporting
Exrtacted from DHIS2 and formatted. 

Simple, just pivot `DS_METRIC` and divide `ACTUAL_REPORTS` / `EXPECTED_REPORTS`

In [None]:
reporting_rate_dhis2_month <- dhis2_reporting_wide |> 
mutate(REPORTING_RATE = ACTUAL_REPORTS / EXPECTED_REPORTS)

print(dim(reporting_rate_dhis2_month))
head(reporting_rate_dhis2_month, 3)

#### Quick data quality check 🔍

In [None]:
# --- Define function ---------------------------
analyze_reporting_rate <- function(data_tibble) {

  # Dynamically get the name of the tibble passed to the function
  # This extracts the literal name of the variable passed (e.g., "reporting_rate_dhis2_month")
  tibble_name_full <- deparse(substitute(data_tibble))

  # Extract the 'method' part from the tibble name
  method <- stringr::str_extract(tibble_name_full, "(?<=reporting_rate_).*?(?=_month)")

  # Calculations for proportion of values > 1
  values_greater_than_1 <- sum(data_tibble$REPORTING_RATE > 1, na.rm = TRUE)
  total_values <- length(data_tibble$REPORTING_RATE)

  if (total_values > 0) {
    proportion <- values_greater_than_1 / total_values * 100
    min_rate <- min(data_tibble$REPORTING_RATE, na.rm = TRUE)
    max_rate <- max(data_tibble$REPORTING_RATE, na.rm = TRUE)
  } else {
    proportion <- 0
    min_rate <- NA # Set to NA if no values to calculate min/max
    max_rate <- NA # Set to NA if no values to calculate min/max
  }

  # Print the formatted result
  print(
    paste0(
      "🔍 For reporting rate method : ", method, ", the values of REPORTING_RATE range from ", round(min_rate, 2),
      " to ", round(max_rate, 2),
      " and ", round(proportion, 2), "% of values are >1 (there are more reports than expected)."
    )
  )

  # Histogram
  hist(data_tibble$REPORTING_RATE, 
     breaks = 50)
}

In [None]:
analyze_reporting_rate(reporting_rate_dhis2_month)

#### Subset cols

In [None]:
reporting_rate_dhis2_month <- reporting_rate_dhis2_month |> 
select(all_of(fixed_cols_rr))

head(reporting_rate_dhis2_month, 3)

### 3.1. Method "**ANY**: look at submissions for **_any_ indicator** that is present
The list of **indicators** is **defined** in the **config file**!

#### Define cols used to evaluate HF "activity" (whether a HF is reporting or not)

In [None]:
cols_to_subset <- c(fixed_cols, DHIS2_INDICATORS)
print(cols_to_subset)

dhis2_routine_subset = dhis2_routine %>% 
  # dplyr::select(all_of(cols_to_subset))  # ⚠️ TEMP switch as config.json was changed but not extracted data (some cols are missing) ⚠️
  dplyr::select(any_of(cols_to_subset))

# Print warning message in case there are indicators defined in the config but not present in the routine data
if (length(cols_to_subset) > length(names(dhis2_routine_subset)) ) {
print(
    paste0("🚨 Warning: the following columns are expected, but missing in dhis2_routine : ",  paste(setdiff(cols_to_subset, names(dhis2_routine_subset)), collapse = ", ") ) 
)
    }

#### 🚨 Set `0` values to `NA`

#### ⚠️ To switch back: issue with changing config and expected cols ... ⚠️

In [None]:
# ⚠️ To switch back: issue with changing config and expected cols ... ⚠️
# Temp version of the code to handle missing cols (defined in confg file, hence -> DHIS2_INDICATORS , but missing from routine data)
#  (because config file was changed but there was no new data extraction)

DHIS2_INDICATORS_FILTERED <- intersect(names(dhis2_routine_subset), DHIS2_INDICATORS)

print(DHIS2_INDICATORS)
print(DHIS2_INDICATORS_FILTERED)

In [None]:
a <- paste0("Set 0 values to NA in cols : ", paste(names(dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED]), collapse=', ') )
print(a)

In [None]:
# 0 value to NA 
if (NA_TREATMENT == 'SET_0_TO_NA') { 
    # dhis2_routine_subset[, DHIS2_INDICATORS][dhis2_routine_subset[, DHIS2_INDICATORS] == 0] <- NA  #  ⚠️ REACTIVATE THIS ⚠️
    dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED][dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED] == 0] <- NA  # ⚠️ TEMP switch as config.json was changed but not extracted data ⚠️
    msg <- paste0("✍🏽 Set 0 values to NA in cols : ", paste(names(dhis2_routine_subset[, DHIS2_INDICATORS_FILTERED]), collapse=', ') )
    # print("Set 0 values to NA")
    log_msg(msg)
    
}

In [None]:
# HF considered "inactif" when all indicators are NA (= did not submit anything for these indicators), 
#     else "actif" (= they submitted something)
hf_active = dhis2_routine_subset %>%
    dplyr::mutate(# nomiss = apply(dhis2_routine_subset[,DHIS2_INDICATORS], 1, function(y) sum(!is.na(y))), 
                  nomiss = apply(dhis2_routine_subset[,DHIS2_INDICATORS_FILTERED], 1, function(y) sum(!is.na(y))), # ⚠️ TEMP SWITCH (cofing issue ... )
                  varmis =ifelse(nomiss == 0, 0, 1),
                  ACTIVE = ifelse(varmis == 0, FALSE, TRUE)) %>% # 🚨 GP changed to BOOLEAN to save space
    dplyr::arrange(ADM1_ID, ADM2_ID, OU_ID, PERIOD) %>% # OU,
    dplyr::group_by(ADM1_ID, ADM2_ID, OU_ID) %>% # OU
    dplyr::mutate(cummiss = sum(nomiss), 
                  # inactivity = nomiss/length(DHIS2_INDICATORS) * 100, 
                  inactivity = nomiss/length(DHIS2_INDICATORS_FILTERED) * 100, # ⚠️ TEMP SWITCH (cofing issue ... )
                  start_date = ifelse(
                    any(inactivity != 100, na.rm = TRUE),
                    min(PERIOD[inactivity != 100], na.rm = TRUE),
                    NA  # Default to NA if no valid values
                    )) %>%
    dplyr::filter(PERIOD >= start_date)

In [None]:
head(hf_active, 3)

#### 🚨 Here 👇 swap denominator: join `dhis2_reporting_expected` to replace `TOTAL_HF` with `EXPECTED_REPORTS`

In [None]:
# Break process: create intermediate df (`hf_active_month`) -> then join `dhis2_reporting_expected`

# --- 1. create intermediate df `hf_active_month`: summarize nr of "active" (reporting) HF by month ------------------------
hf_active_month <- hf_active %>% 
# filter(ADM1_ID == "rWrCdr321Qu") |> # ⚠️⚠️⚠️ TEMP subset just for CODE development ... ! ⚠️⚠️⚠️
    dplyr::group_by(ADM2_ID, YEAR, MONTH) %>%
    dplyr::summarize(
                     SUBMITTED_REPORTS = length(which(ACTIVE == TRUE)), # 🚨 GP changed to BOOLEAN to save space
                     .groups = "drop") |>
mutate(YEAR = as.integer(YEAR), 
       MONTH = as.integer(MONTH)
      )

print(dim(hf_active_month))
head(hf_active_month)


# --- 2. then join `dhis2_reporting_expected` to `hf_active_month`: add denominator col `REPORTING_RATE` ------------------------
reporting_rate_any_month <- left_join(hf_active_month, 
                                      dhis2_reporting_expected |> select(ADM2_ID, YEAR, MONTH, EXPECTED_REPORTS),
                                      by = join_by(ADM2_ID, YEAR, MONTH)   
                                     )  |>
    dplyr::mutate(
        REPORTING_RATE = round(SUBMITTED_REPORTS/EXPECTED_REPORTS,2) # NEW
    ) %>%
    ungroup() %>%  
    mutate(YEAR = as.integer(YEAR),
           MONTH = as.integer(MONTH),
          ) 

print(dim(reporting_rate_any_month))
head(reporting_rate_any_month)

#### Quick data quality check 🔍

In [None]:
analyze_reporting_rate(reporting_rate_any_month)

#### Subset cols

In [None]:
reporting_rate_any_month <- reporting_rate_any_month |> 
select(all_of(fixed_cols_rr))

head(reporting_rate_any_month, 3)

#### Plot by MONTH (heatmap)

In [None]:
# Plot heatmap
options(repr.plot.width = 20, repr.plot.height = 10)

reporting_rate_any_month %>%
mutate(
    DATE = as.Date(paste(YEAR, MONTH, "01", sep = "-")), 
    ADM2_ID = factor(ADM2_ID)
    ) %>%
ggplot(., 
       aes(x = DATE, y = ADM2_ID, 
           fill = REPORTING_RATE * 100) 
      ) + 
  geom_tile() +
  scale_fill_viridis_c(
    option = "C",
    direction = 1,
    limits = c(0, 100), 
    name = "Reporting rate (%)"
  ) +
  labs(
    title = "Taux de rapportage mensuel par district sanitaire",
    subtitle = "Chaque tuile représente l’exhaustivité du rapportage par district et par mois",
    x = "Mois",
    y = "District sanitaire"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 9),
    axis.text.y = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "right",
    panel.grid = element_blank()
  )

#### **Year**ly **mean** and **median** per **ADM2**

In [None]:
# Mean
reporting_rate_any_year_mean = reporting_rate_any_month %>%
    group_by(ADM2_ID, YEAR) %>% 
    summarise(REPORTING_RATE = round(mean(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
    ungroup() %>%
    mutate(YEAR = as.integer(YEAR)) 

print(dim(reporting_rate_any_year_mean))
head(reporting_rate_any_year_mean, 3)

In [None]:
# Median
reporting_rate_any_year_median = reporting_rate_any_month %>%
    group_by(ADM2_ID, YEAR) %>% 
    summarise(REPORTING_RATE = round(median(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
    ungroup() %>%
    mutate(YEAR = as.integer(YEAR))

print(dim(reporting_rate_any_year_median))
head(reporting_rate_any_year_median, 3)

----------------------------

### 3.2. Method **CONF**: based on reporting of **confirmed cases**
**_Reporting rate following methods by WHO and as per Diallo_2025 paper_**

To accurately measure data completeness, we calculate the monthly reporting rate per health district (ADM2) as the **proportion of facility–months that submitted at least one report containing a confirmed malaria case** (**CONF**). <br>
For each ADM2, we expect one report per facility per month. For example, if an ADM2 has 25 facilities, we expect 25 reports for a given month. If only 21 of those facilities report confirmed cases that month, the reporting rate is 21/25 = 84%.

This method improves over simple binary completeness flags by accounting for both spatial (facility coverage) and temporal (monthly timeliness) dimensions. A facility-month is **considered reported** if the **CONF value is not missing**, which serves as a proxy for overall completeness of malaria indicators. We use the presence of CONF (confirmed malaria cases) as the condition for marking a facility-month as reported because it is a core indicator consistently tracked across the dataset. This choice ensures alignment with the structure of the incidence calculation, which is also mainly based on confirmed cases.

#### Calculate

In [None]:
# Tag as "REPORTED" only if `CONF` is not NA (= the HF reported some data for CONF)
dhis2_routine_reporting <- dhis2_routine %>%
  mutate(REPORTED_CONF = if_else(!is.na(CONF), 1, 0))

#### 🚨 Here 👇 swap denominator: join `dhis2_reporting_expected` to replace `N_FACILITIES` with `EXPECTED_REPORTS`

In [None]:
# --- 1.  Calculate nr of "reporting" facilities by month (aka nr of submitted reports as `N_REPORTS`) ------------------------
dhis2_routine_reporting_month <- dhis2_routine_reporting %>%
  group_by(ADM2_ID, YEAR, MONTH) %>% 
  summarise(
    SUBMITTED_REPORTS = sum(REPORTED_CONF, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ungroup() %>%  
    mutate(YEAR = as.integer(YEAR),
           MONTH = as.integer(MONTH)
          ) 

print(dim(dhis2_routine_reporting_month))
head(dhis2_routine_reporting_month, 3)



# --- 2. Join `dhis2_reporting_expected` to add `EXPECTED_REPORTS` ------------------------------------------------
reporting_rate_conf_month <- left_join(
    dhis2_routine_reporting_month,
    dhis2_reporting_expected |> select(ADM2_ID, YEAR, MONTH, EXPECTED_REPORTS),
    by = join_by(ADM2_ID, YEAR, MONTH)
    ) |>
  mutate(
    REPORTING_RATE = SUBMITTED_REPORTS / EXPECTED_REPORTS
  ) %>%
  # ungroup() %>%  
  mutate(YEAR = as.integer(YEAR),
         MONTH = as.integer(MONTH)
          ) 

print(dim(reporting_rate_conf_month))
head(reporting_rate_conf_month, 3)

#### Quick data quality check 🔍

In [None]:
analyze_reporting_rate(reporting_rate_conf_month)

#### Subset cols

In [None]:
reporting_rate_conf_month <- reporting_rate_conf_month |> 
select(all_of(fixed_cols_rr))

head(reporting_rate_conf_month, 3)

#### Plot by MONTH (heatmap)

In [None]:
# Plot reporting rate heatmap
options(repr.plot.width = 20, repr.plot.height = 10) 

reporting_rate_conf_month %>%
mutate(
    DATE = as.Date(paste0(YEAR, "-", MONTH, "-01"))
    ) %>%
ggplot(., aes(x = DATE,  # GP replaced `date` with `DATE`
              y = factor(ADM2_ID), # GP replaced `y = ADM2` with `y = factor(ADM2)`
              fill = REPORTING_RATE * 100)
      ) + 
  geom_tile() +
  scale_fill_viridis_c(
    option = "C",
    direction = 1,  # blue = low, yellow = high
    limits = c(0, 100),
    name = "Reporting rate (%)"
  ) +
  labs(
    title = "Monthly Reporting Rate by Health District",
    subtitle = "Each tile represents the reporting completeness per district per month",
    x = "Month",
    y = "Health District"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 9),
    axis.text.y = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "right",
    panel.grid = element_blank()
  )

#### **Year**ly **mean** and **median** per **ADM2**

In [None]:
# Mean
reporting_rate_conf_year_mean = reporting_rate_conf_month %>%
    group_by(ADM2_ID, YEAR) %>% 
    summarise(REPORTING_RATE = round(mean(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
    ungroup() %>%
    mutate(YEAR = as.integer(YEAR)) 

print(dim(reporting_rate_conf_year_mean))
head(reporting_rate_conf_year_mean, 3)

In [None]:
# Median
reporting_rate_conf_year_median = reporting_rate_conf_month %>%
    group_by(ADM2_ID, YEAR) %>% 
    summarise(REPORTING_RATE = round(median(REPORTING_RATE, na.rm = T), 2), .groups = "drop") %>% 
    ungroup() %>%
    mutate(YEAR = as.integer(YEAR)) 

print(dim(reporting_rate_conf_year_median))
head(reporting_rate_conf_year_median, 3)

#### Plot by YEAR (choropleth)

In [None]:
# 2. Join ADM2 shapes with SNIS reporting data
map_data <- dhis2_shapes %>% 
  left_join(reporting_rate_conf_year_mean, by = "ADM2_ID") %>% 
  sf::st_as_sf() 

In [None]:
# 3. Bin reporting rate values
map_data <- map_data %>%
  mutate(rate_cat = case_when(
    REPORTING_RATE < 0.5 ~ "< 50%",
    REPORTING_RATE < 0.8 ~ "50–80%",
    REPORTING_RATE < 0.9 ~ "80–90%",
    REPORTING_RATE >= 0.9 ~ "90–100%"  # includes 100%
  ))

# 4. Define colors
rate_colors <- c(
  "< 50%"     = "#b2182b",  # dark red
  "50–80%"    = "#f46d43",  # reddish-orange, more vibrant
  "80–90%"    = "#fee08b",  # yellow
  "90–100%" = "#4daf4a"  # clear, strong green (used in many R palettes)
)

# 5. Plot
options(repr.plot.width = 20, repr.plot.height = 5)
ggplot(map_data) +
  geom_sf(aes(fill = rate_cat,
             geometry = geometry),
          color = "white", size = 0.2) +
  facet_wrap(~ YEAR, nrow = 1) +
  scale_fill_manual(values = rate_colors, name = "Taux de rapportage") +
  labs(title = "Taux de rapportage par ADM2, par année") +
  theme_minimal(base_size = 14) +
  theme(
    strip.text = element_text(size = 16),
    plot.title = element_text(size = 18, hjust = 0.5),
    legend.position = "bottom",
    panel.spacing = unit(0.2, "lines"),
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank()
  ) +
  coord_sf(datum = NA)

# 4. Export

## 4.1. 📁 To /data/ folder

#### CSV

In [None]:
# write function
snt_write_csv <- function(x, output_data_path, method) {
  
  full_directory_path <- file.path(output_data_path, "dhis2_reporting_rate")
  
  if (!dir.exists(full_directory_path)) {
    dir.create(full_directory_path, recursive = TRUE)
  }
  
  file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, "_month.csv"))
  
  write_csv(x, file_path)

  log_msg(paste0("Exported : ", file_path))
}

In [None]:
# Method "DHIS2"
snt_write_csv(x = reporting_rate_dhis2_month, 
              output_data_path = DATA_PATH, 
              method = "dhis2")

# Method "ANY"
snt_write_csv(x = reporting_rate_any_month, 
              output_data_path = DATA_PATH, 
              method = "any")

# Method "CONF"
snt_write_csv(x = reporting_rate_conf_month, 
              output_data_path = DATA_PATH, 
              method = "conf")

#### parquet

In [None]:
# write function
snt_write_parquet <- function(x, output_data_path, method) {
  
  full_directory_path <- file.path(output_data_path, "dhis2_reporting_rate")
  
  if (!dir.exists(full_directory_path)) {
    dir.create(full_directory_path, recursive = TRUE)
  }
  
  file_path <- file.path(full_directory_path, paste0(COUNTRY_CODE, "_reporting_rate_", method, "_month.parquet"))
  
  arrow::write_parquet(x, file_path)

  log_msg(paste0("Exported : ", file_path))
}

In [None]:
# Method "DHIS2"
snt_write_parquet(x = reporting_rate_dhis2_month,
                  output_data_path = DATA_PATH,
                  method = "dhis2"
                 )

# Method "ANY"
snt_write_parquet(x = reporting_rate_any_month,
                  output_data_path = DATA_PATH,
                  method = "any"
                 )

# Method "CONF"
snt_write_parquet(x = reporting_rate_conf_month,
                  output_data_path = DATA_PATH,
                  method = "conf"
                 )