# **Dataset Reporting Rate: Calculation Based on DHIS2 Extracted Data**

The **reporting rate** measures the proportion of registered health facilities that submit data. It is calculated for each administrative level 2 (`ADM2`) area and for each reporting period (`PERIOD` in YYYYMM format).
<br>

**Dataset Selection**<br>
The choice of dataset(s) used for reporting rate calculation is controlled by modifying the <code>SNT_config.json</code> configuration file. This allows flexible selection among multiple datasets extracted from the same DHIS2 instance.

**Calculation Logic**<br>
From the selected dataset(s):
- **Numerator:** Number of facilities that _actually_ reported, derived from the element <code>"ACTUAL_REPORTS"</code>.
- **Denominator:** Number of facilities _expected_ to report, derived from the element <code>"EXPECTED_REPORTS"</code>.

After aggregating these counts at the ADM2 level, the reporting rate is computed as:
<br>
<code>REPORTING RATE = ACTUAL_REPORTS / EXPECTED_REPORTS</code>
<br>
and expressed as a **proportion** between 0 and 1.
<br>

-----

### Additional Data Processing Steps

- **Handling Multiple Datasets:**  
  When multiple datasets are available, the pipeline uses only those specified in <code>SNT_config.json</code>. For these selected datasets, the counts of actual and expected reports are summed by ADM2 area.

- **Deduplication of Entries:**  
  Sometimes, the same organizational unit (<code>OU_ID</code>) may appear in multiple datasets for the same period, risking double counting. To address this, deduplication is performed by keeping only the entry with the **highest** <code>ACTUAL_REPORTS</code> value for each unique combination of <code>OU_ID</code> and <code>PERIOD</code>.  
  <ul>
    <li><strong>Why keep the highest?</strong> Because <code>ACTUAL_REPORTS</code> values are binary (0 or 1). If duplicates agree (all 0 or all 1), keeping one suffices. If they differ (some 0, some 1), keeping the 1 ensures that presence of a report is not missed.</li>
    <li><strong>üö®Important:</strong> Deduplication only proceeds if all duplicated values are within {0,1}. If other values are present, deduplication is skipped with a warning to avoid incorrect data handling.</li>
  </ul>

-----


### üá≥üá™ <strong>Niger-Specific Processing:</strong>  
  In Niger, datasets for <strong>HOP</strong> (hospital) facilities are already **pre-aggregated** and may contain values greater than 1 for actual or expected reports, reflecting subunits or departments within a hospital. 
  <br>
  To accurately represent reporting at the facility level and avoid overcounting, all values greater than 1 are converted to 1 (presence/absence). This ensures that the reporting rate reflects whether the hospital as a whole reported, rather than counting multiple subunits separately. This step also prevents cases where <code>ACTUAL_REPORTS</code> exceeds <code>EXPECTED_REPORTS</code>.

------

### Pipeline parameters

- **Outliers detection method**: Specify which method was used to detect outliers in routine data. Choose "Routine data (Raw)" to use raw routine data.
    
- **Use routine with outliers removed**: Toggle this on to use the routine data after outliers have been removed (using the outliers detection method selected above). Else, this pipeline will use either the imputed routine data (to replace the outlier values removed) or the raw routine data if you selected "Routine data (Raw)" as your choice of ‚ÄúOutlier processing method‚Äù.

## 1. Setup

In [None]:
# Project paths
SNT_ROOT_PATH <- "/home/hexa/workspace" 
CODE_PATH <- file.path(SNT_ROOT_PATH, 'code') 
CONFIG_PATH <- file.path(SNT_ROOT_PATH, 'configuration') 
DATA_PATH <- file.path(SNT_ROOT_PATH, 'data', 'dhis2')  

# Load utils
source(file.path(CODE_PATH, "snt_utils.r"))

# Load libraries 
required_packages <- c("arrow", "tidyverse", "glue", "jsonlite", "httr", "reticulate") 
install_and_load(required_packages)

# Environment variables
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")

# Load OpenHEXA sdk
openhexa <- import("openhexa.sdk")

#### 1.1. Load and check `config_json` file

In [None]:
# Load SNT config
config_json <- tryCatch({ jsonlite::fromJSON(file.path(CONFIG_PATH, "SNT_config.json")) },
    error = function(e) {
        msg <- paste0("[ERROR] Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

log_msg(paste0("SNT configuration loaded from : ", file.path(CONFIG_PATH, "SNT_config.json")))

In [None]:
# Configuration settings
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# Which reporting rate PRODUCT_UID to use (DHIS2 dataset id)
REPORTING_RATE_PRODUCT_ID <- config_json$SNT_CONFIG$REPORTING_RATE_PRODUCT_UID  

fixed_cols_rr <- c('YEAR', 'MONTH', 'ADM2_ID', 'REPORTING_RATE') # Fixed cols for exporting RR tables

#### 1.2. Validate parameters

In [None]:
# default: raw routine
if (!exists("ROUTINE_FILE")) { ROUTINE_FILE <- glue::glue("{COUNTRY_CODE}_routine.parquet") }

#### 1.3. üîç Check REPORTING_RATE_PRODUCT_ID is configured

### üêç This probably to be moved to pipeline.py code?

In [None]:
# Check if REPORTING_RATE_PRODUCT_ID is configured
if (is.null(REPORTING_RATE_PRODUCT_ID) || length(REPORTING_RATE_PRODUCT_ID) == 0) {
    log_msg("üö® Warning: REPORTING_RATE_PRODUCT_ID is not configured properly in 'SNT_config.json'. 
    This will prevent filtering by reporting dataset, and all values will be retained.", level = "warning" )
}

## 2. Load Data

### 2.1. Load routine data (DHIS2) 
Already formatted routine data, we use this as the master table<br>
(only used at the very end before exporting the table)

In [None]:
# select dataset
if (ROUTINE_FILE == glue::glue("{COUNTRY_CODE}_routine.parquet")) {
    rountine_dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
} else {
    rountine_dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_IMPUTATION
}

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(rountine_dataset_name, ROUTINE_FILE) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

dhis2_routine <- dhis2_routine %>% mutate(across(c(PERIOD, YEAR, MONTH), as.numeric)) # Ensure correct data type for numerical columns 

# Subset data to keep only columns defined in fixed_cols_rr (if defined)
if (exists("fixed_cols_rr")) {
    dhis2_routine <- dhis2_routine %>% 
    select(any_of(fixed_cols_rr)) |> 
    distinct()
}

# log
log_msg(glue::glue("DHIS2 routine file {ROUTINE_FILE} loaded from dataset : {rountine_dataset_name} dataframe dimensions: {paste(dim(dhis2_routine), collapse=', ')}"))
dim(dhis2_routine)
head(dhis2_routine, 3)

### 2.2. Load Reporting Rate data (DHIS2)

In [None]:
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED
file_name <- paste0(COUNTRY_CODE, "_reporting.parquet")  # reporting rate file

# Load file from dataset
dhis2_reporting <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, file_name) }, 
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 dataset reporting rates file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})
dhis2_reporting <- dhis2_reporting %>% mutate(across(c(PERIOD, YEAR, MONTH, VALUE), as.numeric))  # numeric values

msg <- paste0("DHIS2 Datatset reporting data loaded from file `", file_name, "` (from dataset : `", dataset_name, "`). 
Dataframe dimensions: ", 
              paste(dim(dhis2_reporting), collapse=", "))
log_msg(msg)
head(dhis2_reporting, 3)

## 3. Transform reporting data

### 3.1. Filter Reporting Rate data by "Dataset" (`PRODUCT_UID`)
Logic:
* Value(s) (string) for `PRODUCT_UID` defined in the config.json file
* If none provided (**empty** field) skip filtering and **keep everything**

In [None]:
# Check if REPORTING_RATE_PRODUCT_ID present in the data: if yes, filter to keep only those, else skip filtering (keep all) and log a warning
if (all(REPORTING_RATE_PRODUCT_ID %in% unique(dhis2_reporting$PRODUCT_UID))) {
    dhis2_reporting <- dhis2_reporting %>% filter(PRODUCT_UID %in% REPORTING_RATE_PRODUCT_ID)
    log_msg(glue::glue("ü™Æ Filtering DHIS2 reporting data to keep only values for REPORTING_RATE_PRODUCT_UID(s): {paste(REPORTING_RATE_PRODUCT_ID, collapse=', ')}.
    Removed {nrow(dhis2_reporting) - nrow(dhis2_reporting %>% filter(PRODUCT_UID %in% REPORTING_RATE_PRODUCT_ID))} rows.
    Dataframe dimensions after filtering: {paste(dim(dhis2_reporting), collapse=', ')}"))
} else {
    log_msg(glue::glue("üö® Warning: REPORTING_RATE_PRODUCT_UID: {paste(REPORTING_RATE_PRODUCT_ID, collapse=', ')} not found in DHIS2 reporting data PRODUCT_UIDs: {paste(unique(dhis2_reporting$PRODUCT_UID), collapse=', ')}. 
    ü¶ò Skipping filtering and keeping all data. Dataframe dimensions: {paste(dim(dhis2_reporting), collapse=', ')}"), level = "warning")
}

### 3.2. Pivot wider

In [None]:
# Pivot wider to have one column per PRODUCT_METRIC (which now indicates whether the VALUE is "ACTUAL_REPORTS" or "EXPECTED_REPORTS")
dhis2_reporting_wide <- dhis2_reporting %>%
  pivot_wider(names_from = PRODUCT_METRIC, values_from = VALUE)

# Log msg
log_msg(glue::glue("Pivoted DHIS2 reporting data to wide format, with one column per PRODUCT_METRIC (ACTUAL_REPORTS, EXPECTED_REPORTS).
Dimensions after pivot: {paste(dim(dhis2_reporting_wide), collapse=', ')}"))

dim(dhis2_reporting_wide)
head(dhis2_reporting_wide, 3)

### üëØ Handle **duplicated** values (`OU_ID`)
Using multiple datasets relies on the **assumption** that **each dataset is complementary to the other(s)**. Namely, there should be no "dupliacted" orgunits that are counted in more than one dataset! Else, we would be **double counting**.

#### Check for duplicated values (`OU_ID`)

In [None]:
# Check if any OU_ID is present in more than one PRODUCT_UID
# and if so list them
ou_product_counts <- dhis2_reporting %>%
  group_by(OU_ID, OU_NAME) %>%
  mutate(PRODUCT_UID_count = n_distinct(PRODUCT_UID)) %>%
  filter(PRODUCT_UID_count > 1) %>%
  select(ADM1_NAME, ADM2_NAME, OU_ID, OU_NAME, PRODUCT_UID_count) %>%
  distinct() 

ou_product_counts

# Log msg: which OU_ID have multiple PRODUCT_UIDs
if (nrow(ou_product_counts) > 0) {
    log_msg(glue::glue("üö® Warning: The following OU_IDs are associated with multiple PRODUCT_UIDs in the DHIS2 reporting data:
{paste(apply(ou_product_counts, 1, function(row) paste0(' - ', row['OU_NAME'], ' (', row['OU_ID'], ')')), collapse='\n')}"), 
    level = "warning")
} else {
    log_msg("All OU_IDs are associated with a single PRODUCT_UID in the DHIS2 reporting data.")
}

#### Remove duplicated OU_IDs (shared across PRODUCT_UIDs)
Logic: 
1. Identify if any `OU_ID` is present in both datasets
2. For these, keep `max(ACTUAL_REPORTS)` (since `EXPECTED_REPORTS` is always == 1) because: 
    * if both same value (either both 0 or both 1) => simply deduplicate (`distinct()`)
    * if else if different values, meaning that one dataset say 1 and the other 0 => keep 1 (facility _did_ report)

In [None]:
# Step 1: check for duplicated OU_ID by PERIOD (there should be only 1 value of OU_ID per PERIOD)
dupl_ou_period <- dhis2_reporting_wide %>%
  group_by(OU_ID, PERIOD) %>%
  filter(n() > 1) %>%
  ungroup() %>%
  select(OU_ID, OU_NAME, PERIOD, PRODUCT_UID, ends_with("REPORTS"))

# Log msg
if (nrow(dupl_ou_period) > 0) {
    log_msg(glue::glue("üö® Warning: The OU_IDs are associated with multiple PRODUCT_UIDs affect {nrow(dupl_ou_period)} PERIOD entries (rows) in the DHIS2 reporting data."))
}

dim(dupl_ou_period)
head(dupl_ou_period, 5)

In [None]:
# Step 2: remove duplicated OU_ID by PERIOD
# Use the following logic:
# - 1. first, check that values (ACTUAL_REPORTS, EXPECTED_REPORTS) are all 0 or 1 (if not that needs to be handled differently, so skip for now)
# - 2. then, if multiple PRODUCT_UIDs exist for the same OU_ID and PERIOD, keep the one with the highest ACTUAL_REPORTS value
# (this is because if values agree, then we can simply keep one, if they don't agree, that means that we have 1 and 0 values, so we keep the 1)

if (all(dupl_ou_period$ACTUAL_REPORTS %in% c(0,1)) & all(dupl_ou_period$EXPECTED_REPORTS %in% c(0,1))) {
    dhis2_reporting_wide <- dhis2_reporting_wide %>%
    group_by(PERIOD, OU_ID) %>%
    mutate(ACTUAL_REPORTS_deduplicated = ifelse(OU_ID %in% dupl_ou_period$OU_ID, max(ACTUAL_REPORTS), ACTUAL_REPORTS)) %>%
    ungroup() %>%
    filter(!(OU_ID %in% dupl_ou_period$OU_ID) | (ACTUAL_REPORTS == ACTUAL_REPORTS_deduplicated)) %>%
    select(-ACTUAL_REPORTS_deduplicated)

    log_msg(glue::glue("‚úÖ Deduplicated DHIS2 reporting data by keeping only one PRODUCT_UID per OU_ID and PERIOD, based on highest ACTUAL_REPORTS value.
    Dataframe dimensions after deduplication: {paste(dim(dhis2_reporting_wide), collapse=', ')}"))
} else {
    log_msg("üö® Warning: Cannot deduplicate OU_ID by PERIOD in DHIS2 reporting data because ACTUAL_REPORTS or EXPECTED_REPORTS contain values other than 0 or 1. 
    Analysis will continue without removing duplicated entries.", level = "warning")
}   

dim(dhis2_reporting_wide)
head(dhis2_reporting_wide, 3)

### 3.3. (üá≥üá™ NER only) Make HOP aggregated values (0, >1) into presence/absence (0, 1)
Specific for Niger SNIS instance!<br>
Values for dataset HOP ("ki7YKOfyxjf" = "HOP 03 ACTIVITES DE LUTTE CONTRE LE PALUDISME") count the individual "sub-units" (departments, etc ... ) of a given hospital and therefore can have values >1.<br>
For consistency with CSI (where all values are raw, and therefore only 0 and 1), we need to convert all HOP value >1 into 1.

In [None]:
# Modify dhis2_reporting_wide to replace all values of ACTUAL_REPORTS and EXPECTED_REPORTS that are >1 with 1
if (COUNTRY_CODE == "NER") {
  log_msg("üá≥üá™ Special handling for NER: replacing all values of ACTUAL_REPORTS and EXPECTED_REPORTS that are >1 with 1.")

  # Check if any values >1 exist
  n_actual_reports_gt1 <- sum(dhis2_reporting_wide$ACTUAL_REPORTS > 1, na.rm = TRUE)
  n_expected_reports_gt1 <- sum(dhis2_reporting_wide$EXPECTED_REPORTS > 1, na.rm = TRUE)

  # Extract the PRODUCT_UID and PRODUCT_NAME associated with those values
  if (n_actual_reports_gt1 > 0 | n_expected_reports_gt1 > 0) {
    dupl_actual_reports <- dhis2_reporting_wide %>%
      filter(ACTUAL_REPORTS > 1) %>%
      select(PRODUCT_UID, PRODUCT_NAME) %>%
      distinct()

    log_msg(glue::glue("Note: Found {n_actual_reports_gt1} entries with ACTUAL_REPORTS > 1 and {n_expected_reports_gt1} entries with EXPECTED_REPORTS > 1.
Affected PRODUCT_UIDs and PRODUCT_NAMEs for ACTUAL_REPORTS > 1:
{paste(apply(dupl_actual_reports, 1, function(row) paste0(row['PRODUCT_NAME'], ' (', row['PRODUCT_UID'], ')')), collapse='\n')}"))

    dhis2_reporting_wide <- dhis2_reporting_wide %>%
    mutate(
    ACTUAL_REPORTS = ifelse(ACTUAL_REPORTS > 1, 1, ACTUAL_REPORTS),
    EXPECTED_REPORTS = ifelse(EXPECTED_REPORTS > 1, 1, EXPECTED_REPORTS)
  )

  log_msg("‚úÖ Replaced all values of ACTUAL_REPORTS and EXPECTED_REPORTS that were >1 with 1.")

} # else nothing to replace

  dim(dhis2_reporting_wide)
  head(dhis2_reporting_wide, 3)
}

### 3.4. Aggregate at AMD2 level

In [None]:
# Sum up values (now at acility level) to get totals per ADM2_ID and PERIOD
dhis2_reporting_wide_adm2 <- dhis2_reporting_wide %>%
  group_by(
    PERIOD, 
    YEAR, MONTH, # keep these just for sanity check (not needed for grouping)
    ADM1_NAME, ADM1_ID, # keep these just for sanity check (not needed for grouping)
    ADM2_NAME, ADM2_ID
  ) %>%
  summarise(
    ACTUAL_REPORTS = sum(ACTUAL_REPORTS, na.rm = TRUE),
    EXPECTED_REPORTS = sum(EXPECTED_REPORTS, na.rm = TRUE),
    .groups = 'drop'
  ) 

# Add log messages
log_msg(glue::glue("DHIS2 reporting data pivoted to wide format and aggregated at ADM2 level. 
Dataframe dimensions: {paste(dim(dhis2_reporting_wide_adm2), collapse=', ')}"))
head(dhis2_reporting_wide_adm2, 3)

### 3.5. Calculate REPORTING_RATE
**numerator**: `ACTUAL_REPORTS`<br>
**denominator**: `EXPECTED_REPORTS`

In [None]:
# Calculate REPORTING_RATE as ACTUAL_REPORTS / EXPECTED_REPORTS
reporting_rate_results <- dhis2_reporting_wide_adm2 %>%
  mutate(REPORTING_RATE = ACTUAL_REPORTS / EXPECTED_REPORTS)

log_msg(glue::glue("DHIS2 reporting rate calculated as ACTUAL_REPORTS / EXPECTED_REPORTS. Dataframe dimensions: {paste(dim(reporting_rate_results), collapse=', ')}"))
head(reporting_rate_results, 3)  

### 3.6. Ensure consistency of table (probably can skip because all data comes from the same source!)
Left join reporting indicators with DHIS2 routine data.
Make sure we have a consistent reporting rates table matching periods x org units (safety measure only).

In [None]:
reporting_rate_dataset <- left_join(dhis2_routine, 
                              reporting_rate_results %>% select(all_of(fixed_cols_rr)), 
                              by=c("YEAR", "MONTH", "ADM2_ID"))

print(dim(reporting_rate_dataset))
head(reporting_rate_dataset, 3)

### 3.7. Final visual check on REPORTING_RATE values

In [None]:
# Add log message to communicate range of REPORTING_RATE values and warn if any values are outside [0,1]
min_rr <- min(reporting_rate_dataset$REPORTING_RATE, na.rm = TRUE)
max_rr <- max(reporting_rate_dataset$REPORTING_RATE, na.rm = TRUE)
if (min_rr < 0 | max_rr > 1) {  
    log_msg(glue::glue("üö® Warning: REPORTING_RATE values are outside the expected range [0,1]. 
    Minimum REPORTING_RATE: {round(min_rr, 4)}, Maximum REPORTING_RATE: {round(max_rr, 4)}"), level = "warning")
} else {
    log_msg(glue::glue("‚úÖ REPORTING_RATE values are within the expected range [0,1]. 
    Minimum REPORTING_RATE: {round(min_rr, 4)}, Maximum REPORTING_RATE: {round(max_rr, 4)}"))
}

In [None]:
# Simple plot to visualize distribution of REPORTING_RATE
ggplot(reporting_rate_dataset, aes(x=REPORTING_RATE)) +
  geom_histogram() +
  labs(
    x="Dataset Reporting Rate", y="Frequency",
    title = glue::glue("Reporting rate values range from {round(min(reporting_rate_dataset$REPORTING_RATE), 2)} to {round(max(reporting_rate_dataset$REPORTING_RATE), 2)}")
  ) +
  theme_minimal()

## 4. üìÅ Export to `data/` folder
Export as both .csv and .parquet file formats.

In [None]:
output_data_path <- file.path(DATA_PATH, "reporting_rate")

# parquet
file_path <- file.path(output_data_path, paste0(COUNTRY_CODE, "_reporting_rate_dataset.parquet")) 
write_parquet(reporting_rate_dataset, file_path)
log_msg(glue("Exported : {file_path}"))

# csv
file_path <- file.path(output_data_path, paste0(COUNTRY_CODE, "_reporting_rate_dataset.csv"))
write.csv(reporting_rate_dataset, file_path, row.names = FALSE)
log_msg(glue("Exported : {file_path}"))