# **DRC DHIS2 Data Quality Analysis and Incidence Calculations**


## Configuration

In [None]:
# Set SNT Paths
SNT_ROOT_PATH  <- "~/workspace"
CODE_PATH      <- file.path(SNT_ROOT_PATH, "code")
CONFIG_PATH    <- file.path(SNT_ROOT_PATH, "configuration")

# load util functions
source(file.path(CODE_PATH, "snt_utils.r"))

# List required packages 
required_packages <- c("dplyr", "tidyr", "terra", "ggplot2", "stringr", "lubridate", "viridis", "patchwork", "zoo", "purrr", "arrow", "sf", "reticulate")

# Execute function
install_and_load(required_packages)

# Set environment to load openhexa.sdk from the right environment
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

# Load SNT config
config_json <- tryCatch({ jsonlite::fromJSON(file.path(CONFIG_PATH, "SNT_config.json"))},
    error = function(e) {
        msg <- paste0("Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

# Configuration variables
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_EXTRACTS
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADM_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

In [2]:
# print function
printdim <- function(df, name = deparse(substitute(df))) {
  cat("Dimensions of", name, ":", nrow(df), "rows x", ncol(df), "columns\n\n")
}

# Part 1: Completeness of Routine Health Facility Reporting by Data Element
## 1.1 Import data, check config completeness and standardize data element names

#### 🛠️ part 1 belongs to dhis2 extraction pipeline, and shows possibilities of adding a few reporting logs such as # of active HFs (1.2), checks for missing month (1.3), and figures/tables generated under 1.5, 1.6 and 1.7. Please include all of this.
#### 🛠️ please add internal check: are all indicators in config effectively extracted? Are all months effectively extracted? e.g. if one species getween 202201 and 202501, have all months been extracted? Just to show data is complete.
#### 🛠️ please add population data check (see with Giulia who had a comment on this - was related to monthly or yearly presence of data)

In [None]:
# import analytics DHIS2 data
data <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_dhis2_raw_analytics.parquet")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 analytics file for: " , COUNTRY_CODE, conditionMessage(e))
                      cat(msg)
                      stop(msg)
                      })
printdim(data)

In [None]:
# standardize indicator names
data <- data %>%
  mutate(DX_NAME = DX_NAME %>%
           str_trim() %>%                      # remove leading/trailing spaces
           str_replace_all("\\s+", "_") %>%    # replace 1+ spaces with underscore
           str_to_lower())                     # optional: make lowercase

data <- data %>%
  mutate(CO_NAME = CO_NAME %>%
           str_trim() %>%                      # remove leading/trailing spaces
           str_replace_all("\\s+", "_") %>%    # replace 1+ spaces with underscore
           str_to_lower())                     # optional: make lowercase

## 1.2 Number of active health facilities
Activity was defined as the point at which the health facility reported at least some information on any data element. This was evaluated by checking whether any data element was reported over the specified time period.

In [None]:
# Total number of unique facilities using org unit ID
total_facilities <- data %>% pull(OU) %>% unique() %>% length()

# Check health facility activity: any reported value (including 0) counts as active
activity <- data %>%
  group_by(OU, PE) %>%
  summarise(active = any(!is.na(VALUE)), .groups = "drop")

# Number of facilities that were ever active
active_facilities <- activity %>%
  group_by(OU) %>%
  summarise(active_ever = any(active), .groups = "drop") %>%
  filter(active_ever) %>%
  nrow()

# Proportion
proportion_active <- 100 * active_facilities / total_facilities

# Print result
cat("Out of ", total_facilities,  " unique health facilities, ", active_facilities, 
    " were ever active (", round(proportion_active, 1), "%)\n", sep ='')

## 1.3 Check for missing months

In [None]:
# Min and max month in the dataset
cat("First month for which data is extracted:", min(data$PE), "\n")
cat("Last month for which data is extracted:", max(data$PE), "\n")
cat("Total Number of months for which data is extracted:", length(unique(data$PE)), "\n")

# Check for missing months (assuming monthly from min to max)
all_months <- seq(ymd(paste0(min(data$PE), "01")),
                  ymd(paste0(max(data$PE), "01")),
                  by = "1 month") %>%

              format("%Y%m")

# Which months are missing?
missing_months <- setdiff(all_months, unique(data$PE))

if (length(missing_months) == 0) {
  cat("All months are present — no missing months in the dataset.")
} else {
  cat("⚠️ Missing months detected:")
  cat(missing_months)
}

## 1.4 Aggregate data elements by category option (co)

In [None]:
admin_levels <- colnames(data)[grepl("LEVEL_", colnames(data))]

data_clean <- data %>%
  mutate(VALUE = as.numeric(VALUE)) %>%
  group_by(DX, OU, PE, DX_NAME, across(all_of(admin_levels))) %>%
  summarise(VALUE = sum(VALUE, na.rm = TRUE), .groups = "drop")
printdim(data_clean)

## 1.5 Proportion of Health Facilities Reporting Zero, NULL or Positive Values for Each Data Element

In [None]:
options(jupyter.plot_mimetypes = c("image/png"))

In [None]:
# --- STEP 1: Convert 'PE' to proper DATE
data_clean <- data_clean %>%
  mutate(
    PE = as.character(PE),
    DATE = as.Date(paste0(substr(PE, 1, 4), "-", substr(PE, 5, 6), "-01")),
    INDICATOR = DX_NAME  # alias for clarity
  )

# --- STEP 2: Build expected full grid (OU × INDICATOR × DATE)
full_grid <- expand_grid(
  OU = unique(data_clean$OU),
  INDICATOR = unique(data_clean$INDICATOR),
  DATE = unique(data_clean$DATE)
)

# --- STEP 3: Join to detect missing / zero / positive
reporting_check <- full_grid %>%
  left_join(
    data_clean %>% select(OU, INDICATOR, DATE, VALUE),
    by = c("OU", "INDICATOR", "DATE")
  ) %>%
  mutate(
    is_missing = is.na(VALUE),
    is_zero = VALUE == 0 & !is.na(VALUE),
    is_positive = VALUE > 0 & !is.na(VALUE)
  )

# --- STEP 4: Summarise by INDICATOR and date
reporting_summary <- reporting_check %>%
  group_by(INDICATOR, DATE) %>%
  summarise(
    n_total = n_distinct(OU),
    n_missing = sum(is_missing),
    n_zero = sum(is_zero),
    n_positive = sum(is_positive),
    pct_missing = ifelse(n_total > 0, 100 * n_missing / n_total, 0),
    pct_zero = ifelse(n_total > 0, 100 * n_zero / n_total, 0),
    pct_positive = ifelse(n_total > 0, 100 * n_positive / n_total, 0),
    .groups = "drop"
  )

# --- STEP 5: Reshape for stacked plot
plot_data <- reporting_summary %>%
  pivot_longer(
    cols = starts_with("pct_"),
    names_to = "Status", values_to = "Percentage"
  ) %>%
  mutate(
    Status = recode(Status,
                    pct_missing = "Missing",
                    pct_zero = "Zero reported",
                    pct_positive = "Positive reported")
  ) %>%
  complete(INDICATOR, DATE, Status, fill = list(Percentage = 0))

# --- STEP 6: Plot
options(repr.plot.width = 15, repr.plot.height = 15)

ggplot(plot_data, aes(x = DATE, y = Percentage, fill = Status)) +
  geom_col(position = "stack") +
  facet_wrap(~ INDICATOR, scales = "free_y", ncol = 4) +
  scale_y_continuous() +
  scale_fill_manual(values = c(
    "Missing" = "tomato",
    "Zero reported" = "skyblue",
    "Positive reported" = "green"
  )) +
  labs(
    title = "Health Facility Reporting Status by Data Element",
    x = "Month", y = "% of Facilities",
    fill = "Reporting Status"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18),
    strip.text = element_text(size = 14),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

## 1.6 Proportion of months reported for each data element per health facility

In [None]:
# Max available column
name_cols <- grep("LEVEL_\\d+_NAME", admin_levels, value = TRUE)
max_level <- max(as.numeric(gsub("LEVEL_(\\d+)_NAME", "\\1", name_cols)))
max_admin_col_name <- paste0("LEVEL_", max_level, "_NAME")

# Count number of months reported for each indicator per facility
facility_coverage <- data_clean %>%
  group_by(OU, !!sym(max_admin_col_name), DX_NAME) %>%
  summarise(N_VALUES = sum(!is.na(VALUE)), .groups = "drop") %>%
  pivot_wider(names_from = DX_NAME, 
              values_from = N_VALUES, 
              values_fill = 0)

# Turn wide data back to long for plotting
facility_long <- facility_coverage %>%
  pivot_longer(
    cols = -c(OU, !!sym(max_admin_col_name)),
    names_to = "indicator",
    values_to = "months_reported"
  ) %>%
  mutate(percent_reported = (months_reported / length(unique(data$PE))) * 100) %>% 
  left_join(
    data %>% 
      select(OU, !!sym(ADM_2)) %>% 
      distinct(),
    by = "OU"
  )

# Heatmap: Indicators as rows, Health Facilities as columns
options(repr.plot.width = 15, repr.plot.height = 10)

ggplot(facility_long, aes(x = !!sym(max_admin_col_name), y = indicator, fill = percent_reported)) +
  geom_tile() +
  scale_fill_viridis_c(name = "% Reported", limits = c(0, 100)) +
  labs(
    title = "Reporting Completeness per Health Facility",
    x = "Health Facility",
    y = "Indicator"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),  # Hide x-axis labels if too many
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 18, face = "bold"),
    axis.title.x = element_text(size = 16),  
    axis.title.y = element_text(size = 16),
    panel.grid = element_blank()
  )

**Conclusion** Aggregated data elements (TDR réalisé, TDR positif, Paludism simple confirmé traité and Cas suspects) are consistently present in most health facilities, while others (like Paludisme présumé) seem to be consistently missing. 

## 1.7 Summary Table of Reporting Completeness Per Indicator

In [None]:
total_facilities <- n_distinct(facility_coverage$OU)

element_summary <- facility_coverage %>%
  pivot_longer(cols = -c(OU, !!sym(max_admin_col_name)), names_to = "indicator", values_to = "months_reported") %>%
  group_by(indicator) %>%
  summarise(
    mean_n_months_reporting = round(mean(months_reported, na.rm = TRUE), 1),
    median_n_months_reporting = round(median(months_reported, na.rm = TRUE), 1),
    n_facilities_reporting = sum(months_reported > 0, na.rm = TRUE),
    prop_facilities_reporting = round(n_facilities_reporting / total_facilities * 100, 1),
    .groups = "drop"
  ) %>%
  arrange(mean_n_months_reporting)

element_summary