# Outliers **removal** and **imputation** 

**Input**:
* DHIS2 **routine data** (formatted and aligned as per ush).
    * From Dataset **SNT_DHIS2_FORMATTED** as `XXX_routine.parquet` 
* Table of outliers: table containing only values flagged as outliers, based on 1 outliers detection method. This input is a choice of the user (see parameters)
    * From Dataset **DHIS2_OUTLIERS_DETECTION** as `XXX_outlier_<method_name>.parquet`

**Output**:
* DHIS2 **routine data _without_ outliers**: outliers are simply removed.
    * To Dataset **DHIS2_OUTLIERS_REMOVAL_IMPUTATION** as `XXX_routine_removed-outlier-<method_name>.parquet` (e.g., `COD_routine_outlier_mean3sd_removed.parquet`)
* DHIS2 **routine data** _without_ outliers &  **_with_ imputed values**: values flagged as outliers are replaced by imputed values.
    * To Dataset **DHIS2_OUTLIERS_REMOVAL_IMPUTATION** as `XXX_routine_removed-outlier-<method_name>_imputed.parquet`
    * Imputation uses the moving avergae as per:
  ```
  rollapply(IMPUTATION, # zoo::rollapply()
                               width = 3, 
                               FUN = function(x) ceiling(mean(x, na.rm = TRUE)), 
                               fill = NA, 
                               align = "center")
  ```

-------------------------------

**Parameters**:
* **How to treat outliers**:
    * remove
    * remove & replace with impuated values
<br>
<br>
* **Choice of outliers detection method**:
    * mean _n_ * SD
    * median _n_ * MAD
    * _n_ * IQR
    * Magic Glasses "Partial": file called `*_outlier_magic_glasses_partial_*` (MAD15 >> MAD10)
    * Magic Glasses "Complete": file called `*_outlier_magic_glasses_complete_*` (MAD15 >> MAD10 >> seasonal5 >> seasonal3)

---------------------------

## 0. Parameters

👇 these are now ⚡**pipeline parameters**⚡!

In [None]:
# OUTLIER_METHOD <- "mean3sd"

#### Set Default values **if _not_ provided by pipeline**
This makes the execution flexible and "safe": nb can be run manually from here or be executed via pipeline, without having to change anything in the code!

🚨 **Important**: must use the `if (!exists("PARAMETER_NAME")) { ... }` formula to avoid over writing the parameters injected by Papermill (into special cell a top of nb, see OUTPUT nb)

In [None]:
# Set BACKUP VALUE: name of config file to use
if (!exists("CONFIG_FILE_NAME")) {
  CONFIG_FILE_NAME <- "SNT_config.json"  # Default if not provided by pipeline
}

# Set BACKUP VALUE: 
if (!exists("OUTLIER_METHOD")) {
  OUTLIER_METHOD <- "median3mad"  # other options: "mean3sd", "iqr1-5", "magic_glasses_partial", "magic_glasses_complete" 
}                              # ⚠️ TO DO: in pipeline.py, make this choice dynamic based on files available in Dataset (subset part of filename that indicates the method
                               

## 1. Setup

### 1.1. Paths

In [None]:
# Set BACKUP VALUE: root path - NEVER CHANGE THIS!
if (!exists("ROOT_PATH")) {
  ROOT_PATH <- "/home/hexa/workspace"  
}

In [None]:
# PROJECT PATHS

# Project folders
CODE_PATH <- file.path(ROOT_PATH, 'code') # this is where we store snt_functions.r and snt_utils.r
CONFIG_PATH <- file.path(ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(ROOT_PATH, 'data') # same as in Datasets but /data/ gets over written every time a new version of Datasets is pushed

print(CODE_PATH)

### 1.2. Utils functions

In [None]:
source(file.path(CODE_PATH, "snt_utils.r"))

### 1.3. Packages

In [None]:
# List required pcks 
required_packages <- c("arrow", # for .parquet
                       "tidyverse",
                       "stringi", 
                       # "sf",
                       # "forecast",
                       "zoo",
                       "jsonlite", 
                       "httr", 
                       "reticulate")

# Execute function
install_and_load(required_packages)

#### Set environment to load openhexa.sdk from the right path

In [None]:
# Set environment to load openhexa.sdk from the right path
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.4. Load and check `config` file

In [None]:
# Load SNT config

config_json <- tryCatch({
        fromJSON(file.path(CONFIG_PATH, CONFIG_FILE_NAME)) # "SNT_config.json"
    },
    error = function(e) {
        msg <- paste0("Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, CONFIG_FILE_NAME)) 
log_msg(msg)

#### **Checks for SNT mandatory configuration fields**

In [None]:
# CHECK SNT configuration 
snt_config_mandatory <- c("COUNTRY_CODE", "DHIS2_ADMINISTRATION_1", "DHIS2_ADMINISTRATION_2") 

for (conf in snt_config_mandatory) {
    print(paste(conf, ":", config_json$SNT_CONFIG[conf]))
    if (is.null(config_json$SNT_CONFIG[[conf]])) {
        msg <- paste("Missing configuration input:", conf)
        cat(msg)   
        stop(msg)
    }
}

#### **Save config fields as variables**

In [None]:
# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "ANY"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

In [None]:
DHIS2_INDICATORS

In [None]:
# Fixed routine formatting columns
# must keep&use `OU` as it contains unique ids (OU_NAME has homonimous values!)
fixed_cols <- c('PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID', 'OU_ID') # keep ADMX_ID only! 

print(paste("Fixed routine data ('dhis2_routine') columns (always expected): ", paste(fixed_cols, collapse=", ")))

## 2. Load Data

### 2.1. **Routine** data (DHIS2) 
already formatted & aggregated<br>
(output of pipeline "XXX" and stored in Dataset "**SNT_DHIS2_FORMATTED**")

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_routine.parquet")) }, 
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("DHIS2 routine data loaded from dataset : ", dataset_name, " dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
log_msg(msg)

In [None]:
head(dhis2_routine, 3)

#### Checks on routine data cols

**fixed_cols**

In [None]:
# Check if all "fixed" cols are present in dhis2_routine

actual_cols <- colnames(dhis2_routine) # dhis2_routine
missing_cols <- setdiff(fixed_cols, actual_cols) # Columns in fixed_cols but not in actual_cols)

# Check if all required columns are present
all_present <- length(missing_cols) == 0

if (all_present) { 
  log_msg(paste0("Success: The 'dhis2_routine' tibble contains all the expected 'fixed' columns: ",
                paste(fixed_cols, collapse = ", "), "."))
} else {
    log_msg(paste0(
      "🛑 Missing Columns: The following required columns are NOT present in 'dhis2_routine': ",
      paste(missing_cols, collapse = ", "),
      "."
    ), "warning")
  }

**DHIS2_INDICATORS**

In [None]:
# Check if all "DHIS2_INDICATORS" cols are present in dhis2_routine

actual_cols <- colnames(dhis2_routine) # dhis2_routine
missing_cols <- setdiff(DHIS2_INDICATORS, actual_cols) # Columns in fixed_cols but not in actual_cols)
all_present <- length(missing_cols) == 0

if (all_present) { 
  log_msg(paste0("Success: The 'dhis2_routine' tibble contains all the expected 'DHIS2_INDICATORS' columns: ",
                paste(fixed_cols, collapse = ", "), "."))
} else {
    log_msg(paste0(
      "🚨 Missing Columns: The following columns for DHIS2 INDICATORS are NOT present in 'dhis2_routine': ",
      paste(missing_cols, collapse = ", "),
      ". 🚨 Looks like the ", CONFIG_FILE_NAME, " file was modified after extraction. 🚨 The analysis will continue WITHOUT the missing indicators."
    ), "warning")
}

### 2.2. Ouliers table

In [None]:
# DHIS2 Dataset extract identifier
dataset_name_outliers <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_OUTLIERS_DETECTION

# Load file from dataset
outliers <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name_outliers, paste0(COUNTRY_CODE, "_outlier_", OUTLIER_METHOD, ".parquet")) }, # COD_outlier_iqr1-5.parquet
                  error = function(e) {
                      msg <- paste("[ERROR] Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      cat(msg)
                      stop(msg)
})

msg <- paste0("Table with outlier values loaded from dataset : ", dataset_name_outliers, " dataframe dimensions: ", paste(dim(outliers), collapse=", "))
log_msg(msg)

In [None]:
print(dim(outliers))
head(outliers, 3)

# 3. Remove Outliers

### Pivot routine data 
Make long

In [None]:
# Pivot `indicator_cols` = all columns except fixed columns
# indicator_cols <- colnames(dhis2_routine)[!(names(dhis2_routine) %in% fixed_cols)] # problem: includes `OU_NAME`
indicator_cols <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS) # instead, just extract list of indicators colsnames based on config file

# pivot routine data
dhis2_routine_long <- dhis2_routine %>%
    pivot_longer(#cols = all_of(indicator_cols),
                 cols = any_of(indicator_cols),  # ⚠️ TEMP switch as config.json was changed but not extracted data (some cols are missing) ⚠️
                 names_to = "INDICATOR",
                 values_to = "VALUE") 

In [None]:
print(dim(dhis2_routine_long))
head(dhis2_routine_long,3)

### Format the **outliers** table: rename col `OUTLIER_<method_name>` to `OUTLIER`
🚨 Necessary because the **col flagging** the **outlier** values has a **different name** for each outlier detetcion method output <br>
(💡 consider changing this upstream ... )

In [None]:
outliers <- outliers %>%
  rename_with(~ "OUTLIER", starts_with("OUTLIER_"))

In [None]:
names(outliers)

### Join outliers on routine data
This adds the col `OUTLIER_*` to the (pivoted long) routine data.
Here, `NA`s are assigned to values that are not outliers. These `NA`s need to be replaced by `FALSE`

In [None]:
# Join
# merged_data <- left_join(dhis2_routine_long, outliers, 
#                          by = c(fixed_cols, "INDICATOR", "VALUE")) 

dhis2_routine_long_outliers <- left_join(dhis2_routine_long, outliers, 
                                         by = c(fixed_cols, "INDICATOR", "VALUE"))

# In col `OUTLIER_*`, replace `NA`s with `FALSE`
# merged_data <- merged_data %>%
# mutate(OUTLIER = if_else(is.na(OUTLIER), FALSE, OUTLIER))

dhis2_routine_long_outliers <- dhis2_routine_long_outliers %>%
mutate(OUTLIER = if_else(is.na(OUTLIER), FALSE, OUTLIER))

print(dim(dhis2_routine_long_outliers))
head(dhis2_routine_long_outliers, 3)

### Remove outliers
This was not in the original/older code (BFA). Added now as option: just remove all values flagged as outliers.
<br>
Later: **Export** as `XXX_routine_outlier-<method_name>_removed`.csv/parquet 

#### **Remove** outliers & pivot wider

In [None]:
# was `routine_removed_outliers` now `dhis2_routine_long_outliers_removed`
dhis2_routine_long_outliers_removed <- dhis2_routine_long_outliers %>%
    filter(OUTLIER == FALSE)

In [None]:
print(dim(dhis2_routine_long_outliers_removed))
head(dhis2_routine_long_outliers_removed, 3)

In [None]:
# When removing outliers, how many values are removed (= nr of values converted to NA)
nr_of_outliers <- dhis2_routine_long_outliers %>% filter(OUTLIER == TRUE) %>% nrow()
nr_of_values <- dhis2_routine_long_outliers %>% nrow()
perc_outliers <- round(nr_of_outliers/nr_of_values * 100, 2)

msg <- paste0(
  "Using outliers detection method *", OUTLIER_METHOD,
  "*. Identified and removed ", nr_of_outliers,
  " outliers (", perc_outliers, "% of values)."
)

log_msg(msg) 

In [None]:
dhis2_routine_outliers_removed <- dhis2_routine_long_outliers_removed %>%
select(all_of(fixed_cols), INDICATOR, VALUE) %>% # names() #  'PERIOD''YEAR''MONTH''ADM1_ID''ADM2_ID''OU''INDICATOR''VALUE'
pivot_wider(names_from = "INDICATOR", values_from = "VALUE") 

In [None]:
print(dim(dhis2_routine_outliers_removed))
head(dhis2_routine_outliers_removed, 3)

# 4. Impute values: replace outliers based on rolling average

### Create col `TO_IMPUTE` 
Create `TO_IMPUTE` col from `VALUE`, and set to `NA` if value is flagged as outlier (namely, when `OUTLIER == TRUE`)

In [None]:
dhis2_routine_long_outliers_imputed <- dhis2_routine_long_outliers %>%
  mutate(TO_IMPUTE = VALUE,
         TO_IMPUTE = if_else(OUTLIER == TRUE, NA_real_, TO_IMPUTE)) 

head(dhis2_routine_long_outliers_imputed, 3)

In [None]:
# Sanity check
dhis2_routine_long_outliers_imputed %>% filter(OUTLIER == TRUE) %>% head(., 3)

### Impute: Calculate rolling average

In [None]:
log_msg("Calculating rolling average")

dhis2_routine_long_outliers_imputed <- dhis2_routine_long_outliers_imputed %>%
arrange(ADM1_ID, ADM2_ID, OU_ID, INDICATOR, PERIOD) %>%  # Ensure proper order
    group_by(ADM1_ID, ADM2_ID, OU_ID, INDICATOR) %>%  # Group data
    mutate(
        MOVING_AVG = zoo::rollapply(TO_IMPUTE, 
                                    width = 3, 
                                    FUN = function(x) ceiling(mean(x, na.rm = TRUE)), 
                                    fill = NA, 
                                    align = "center"),
        VALUE_IMPUTED = ifelse(is.na(TO_IMPUTE), MOVING_AVG, TO_IMPUTE)  # Replace NA with MOVING_AVG
    ) %>%  
    ungroup()


In [None]:
print(dim(dhis2_routine_long_outliers_imputed))
head(dhis2_routine_long_outliers_imputed, 3)

In [None]:
# # Curiosity check: see where NaN's introduced by `zoo::rollapply()`
# merged_data_imputation_imputed %>% filter(is.nan(VALUE_IMPUTED)) %>% head(., 3)

In [None]:
# Set all `NaN`s to `NA`s
dhis2_routine_long_outliers_imputed$VALUE_IMPUTED[is.nan(dhis2_routine_long_outliers_imputed$VALUE_IMPUTED)] <- NA

### Replace values, and pivot wider
More specifically:
* replace outlier values with imputed values: simply rename `VALUE_IMPUTED` col to `VALUE`
* pivot table to wider (return to original structure)

In [None]:
dhis2_routine_imputed <- dhis2_routine_long_outliers_imputed %>%
    mutate(VALUE = VALUE_IMPUTED) %>% # IMPUTATION_VALUE
    select(all_of(fixed_cols), INDICATOR, VALUE) %>% # names() #  'PERIOD''YEAR''MONTH''OU''ADM1_ID''ADM2_ID''INDICATOR''VALUE'
    pivot_wider(names_from = "INDICATOR", values_from = "VALUE") 

#### Sanity check: make sure imputing does not alter the number of rows in the table

In [None]:
# Check dimensions and log error if they don't match
if (nrow(dhis2_routine) != nrow(dhis2_routine_imputed)) {
  msg_error <- paste0(
    "[ERROR] Error: Nr of rows in 'dhis2_routine' (",
    paste(dim(dhis2_routine), collapse = "x"),
    ") does not match nr of rows in 'dhis2_routine_imputed' (",
    paste(dim(dhis2_routine_imputed), collapse = "x"),
    ")."
  )

  stop(msg_error) # stop the script and print the error message
}

# 5. Export Output tables
Export tables as .csv and .parquet files to `/data/dhis2/outliers_removal_imputation/` folder, then code in pipeline.py will handle the writing to Dataset 

In [None]:
# Output directory
output_directory <- file.path(DATA_PATH, "dhis2", "outliers_removal_imputation")

## 5.1. Routine without outliers: `dhis2_routine_outliers_removed`
As `XXX_routine_removed-outliers-<method_name>`.parquet/csv

In [None]:
# Exporting removed data 
export_data(data_object=dhis2_routine_outliers_removed, 
            file_path=file.path(output_directory, paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_METHOD, "_removed.csv")))

export_data(data_object=dhis2_routine_outliers_removed, 
            file_path=file.path(output_directory, paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_METHOD, "_removed.parquet")))

## 5.2. Routine with **imputed** values: `dhis2_routine_imputed`
As `XXX_routine_outliers-<method_name>_imputed`.parquet/csv

In [None]:
# Exporting imputed data
export_data(data_object=dhis2_routine_imputed, 
            file_path=file.path(output_directory, paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_METHOD, "_imputed.csv")))

export_data(data_object=dhis2_routine_imputed, 
            file_path=file.path(output_directory, paste0(COUNTRY_CODE, "_routine_outliers-", OUTLIER_METHOD, "_imputed.parquet")))