# Outliers Detection

We use the following approaches:
* (always) mean ± \[3] SD 
* (always) meadian ± \[3] MAD
* (always) IQR x \[1.5]
* (optinal) Magic Glasses PARTIAL: 15 MAD -> 10 MAD
* (optinal) Magic Glasses COMPLETE: 15 MAD -> 10 MAD -> SEASONAL 5 -> SEASONAL 3

--------------------

**Input**: 
* **routine DHIS2** data (formatted and aligned)
    * from Dataset "**snt-dhis2-formatted**", `XXX_routine_data.parquet`

**Output**: 
All outputs saved to Dataset "**snt-outliers-detection**", with the following .parquet files:
* **1 comprehensive table** with flags for all outliers methods selected by the user
    *  cols: YEAR, MONTH, ADM1_ID, ADM2_ID, OU_ID, INDICATOR, VALUE, **OUTLIER_METHOD_X**, **OUTLIER_METHOD_Y**, **OUTLIER_METHOD_Z** ...
    *  Filename: `XXX_flagged_outliers_allmethods.parquet`
* **3-5 individual tables** (separated files), each containig the flags for _a given method only_
    *   cols: YEAR, MONTH, ADM1_ID, ADM2_ID, OU_ID, INDICATOR, VALUE, **OUTLIER_METHOD_X**
    *   Filename: `XXX_outlier_<method_name>.parquet`
* 🐘 **Table** in ws **Database** with added cols needed for 📊 **Shiny App: SNT Outliers Explorer**
    *   cols: YEAR, MONTH, ADM1_ID, ADM2_ID, OU_ID, INDICATOR, VALUE, OUTLIER_METHOD_X, OUTLIER_METHOD_Y, OUTLIER_METHOD_Z, **ADM1_NAME**, **ADM2_NAME**, **OU_NAME**, **DATE**
    *   Table name: `flagged_outliers_allmethods_name_date`

---------------------

🚨 **Note**: `OU_ID` and `OU_NAME` in this case refers to **Health Facility** (HF) level! <br>
Make sure this is correctly configured for each country else results might change ... 

## 0. Parameters

👇 these are now ⚡**pipeline parameters**⚡!

In [None]:
# DEVIATION_MEAN <- 3
# DEVIATION_MEDIAN <- 3
# DEVIATION_IQR <- 1.5
# RUN_MAGIC_GLASSES_PARTIAL <- TRUE
# RUN_MAGIC_GLASSES_COMPLETE <- TRUE

#### Set Default values **if _not_ provided by pipeline**
This makes the execution flexible and "safe": nb can be run manually from here or be executed via pipeline, without having to change anything in the code!

In [None]:
# ⚠️ TEMP! Just for code dev 
SUBSET_DATA <- FALSE # if TRUE, 🪓 subsets data to keep only 1 province so that computations don't take forever ...
                     # `dhis2_routine_long <- dhis2_routine_long %>% filter(ADM1_ID == dhis2_routine_long$ADM1_ID[1]) `

# SUBSET_DATA <- TRUE # ⚠️⚠️⚠️ TEMP for dev! Make FALSE again! ⚠️⚠️⚠️

In [None]:
# Set BACKUP VALUE: name of config file to use
if (!exists("CONFIG_FILE_NAME")) {
  CONFIG_FILE_NAME <- "SNT_config.json"  # Default if not provided by pipeline
}

# Set BACKUP VALUE: deviations around mean or median 
if (!exists("DEVIATION_MEAN")) {
  DEVIATION_MEAN <- 3  
}

if (!exists("DEVIATION_MEDIAN")) {
  DEVIATION_MEDIAN <- 3  
}

if (!exists("DEVIATION_IQR")) {
  DEVIATION_IQR <- 1.5  
}

if (!exists("RUN_MAGIC_GLASSES_PARTIAL")) {
  RUN_MAGIC_GLASSES_PARTIAL <- TRUE 
}

if (!exists("RUN_MAGIC_GLASSES_COMPLETE")) {
  RUN_MAGIC_GLASSES_COMPLETE <- TRUE 
}

## 1. Setup

### 1.1. Paths

In [None]:
# Set BACKUP VALUE: root path - NEVER CHANGE THIS!
if (!exists("ROOT_PATH")) {
  ROOT_PATH <- "~/workspace"  
}

In [None]:
# PROJECT PATHS

# Project folders
CODE_PATH <- file.path(ROOT_PATH, 'code') # this is where we store snt_functions.r and snt_utils.r
CONFIG_PATH <- file.path(ROOT_PATH, 'configuration') # .json config file
DATA_PATH <- file.path(ROOT_PATH, 'data') # same as in Datasets but /data/ gets over written every time a new version of Datasets is pushed

print(CODE_PATH)

### 1.2. Utils functions

In [None]:
source(file.path(CODE_PATH, "snt_utils.r"))

### 1.3. Packages

In [None]:
# List required pcks 
required_packages <- c("arrow", # for .parquet
                       "tidyverse",
                       "stringi", 
                       # "sf",
                       "forecast",
                       "jsonlite", 
                       "httr", 
                       "DBI", # write to DB
                       "RPostgres", # write to DB
                       "reticulate")

# Execute function
install_and_load(required_packages)

#### For 📦{sf}, tell OH where to find stuff ...

In [None]:
# Hope this gets fixed at the source one day ...
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")

#### Set environment to load openhexa.sdk from the right path

In [None]:
# Set environment to load openhexa.sdk from the right path
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

### 1.4. Load and check `config` file

In [None]:
# Load SNT config

config_json <- tryCatch({
        fromJSON(file.path(CONFIG_PATH, CONFIG_FILE_NAME)) # "SNT_config_COD.json"
    },
    error = function(e) {
        msg <- paste0("Error while loading configuration", conditionMessage(e))  
        cat(msg)   
        stop(msg) 
    })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, CONFIG_FILE_NAME)) 
log_msg(msg)

#### **Checks for SNT mandatory configuration fields**

In [None]:
# CHECK SNT configuration 
snt_config_mandatory <- c("COUNTRY_CODE", "DHIS2_ADMINISTRATION_1", "DHIS2_ADMINISTRATION_2") 

for (conf in snt_config_mandatory) {
    # print(paste(conf, ":", config_json$SNT_CONFIG[conf]))
    log_msg(paste(conf, ":", config_json$SNT_CONFIG[conf]))
    if (is.null(config_json$SNT_CONFIG[[conf]])) {
        msg <- paste("Missing configuration input:", conf)
        # cat(msg)   
        log_msg(msg)
        stop(msg)
    }
}

#### **Save config fields as variables**

In [None]:
# Generic
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE
ADMIN_1 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_1)
ADMIN_2 <- toupper(config_json$SNT_CONFIG$DHIS2_ADMINISTRATION_2)

# Which (aggregated) indicators to use to evaluate "activity" of an HF - for Reporting Rate method "ANY"
DHIS2_INDICATORS <- names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)

In [None]:
# DHIS2_INDICATORS
log_msg(paste("Expecting the following DHIS2 (aggregated) indicators : ", paste(DHIS2_INDICATORS, collapse=", ")))

In [None]:
# Fixed routine formatting columns
# Note: must keep&use `OU_ID` as it contains unique ids (OU_NAME has homonimous values!)
# fixed_cols <- c('PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM1', 'ADM2_ID', 'ADM2', 'OU', 'OU_NAME') 
fixed_cols <- c('PERIOD', 'YEAR', 'MONTH', 'ADM1_ID', 'ADM2_ID', 'OU_ID') # keep ADMX_ID only! 

# log_msg(paste("Fixed routine data (\"dhis2_routine\") columns (always expected): ", paste(fixed_cols, collapse=", ")))
log_msg(paste("Expecting the following columns from routine data (`dhis2_routine`) : ", paste(fixed_cols, collapse=", ")))

## 2. Load Data

### 2.1. **Routine** data (DHIS2) 
already formatted & aggregated<br>
(output of pipeline "XXX" and stored in Dataset "**SNT_DHIS2_FORMATTED**")

In [None]:
# DHIS2 Dataset extract identifier
dataset_name <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

# Load file from dataset
dhis2_routine <- tryCatch({ get_latest_dataset_file_in_memory(dataset_name, paste0(COUNTRY_CODE, "_routine.parquet")) }, 
                  error = function(e) {
                      msg <- paste("Error while loading DHIS2 routine data file for: " , COUNTRY_CODE, conditionMessage(e))  # log error message
                      # cat(msg)
                      log_msg(msg)
                      stop(msg)
})

# msg <- paste0("DHIS2 routine data loaded from dataset : ", dataset_name, " dataframe dimensions: ", paste(dim(dhis2_routine), collapse=", "))
# log_msg(msg)

msg1 <- paste0("DHIS2 routine data loaded from dataset : ", dataset_name)
log_msg(msg1)

msg2 <- paste0("DHIS2 routine data loaded has dimensions: ", nrow(dhis2_routine), " rows, ", ncol(dhis2_routine), " columns.")
log_msg(msg2)


In [None]:
head(dhis2_routine, 3)

### 🔍 Check: any "empty" col (contains only NA's)?

In [None]:
# Identify which columns are entirely NA
na_col_check <- dhis2_routine %>%
  map_lgl(~all(is.na(.x)))

# Get the names of those columns
all_na_column_names <- names(na_col_check[na_col_check])

# Print warning only if there are any such columns
if (length(all_na_column_names) > 0) {
  log_msg(paste0("🚨 The following columns contain only `NA` values : ", paste(all_na_column_names, collapse=", ")), "warning")
}

# X. 📊 for **Shiny** app: extract `*_NAME` table
Needed later: 
* join to `XXX_flagged_outliers_allmethods.parquet`
* then write table to Database to expose to Shiny
    * --> in the app, we can have human readable names for labels

In [None]:
pyramid <- dhis2_routine |> 
select(ends_with("_NAME"), ends_with("_ID")) |>
distinct() |>
# Simpify strings 
mutate(
    ADM1_NAME = stringr::str_trim(str_remove_all(ADM1_NAME, "^[A-Z]{2}| PROVINCE")),
    ADM2_NAME = stringr::str_trim(str_remove_all(ADM2_NAME, "^[A-Z]{2}| ZONE DE SANTE"))
)


dim(pyramid)
head(pyramid, 3)

# 3. Outliers Detection

## 3.1. Format routine data
These steps:
* Replace `NA`s with `0`s - 🚨 note: in `dhis2_reporting_rate` the opposite transformation is performed (optional)" with parameter `SET_0_TO_NA` ! 
* filter indicators (were cols, now rows) to keep only what specified in config.json file (`columns_selection = names(config.json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)`)
* pivot longer: cols become rows

In [None]:
# Replace NAs with 0s
dhis2_routine_NAto0 = dhis2_routine %>% replace(is.na(.), 0)

log_msg("Routine data formatting: replaced all `NA`s with 0.")

head(dhis2_routine_NAto0, 3)

In [None]:
# Filter cols to keep only the indicators specified in the config file

# Define col names
indicators_to_keep = names(config_json$DHIS2_DATA_DEFINITIONS$DHIS2_INDICATOR_DEFINITIONS)
indicators_to_keep_present_in_routine = intersect(names(dhis2_routine), indicators_to_keep)
indicators_to_keep_missing_in_routine = setdiff(indicators_to_keep, names(dhis2_routine))

# Select cols
# dhis2_routine_2 <- dhis2_routine_1 %>% select(all_of(c(fixed_cols, indicators_to_keep)) )  # ⚠️⚠️⚠️ TEMP switch coz CONFIG file was updated but yet used to extract data!
dhis2_routine_selectcols <- dhis2_routine_NAto0 %>% select(any_of(c(fixed_cols, indicators_to_keep)) )  

log_msg(paste0("Routine data formatting: filtered cols to keep only indicators : ", paste(indicators_to_keep_present_in_routine, collapse=", ")))

if (length(indicators_to_keep_missing_in_routine) > 0 ) {
    log_msg(paste0("The following indicators defined in the config file are missing from routine data : ", paste(indicators_to_keep_missing_in_routine, collapse=", ") ))
}


head(dhis2_routine_selectcols, 3)

In [None]:
# pivot longer: cols become rows
dhis2_routine_long <- dhis2_routine_selectcols %>% 
    pivot_longer(#cols = all_of(indicators_to_keep), 
                 cols = any_of(indicators_to_keep), # ⚠️⚠️⚠️ TEMP switch coz CONFIG file was updated but yet used to extract data!
                 names_to = 'INDICATOR',
                 values_to = 'VALUE') 

print(dim(dhis2_routine_long))
head(dhis2_routine_long, 3)

## 3.2. Detect outliers: "classic" methods
Namely:
* mean ± 3 SD
* median ± 3 MAD
* 1.5 x IQR

### 3.2.1. Calculate **summary stats**
At `OU_ID` (Health Facility) x `INDICATOR`, calculate:
* mean
* median
* SD
* MAD
* Q1 (25th)
* Q3 (75th).

In [None]:
fixed_cols

In [None]:
# Remove "MONTH" and "PERIOD" from fixed cols as we need the grouping by YEAR
grouping_cols <- fixed_cols[fixed_cols != "MONTH" & fixed_cols != "YEAR" & fixed_cols != "PERIOD"]
print(grouping_cols)

# 🪓 Switch for code dev!🪓
if (SUBSET_DATA) {
    adm1_to_keep <- c(unique(dhis2_routine_long$ADM1_ID)[1], unique(dhis2_routine_long$ADM1_ID)[2], unique(dhis2_routine_long$ADM1_ID)[3])
    dhis2_routine_long <- dhis2_routine_long %>% filter(ADM1_ID %in% adm1_to_keep) 
    log_msg(paste0("🚨 Subsetting data for `ADM1_ID == ", adm1_to_keep, "`! 🪓 This should only be used for CODE DEVELOPMENT!"))
}

flagged_outliers_classic <- dhis2_routine_long %>%
group_by(across(all_of(c(grouping_cols, "INDICATOR")))) %>% # "ADM1_ID" "ADM2_ID" "OU_ID" 'INDICATOR'
mutate(
    n = n(), # added for curiosity (but not used)
    mean = ceiling(mean(VALUE, na.rm = TRUE)),
    median = ceiling(median(VALUE, na.rm = TRUE)),
    sd = ceiling(sd(VALUE, na.rm = TRUE)),
    mad = ceiling(mad(VALUE, constant = 1, na.rm = TRUE)), # 🚨 scale factor: `constant = 1` (default `constant = 1.4826`) 
    q1 = ceiling(quantile(VALUE, 0.25, na.rm = TRUE)), 
    q3 = ceiling(quantile(VALUE, 0.75, na.rm = TRUE))
  ) %>% 
  ungroup() %>%
  mutate(sd = if_else(is.na(sd), 0, sd)) # to prevent NA's being introduced by sd() when single VALUE ... !

dim(flagged_outliers_classic)
head(flagged_outliers_classic, 3)

### 3.3. Flag outlier values
Flagging from *all* the 3 methods in the same table (each method is a column).

🚨 **Note**: flagging outliers using **booleans** to save memory (instead of old approach with char strings as `'aberrante'|'normale'`)

### 3.3.1 Use the **3 "classic" methods**

In [None]:
# # Parametrize - moved to start of nb
# DEVIATION_MEAN <- 3
# DEVIATION_MEDIAN <- 3
# DEVIATION_IQR <- 1.5

In [None]:
flagged_outliers_classic <- flagged_outliers_classic %>% 
mutate(
    mean_lower_bound = mean - DEVIATION_MEAN * sd, 
    mean_upper_bound = mean + DEVIATION_MEAN * sd,
    !!sym(glue::glue("OUTLIER_MEAN{DEVIATION_MEAN}SD")) := if_else(
      VALUE < mean_lower_bound | VALUE > mean_upper_bound,
      TRUE, # = outlier
      FALSE
    )  ) %>% 
  mutate(
    median_lower_bound = median - DEVIATION_MEDIAN * mad,
    median_upper_bound = median + DEVIATION_MEDIAN * mad,
    !!sym(glue::glue("OUTLIER_MEDIAN{DEVIATION_MEDIAN}MAD")) := if_else(
      VALUE < median_lower_bound | VALUE > median_upper_bound,
      TRUE,
      FALSE
    )  ) %>% 
  mutate(
    iqr = (q3 - q1) * DEVIATION_IQR,
    iqr_lower_bound = q1 - iqr,
    iqr_upper_bound = q3 + iqr,
    !!sym(glue::glue("OUTLIER_IQR{DEVIATION_IQR}")) := if_else(
      VALUE < iqr_lower_bound | VALUE > iqr_upper_bound,
      TRUE,
      FALSE
    )  ) 

outlier_cols = flagged_outliers_classic |> select(starts_with("OUTLIER_")) |> names()
log_msg(paste0("Calculated columns : ", paste(outlier_cols, collapse=", ") ))

print(names(flagged_outliers_classic))
head(flagged_outliers_classic, 3)

**Unflag 0 as outliers**: when `VALUE == 0 & OUTLIER_* == TRUE` make `FALSE` (make _not_ outlier) 

In [None]:
flagged_outliers_classic <- flagged_outliers_classic %>%
  mutate(across(starts_with("OUTLIER_"), 
                ~ if_else(VALUE == 0 & .x, FALSE, .x))) #|> filter(VALUE == 0) |> head() # generates an empty tibble 

-------------------

## 3.3. Detect outliers: **Magic Glasses** methods 
Where:
* MAGIC_GLASSES_**PARTIAL** = only MAD15 -> MAD10
* MAGIC_GLASSES_**COMPLETE** = the complete method: MAD15 -> MAD10 -> Seasonality5 -> Seasonality3

### 3.3.1. Use the **Magic Glasses Partial** method
*NOTE: use `dhis2_routine_long` = **routine formatted long** (meaning: NA -> 0, keep only relevent indicators, pivot longer)*

In this approach, values are evaluated (to detect outliers) in two consecutive steps: 
1. Detect and **remove** outliers based on MAD 15
2. Detect and **flag** outliers based on MAD 10

Specifically, breakdown of steps for code:
* **Write function** called `detect_outliers_mad_custom()` that:
    * 1. calculates MAD based on custom `deviation` (e.g. 15) (on data grouped by OU, INDICATOR, **YEAR**);
      2. creates a new col with custom name (`outlier_column = "mad_flag"`) that flags the detected outliers
* **Apply function** to `dhis2_routine_long` to flag outliers based on **MAD 15** (`flagged_outliers_mad15`)
* **Remove outliers** as per MAD 15 (`flagged_outliers_mad15_filtered`)
* **Apply function** to flag outliers based on **MAD 10**
* This produces the df `flagged_outliers_mad10`
* finally, the tables with the outliers detected sequntially, are merged (joined) into a single df: `flagged_outliers_mad15_mad10`
    * note: this introduces `NA`s because the 2 df have different size (since mad10 is calculed on routine data minus the outliers identified by mad15). To fix, this, `NA`s are converted to `TRUE`

#### Write function `detect_outliers_mad_custom()`

In [None]:
# Function to detect outliers based on MAD method with custom nr of MAD's (deviation)
detect_outliers_mad_custom <- function(data, deviation) { 
    flag_outlier_colname = paste0("OUTLIER_MAD", deviation)
    
    data %>% 
    group_by(YEAR, OU_ID, INDICATOR) %>% 
    mutate(
      median_val = median(VALUE, na.rm = TRUE),
      mad_val = mad(VALUE, constant = 1, na.rm = TRUE), # 🚨 `constant = 1`
      "{flag_outlier_colname}" := VALUE > (median_val + deviation * mad_val) | VALUE < (median_val - deviation * mad_val) # bool 
    ) %>%
    ungroup()
}

#### **Apply function** with MAD **15**

In [None]:
if (RUN_MAGIC_GLASSES_PARTIAL | RUN_MAGIC_GLASSES_COMPLETE) {
    flagged_outliers_mad15 <- detect_outliers_mad_custom(
                                         data = dhis2_routine_long,  
                                         deviation = 15
                                         )

    outlier_cols = flagged_outliers_mad15 |> select(starts_with("OUTLIER_")) |> names()
    log_msg(paste0("Calculated column : ", paste(outlier_cols, collapse=", ") ))

print(dim(flagged_outliers_mad15))
head(flagged_outliers_mad15, 3)
    }

#### **Filter** out outliers as per MAD **15**
Keep only values that are _not_ flagged as outliers by MAD15. 

In [None]:
if (RUN_MAGIC_GLASSES_PARTIAL | RUN_MAGIC_GLASSES_COMPLETE) {
    
  flagged_outliers_mad15_filtered <- flagged_outliers_mad15 %>%
    filter(OUTLIER_MAD15 == FALSE) 
  
  print(dim(flagged_outliers_mad15_filtered))
  
  nr_of_rows_removed <- nrow(flagged_outliers_mad15) - nrow(flagged_outliers_mad15_filtered)
  log_msg(paste0("Filtering data for MAD15 outliers removed ", nr_of_rows_removed , " rows"))

    if (length(which(is.na(flagged_outliers_mad15_filtered$OUTLIER_MAD15))) > 0 ) {
        log_msg("🚨 Unexpected `NA`s found in `flagged_outliers_mad15_filtered$OUTLIER_MAD15`.")
    }
    }



#### **Apply function** again, but with MAD **10**

In [None]:
if (RUN_MAGIC_GLASSES_PARTIAL | RUN_MAGIC_GLASSES_COMPLETE) {

    flagged_outliers_mad10 <- detect_outliers_mad_custom(
                                                         data = flagged_outliers_mad15_filtered,
                                                         deviation = 10 # produces column = " OUTLIER_MAD10" 
        ) %>%
    rename(OUTLIER_MAD15_MAD10 = OUTLIER_MAD10) # not elegant but more explicitly SEQUENTIAL (highlight that MAD10 is on top/after of MAD15)

    log_msg(paste0("Calculated column : OUTLIER_MAD15_MAD10"))
    
    print(dim(flagged_outliers_mad10))
    head(flagged_outliers_mad10, 3)

}

#### **Join** together `flagged_outliers_mad15` (MAD 15) and `flagged_outliers_mad10` (MAD 15 -> MAD 10)

In [None]:
# Join together `flagged_outliers_mad15` (MAD 15) and `flagged_outliers_mad10` (MAD 15 -> MAD 10)

if (RUN_MAGIC_GLASSES_PARTIAL | RUN_MAGIC_GLASSES_COMPLETE) {

    flagged_outliers_mad15_mad10 <- flagged_outliers_mad15 %>%
      left_join(
        flagged_outliers_mad10 %>% select(PERIOD, OU_ID, INDICATOR, OUTLIER_MAD15_MAD10),
        by = c("PERIOD", "OU_ID", "INDICATOR")
      ) |>
    # MAD15 has more rows than MAD10 => NA's introduced!
    mutate(OUTLIER_MAD15_MAD10 = if_else(is.na(OUTLIER_MAD15_MAD10), TRUE, OUTLIER_MAD15_MAD10)) # 🚨 IMPORTANT! 🚨 keep!

    log_msg("Created table with OUTLIER_MAD15 and OUTLIER_MAD15_MAD10")
    
    print(dim(flagged_outliers_mad15_mad10))
    head(flagged_outliers_mad15_mad10, 3)

    }

----------------------------

### 3.3.2. Use the **Magic Glasses Complete** method
Apply detection of **seasonal** outliers.

Steps:
* Write **function** `detect_seasonal_outliers()` which takes user defined `deviation` values and produces a `OUTLIER_SESONAL<deviation>` col where outliers values are flagged
* Apply detect_seasonal_outliers() with deviation = **5**
* Filter out outliers seasonal 5
* Apply detect_seasonal_outliers() with deviation = **3**
* Join seasonal 5 and seasonl 3 into single table

#### Write function `detect_seasonal_outliers()`

In [None]:
detect_seasonal_outliers <- function(data, deviation) { 
  
  outlier_column_name = paste0("OUTLIER_SEASONAL", deviation) # Renamed 'flag_outlier_colname'
  
  # Ensure data is sorted for time series operations
  prepared_data <- data %>% arrange(OU_ID, INDICATOR, PERIOD) # Renamed 'data' to 'prepared_data'
  
  # Process each group to detect seasonal outliers
  outlier_detection_results <- prepared_data %>% # Renamed 'outlier_list'
    group_by(OU_ID, INDICATOR) %>%
    group_map(~ {
      current_group_df <- .x # Renamed 'df'
      group_identifiers <- .y # Renamed 'key'
      
      # Skip if fewer than 2 non-NA values for time series analysis
      if (sum(!is.na(current_group_df$VALUE)) < 2) {
        current_group_df[[outlier_column_name]] <- NA  # 👈 "skipping" introduces NA's --> these are removed in the last step before `return()`
        current_group_df$OU_ID <- group_identifiers$OU_ID
        current_group_df$INDICATOR <- group_identifiers$INDICATOR
        return(current_group_df)
      }
      
      # Create time series object
      time_series_data <- stats::ts(current_group_df$VALUE, frequency = 12)
      # Clean time series data (remove outliers and fill missing values)
      cleaned_time_series <- forecast::tsclean(time_series_data, replace.missing = TRUE)
      # Calculate Median Absolute Deviation (MAD) for outlier threshold
      mad_threshold <- mad(current_group_df$VALUE, constant = 1, na.rm = TRUE)
      
      # Flag seasonal outliers
      is_seasonal_outlier <- abs(time_series_data - cleaned_time_series) / mad_threshold >= deviation # Renamed 'seasonal_flag'
      current_group_df[[outlier_column_name]] <- is_seasonal_outlier
      current_group_df$OU_ID <- group_identifiers$OU_ID
      current_group_df$INDICATOR <- group_identifiers$INDICATOR
      return(current_group_df)
    }) %>%
    bind_rows()
  
  # In OUTLIER flagging col, replace any `NA` (= skipped evaluation) with `FALSE` (treat as non-outlier)
  outlier_detection_results[[outlier_column_name]][is.na(outlier_detection_results[[outlier_column_name]])] <- FALSE # 👈 replace NA's: fixing here 
  
  return(outlier_detection_results)
}



#### **Filter** out MAD outliers before seasonal detection

In [None]:
# Filter out MAD outliers before seasonal detection

if (RUN_MAGIC_GLASSES_COMPLETE) {

# flagged_outliers_mad15_mad10_filtered <- flagged_outliers_mad15_mad10 %>%
#   filter((is.na(OUTLIER_MAD15) | OUTLIER_MAD15 == FALSE),  # not really needed anymore: `is.na(OUTLIER_MAD15)` ... but keeping just in case
#          (is.na(OUTLIER_MAD15_MAD10) | OUTLIER_MAD15_MAD10 == FALSE),
#          !is.na(VALUE))

flagged_outliers_mad15_mad10_filtered <- flagged_outliers_mad15_mad10 %>%
  filter(#(OUTLIER_MAD15 == FALSE),  # not needed: outliers identified by MAD15 are necessary outliers for MAD10 !
         (OUTLIER_MAD15_MAD10 == FALSE) 
        )
    
    nr_of_rows_removed <- nrow(flagged_outliers_mad15_mad10) - nrow(flagged_outliers_mad15_mad10_filtered)
    log_msg(paste0("Filtering data for MAD15_MAD10 outliers removed ", nr_of_rows_removed , " rows"))

     if (length(which(is.na(flagged_outliers_mad15_mad10_filtered$OUTLIER_MAD15_MAD10))) > 0 ) {
        log_msg("🚨 Unexpected `NA`s found in `flagged_outliers_mad15_mad10_filtered$OUTLIER_MAD15_MAD10`.")
    }
    }

#### **Apply function** `detect_seasonal_outliers()` with deviation = **5**

In [None]:
if (RUN_MAGIC_GLASSES_COMPLETE) {

    flagged_outliers_seasonal5 <- detect_seasonal_outliers(
                                               data = flagged_outliers_mad15_mad10_filtered, # all MAD (15 and 10) already REMOVED
                                               deviation = 5) # -> resulting colname: OUTLIER_SEASONAL5

    outlier_cols = flagged_outliers_seasonal5 |> select(starts_with("OUTLIER_SEAS")) |> names()
    log_msg(paste0("Calculated column : ", paste(outlier_cols, collapse=", ") ))
    
    print(dim(flagged_outliers_seasonal5))
    head(flagged_outliers_seasonal5, 3)

    }

#### **Filter** out outliers seasonal 5

In [None]:
# remaining values for seasonal3

if (RUN_MAGIC_GLASSES_COMPLETE) {

    # flagged_outliers_seasonal5_filtered <- flagged_outliers_seasonal5 %>%
    #   filter(
    #     OUTLIER_SEASONAL5 == FALSE,
    #     (is.na(OUTLIER_MAD15) | OUTLIER_MAD15 == FALSE),
    #     (is.na(OUTLIER_MAD15_MAD10) | OUTLIER_MAD15_MAD10 == FALSE),
    #     !is.na(VALUE) # NB: so far VALUE is never NA
    #   )

    flagged_outliers_seasonal5_filtered <- flagged_outliers_seasonal5 %>%
      filter(
        OUTLIER_SEASONAL5 == FALSE
        # (OUTLIER_MAD15 == FALSE), # these were already removed in `flagged_outliers_mad15_mad10_filtered` !!
        # (OUTLIER_MAD15_MAD10 == FALSE)
        # !is.na(VALUE) # NB: so far VALUE is never NA
      )

    nr_of_rows_removed <- nrow(flagged_outliers_seasonal5) - nrow(flagged_outliers_seasonal5_filtered)
    log_msg(paste0("Filtering data for SEASONAL5 outliers removed ", nr_of_rows_removed , " rows"))

     if (length(which(is.na(flagged_outliers_seasonal5_filtered$OUTLIER_SEASONAL5))) > 0 ) {
        log_msg("🚨 Unexpected `NA`s found in `flagged_outliers_seasonal5_filtered$OUTLIER_SEASONAL5`.")
    }

    }

#### **Apply function** `detect_seasonal_outliers()` with deviation = **3**

In [None]:
# Run seasonal3 detection

if (RUN_MAGIC_GLASSES_COMPLETE) {

    flagged_outliers_seasonal3 <- detect_seasonal_outliers(data = flagged_outliers_seasonal5_filtered, 
                                                           deviation = 3) %>%
    rename(OUTLIER_SEASONAL5_SEASONAL3 = OUTLIER_SEASONAL3) # not elegant but more explicit ...

    log_msg(paste0("Calculated column : OUTLIER_SEASONAL5_SEASONAL3") )

    }

#### **Join** together `flagged_outliers_seasonal5` (seas5) and `flagged_outliers_seasonal3` (seas5 -> seas3)

In [None]:
# 🚨 Added on 2025-07-14 🚨

if (RUN_MAGIC_GLASSES_PARTIAL | RUN_MAGIC_GLASSES_COMPLETE) {
  
  flagged_outliers_seasonal5_seasonal3 <- flagged_outliers_seasonal5 %>%
    left_join(
      flagged_outliers_seasonal3 %>% select(PERIOD, OU_ID, INDICATOR, OUTLIER_SEASONAL5_SEASONAL3),
      by = c("PERIOD", "OU_ID", "INDICATOR")
    ) |>
    mutate(OUTLIER_SEASONAL5_SEASONAL3 = if_else(is.na(OUTLIER_SEASONAL5_SEASONAL3), TRUE, OUTLIER_SEASONAL5_SEASONAL3)) # ✍🏽 GP added 20250711

  outlier_cols = flagged_outliers_seasonal5_seasonal3 |> select(starts_with("OUTLIER_")) |> names()
  log_msg(paste0("Created table with ", paste(outlier_cols, collapse=", ") ) ) 
  
  print(dim(flagged_outliers_seasonal5_seasonal3))
  head(flagged_outliers_seasonal5_seasonal3, 3)
  
}

# 4. Generate and Export Output tables
Export tables as .csv and .parquet files to `/data/` folder, then code in pipeline.py will handle the writing to Dataset 

## 4.1. Join all "flags" into single df
This table contains **all values**, **flagged** (bool) based on each (and all) outliers detection method used.

**Structure**: contains always the same fixed set of cols (`fixed_cols`, INDICATOR, VALUE) + **1 col for each** implemented **outliers detection method** <br>
(so table keeps growing as cols are added to the right end)

**Note**: NA are introduced when performing `left_join()` when left hand side has more rows than right hand side! <br>
Here, we are joining df's of diffrente sizes: `flagged_outliers_seasonal5_seasonal3` (MG COMPLETE) has fewer rows than `flagged_outliers_mad15_mad10` (MG PARTIAL) because to calculate MG COMPLETE we first remove all the outliers flagged by MG PARTIAL!<br>
Therefore, after joinig to make `flagged_outliers_allmethods`, it's necessary to convert all `NA`s into `TRUE` at col `OUTLIER_SEASONAL5_SEASONAL3`

In [None]:
# RUN_MAGIC_GLASSES_PARTIAL <- TRUE
# RUN_MAGIC_GLASSES_COMPLETE <- TRUE

In [None]:
# Join all flags into one full dataset

if (RUN_MAGIC_GLASSES_PARTIAL == FALSE && RUN_MAGIC_GLASSES_COMPLETE == FALSE) {
    
    flagged_outliers_allmethods <- flagged_outliers_classic
    
} else if (RUN_MAGIC_GLASSES_PARTIAL == TRUE && RUN_MAGIC_GLASSES_COMPLETE == FALSE) {
    
    flagged_outliers_allmethods <- flagged_outliers_classic %>%
    left_join(
        flagged_outliers_mad15_mad10 %>% select(PERIOD, OU_ID, INDICATOR, OUTLIER_MAD15_MAD10),
        by = c("PERIOD", "OU_ID", "INDICATOR")
        )

} else if (RUN_MAGIC_GLASSES_PARTIAL == TRUE && RUN_MAGIC_GLASSES_COMPLETE == TRUE) {
    
    flagged_outliers_allmethods <- flagged_outliers_classic %>%
    left_join(
        flagged_outliers_mad15_mad10 %>% select(PERIOD, OU_ID, INDICATOR, OUTLIER_MAD15_MAD10),
        by = c("PERIOD", "OU_ID", "INDICATOR")
        ) %>%
    left_join(
        flagged_outliers_seasonal5_seasonal3 %>% select(PERIOD, OU_ID, INDICATOR, OUTLIER_SEASONAL5_SEASONAL3),
        by = c("PERIOD", "OU_ID", "INDICATOR")
    ) %>%
    # Convert NA to TRUE: all outliers identified by MAD are also outliers for SEASONAL!
    mutate(OUTLIER_SEASONAL5_SEASONAL3 = if_else(is.na(OUTLIER_SEASONAL5_SEASONAL3) & OUTLIER_MAD15_MAD10, TRUE, OUTLIER_SEASONAL5_SEASONAL3)) 

     if (length(which(is.na(flagged_outliers_allmethods$OUTLIER_SEASONAL5_SEASONAL3))) > 0 ) {
        log_msg("🚨 Unexpected `NA`s found in `flagged_outliers_allmethods$OUTLIER_SEASONAL5_SEASONAL3`.")
    }

    }

print(dim(flagged_outliers_allmethods))
head(flagged_outliers_allmethods, 3)

In [None]:
# --- Drop unnecessary cols (meadian, mean, bounds ... ) ---------------
flagged_outliers_allmethods <- flagged_outliers_allmethods %>% 
# select(all_of(c(fixed_cols, "INDICATOR", "VALUE")), starts_with("OUTLIER_")) %>% # ⚠️⚠️⚠️ TEMP SWITCH due to changing CONFIG  
select(any_of(c(fixed_cols, "INDICATOR", "VALUE")), starts_with("OUTLIER_")) %>%
# Drop "intermediate" outliers flags (are not among the methods themselves, just needed for Magic Glasses)
select(-any_of(c("OUTLIER_MAD15", "OUTLIER_SEASONAL5"))) # `any_of()`: flexible in case these cols are not there (if MG not run)

# Define the desired renames as a named character vector: new_name = "old_name"
# (Note: This is the format for `rename()` when using `!!!`)
desired_renames <- c(
  OUTLIER_MAGIC_GLASSES_PARTIAL = "OUTLIER_MAD15_MAD10",
  OUTLIER_MAGIC_GLASSES_COMPLETE = "OUTLIER_SEASONAL5_SEASONAL3"
)


# --- Rename cols ---------------

# Get the names of columns currently in the dataframe
current_cols <- names(flagged_outliers_allmethods)

# Filter `desired_renames` to only include pairs where the 'old_name' (the value) exists
#  in the dataframe's current columns ...
existing_renames <- desired_renames[desired_renames %in% current_cols]

# ... then apply the rename operation only for the columns that actually exist.
if (length(existing_renames) > 0) { 
  flagged_outliers_allmethods <- flagged_outliers_allmethods %>%
    rename(!!!existing_renames)
}

# # OLD for REFERENCE
# rename(
#     OUTLIER_MAGIC_GLASSES_PARTIAL = OUTLIER_MAD15_MAD10,
#     OUTLIER_MAGIC_GLASSES_COMPLETE = OUTLIER_SEASONAL5_SEASONAL3
# )

In [None]:
outlier_cols = flagged_outliers_allmethods |> select(starts_with("OUTLIER_")) |> names()
log_msg(glue::glue("Created table `{COUNTRY_CODE}_flagged_outliers_allmethods`, containing the columns : ", paste(outlier_cols, collapse=", ") ))

In [None]:
print(dim(flagged_outliers_allmethods))
head(flagged_outliers_allmethods, 3)

### Export as `.csv`(SILENCED due to size) and `.parquet`

In [None]:
# Define the base directory for saving files
output_dir <- file.path(DATA_PATH, "dhis2", "outliers_detection")

# export_data(data_object=flagged_outliers_allmethods, 
#             file_path=file.path(output_dir, paste0(COUNTRY_CODE, "_flagged_outliers_allmethods.csv")))

export_data(data_object=flagged_outliers_allmethods, 
            file_path=file.path(output_dir, paste0(COUNTRY_CODE, "_flagged_outliers_allmethods.parquet")))


## 4.2. Extract **outliers-only** tables for each individual detection method
As many tables as outliers detection methods implemented.

In [None]:
# Define a list of column names with outlier flags 
outlier_cols_to_process <- flagged_outliers_allmethods %>% select(starts_with("OUTLIER_")) %>% names()
print(outlier_cols_to_process)

#### 4.2.1. Create the outlier tables (one tibble for each outlier column)

In [None]:
# Write Function to extract outliers for a given column

extract_outlier_tibble <- function(outlier_col_name) {
  # Dynamically create the list of cols to select
  cols_to_select <- c(fixed_cols, "INDICATOR", "VALUE", outlier_col_name)

  # Filter the data for the current outlier column and return the tibble
  outliers_data <- flagged_outliers_allmethods %>%
    select(all_of(cols_to_select)) %>%
    filter(!!sym(outlier_col_name) == TRUE)

  log_msg(paste0("Extracted ", nrow(outliers_data), " outliers for: ", outlier_col_name))
  return(outliers_data)
}

In [None]:
# Apply function -> creates list of tibbles 
all_outlier_tibbles <- map(outlier_cols_to_process, extract_outlier_tibble) %>%
  set_names(outlier_cols_to_process)

In [None]:
# Examplee of how to acces each tibble from `all_outlier_tibbles` - inspect list of outliers based on each detection method individually
head(all_outlier_tibbles$OUTLIER_MEAN, 3)

#### 4.2.2. Export each outlier table as individual file

In [None]:
# Function to export each tibble as a separate file
export_outlier_tibble <- function(outlier_tibble, outlier_col_name) {
  
  # # To export as CSV - 🤫 SILENCED as files are VERY HEAVY ... 
  # file_name_csv <- paste0(COUNTRY_CODE, "_", tolower(outlier_col_name), ".csv")
  # file_path_csv <- file.path(output_dir, file_name_csv)
  # write_csv(outlier_tibble, file_path_csv)
    
  # Replace dots with hyphens in the outlier_col_name for the file name 
  cleaned_outlier_col_name <- gsub("\\.", "-", outlier_col_name)

   # To export as PARQUET
  file_name_parquet <- paste0(COUNTRY_CODE, "_", tolower(cleaned_outlier_col_name), ".parquet")
  file_path_parquet <- file.path(output_dir, file_name_parquet)
  arrow::write_parquet(outlier_tibble, file_path_parquet)

  # message(paste0("Exported ", nrow(outlier_tibble), " outliers to: ", file_path_csv, " and ", file_path_parquet)) # 🤫 SILENCED as files are VERY HEAVY ...
  log_msg(paste0("Method `", cleaned_outlier_col_name, "` identified ", nrow(outlier_tibble), " outliers, and exported to: ", file_path_parquet))
}

In [None]:
# Iterate through the list of tibbles and their corresponding names to export them
# Iterate over two lists/vectors in parallel
purrr::walk2(all_outlier_tibbles, names(all_outlier_tibbles), export_outlier_tibble)

# 5. 🐘 Write to **Database** to expose to Shiny app

#### 5.1.1. Add *_NAME cols: `flagged_outliers_allmethods_name`
Get these cols from routine data (see: "**X. 📊 Shiny app: extract `*_NAME` table**")

In [None]:
flagged_outliers_allmethods_name <- flagged_outliers_allmethods |> 
left_join(pyramid, by = join_by(ADM1_ID, ADM2_ID, OU_ID)) 

dim(flagged_outliers_allmethods_name)
head(flagged_outliers_allmethods_name)

#### 5.1.2. Add `DATE` col: `flagged_outliers_allmethods_name_date`

In [None]:
flagged_outliers_allmethods_name_date <- flagged_outliers_allmethods_name |> 
mutate(DATE = make_date(year = YEAR, month = MONTH, day = 1L)) 

dim(flagged_outliers_allmethods_name_date)
head(flagged_outliers_allmethods_name_date, 3)

#### 5.2. Write table to DB

In [None]:
# library(DBI)
# library(RPostgres)

In [None]:
dbname <- Sys.getenv("WORKSPACE_DATABASE_DB_NAME")
host <- Sys.getenv("WORKSPACE_DATABASE_HOST")
port <- Sys.getenv("WORKSPACE_DATABASE_PORT")
username <- Sys.getenv("WORKSPACE_DATABASE_USERNAME")
password <- Sys.getenv("WORKSPACE_DATABASE_PASSWORD")

In [None]:
con <- DBI::dbConnect(RPostgres::Postgres(),
                      dbname = dbname, 
                      host = host, 
                      port = port, 
                      user = username,
                      password = password,
                      sslmode = 'require'
                     )

In [None]:
# Write table to DB
DBI::dbWriteTable(con, 
                  "flagged_outliers_allmethods_name_date", 
                  flagged_outliers_allmethods_name_date, 
                  overwrite = TRUE, 
                  row.names = FALSE
                 ) 

In [None]:
DBI::dbDisconnect(con)