# Insecticide-teated net (ITN) access and use, DHS data

## Resources

https://dhsprogram.com/data/Guide-to-DHS-Statistics/Access_to_an_Insecticide-Treated_Net_ITN.htm

https://dhsprogram.com/data/Guide-to-DHS-Statistics/index.htm#t=Use_of_Mosquito_Nets_by_Persons_in_the_Household.htm%23Percentage_of_the1bc-1&rhtocid=_15_3_0

https://dhsprogram.com/publications/publication-dhsg4-dhs-questionnaires-and-manuals.cfm

### Access

Percentage of the de facto household population with access to an ITN in the household, defined as the proportion of the de facto household population who slept under an ITN if each ITN in the household were used by up to two people.

Numerator: Number of de facto persons (hv103 = 1) who could sleep under an ITN if each ITN in the household is used by up to 2 people, calculated for each household as the minimum of:

1. number of de facto persons in the household (hv013), and
2. twice the number of ITNs per household (2 * sum of hml10_1 – hml10_7 = 1) <- assumed that maximum two people can sleep under a bednet
   
Denominator: Number of persons who stayed in the household the night before the survey (hv103 = 1)

Variables: hhid (household identification), hml10_1 – _7 (Insecticide-Treated Net (ITN)), hv013 (Number of de facto members) hv103, (Slept last night), hv005 (Household sample)

### Use

1)      Percentage of the de facto household population who slept the night before the survey under a mosquito net (treated or untreated).

2)      Percentage of the de facto household population who slept the night before the survey under an insecticide-treated net (ITN).

3)      Among the de facto household population in households with at least one ITN, the percentage who slept under an ITN the night before the survey.

Coverage:
Population base: De facto household members (PR file, HR file)
Time period: Night before the survey

Numerators:
1)      Number of de facto persons who reported sleeping under any mosquito net the night before the survey (hv103 = 1 & hml12 in 1:3)
2)      Number of de facto persons who reported sleeping under an ITN the night before the survey (hv103 = 1 & hml12 in 1:2)
3)      Number of de facto persons in households with at least one ITN who reported sleeping under an ITN the night before the survey (hv103 = 1 & hml12 in 1:2 & any hml10_1 – hml10_7 = 1)

Denominators:
a)       Number of persons in the de facto household population (hv103 = 1)
b)       Number of persons in the de facto household population (hv103 = 1)
c)       Number of persons in the de facto household population in households owning at least one ITN (hv103 = 1 & any hml10_1 – hml10_7 = 1)

Variables: HR file, PR file.


**Project uses numerator 2) Number of de facto persons who reported sleeping under an ITN the night before the survey (hv103 = 1 & hml12 in 1:2)**

**Project uses denominator b) Number of persons in the de facto household population (hv103 = 1)**

## Preliminary steps

In [None]:
rm(list = ls())

options(scipen=999)

In [None]:
# Global paths
Sys.setenv(PROJ_LIB = "/opt/conda/share/proj")
Sys.setenv(GDAL_DATA = "/opt/conda/share/gdal")

In [None]:
# Paths
ROOT_PATH <- '~/workspace'
CONFIG_PATH <- file.path(ROOT_PATH, 'configuration')
CODE_PATH <- file.path(ROOT_PATH, 'code')
DATA_PATH <- file.path(ROOT_PATH, 'data')
DHS_DATA_PATH <- file.path(DATA_PATH, 'dhs', 'raw')
OUTPUT_DATA_PATH <- file.path(DATA_PATH, 'dhs', 'indicators', 'bednets')

In [None]:
# Load utils
source(file.path(CODE_PATH, "snt_utils.r"))

# List required pcks
required_packages <- c("haven", "sf", "glue", "survey", "data.table", "stringi", "jsonlite", "httr", "reticulate", "arrow")

# Execute function
install_and_load(required_packages)

In [None]:
Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python")
reticulate::py_config()$python
openhexa <- import("openhexa.sdk")

# Load SNT config
CONFIG_FILE_NAME <- "SNT_config.json"
config_json <- tryCatch({ fromJSON(file.path(CONFIG_PATH, CONFIG_FILE_NAME)) },
                        error = function(e) {
                          msg <- paste0("Error while loading configuration", conditionMessage(e))  
                          cat(msg)   
                          stop(msg) 
                        })

msg <- paste0("SNT configuration loaded from  : ", file.path(CONFIG_PATH, CONFIG_FILE_NAME)) 
log_msg(msg)

# Set config variables
COUNTRY_CODE <- config_json$SNT_CONFIG$COUNTRY_CODE

## Geo and admin data

In [None]:
admin_level <- 'ADM1'
admin_id_col <- glue(admin_level, 'ID', .sep='_')
admin_name_col <- glue(admin_level, 'NAME', .sep='_')
admin_cols <- c(admin_id_col, admin_name_col)

In [None]:
# Load spatial file from dataset 

dhis2_dataset <- config_json$SNT_DATASET_IDENTIFIERS$DHIS2_DATASET_FORMATTED

spatial_data_filename <- paste(COUNTRY_CODE, "shapes.geojson", sep = "_")
# spatial_data <- read_sf(file.path(DATA_PATH, 'dhis2', 'formatted', spatial_data_filename))
spatial_data <- get_latest_dataset_file_in_memory(dhis2_dataset, spatial_data_filename)
log_msg(glue("File {spatial_data_filename} successfully loaded from dataset version: {dhis2_dataset}"))

In [None]:
spatial_data <- st_as_sf(spatial_data)

In [None]:
# aggregate geometries by the admin columns
spatial_data <- aggregate_geometry(
  sf_data=spatial_data,
  admin_id_colname=admin_id_col,
  admin_name_colname=admin_name_col
)

# keep class
spatial_data <- st_as_sf(spatial_data)

if(COUNTRY_CODE == "COD"){
  spatial_data[[admin_name_col]] <- clean_admin_names(spatial_data[[admin_name_col]])
}

admin_data <- st_drop_geometry(spatial_data)
setDT(admin_data)

## Import DHS data

In [None]:
data_source <- 'DHS'
indicator_access <- 'PCT_ITN_ACCESS'
indicator_use <- 'PCT_ITN_USE'

### Unzip data for the analysis

In [None]:
household_recode <- 'HR'
person_recode <- 'PR'
target_file_type <- 'SV'

delete_otherextension_files(DHS_DATA_PATH, extension_to_retain=".zip")

In [None]:
dhs_hr_zip_filename <- extract_latest_dhs_recode_filename(DHS_DATA_PATH, household_recode, target_file_type)
unzip(file.path(DHS_DATA_PATH, dhs_hr_zip_filename), exdir=DHS_DATA_PATH)

dhs_pr_zip_filename <- extract_latest_dhs_recode_filename(DHS_DATA_PATH, person_recode, target_file_type)
unzip(file.path(DHS_DATA_PATH, dhs_pr_zip_filename), exdir=DHS_DATA_PATH)

In [None]:
# # Remove existing output files
# files <- list.files(OUTPUT_DATA_PATH, full.names = TRUE)
# files_to_delete <- files[grepl('_ITN_', basename(files), ignore.case = TRUE) & grepl(COUNTRY_CODE, basename(files), ignore.case = TRUE)]
# file.remove(files_to_delete)

### Import data files

In [None]:
data_extension <- '.SAV'
dhs_hr_filename <- list.files(path = DHS_DATA_PATH, pattern = paste0(".*", household_recode, ".*\\", data_extension, "$"), ignore.case=TRUE)
dhs_pr_filename <- dir(path = DHS_DATA_PATH, pattern = paste0(".*", person_recode, ".*\\", data_extension, "$"), ignore.case=TRUE)

if(!check_dhs_same_version(dhs_hr_filename, dhs_pr_filename)){
  stop("The necessary DHS data do not have the same version/issue. Check available data before rerunning.")
}

In [None]:
dhs_hr_dt <- read_spss(file.path(DHS_DATA_PATH, dhs_hr_filename)) # household recode
dhs_hr_dt <- setDT(dhs_hr_dt)

dhs_pr_dt <- read_spss(file.path(DHS_DATA_PATH, dhs_pr_filename)) # person recode
dhs_pr_dt <- setDT(dhs_pr_dt)

In [None]:
# Make admin codes and names dataframe (for future merging)

dhs_admin_dt <- make_dhs_admin_df(
  input_dhs_df=dhs_hr_dt,
  original_admin_column="HV024",
  new_admin_name_colname=admin_name_col,
  new_admin_code_colname='DHS_ADM1_CODE'
)

# format the names to be like DHIS2 names
dhs_admin_dt[, (admin_name_col) := format_names(get(admin_name_col))]

# TODO this should be changed in the formatting of DHIS2 data; the correct name should be with a space
dhs_admin_dt[get(admin_name_col) == "MAI NDOMBE", (admin_name_col) := "MAINDOMBE"]

In [None]:
# Check that all regions can be matched with DHIS2 pyramid
if(!check_perfect_match(dhs_admin_dt, admin_name_col, admin_data, admin_name_col)){
  stop("The DHS data provided does not fully match DHIS2 pyramid data. Please check input data before retrying.")
}

### Set relevant columns

In [None]:
household_id_cols <- c("HHID", "HV000", "HV001", "HV002")
original_household_ITN_cols <- grep('HML10', names(dhs_hr_dt), value = TRUE)
household_sampling_cols <- c("HV005", "HV021", "HV022", "HV023", "HV024")
household_inhabitants_col <- "HV013"
person_slept_col <- "HV103"
person_id_col <- "HVIDX"
person_bednet_col <- "HML12"

## Preprocess Household recode data

In [None]:
# filter columns
hr_dt <- dhs_hr_dt[, .SD, .SDcols=c(household_id_cols, household_sampling_cols, household_inhabitants_col, original_household_ITN_cols)]

# check i didn't omit any crucial variable
nrow(hr_dt[duplicated(hr_dt)])


In [None]:
sapply(original_household_ITN_cols, function(i) table(hr_dt[[i]], useNA = 'always'))

In [None]:
# make syntactically valid names
setnames(hr_dt, old = names(hr_dt), new = make.names(names(hr_dt)))
household_ITN_cols <- grep('HML10', names(hr_dt), value = TRUE)

In [None]:
sapply(household_ITN_cols, function(i) table(hr_dt[[i]], useNA = 'always'))

In [None]:
# add admin name column
hr_dt <- merge.data.table(dhs_admin_dt, hr_dt, by.x = "DHS_ADM1_CODE", by.y = "HV024", all = TRUE)

# sapply(household_ITN_cols, function(i) table(hr_dt[[i]], useNA = 'always'))

hr_dt[, (household_ITN_cols) := lapply(.SD, function(x) {
  x <- as.integer(as.character(x))  # convert factors/characters to numeric
  ifelse(is.na(x), 0, x)
}), .SDcols = household_ITN_cols]

# compute the maximum potential users, given the number of ITNs present in the household
hr_dt[, max_users := 2 * rowSums(.SD, na.rm = TRUE), .SDcols = household_ITN_cols] # maximum 2 times the number of ITNs in the household

# compute real potential users
hr_dt[, potential_users := pmin(max_users, HV013, na.rm = TRUE)]

# compute weights
hr_dt[, wt := HV005/1000000]

## Access to ITN

### Preprocess person file

In [None]:
# filter relevant columns
access_pr_dt <- dhs_pr_dt[, .SD, .SDcols = c(
  household_id_cols,
  person_id_col,
  person_slept_col
)]

# # check no necessary column was omitted
# nrow(access_pr_dt[duplicated(access_pr_dt)])

In [None]:
# make denominator: group and sum, removing NAs
access_pr_dt <- access_pr_dt[, .(total_slept = sum(get(person_slept_col), na.rm = TRUE)), by = household_id_cols]

### Join with household file

In [None]:
# check merge with household file
check_perfect_match(hr_dt, 'HHID', access_pr_dt, 'HHID')

# lapply(household_id_cols, function(i) check_perfect_match(hr_dt, i, access_pr_dt, i))
if(!all(unlist((lapply(household_id_cols, function(i) check_perfect_match(hr_dt, i, access_pr_dt, i)))))){
  print('Person and Household data does not match')
}

In [None]:
access_dt <- merge.data.table(hr_dt, access_pr_dt, by = household_id_cols, all = TRUE)

# filter rows
access_dt <- access_dt[total_slept > 0] # to not divide by 0 (only households where someone slept last night)

DHS guidelines for the calculation of “potential users”: "In households which have more than 1 ITN for every 2 people, the product of this calculation will be greater than the number of individuals who spent the previous night. In this case, the “potential users” variable in that household should be modified to reflect the number of individuals who spent the previous night in the household because the number of potential users in a household cannot exceed the number of individuals who spent the previous night in that household."

In [None]:
access_dt[, foo := fifelse(
    potential_users > total_slept,
    total_slept,
    potential_users
)]

### Compute ITN access indicator

In [None]:
access_dt[, (indicator_access) :=  potential_users / total_slept]

In [None]:
summary(access_dt[[indicator_access]])

#### Account for the sampling strategy

In [None]:
# clustering, stratification, weights (for means, proportions, regression models, etc.)
access_design_sampling = svydesign(
  ids = ~ HV021, # primary sampling unit / cluster ids (cluster number and/or ultimate area unit)
  data = access_dt, # dataset
  strata = ~ HV023, # groupings of primary sampling units
  weights = ~ wt, # the sampling weights variable
  num_p=1, # ? dunno what this is
  nest = T # the primary sampling units are nested within the strata
)

In [None]:
bednet_access_table <- svyby(formula = as.formula(paste("~", indicator_access)),
                           # by = ~ ADM1,
                           by = reformulate(admin_name_col),
                           FUN = svymean,
                           design = access_design_sampling,
                           level = 0.95,
                           vartype = "ci",
                           na.rm = TRUE,
                           influence = TRUE)

In [None]:
setDT(bednet_access_table)

In [None]:
lower_bound_col <- glue("{toupper(indicator_access)}_CI_LOWER_BOUND")
upper_bound_col <- glue("{toupper(indicator_access)}_CI_UPPER_BOUND")
sample_avg_col <- glue("{toupper(indicator_access)}_SAMPLE_AVERAGE")

In [None]:
# names(bednet_access_table) <- toupper(names(bednet_access_table))
names(bednet_access_table)[names(bednet_access_table) == 'ci_l'] <- lower_bound_col
names(bednet_access_table)[names(bednet_access_table) == 'ci_u'] <- upper_bound_col
names(bednet_access_table)[names(bednet_access_table) == indicator_access] <- sample_avg_col

In [None]:
# Cap the CI's between 0 and 1 (in case of small sample => large CI's)
bednet_access_table[get(lower_bound_col) < 0, (lower_bound_col) := 0]
bednet_access_table[get(upper_bound_col) > 1, (upper_bound_col) := 1]

In [None]:
# Convert to percentages
bednet_access_table[, (lower_bound_col) := get(lower_bound_col) * 100]
bednet_access_table[, (upper_bound_col) := get(upper_bound_col) * 100]
bednet_access_table[, (sample_avg_col) := get(sample_avg_col) * 100]

In [None]:
bednet_access_table <- merge.data.table(admin_data, bednet_access_table, by = admin_name_col, all = TRUE)

In [None]:
head(bednet_access_table)

In [None]:
filename_without_extension <- glue("{COUNTRY_CODE}_{data_source}_{admin_level}_{toupper(indicator_access)}")
write.csv(bednet_access_table, file = file.path(OUTPUT_DATA_PATH, paste0(filename_without_extension, '.csv')), row.names = FALSE)
write_parquet(bednet_access_table, file.path(OUTPUT_DATA_PATH, paste0(filename_without_extension, '.parquet')))

## ITN use

### Preprocess person file

In [None]:
# filter columns
use_pr_dt <- dhs_pr_dt[, .SD, .SDcols=c(household_id_cols, person_id_col, person_slept_col, person_bednet_col)]

# check no necessary column was omitted
nrow(use_pr_dt[duplicated(use_pr_dt)])

# # for(i in person_slept_col){print(table(access_pr_dt[[i]]))}
# sapply(person_bednet_col, function(i) table(use_pr_dt[[i]], useNA = 'always'))

The DHS guide ( https://dhsprogram.com/data/Guide-to-DHS-Statistics/index.htm#t=Use_of_Mosquito_Nets_by_Persons_in_the_Household.htm ) suggests to use both 1 & 2 as possible values for HML12; but 2 is "Both treated (ITN) and untreated nets"; using as specified in the guide, but to be kept in mind

In [None]:
# # group and sum, removing NAs and keeping only 1 as valid value
# use_pr_dt <- use_pr_dt[, slept_itn := as.integer(
#   get(person_slept_col) == 1 & (get(person_bednet_col) == 1)
# )]

In [None]:
# group and sum, removing NAs
use_pr_dt <- use_pr_dt[, slept_itn := as.integer(
  get(person_slept_col) == 1 & (get(person_bednet_col) %in% c(1, 2))
)]

# check recodings are correct
xtabs(~ get(person_slept_col) + get(person_bednet_col) + slept_itn, data = use_pr_dt, addNA = TRUE)

In [None]:
use_pr_dt <- use_pr_dt[, .(
  total_slept = sum(get(person_slept_col), na.rm = TRUE),
  total_slept_itn = sum(get("slept_itn"), na.rm = TRUE)
), by = household_id_cols
]

use_pr_dt[, (indicator_use) := total_slept_itn / total_slept]

### Join with household file

In [None]:
use_dt <- merge.data.table(hr_dt, use_pr_dt, by = household_id_cols)

### Compute ITN use indicator

#### Account for sampling strategy

In [None]:
use_design_sampling = svydesign(
  ids = ~ HV021, # primary sampling unit / cluster ids (cluster number and/or ultimate area unit)
  data = use_dt, # dataset
  strata = ~ HV023, # groupings of primary sampling units
  weights = ~ wt, # the sampling weights variable
  num_p=1, # ? dunno what this is
  nest = T # the primary sampling units are nested within the strata
)

In [None]:
bednet_use_table <- svyby(formula = as.formula(paste("~", indicator_use)),
                          # by = ~ ADM1,
                          by = reformulate(admin_name_col),
                          FUN = svymean,
                          design = use_design_sampling,
                          level = 0.95,
                          vartype = "ci",
                          na.rm = TRUE,
                          influence = TRUE)

In [None]:
setDT(bednet_use_table)

In [None]:
lower_bound_col <- glue("{toupper(indicator_use)}_CI_LOWER_BOUND")
upper_bound_col <- glue("{toupper(indicator_use)}_CI_UPPER_BOUND")
sample_avg_col <- glue("{toupper(indicator_use)}_SAMPLE_AVERAGE")

names(bednet_use_table)[names(bednet_use_table) == 'ci_l'] <- lower_bound_col
names(bednet_use_table)[names(bednet_use_table) == 'ci_u'] <- upper_bound_col
names(bednet_use_table)[names(bednet_use_table) == indicator_use] <- sample_avg_col

In [None]:
# Cap the CI's between 0 and 1 (in case of small sample => large CI's)
bednet_use_table[get(lower_bound_col) < 0, (lower_bound_col) := 0]
bednet_use_table[get(upper_bound_col) > 1, (upper_bound_col) := 1]

In [None]:
# Convert to percentages
bednet_use_table[, (lower_bound_col) := get(lower_bound_col) * 100]
bednet_use_table[, (upper_bound_col) := get(upper_bound_col) * 100]
bednet_use_table[, (sample_avg_col) := get(sample_avg_col) * 100]

In [None]:
bednet_use_table <- merge.data.table(admin_data, bednet_use_table, by = admin_name_col, all = TRUE)

In [None]:
filename_without_extension <- glue("{COUNTRY_CODE}_{data_source}_{admin_level}_{indicator_use}")
write.csv(bednet_use_table, file = file.path(OUTPUT_DATA_PATH, paste0(filename_without_extension, '.csv')), row.names = FALSE)
write_parquet(bednet_use_table, file.path(OUTPUT_DATA_PATH, paste0(filename_without_extension, '.parquet')))