
### Are there **geographic hotspots** where **neonatal mortality** and **morbidity** cluster?

To answer the research question, Variables were selected and categorized into these four categories:

1. Geographical location
2. Outcomes -> mortality & morbidity
3. Care Seeking Behavior
    1. Preventive dimensions -> Vaccinations
    2. Responsive dimensions -> Treatment and seeking help
    3. Exposure dimensions -> birth weight + ANC/PNC
4. Contextual/ Environmental Risk Factors

The data consists of **82 selected variables** that are related to the previous categories

In [88]:
import pandas as pd

df = pd.read_csv("KR_data.csv")

In [None]:
# TODO: Add the following variables: Place of birth (M15), Current age of respondent (V102), water source (V113)
selected_colms = [
    "V024",
    "V026",
    "V102",
    "V106",
    "V206",
    "V207",
    "B5",
    "B6",
    "B7",
    "B8",
    "M18",
    "M19",
    "H2",
    "H3",
    "H4",
    "H5",
    "H6",
    "H7",
    "H8",
    "H9",
    "H11",
    "H12A",
    "H12B",
    "H12C",
    "H12D",
    "H12E",
    "H12G",
    "H12J",
    "H12K",
    "H12M",
    "H12S",
    "H12T",
    "H12U",
    "H12X",
    "H13",
    "H13B",
    "H14",
    "H15",
    "H15A",
    "H15B",
    "H15C",
    "H15D",
    "H15E",
    "H15F",
    "H15G",
    "H15H",
    "H15I",
    "H20",
    "H22",
    "H31",
    "H31B",
    "H31C",
    "H32A",
    "H32B",
    "H32C",
    "H32D",
    "H32E",
    "H32G",
    "H32K",
    "H32S",
    "H32T",
    "H32X",
    "H32Y",
    "H37A",
    "H37B",
    "H37D",
    "H37DA",
    "H37E",
    "H37AA",
    "H37AB",
    "H37H",
    "H37I",
    "H37J",
    "H37K",
    "H37L",
    "H37M",
    "H37X",
    "H37Y",
    "H37Z",
    "V190",
    "M14",
    "M13",
    "M70",
]

KR_df = df[selected_colms]

KR_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21218 entries, 0 to 21217
Data columns (total 83 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V024    21218 non-null  float64
 1   V026    21218 non-null  float64
 2   V102    21218 non-null  float64
 3   V106    21218 non-null  float64
 4   V206    4178 non-null   float64
 5   V207    4178 non-null   float64
 6   B5      19491 non-null  float64
 7   B6      949 non-null    float64
 8   B7      949 non-null    float64
 9   B8      18452 non-null  float64
 10  M18     18212 non-null  float64
 11  M19     1867 non-null   float64
 12  H2      4239 non-null   float64
 13  H3      3269 non-null   float64
 14  H4      4063 non-null   float64
 15  H5      2812 non-null   float64
 16  H6      3052 non-null   float64
 17  H7      2470 non-null   float64
 18  H8      2329 non-null   float64
 19  H9      3699 non-null   float64
 20  H11     17066 non-null  float64
 21  H12A    88 non-null     float64
 22

In [90]:
# calculate the missing values
missing_values = KR_df.isnull().sum()
print(missing_values)
# 77 out of 83 columns have missing values

V024        0
V026        0
V102        0
V106        0
V206    17040
        ...  
H37Z    21182
V190        0
M14     12314
M13         0
M70     19042
Length: 83, dtype: int64


## Missing Data Analysis

Out of 83 total columns (82 selected variables + ID), **77 variables contained missing values**.

To manage missingness, variables were categorized by the percentage of missing data.

- **Low Missingness (≤5%)**: Considered negligible, generally left as-is or imputed minimally.
- **Moderate Missingness (5–10%)**: Evaluated for safe imputation strategies.
- **High Missingness (>10%)**: Required case-by-case assessment; often contextual or structural (e.g., skip patterns in survey).

In [91]:
# Categorize Columns by Missingness
# data missing between 5-10% is usually safe to impute
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 10) & (missing_data > 0)].sort_values(
    ascending=False
)
missing_data

B5    8.139316
dtype: float64

### Variable **B5 (Child Alive or Dead at Time of Interview)**

- **Missingness:** 8%
- **Assessment:**
    - Records with missing B5 values also tended to be missing in other critical columns (e.g., sickness symptoms, ANC/PNC details, mother’s reported number of deceased children).
    - No geographic pattern or correlation with district was found.
- **Decision:**
    - Dropped rows with missing B5 values, as these cases were largely incomplete across multiple outcome-related variables.

In [92]:
KR_df = KR_df.dropna(subset=["B5"])

In [93]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 10) & (missing_data > 0)].sort_values(
    ascending=False
)
missing_data

M18    6.674876
B8     5.330665
dtype: float64

In [94]:
KR_df[KR_df["B8"].isnull()]["B5"].value_counts()

B5
2.0    944
1.0     95
Name: count, dtype: int64

### (5–10% Missingness) - After Reassessment

- **B8 (Current Age of Child, if alive)**
    - **Missingness:** ~5%
    - **Treatment:**
        - If child is **alive**: Imputed using the **mode** (most frequently reported current age).
        - If child is **dead**: Assigned value **“-1”** to explicitly indicate that current age is not applicable.
- **M18 (Size of Child at Birth)**
    - **Missingness:** ~6.7%
    - **Treatment:**
        - Imputed missing values with **“Don’t Know”**, which is a valid categorical response already present in the dataset.
        - This preserves interpretability and prevents introducing artificial bias by forcing numeric or derived imputations.

Since the outcome now is always available (B5) the value of B8 if missing can be filled with mode if child is alive, or -1 if the child is dead

Values missing for M18, can be filled with the value "Don't Know"

B5 -> 1 = alive, 2 = dead

In [95]:
import numpy as np
import statistics

KR_df["B8"] = np.where(
    KR_df["B8"].isnull() & (KR_df["B5"] == 2.0),
    -1,
    np.where(
        KR_df["B8"].isnull() & (KR_df["B5"] == 1.0),
        statistics.mode(KR_df["B8"]),
        KR_df["B8"],
    ),
)

In [96]:
KR_df = KR_df.fillna({"M18": 8.0})

### Consistency Checks on Mortality Variables

Before addressing variables with 10–30% missingness, I performed **consistency checks** on the mortality-related variables **B6 (Age at Death in Months)** and **B7 (Age at Death in Days)**.

- **Logic Applied:**
    - If **B5 = 1 (Child Alive)**, then **B6** and **B7** should not contain valid values.
- **Finding:**
    - For alive children (B5 = 1), both B6 and B7 were filled with `NaN`.
- **Action Taken:**
    - Assigned **“-1”** to both B6 and B7 whenever B5 = 1, to explicitly indicate “Not Applicable.”

This ensured consistency across survival and age-at-death variables.

In [97]:
print(KR_df[KR_df["B5"] == 1.0][["B6", "B7"]])

# Consistency check for B6 and B7 for alive children
KR_df.loc[KR_df["B5"] == 1.0, "B6"] = -1
KR_df.loc[KR_df["B5"] == 1.0, "B7"] = -1

       B6  B7
0     NaN NaN
1     NaN NaN
7     NaN NaN
8     NaN NaN
9     NaN NaN
...    ..  ..
21213 NaN NaN
21214 NaN NaN
21215 NaN NaN
21216 NaN NaN
21217 NaN NaN

[18542 rows x 2 columns]


In [98]:
KR_df[KR_df["B5"] == 1.0][["B6", "B7"]].value_counts()

B6    B7  
-1.0  -1.0    18542
Name: count, dtype: int64

In [99]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 10) & (missing_data > 0)].sort_values(
    ascending=False
)
missing_data.count()

0

In [100]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 30) & (missing_data > 10)].sort_values(
    ascending=False
)
missing_data.count()

4

## Missingness Between 10–30% (Child Morbidity and Care-Seeking)

The next set of variables with moderate missingness (10–30%) were primarily related to **child morbidity (diarrhea, fever, respiratory symptoms)** and their associated **care-seeking and treatment** variables.

### Diarrhea-Related Variables

- **H11 (Child had diarrhea in the last 2 weeks):**
    - Missing values replaced with **“8 = Don’t Know”**, to avoid assumptions of presence/absence.
- **Seeking Help & Treatment Variables (linked to diarrhea):**
    - If **H11 = NaN (child not reported for diarrhea)**, all associated help-seeking and treatment variables were encoded as **0 (Not Selected)** for consistency.

In [101]:
# ensure if the "Diharrea Sought help" all are blank if the patient not having Diharrea
# list of Sought help options
sought_help_options = [
    "H12A",
    "H12B",
    "H12C",
    "H12D",
    "H12E",
    "H12G",
    "H12J",
    "H12K",
    "H12M",
    "H12S",
    "H12T",
    "H12U",
    "H12X",
]

# Treatment options
treatment_options = [
    "H13",
    "H13B",
    "H14",
    "H15",
    "H15A",
    "H15B",
    "H15C",
    "H15D",
    "H15E",
    "H15F",
    "H15G",
    "H15H",
    "H15I",
    "H20",
]

In [102]:
KR_df[KR_df["H11"].isna()]
KR_df.loc[KR_df["H11"].isna(), sought_help_options + treatment_options] = 0
KR_df["H11"] = KR_df["H11"].fillna(8.0)

In [103]:
KR_df[treatment_options + sought_help_options].isna().sum()
# if a treatment or an option for seeking help is not chosen, assume it wasn't selected -> fill with 0

KR_df[treatment_options + sought_help_options] = KR_df[
    treatment_options + sought_help_options
].fillna(0)

In [104]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 10) & (missing_data > 0)].sort_values(
    ascending=False
)
missing_data

Series([], dtype: float64)

In [105]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 30) & (missing_data > 10)].sort_values(
    ascending=False
)
missing_data

H22     12.44164
H31     12.44164
H31B    12.44164
dtype: float64

### Fever-Related Variables

- **H22 (Child had fever in the last 2 weeks):**
    - Missing values replaced with **“8 = Don’t Know.”**
- **Help-Seeking & Treatment Variables (linked to fever):**
    - If H22 is missing, all related variables were encoded as **0 (Not Selected).**

### Respiratory Symptoms (Cough, Short Rapid Breaths)

- **H31B (Short/Rapid Breathing) & H31C (Chest Problems/Blocked Nose):**
    - Missing values replaced with **“8 = Don’t Know.”**
- **Associated Care-Seeking Variables:**
    - If symptom variable was missing, treatment and care-seeking responses were set to **0 (Not Selected).**

In [106]:
# list of Sought help options for fever and cough
sought_help_fever = [
    "H32A",
    "H32B",
    "H32C",
    "H32D",
    "H32E",
    "H32G",
    "H32K",
    "H32S",
    "H32T",
    "H32X",
    "H32Y",
]

# Treatment options for fever and cough
treatment_fever = [
    "H37A",
    "H37B",
    "H37D",
    "H37DA",
    "H37E",
    "H37AA",
    "H37AB",
    "H37H",
    "H37I",
    "H37J",
    "H37K",
    "H37L",
    "H37M",
    "H37X",
    "H37Y",
    "H37Z",
]

In [107]:
# define missing data for fever and fill it with don't know since it's less than 30%
KR_df.loc[KR_df["H22"].isna(), sought_help_fever + treatment_fever] = 0
KR_df["H22"] = KR_df["H22"].fillna(8.0)

In [108]:
KR_df.loc[
    KR_df["H31B"].isna(), sought_help_fever + treatment_fever
].isna().sum()  # No Null values
KR_df["H31B"] = KR_df["H31B"].fillna(8.0)

In [109]:
KR_df[treatment_fever + sought_help_fever] = KR_df[
    treatment_fever + sought_help_fever
].fillna(0)

In [110]:
KR_df["H31C"] = KR_df["H31C"].fillna(8.0)
KR_df["H31"] = KR_df["H31"].fillna(8.0)

**After this step, all variables with missingness <30% were resolved.**

In [111]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 10) & (missing_data > 0)].sort_values(
    ascending=False
)
missing_data

Series([], dtype: float64)

In [112]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[(missing_data < 30) & (missing_data > 10)].sort_values(
    ascending=False
)
missing_data

Series([], dtype: float64)

In [None]:
# TODO: check consistency of care-sought variables that indicate no treatment/no care with other variables. make sure if true other false and vice versa

## Variables with >30% Missingness

Finally, I addressed 13 variables with **high missingness (54–90%)**, including:

- Child **vaccination records**
- Maternal **ANC/PNC care details**
- **Birth weight (in kg)**
- Maternal report of **number of sons/daughters deceased**

In [113]:
missing_data = KR_df.isnull().sum() / KR_df.shape[0] * 100
missing_data = missing_data[missing_data > 30].sort_values(ascending=False)
missing_data

M19     90.462265
M70     88.835873
H8      88.050895
H7      87.327484
H5      85.572828
H6      84.341491
H3      83.228157
H9      81.022010
V206    80.165204
V207    80.165204
H4      79.154482
H2      78.251501
M14     54.317377
dtype: float64

High missingness (>30–40%) → ask: is this variable really useful? Sometimes better to drop.

- V206: Sons who have died
- V207: daughters who have died
- H9: Received MEASLES 1
- H8: Received POLIO 3
- H7: Received DPT 3
- H6: Received POLIO 2
- H5: Received DPT 2
- H4: Received POLIO 1
- H3: Received DPT 1
- H2: Received BCG
- M14: Number of ANC visits
- M19: Birth weight in kilograms (3 decimals) -> We already handled missing values in M18, so this can be easily dropped
- M70: Baby postnatal check (within 2 months)

All these variables are unreliable. 




Revision of the Data Modeling Strategy:
To answer the research question, 4 concepts are required for each subject.
1. Geographical location
2. Outcomes -> mortality & morbidity 
3. Care Seeking Behavior
   1. Preventive dimensions -> Vaccinations (this will be dropped)
   2. Responsive dimensions -> Treatment and seeking help
   3. Exposure dimensions -> birth weight (kept through M18, though a lot of missing data and "I don't know") + ANC/PNC (high level of missingness)
4. Contextual/ Environmental Risk Factors

- Given their extremely high missingness, retaining these variables risked introducing noise, unreliable imputation, or loss of interpretability.
- Since alternative variables capturing **care-seeking behavior, birth conditions, and outcomes** were already included, dropping these variables would not jeopardize the analytic strategy.
- **Action:** All variables with >30% missingness were **dropped**.

In [114]:
cols_to_drop = [
    "V206",
    "V207",
    "H9",
    "H8",
    "H7",
    "H6",
    "H5",
    "H4",
    "H3",
    "H2",
    "M14",
    "M19",
    "M70",
]
KR_df = KR_df.drop(columns=cols_to_drop)

In [115]:
KR_df.isna().sum().sum()

0

In [116]:
KR_df.to_csv("KR_cleaned_df.csv")