# Data Cleaning Notebook for Dengue Transfer Learning Project

## Task

Conduct EDA for TensorFlow transfer learning pipeline to forecast **weekly dengue cases** (`total_cases`) from 22 multivariate weather/environmental features.

### Dataset
Dengue ML datasets track environmental and temporal factors influencing Aedes mosquito breeding and virus transmission in tropical regions like San Juan and Iquitos.

- #### Temporal Features
    - **city**: Location identifier (e.g., 'sj' for San Juan, 'iq' for Iquitos)—captures city-specific mosquito/dengue patterns.
    - **year, weekofyear, week_start_date**: Time granularity for seasonality; dengue peaks during rainy seasons (weekofyear critical for lagged effects).

- #### Vegetation Indices (NDVI)
    - **ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw**: Normalized Difference Vegetation Index by city quadrant. Higher NDVI indicates lush vegetation providing mosquito shade/breeding sites; key for Aedes habitat detection via satellite.

- #### Precipitation \& Water
    - **precipitation_amt_mm**: Rainfall amount—creates standing water breeding sites.
    - **reanalysis_precip_amt_kg_per_m2, reanalysis_sat_precip_amt_mm**: Reanalysis (modeled) precipitation variants confirming observed rain.
    - **station_precip_mm**: Ground station measurements—most direct rain proxy.

- #### Temperature Metrics
    - **reanalysis_air_temp_k, reanalysis_avg_temp_k, reanalysis_max_air_temp_k, reanalysis_min_air_temp_k**: Reanalysis temps in Kelvin; optimal Aedes range 26-32°C accelerates larval development/virus replication.
    - **station_avg_temp_c, station_max_temp_c, station_min_temp_c**: Station temps in Celsius—ground truth validation.
    - **station_diur_temp_rng_c**: Diurnal range; wider swings stress mosquitoes.
    - **reanalysis_tdtr_k**: Temperature diurnal temperature range (reanalysis).

- #### Humidity \& Moisture
    - **reanalysis_dew_point_temp_k**: Dew point—direct humidity proxy; high values (>20°C) favor mosquito survival.
    - **reanalysis_relative_humidity_percent**: Relative humidity %—critical for egg/larval viability.
    - **reanalysis_specific_humidity_g_per_kg**: Absolute moisture content.


### Notebook sections for the second project notebook (Data Cleaning)
1. Get Data
2. Data Cleaning
4. Feature Selection (TBC poss notebook 03)
5. Feature Engineering (TBC poss notebook 03)
6. Benchmark Model
7. Model Tuning  (TBC)
8. Model Evaluation  (TBC, poss notebook 03)

In [1]:
import sys
import os
from pathlib import Path
from typing import List, Tuple, Any
import gc

# Set one level up as project root|
if os.path.abspath("..") not in sys.path:
    sys.path.insert(0, os.path.abspath(".."))
    
from src.config import ProjectConfig  # project config file parser
from src.utils.eda import value_streaks, top_correlations
from src.utils.visualizations import compute_correlations_matrix, \
                display_distributions, random_color, random_colormap, \
                display_timeseries

import pandas as pd
import numpy as np
import random
import time
from datetime import timedelta

from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
# from matplotlib.axis import Axis
from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
import seaborn as sns

In [2]:
cnfg = ProjectConfig.load_configuration()
PATH_TO_RAW_DATA = cnfg.data.dirs["raw"]
FILE_TRAIN_RAW= cnfg.data.files["features_train"]
FILE_TEST_RAW = cnfg.data.files["features_test"]
FILE_LABELS_RAW = cnfg.data.files["labels_train"]
TARGET = "total_cases"

### Get Data

In [3]:
df_train_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TRAIN_RAW, parse_dates=["week_start_date"])
df_test_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TEST_RAW, parse_dates=["week_start_date"])
df_labels_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_LABELS_RAW)
list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]

In [4]:
df_train_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TRAIN_RAW, parse_dates=["week_start_date"])
# df_test_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TEST_RAW, parse_dates=["week_start_date"])
df_labels_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_LABELS_RAW)
# list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
# for df in list_raw_df:
#     display(df.sample(1))  

***To reduce data snooping, slice last entries for both dataset cities***

In [5]:
holdout_pct = 0.05
cities_first_i = df_train_raw.groupby(by="city")["week_start_date"].idxmin()  # Series w Start indices
cities_last_i = df_train_raw.groupby(by="city")["week_start_date"].idxmax()  # Series w end indices
cities_last_i = (cities_last_i - (cities_last_i - cities_first_i) * holdout_pct).astype(int)  # indice math with Series
period = tuple(slice(cities_first_i[city], cities_last_i[city], 1) for city in cities_last_i.index[::-1])  # Create tuple of slices from 2 Series
df_train_raw_eda = df_train_raw.iloc[np.r_[period]].reset_index()  # apply defuned slices
df_labels_raw_eda = df_labels_raw.iloc[np.r_[period]].reset_index()

### Data Cleaning

# TODO:
- NaNs:
  - potential for row-wise remowal where most of datapoint features lack values
  - columnwise *"ndvi_ne"* feature:
  - sparse, but importand (vegetation data crucial for mosquitoes) - attempt to interpolate from other rows of the same feature
  - possible imputation for temperatures where reanalysis data available
  - Missing value patterns between San Juan and Iquitos do differ
  - Entire season missing for all vegetation data quadrants.:
    - keep rows and interpolate data from other years data
    - There are different missing data patterns between two datset cities:
        - For imputing use only relevant city data as both climate and vegetation differ San Juan and Iquitos
  - **Preprocessing tactics for NaNs**:
    1) remove obvious rows (eg sparsity > 50%)
    2) Figure out trattegy for San Juan that is missing ~20% data for 'ndvi_ne' (most likely keep, if can build other quadrants)
    3) check target distribution and if some outlier rows have to be removed
    4) reevaluet what is left for imputation:
        - imputation options for weather and short term ndvi_ breaks data (time sensitive):
            1) Use only relevant city data/group for imputation
            2) Impute accordibng to feature importance vs importance for target (mosqiotoe breeding):
                1. station_precip_mm (CRITICAL) - Creates breeding sites
                2. station_avg_temp_c (CRITICAL) - Optimal 26-32°C for development  
                3. station_max_temp_c (HIGH) - Heat stress threshold (>32°C kills)
                4. ndvi_ features: Shade/habitat (MEDIUM):
                    1) if long missing streeks missing (etire season):
                        - 5-12 weeks missing Seasonal mean from same quadrant in all other years for same week
                        - 15-18+ weeks	DROP COLUMN or flag as unreliable - Entire growing season lost 
                    2) For shorter periods (1-4 weeks) NaNs Use interpolate(method='time') -> lienar between existing time points???)
                6. station_min_temp_c (MEDIUM) - Night survival threshold
                7. station_diur_temp_rng_c (LOW) - Secondary stress indicator
            3) Impute if reanalysis columns available
            4) check time sensitive imputation methods (simpler - forward fill last valid value forward with ffill())
- multicolinearity:
    - Not an issue per se for LSTM, but introduces redundancy. Therefore:
        - Remove identical:
            -  "reanalysis_sat_precip_amt_mm" and near identical "reanalysis_dew_point_temp_k"
        - remove highly correlated infered feature:
            - "reanalysis_avg_temp_k"
        - keep potential cross domain feature despite high correlations:
            - "reanalysis_tdtr_k"
        - keep direct sation measurement data despite correlations with reanalysis data:
            - "station_diur_temp_rng_c"
        -  cluster vegetation in North and South features:
            - 'ndvi_ne' with 'ndvi_nw' and 'ndvi_se' with 'ndvi_sw (AFTER NaN interpolation)
- **Preprocessing tactics for outliers:**
    - Target ("total_cases")
        - If tree models used (eg LightGBM) - no issue, trees are not sensitive to outliers:
            - use huber loss for extra safety when handling tails
            - clip extreme values for dengue context realistic predictions
            - RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
        - RNNs (eg LSTM) are outlier sensitive (gradient instability, hidden state patterns loose importance at peaks, scaling):
            - Log transform
            - Scale (RobustScaler  with IQR is more outlier resistant)
            - apply huber loss
        - clip (!= remove) extreme values for target for both models, separate clipping by city (outlier in Iquitos may not be an outlier in much larger San Juan)
    - Features:
        - clip globally (less complex, city specific can mess up transfer learning
        - clip for both tree and RNN
        - RobustScaler for RNN
        - RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
            - rainfall:
                - clip to ~ 300 mm (test 99.5 percentile)
            - temperature:
               - are specific extremes are from 20 - 40 C (test 99.5 percentile)
               - clip min to ~ 20
               - clip max to ~ 40
            - vegetation (ndvi features)
                - dataset ranges from -0.456100 (water bodies) to 0.546017 (rainforest) are possible
                - no need for clipping, but can apply IQR clip as preventive measure for future data/prediction inputs
- **Preprocessing tactics for low value target streaks**:
    - Drop 75 rows of initial zero/low value streaks

In [6]:
zero_one_targets = value_streaks(data=df_labels_raw_eda, column=TARGET, value=range(2),
                             run_threshold=5)
print("Zero and one consecutive value streaks for target data ('total dengue cases).")
zero_one_targets

Zero and one consecutive value streaks for target data ('total dengue cases).


Unnamed: 0,first_pos,last_pos,streak_len
0,888,962,75
1,1031,1039,9
2,1047,1052,6
3,1299,1304,6
