# Feature Enginering Notebook for Dengue Transfer Learning Project

### Dataset
Dengue ML datasets track environmental and temporal factors influencing Aedes mosquito breeding and virus transmission in tropical regions like San Juan and Iquitos.

- #### Temporal Features
    - **city**: Location identifier (e.g., 'sj' for San Juan, 'iq' for Iquitos)—captures city-specific mosquito/dengue patterns.
    - **year, weekofyear, week_start_date**: Time granularity for seasonality; dengue peaks during rainy seasons (weekofyear critical for lagged effects).

- #### Vegetation Indices (NDVI)
    - **ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw**: Normalized Difference Vegetation Index by city quadrant. Higher NDVI indicates lush vegetation providing mosquito shade/breeding sites; key for Aedes habitat detection via satellite.

- #### Precipitation \& Water
    - **precipitation_amt_mm**: Rainfall amount—creates standing water breeding sites.
    - **reanalysis_precip_amt_kg_per_m2, reanalysis_sat_precip_amt_mm**: Reanalysis (modeled) precipitation variants confirming observed rain.
    - **station_precip_mm**: Ground station measurements—most direct rain proxy.

- #### Temperature Metrics
    - **reanalysis_air_temp_k, reanalysis_avg_temp_k, reanalysis_max_air_temp_k, reanalysis_min_air_temp_k**: Reanalysis temps in Kelvin; optimal Aedes range 26-32°C accelerates larval development/virus replication.
    - **station_avg_temp_c, station_max_temp_c, station_min_temp_c**: Station temps in Celsius—ground truth validation.
    - **station_diur_temp_rng_c**: Diurnal range; wider swings stress mosquitoes.
    - **reanalysis_tdtr_k**: Temperature diurnal temperature range (reanalysis).

- #### Humidity \& Moisture
    - **reanalysis_dew_point_temp_k**: Dew point—direct humidity proxy; high values (>20°C) favor mosquito survival.
    - **reanalysis_relative_humidity_percent**: Relative humidity %—critical for egg/larval viability.
    - **reanalysis_specific_humidity_g_per_kg**: Absolute moisture content.


### For a fair fight between LightGBM and our LSTM, both models will have the same starting lineup of features. These include seasonal cues, vegetation signals, missing data hints, and early low-case periods—everything else is just how each model likes to learn over time.

### Notebook sections for the third project notebook (Feature Enginering/Selection/)
1. Get Data
2. Feature Engineering
    - Common for both models
    - Model specific:
        - LightGBM benchmark
        - LSTM
3. Feature Selection
4. Target processing
5. Benchmark model (TBC, poss notebook 04)
7. Model Tuning  (TBC)
8. Model Evaluation  (TBC, poss notebook 04)

In [1]:
import sys
import os
from pathlib import Path
from typing import List, Tuple, Any, Dict
import gc
import itertools

# Set one level up as project root|
if os.path.abspath("..") not in sys.path:
    sys.path.insert(0, os.path.abspath(".."))
    
from src.config import ProjectConfig  # project config file parser
from src.utils.eda import value_streaks, top_correlations
from src.utils.visualizations import compute_correlations_matrix, \
                display_distributions, random_color, random_colormap, \
                display_timeseries

import pandas as pd
import numpy as np
import random
import time
from datetime import timedelta

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

from src.utils.eda import top_correlations, top_vif
from src.utils.utils import _check_feature_presence, load_file, save_file
from src.preprocessing.clean import drop_nan_rows, cap_outliers, \
                                    median_groupwise_impute
from src.preprocessing.engineer import reduce_features, remove_features

from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
# from matplotlib.axis import Axis
# from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
import seaborn as sns

In [2]:
cnfg = ProjectConfig.load_configuration()
PATH_TO_RAW_DATA = cnfg.data.dirs["raw"]
FILE_TRAIN_RAW= cnfg.data.files["features_train"]
FILE_TEST_RAW = cnfg.data.files["features_test"]
FILE_LABELS_RAW = cnfg.data.files["labels_train"]
TARGET = cnfg.preprocess.feature_groups["target"]
ENV_FEAT_PREFIX = cnfg.preprocess.feature_groups["env_prefixes"]
CITYGROUP_FEAT = cnfg.preprocess.feature_groups["city"]
WEEK_FEAT = cnfg.preprocess.feature_groups["week"]
DATETIME_FEAT = cnfg.preprocess.feature_groups["datetime"]

## Get Data

In [3]:
df_train_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TRAIN_RAW, datetime_col=DATETIME_FEAT)
df_test_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TEST_RAW, datetime_col=DATETIME_FEAT)
df_labels_raw = load_file(path=PATH_TO_RAW_DATA / FILE_LABELS_RAW)
list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
env_features = [f for f in df_train_raw if f.startswith(tuple(ENV_FEAT_PREFIX))]

In [4]:
# TODO remove after testing
# MOCK CLEAN PIPELINE 

intermediate_output = cap_outliers(data=df_train_raw)
df_train_clean = intermediate_output["data"]
df_train_clean, df_labels_clean = drop_nan_rows(X=df_train_clean, y=df_labels_raw)
df_train_clean, df_nan_mask = median_groupwise_impute(X=df_train_clean)

## Common engineering baseline
- [ ] combine north and south NDVIs
- [ ] Process zero/low value target value streaks

### Combine north and south NDVIs

In [5]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        
#     return X_reduced

In [6]:
df_train_clean = reduce_features(df_train_clean)
[f for f in df_train_clean.columns if f.startswith("ndvi")]

['ndvi_north', 'ndvi_south']

### Process zero/low value target value streaks
- [ ] Feature engineer flag for early period's low values (ie `iq_initial_low_case_streak`)

## Common Feature Selection baseline
- [ ] Remove selected multicolinear features (esp important for LighGBM):
    - Populate feature removal list in `confif.yaml` iteratively based on:
        - Initial assesment of pairwise correlations in EDA
        - Proritize removal of "reanalysis_" features
        - Reasses if city-stratified VIF (Variance Inflation Factor) for any feature remains > 10.

### Remove selected multicolinear features
- [X] Remove initial `config.yaml` milticolinear features form dataframe
- [X] Assess city-wise VIF to EDA instead of correlation matrix for one-vs-all relationships.
- [ ] Remove features with VIF > 10
- [X] always prefer station_* over reanalysis_*

In [7]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        
#     return X_reduced

In [8]:
cnfg.preprocess.multicolinear["removal_list"]

['reanalysis_sat_precip_amt_mm',
 'reanalysis_dew_point_temp_k',
 'reanalysis_air_temp_k',
 'reanalysis_specific_humidity_g_per_kg',
 'reanalysis_avg_temp_k',
 'reanalysis_max_air_temp_k',
 'reanalysis_min_air_temp_k']

In [9]:
# # TODO remove after cleaning
# def top_vif(data: pd.DataFrame):
#     """
#     Calculate and return Variance Inflation Factor (VIF) scores for numeric features.
    
#     :param data: pandas DataFrame containing numeric and non-numeric features.
#     :return: pandas DataFrame with features and their VIF scores,
#                 sorted descending (excludes constant).
#     """
#     data_vif = add_constant(data.select_dtypes(include="number"))
#     cols = data_vif.columns
#     if data_vif.isna().sum().sum() > 1:
#         raise ValueError(f"{data_vif.isna().sum().sum()} NaNs in the dataframe.")
#     data_vif = [variance_inflation_factor(
#         data_vif.values, i) for i in range(data_vif.shape[1])]
#     data_vif = pd.DataFrame(data=data_vif, index=cols, columns=["vif"])
#     data_vif = data_vif.sort_values(by="vif", ascending=False,
#                                     na_position="first").drop(index="const")
    
#     return data_vif    

In [10]:
# DONT DO CAPPING, LOG TRANSFORM INSTEAD
# intermediate_target = cap_outliers(data=df_labels_raw, features=TARGET)
# df_labels_clean = intermediate_target["data"]
# df_labels_log = ...

In [11]:
# display_distributions(data = df_labels_log,
#                       hue_palette=(CITYGROUP_FEAT, random_colormap()),
#                      features=[TARGET], title_prefix=TARGET.capitalize(),
#                      )

In [12]:
temp_vif_pre = df_train_clean.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
temp_vif_pre =  temp_vif_pre.loc["iq"].join(temp_vif_pre.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
temp_vif_pre["vif_total"] = top_vif(data=df_train_clean).values
temp_vif_pre.sort_values(by="vif_sj", ascending=False, na_position="first")

  vif = 1. / (1. - r_squared_i)
  vif = 1. / (1. - r_squared_i)
  vif = 1. / (1. - r_squared_i)


Unnamed: 0,vif_iq,vif_sj,vif_total
precipitation_amt_mm,inf,inf,inf
reanalysis_sat_precip_amt_mm,inf,inf,inf
reanalysis_dew_point_temp_k,453.223289,1696.639393,461.460227
reanalysis_air_temp_k,76.644665,1138.498349,182.108323
reanalysis_specific_humidity_g_per_kg,418.345797,522.410071,379.079529
reanalysis_relative_humidity_percent,113.106234,305.767118,209.515216
reanalysis_avg_temp_k,37.465639,274.534057,66.693535
station_avg_temp_c,3.198979,24.468301,7.145946
reanalysis_min_air_temp_k,5.496557,19.140084,23.223278
reanalysis_max_air_temp_k,5.799613,17.218028,26.361666


In [13]:
df_train_clean = remove_features(X=df_train_clean)

In [14]:
temp_vif_post = df_train_clean.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
temp_vif_post =  temp_vif_post.loc["iq"].join(temp_vif_post.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
temp_vif_post["vif_total"] = top_vif(data=df_train_clean).values
temp_vif_post.sort_values(by="vif_sj", ascending=False, na_position="first")

Unnamed: 0,vif_iq,vif_sj,vif_total
station_avg_temp_c,2.892104,15.205426,3.499835
station_min_temp_c,2.20213,9.482349,2.034867
station_max_temp_c,2.524067,7.57361,2.708194
reanalysis_relative_humidity_percent,6.135895,3.091658,9.21114
station_diur_temp_rng_c,3.015184,3.033468,5.947643
reanalysis_precip_amt_kg_per_m2,1.615392,2.586292,1.958662
year,1.119315,2.036535,1.282457
station_precip_mm,1.275959,1.89013,1.627019
precipitation_amt_mm,1.489978,1.854858,1.73702
reanalysis_tdtr_k,5.927013,1.816751,7.728891


In [15]:
corr_threshold=0.8
print(f"City-stratified correlations exceeding {corr_threshold}:")
df_train_clean.groupby(
    by=CITYGROUP_FEAT).apply(
        lambda group: top_correlations(
            data=group, 
            corr_threshold=corr_threshold
        ), include_groups=False)

City-stratified correlations exceeding 0.8:


city                                                          
iq    ndvi_south          ndvi_north                              0.872014
      reanalysis_tdtr_k   reanalysis_relative_humidity_percent   -0.900588
sj    station_min_temp_c  station_avg_temp_c                      0.898016
      station_max_temp_c  station_avg_temp_c                      0.864780
dtype: float64

**Conclusion**:
- Overall `vif_total` Variance Inflation Factor cobined for both cities is below 10:
  - Among highest VIF scoring overall remains `reanalysis_relative_humidity_percent`, but it is important as there are no alternative station humiidity data and humidity is important for dengue detection domain. 
- `San Juan` still suffers from higher Multicolinearity that reflects geographical characteristic (more stable climate):
    - Top VIF scores for `San Juan` reach 15.2 and are in crucial station  temperature measurements. `station_avg_temp_c` could be potentially removed, but risks hiding important signals in `Iquitos`.
    - Keep `station_avg_temp_c` and observe model results:
        - VIF score of ~ 15 should be well handled by `LightGBM`
        - If `LTSM` `San Juan` -> `Iquitos` transfer learn underperforms, attempt addinf `station_avg_temp_c` to `config.yaml` feature removal list. 
- VIF factors for `Iquitos` are all acceptable in range below ~6.1.
- Remaining high correlation pairs are differenft for both cities confirming that closeness of these features are city specific and thus may encompass important signals for models. So they are kept.

## Target processing
- [ ] log transform targets

### Process zero/low value target value streaks
- [ ] Predict log1p(total_cases) to reduce zero target influence
- [ ] (Optional) LSTM specific - if LSTM results are sub-optimal, experiment with downweighting affected period data (ie 0.3 or more)

In [16]:
zero_one_targets = value_streaks(data=df_labels_raw, column=TARGET, value=range(2),
                             run_threshold=5)
print("Zero and one consecutive value streaks for target data ('total dengue cases).")
zero_one_targets

Zero and one consecutive value streaks for target data ('total dengue cases).


Unnamed: 0,first_pos,last_pos,streak_len
0,936,1010,75
1,1079,1087,9
2,1095,1100,6
3,1347,1352,6


# TODO: adjust **Conclusion**:
- Outliers are 

separate tactics for baseline tree modela and RNN.
- Tree (eg LighGBM):
    - drop redundant and higgly correlated reanalysis features, replace with engineered features that provide temporal info
    - single vector input:
        - weather/vegetation snapshot for t-1 week:
            - keep these features (poss few more):
                1. city
                2. sin/cos_weekofyear (cyclical),
                3. station_avg_temp_c,
                4. station_precip_mm,
                5. station_min_temp_c/station_max_temp_c,
                6. ndvi_ne/ndvi_sw,
                7. reanalysis_relative_humidity_percent
        - multiple lagged/aggregated case metrics.
            - see next cell
            - add:
                -  4w rolling means
                -  4w rollong max
                -  w1-w4 lags (4 features)
        - multiple lagged/aggregated weather features:
            - 4w averages for temp, rain, humidity
        - limit to 25 dimensions (preferably 20)
- RNN (eg LSTM)
    - keep most original features, drop extremes of 2-3 highly collinear reanalysis, minimal engineering 
    - windowed input with:
        -  4-8 time steps (if single step/week forecast)
        -  10-20 features per step
        -  include city embedding (research if truly needed)

Conclusions from EDA notebook (~ line 33)
- Tree models may handle interrupted seasonality better, but:
    - with adittional feature engineering:
      - seasonal cyclical sin/cos features (weekly on annual basis)
      - weekly case autocorrelation/shift() (must) or annual shift features (indicate time momentum for tree model):
          - trees can handle resulting NaNs
      - weekly per city differencing to hughlight spikes and sudue seasonality - stationary (constant statistical properties over time) benefit trees.
      - Optional: annual diff feature for outbreak signals per city
-  RNNs (eg LSTM) will struggle with outbreak spikes as works best with smooth seasonal patterns:
    - still may benefit from  cyclical sin/cos features
    - precip_avg_4w
    - Rolling_mean_cases_4w (handle NaNs, eg impute with bfill() + ffill() combo)
    - lag1_cases