# Feature Enginering Notebook for Dengue Transfer Learning Project

### Dataset
Dengue ML datasets track environmental and temporal factors influencing Aedes mosquito breeding and virus transmission in tropical regions like San Juan and Iquitos.

- #### Temporal Features
    - **city**: Location identifier (e.g., 'sj' for San Juan, 'iq' for Iquitos)—captures city-specific mosquito/dengue patterns.
    - **year, weekofyear, week_start_date**: Time granularity for seasonality; dengue peaks during rainy seasons (weekofyear critical for lagged effects).

- #### Vegetation Indices (NDVI)
    - **ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw**: Normalized Difference Vegetation Index by city quadrant. Higher NDVI indicates lush vegetation providing mosquito shade/breeding sites; key for Aedes habitat detection via satellite.

- #### Precipitation \& Water
    - **precipitation_amt_mm**: Rainfall amount—creates standing water breeding sites.
    - **reanalysis_precip_amt_kg_per_m2, reanalysis_sat_precip_amt_mm**: Reanalysis (modeled) precipitation variants confirming observed rain.
    - **station_precip_mm**: Ground station measurements—most direct rain proxy.

- #### Temperature Metrics
    - **reanalysis_air_temp_k, reanalysis_avg_temp_k, reanalysis_max_air_temp_k, reanalysis_min_air_temp_k**: Reanalysis temps in Kelvin; optimal Aedes range 26-32°C accelerates larval development/virus replication.
    - **station_avg_temp_c, station_max_temp_c, station_min_temp_c**: Station temps in Celsius—ground truth validation.
    - **station_diur_temp_rng_c**: Diurnal range; wider swings stress mosquitoes.
    - **reanalysis_tdtr_k**: Temperature diurnal temperature range (reanalysis).

- #### Humidity \& Moisture
    - **reanalysis_dew_point_temp_k**: Dew point—direct humidity proxy; high values (>20°C) favor mosquito survival.
    - **reanalysis_relative_humidity_percent**: Relative humidity %—critical for egg/larval viability.
    - **reanalysis_specific_humidity_g_per_kg**: Absolute moisture content.


### For a fair fight between LightGBM and our LSTM, both models will have the same starting lineup of features. These include seasonal cues, vegetation signals, missing data hints, and early low-case periods—everything else is just how each model likes to learn over time.

### Notebook sections for the third project notebook (Feature Enginering/Selection/)
1. Get Data
2. Feature Engineering
    - Common for both models
    - Model specific:
        - LightGBM benchmark
        - LSTM
3. Feature Selection
4. Target processing
5. Benchmark model (TBC, poss notebook 04)
7. Model Tuning  (TBC)
8. Model Evaluation  (TBC, poss notebook 04)

In [1]:
import sys
import os
from pathlib import Path
from typing import List, Tuple, Any, Dict, Optional
import gc
import itertools
import logging
logging.basicConfig(level=logging.INFO)

# Set one level up as project root|
if os.path.abspath("..") not in sys.path:
    sys.path.insert(0, os.path.abspath(".."))
    
from src.config import ProjectConfig  # project config file parser
from src.utils.eda import value_streaks, top_correlations
from src.utils.visualizations import (compute_correlations_matrix,
                display_distributions, random_color, random_colormap,
                display_timeseries)

import pandas as pd
import numpy as np
import random
import time
from datetime import timedelta

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

from src.utils.eda import top_correlations, top_vif
from src.utils.utils import _check_feature_presence, load_file, save_file
from src.preprocessing.clean import (drop_nan_rows, cap_outliers,
                                    median_groupwise_impute, pipe_clean)

from src.preprocessing.engineer import (reduce_features,
                                        add_missingness_features,
                                        low_value_targets,
                                        circular_time_features)

from src.preprocessing.select import remove_features

from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
# from matplotlib.axis import Axis
# from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
import seaborn as sns

In [2]:
cnfg = ProjectConfig.load_configuration()
PATH_TO_RAW_DATA = cnfg.data.dirs["raw"]
PATH_TO_INTERMEDIATE_DATA = cnfg.data.dirs["intermediate"]

FILE_TRAIN_RAW= cnfg.data.files["features_train"]
FILE_TEST_RAW = cnfg.data.files["features_test"]
FILE_LABELS_RAW = cnfg.data.files["labels_train"]

PATH_TO_INTERMEDIATE_DATA = cnfg.data.dirs["intermediate"]
FILE_NAN_CLEAN = cnfg.data.files["nan_mask"]
FILE_TRAIN_CLEAN = cnfg.data.files["features_clean"]
FILE_LABELS_CLEAN = cnfg.data.files["labels_clean"]

TARGET = cnfg.preprocess.feature_groups["target"]
ENV_FEAT_PREFIX = cnfg.preprocess.feature_groups["env_prefixes"]
CITYGROUP_FEAT = cnfg.preprocess.feature_groups["city"]
WEEK_FEAT = cnfg.preprocess.feature_groups["week"]
DATETIME_FEAT = cnfg.preprocess.feature_groups["datetime"]

## Get Data

- Raw data

In [3]:
df_train_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TRAIN_RAW, datetime_col=DATETIME_FEAT)
df_test_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TEST_RAW, datetime_col=DATETIME_FEAT)
df_labels_raw = load_file(path=PATH_TO_RAW_DATA / FILE_LABELS_RAW)
list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
env_features = [f for f in df_train_raw if f.startswith(tuple(ENV_FEAT_PREFIX))]

- Cleaned data

In [4]:
# %%time
# # use pipeline output
# cleaned_data = pipe_clean(overwrite_files=True)
# df_train_clean = cleaned_data["X_clean_data"]
# df_labels_clean = cleaned_data["y_clean_data"]
# df_nan_mask = cleaned_data["nan_mask_data"]

In [5]:
%%time
# or just load from disc:
df_train_clean = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_TRAIN_CLEAN, datetime_col=DATETIME_FEAT)
df_labels_clean = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_LABELS_CLEAN)
df_nan_mask = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_NAN_CLEAN)

CPU times: user 40.2 ms, sys: 5.87 ms, total: 46.1 ms
Wall time: 45.3 ms


## Common engineering baseline
- [X] Missingness features
- [X] combine north and south NDVIs
- [X] Process zero/low value target value streaks (add `iq_initial_low_case_streak` feature)
- [ ] Cyclical time features
- [ ] (Optional) Lags / rolling stats
    - Environmental station features 1-2 lags each (t-1, if feature space allows add t-2)
    - Cumulative effects (rainfall, precipitation) - rolling sum (4 weeks)
    - NDVI missingness: 1-2 lags (t-1, if feature space allows  t-4)
    - Station missingness: 1 lag (t-1)
    - if feature space bloats - replace lags with single rolling stat
- [ ] RobustScaler (features)

### Add data misingness features
- Add missigness features that flag NaN rows for top NaN columns that have NaN rate above 1 %:
    - Idea - models, especially `LSTM` may learn from informatioun that "this datapoint was missing"
- [X] grouped `ndvi_n_missing` and `ndvi_s_missing` (`ndvi_n` and `ndvi_s` will be grouped downstream)
    - if feature space bloats - combine in single `ndvi_missing`
- [X] `station_missing` - group all station data missing flags ("station", "precipitation" prefixes)
- [X] `missing_pct_env_feat`:
    - normalized aggregation of all missing feature count/NaNs at a timestamp relative to a fixed set of core environmental variables, signals overall source data uncertainty.
    -  if feature space bloats - drop entirely
- [ ] (Optional) During SJ pretraining: Slightly stronger dropout on missingness features than on base features to hide city-correlated missingness

In [6]:
# # TODO remove after cleaning

# def add_missingness_features(X: pd.DataFrame,
#                              nan_mask: pd.DataFrame,
#                              aggregated_feat_name: Optional[str]=None,
#                              input_env_feat_prefixes: Optional[List[str]]=None,
#                              input_group_prefix: Optional[list]=None,
#                              output_feature_prefix: Optional[str]=None
#                             ) -> pd.DataFrame:
#     """
#     Add missingness indicator features to DataFrame using column prefix patterns.
#     1. (optional) Aggregated ratio of missing values across environment prefix columns.
#     2. Max missingness indicator (0/1) per feature group defined by prefixes
#     Fall back to config values if parameters unspecified.
#     :param X: Input DataFrame to add missingness features to.
#     :param nan_mask: Boolean DataFrame same shape as X where True indicates missing.
#     :param aggregated_feat_name: Name for aggregated missingness ratio feature. 
#                                 Uses config default if None.
#     :param input_env_feat_prefixes: List of prefixes to match environment columns 
#                                    for aggregated ratio. Uses config if None.
#     :param input_group_prefix: List of prefix lists or single strings defining 
#                               feature groups for max missingness indicators. 
#                               Mixed format supported: `[["station", "precip"], "ndvi_s"]`.
#                               Uses config if None.
#     :param output_feature_prefix: Prefix for new group missingness columns 
#                                  (e.g., "missing_station_max"). Uses config if None.
#     :return: X with additional missingness feature column(s).
#     :raises ValueError: If no columns match environment prefixes.
#     """
#     X_missing = X.copy()
#     config = cnfg.preprocess.missingness_features
#     aggregated_feat_name = aggregated_feat_name or config.get("aggregated_feat_n")
#     input_env_feat_prefixes = input_env_feat_prefixes or cnfg.preprocess.feature_groups["env_prefixes"]
#     input_group_prefix = input_group_prefix or config["group_prefixes"]
#     output_feature_prefix = output_feature_prefix or config["new_feature_prefix"]

#     if aggregated_feat_name and input_env_feat_prefixes:
#         env_cols = nan_mask.columns[nan_mask.columns.str.startswith(tuple(input_env_feat_prefixes))]
#         denominator = len(env_cols)
#         if denominator == 0:
#             raise ValueError("No columns match environment prefix/es '{input_env_feat_prefixes}'.")
#         X_missing[f"{output_feature_prefix}{aggregated_feat_name}"
#             ] = nan_mask[env_cols].sum(axis=1) / denominator

#     for prefix in input_group_prefix:
#         if isinstance(prefix, str):
#             prefix = [prefix]
#         features = nan_mask.columns[
#             nan_mask.columns.str.startswith(tuple(prefix))]
#         if len(features) == 0:
#             logging.warning(f"No columns match prefix/es '{prefix}' - skipping.")
#             continue
#         feature_name = f"{output_feature_prefix}{prefix[0]}"
#         X_missing[feature_name] = nan_mask[features].agg("max", axis=1)
    
#     return X_missing

In [7]:
df_train_eng = add_missingness_features(X=df_train_clean, nan_mask=df_nan_mask)
df_train_eng.columns[-4:]

Index(['missing_pct_env_feat', 'missing_station', 'missing_ndvi_s',
       'missing_ndvi_n'],
      dtype='object')

### Combine north and south NDVIs
- [X] combine primary veatures (method `mean`)
- [X] combine related misingness flags (method `max`) - done upstream in `add_missingness_features()`

In [8]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        
#     return X_reduced

In [9]:
df_train_eng = reduce_features(df_train_eng)
[f for f in df_train_eng.columns if f.startswith("ndvi")]

['ndvi_north', 'ndvi_south']

### Process zero/low value target value streaks
- [X] Feature engineer continuous low case streak counts for entire period's low values (ie `low_case_streak`)
- [X] Feature engineer binary low case streak marker for early low values (ie `initial_low_case_streak`)
- [X] avoid target leakage (do not use future data in calculating streaks - i.e. `zero_one_targets` function used for EDA)
- [ ] compare performance with and without `low_case_streak` and `initial_low_case_streak` features

In [10]:
low_val_feats = cnfg.preprocess.low_value_streak_features
low_val_feats

{'target_streak_len_threshold': 26,
 'target_streak_feat_n': 'low_case_streak',
 'initial_streaks': True,
 'low_value_range': 2}

In [11]:
zero_one_targets = value_streaks(data=df_labels_clean, column=TARGET, value=range(2),
                             run_threshold=1)
print("Zero and one consecutive value streaks for target data (total dengue cases).")
zero_one_targets

Zero and one consecutive value streaks for target data (total dengue cases).


Unnamed: 0,first_pos,last_pos,streak_len
0,930,1004,75
1,1073,1081,9
2,1089,1094,6
3,1339,1344,6
4,1196,1200,5
5,1238,1240,3
6,1098,1100,3
7,1264,1266,3
8,1335,1337,3
9,1388,1389,2


In [12]:
# def _get_cumulative_streaks(group: pd.Series,
#                             low_value_range:int,
#                            filter_initial_streaks: bool=False):
#     """
#     Compute low-value streak lengths within a 1D series, optionally filtering to initial streak only.

#     :param group: pandas Series representing a single grouped sequence (e.g., one city).
#     :param low_value_range: Upper bound (exclusive) for values considered "low".
#     :param filter_initial_streaks: If True, keep only initial low-value streak when it starts at first
#                                    position (all others 0). If False, return complete streak lengths.
#                                    Default is False.
#     :return: pandas Series (same index/shape as `group`) with cumulative low-value streak lengths,
#              filtered according to `filter_initial_streaks`.
#     """
    
#     mask_lows = group.isin(range(low_value_range))
#     streak_groups = (mask_lows != mask_lows.shift(fill_value=False)).cumsum()
#     low_count_streaks = mask_lows.groupby(by=streak_groups).cumsum()

#     if not filter_initial_streaks:
#         return low_count_streaks

#     else:
#         if low_count_streaks.iloc[0] == 1:
#             initial_low_streaks = low_count_streaks.where(streak_groups == 1, 0)
#         else:
#             initial_low_streaks = pd.Series(0, index=low_count_streaks.index)
#         return initial_low_streaks

In [13]:
# def low_value_targets(X: pd.DataFrame, y: pd.DataFrame,
#                       target_feature: str | None=None,
#                       group_feature:str | None=None,
#                       new_feat_name:str | None=None,
#                       initial_streaks_only:bool | None=None,
#                       min_initial_streak_len:int | None=None,
#                       low_value_range:int | None=None
#                      ) -> pd.DataFrame:
#     """
#     Generate low-value streak features for ML pipelines. Creates continuous streak length feature
#     (`low_case_streak`) and optional boolean initial-streak indicator with minimum length filtering.
    
#     Continuous streak preserves magnitude info for model gradients. Boolean initial streak supports
#     minimum length thresholding.

#     :param X: Input features DataFrame.
#     :param y: Target DataFrame (same index as X).
#     :param target_feature: Target column name. Uses config default if None.
#     :param group_feature: Grouping column (e.g., 'city'). None processes entire series.
#                             Defaults to config settings.
#     :param new_feat_name: Output base name (e.g., 'low_case_streak'). None returns X unchanged.
#                             Defaults to config settings.
#     :param initial_streaks_only: If True, adds thresholded boolean `initial_{new_feat_name}`.
#                             Defaults to config settings.
#     :param min_initial_streak_len: Min length threshold. Applies **only** to boolean `initial_*` feature.
#                             Defaults to config settings.
#     :param low_value_range: Values < this are "low" (exclusive upper bound).
#                             Defaults to config settings.
#     :return: X with `low_case_streak` (continuous) ± `initial_low_case_streak` (boolean).
#     """
    
#     assert all(X.index == y.index), "Indices for 'X' and 'y' must be aligned." 
    
#     config_values = cnfg.preprocess
    
#     target_feature = target_feature or config_values.feature_groups["target"]
#     group_feature = group_feature or config_values.feature_groups["city"]
#     min_initial_streak_len = min_initial_streak_len or (config_values.
#         low_value_streak_features["target_streak_len_threshold"])
#     new_feat_name = new_feat_name or (config_values.
#         low_value_streak_features.get("target_streak_feat_n"))
#     low_value_range = low_value_range or (config_values.
#         low_value_streak_features["low_value_range"])

#     if initial_streaks_only is None:
#         initial_streaks_only = config_values.low_value_streak_features.get("initial_streaks"
#                                                                           ) or False
#     if new_feat_name is None:
#         return X
        
#     X_low_streaks = X.copy()

#     if group_feature is not None:
#         low_count_streaks = y.groupby(by=group_feature)[target_feature].transform(
#             lambda x: _get_cumulative_streaks(x, low_value_range))
#     else:
#         low_count_streaks = _get_cumulative_streaks(y[target_feature], low_value_range)
        
#     temp_output_dict = {new_feat_name: low_count_streaks}

#     if initial_streaks_only:
#         if group_feature is not None:
#             initial_streaks = y.groupby(by=group_feature)[target_feature].transform(
#                 lambda x: _get_cumulative_streaks(x, low_value_range, initial_streaks_only))
#         else:
#             initial_streaks = _get_cumulative_streaks(y[target_feature], low_value_range,
#                                                       initial_streaks_only)
            
#         if min_initial_streak_len:
#             long_streak_mask = initial_streaks >= min_initial_streak_len
#             long_strek_threshpoints = long_streak_mask[
#                 initial_streaks == min_initial_streak_len].index
            
#             for point in long_strek_threshpoints:
#                 start = max(0, point - min_initial_streak_len + 1)
#                 long_streak_mask[start:point] = True
#             initial_streaks = initial_streaks.where(long_streak_mask, 0)
            
#         temp_output_dict[f"initial_{new_feat_name}"] = initial_streaks
            
            
#     for feat_name, streak in temp_output_dict.items():
#         if feat_name.startswith("initial_"):
#             streak = streak.astype(bool).astype(int)
#         X_low_streaks[feat_name] = streak
        
#     return X_low_streaks

In [14]:
df_train_eng = low_value_targets(X=df_train_eng, y=df_labels_clean)
df_train_eng.columns

Index(['city', 'year', 'weekofyear', 'week_start_date', 'precipitation_amt_mm',
       'reanalysis_air_temp_k', 'reanalysis_avg_temp_k',
       'reanalysis_dew_point_temp_k', 'reanalysis_max_air_temp_k',
       'reanalysis_min_air_temp_k', 'reanalysis_precip_amt_kg_per_m2',
       'reanalysis_relative_humidity_percent', 'reanalysis_sat_precip_amt_mm',
       'reanalysis_specific_humidity_g_per_kg', 'reanalysis_tdtr_k',
       'station_avg_temp_c', 'station_diur_temp_rng_c', 'station_max_temp_c',
       'station_min_temp_c', 'station_precip_mm', 'missing_pct_env_feat',
       'missing_station', 'missing_ndvi_s', 'missing_ndvi_n', 'ndvi_north',
       'ndvi_south', 'low_case_streak', 'initial_low_case_streak'],
      dtype='object')

In [15]:
df_train_eng["initial_low_case_streak"].sum()

75

In [16]:
df_train_eng["low_case_streak"].sum()

3029

### Cyclical time features
- [X] Create circular sin/cos features based on `weekofyear`
- [X] drop source time feature `weekofyear`
- [X] keep `week_start_date` for time being for train-test-splits and windowing
- [ ] remove `week_start_date` before modeling

In [17]:
# def circular_time_features(X: pd.DataFrame,
#                            source_feature: str | None = None,
#                            period: int | None = None,
#                            drop_source_feature: bool = True):
#     """
#     Generate cyclical sin/cos features for periodic time variables. Transforms integer week/month/day
#     features into continuous circular encodings that preserve temporal continuity across period boundaries.

#     :param X: Input features DataFrame.
#     :param source_feature: Periodic column name (e.g., 'weekofyear'). Uses config default if None.
#     :param period: Fixed cycle length (e.g., 52 for weeks, 12 for months). Uses column max if None.
#     :param drop_source_feature: If True, drops original source column after encoding.
#     :return: X with `sin_{source_feature}` and `cos_{source_feature}` columns added.
#     """
#     source_feature = (source_feature 
#                       or cnfg.preprocess.feature_groups["week"])
    
#     _check_feature_presence(source_feature, X.columns)
    
#     X_circular = X.copy()
#     divisor = period or X_circular[source_feature].max()
    
#     X_circular[f"sin_{source_feature}"] = np.sin(
#         2 * np.pi * X_circular[source_feature] / divisor)
#     X_circular[f"cos_{source_feature}"] = np.cos(
#         2 * np.pi * X_circular[source_feature] / divisor)

#     if drop_source_feature:
#         X_circular.drop(columns=[source_feature], inplace=True)

#     return X_circular

In [28]:
df_train_eng = circular_time_features(df_train_eng)
[feature for feature in df_train_eng.columns if "week" in feature]

['week_start_date', 'sin_weekofyear', 'cos_weekofyear']

## Common Feature Selection baseline

### Remove selected multicolinear features
- [X] Remove initial `config.yaml` milticolinear features form dataframe
    - Initial assesment of pairwise correlations in EDA 
- [X] Assess city-wise VIF to EDA instead of correlation matrix for one-vs-all relationships.
- [ ] Remove features with VIF > 10
- [X] always prefer station_* over reanalysis_*

In [19]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        
#     return X_reduced

In [20]:
cnfg.preprocess.multicolinear["removal_list"]

['reanalysis_sat_precip_amt_mm',
 'reanalysis_dew_point_temp_k',
 'reanalysis_air_temp_k',
 'reanalysis_specific_humidity_g_per_kg',
 'reanalysis_avg_temp_k',
 'reanalysis_max_air_temp_k',
 'reanalysis_min_air_temp_k']

In [21]:
# # TODO remove after cleaning
# def top_vif(data: pd.DataFrame):
#     """
#     Calculate and return Variance Inflation Factor (VIF) scores for numeric features.
    
#     :param data: pandas DataFrame containing numeric and non-numeric features.
#     :return: pandas DataFrame with features and their VIF scores,
#                 sorted descending (excludes constant).
#     """
#     data_vif = add_constant(data.select_dtypes(include="number"))
#     cols = data_vif.columns
#     if data_vif.isna().sum().sum() > 1:
#         raise ValueError(f"{data_vif.isna().sum().sum()} NaNs in the dataframe.")
#     data_vif = [variance_inflation_factor(
#         data_vif.values, i) for i in range(data_vif.shape[1])]
#     data_vif = pd.DataFrame(data=data_vif, index=cols, columns=["vif"])
#     data_vif = data_vif.sort_values(by="vif", ascending=False,
#                                     na_position="first").drop(index="const")
    
#     return data_vif    

In [22]:
# DONT DO CAPPING, LOG TRANSFORM INSTEAD
# intermediate_target = cap_outliers(data=df_labels_raw, features=TARGET)
# df_labels_clean = intermediate_target["data"]
# df_labels_log = ...

In [23]:
# display_distributions(data = df_labels_log,
#                       hue_palette=(CITYGROUP_FEAT, random_colormap()),
#                      features=[TARGET], title_prefix=TARGET.capitalize(),
#                      )

In [24]:
temp_vif_pre = df_train_eng.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
temp_vif_pre =  temp_vif_pre.loc["iq"].join(temp_vif_pre.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
temp_vif_pre["vif_total"] = top_vif(data=df_train_eng).values
temp_vif_pre.sort_values(by="vif_sj", ascending=False, na_position="first")

  vif = 1. / (1. - r_squared_i)
  vif = 1. / (1. - r_squared_i)
  return 1 - self.ssr/self.centered_tss
  vif = 1. / (1. - r_squared_i)


Unnamed: 0,vif_iq,vif_sj,vif_total
initial_low_case_streak,5.011947,,10.039308
reanalysis_sat_precip_amt_mm,inf,inf,inf
precipitation_amt_mm,inf,inf,inf
reanalysis_dew_point_temp_k,455.592262,1698.333029,465.31392
reanalysis_air_temp_k,77.742075,1138.869614,185.239521
reanalysis_specific_humidity_g_per_kg,419.60753,525.113549,383.644881
reanalysis_relative_humidity_percent,114.996983,306.044563,212.880313
reanalysis_avg_temp_k,37.573909,274.564065,66.79385
station_avg_temp_c,3.350385,24.573056,4.341595
reanalysis_min_air_temp_k,5.522143,19.140116,23.767917


In [25]:
df_train_clean = remove_features(X=df_train_clean)

In [26]:
temp_vif_post = df_train_clean.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
temp_vif_post =  temp_vif_post.loc["iq"].join(temp_vif_post.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
temp_vif_post["vif_total"] = top_vif(data=df_train_clean).values
temp_vif_post.sort_values(by="vif_sj", ascending=False, na_position="first")

Unnamed: 0,vif_iq,vif_sj,vif_total
station_avg_temp_c,2.903773,15.205912,4.242656
station_min_temp_c,2.246606,9.490197,2.71182
station_max_temp_c,2.525159,7.615379,3.264081
ndvi_se,2.690311,3.280663,3.942102
ndvi_sw,4.025909,3.234273,6.650551
reanalysis_relative_humidity_percent,6.139555,3.111873,9.211951
station_diur_temp_rng_c,3.016941,3.041026,5.982186
reanalysis_precip_amt_kg_per_m2,1.618934,2.597984,1.962594
year,1.139403,2.130203,1.293049
ndvi_nw,2.845429,1.950984,4.045501


In [27]:
corr_threshold=0.8
print(f"City-stratified correlations exceeding {corr_threshold}:")
df_train_clean.groupby(
    by=CITYGROUP_FEAT).apply(
        lambda group: top_correlations(
            data=group, 
            corr_threshold=corr_threshold
        ), include_groups=False)

City-stratified correlations exceeding 0.8:


city                                                          
iq    ndvi_ne             ndvi_sw                                 0.840730
      reanalysis_tdtr_k   reanalysis_relative_humidity_percent   -0.900588
sj    station_min_temp_c  station_avg_temp_c                      0.898016
      station_max_temp_c  station_avg_temp_c                      0.864780
      ndvi_sw             ndvi_se                                 0.819673
dtype: float64

**Conclusion**:
- Overall `vif_total` Variance Inflation Factor cobined for both cities is below 10:
  - Among highest VIF scoring overall remains `reanalysis_relative_humidity_percent`, but it is important as there are no alternative station humiidity data and humidity is important for dengue detection domain. 
- `San Juan` still suffers from higher Multicolinearity that reflects geographical characteristic (more stable climate):
    - Top VIF scores for `San Juan` reach 15.2 and are in crucial station  temperature measurements. `station_avg_temp_c` could be potentially removed, but risks hiding important signals in `Iquitos`.
    - Keep `station_avg_temp_c` and observe model results:
        - VIF score of ~ 15 should be well handled by `LightGBM`
        - If `LTSM` `San Juan` -> `Iquitos` transfer learn underperforms, attempt addinf `station_avg_temp_c` to `config.yaml` feature removal list. 
- VIF factors for `Iquitos` are all acceptable in range below ~6.1.
- Remaining high correlation pairs are differenft for both cities confirming that closeness of these features are city specific and thus may encompass important signals for models. So they are kept.

### RobustScaler (features)

## Target processing
- [ ] Process zero/low value target value streaks
- [ ] RobustScaler (Targets)

### Process zero/low value target value streaks
- [ ] Predict log1p(total_cases) to reduce zero target influence (log transform targets)
- [ ] (Optional) LSTM specific - if LSTM results are sub-optimal, experiment with downweighting affected period data (ie 0.3 or more)

# TODO: adjust **Conclusion**:
- Target outliers are 

### RobustScaler (targets)

# TODO Research before applying:
- **Final Preprocessing tactics for outliers:**
    - Target ("total_cases")
        - If tree models used (eg LightGBM) - no issue, trees are not sensitive to outliers:
            - [ ] use huber loss for extra safety when handling tails
            - [ ] RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
        - RNNs (eg LSTM) are outlier sensitive (gradient instability, hidden state patterns loose importance at peaks, scaling):
            - [ ] Log transform
            - [ ] Scale (RobustScaler  with IQR is more outlier resistant)
            - [ ] apply huber loss
    - Features:
        - [ ] RobustScaler for RNN
        - [ ] RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.

separate tactics for baseline tree modela and RNN.
- Tree (eg LighGBM):
    - drop redundant and higgly correlated reanalysis features, replace with engineered features that provide temporal info
    - single vector input:
        - weather/vegetation snapshot for t-1 week:
            - keep these features (poss few more):
                1. city
                2. sin/cos_weekofyear (cyclical),
                3. station_avg_temp_c,
                4. station_precip_mm,
                5. station_min_temp_c/station_max_temp_c,
                6. ndvi_ne/ndvi_sw,
                7. reanalysis_relative_humidity_percent
        - multiple lagged/aggregated case metrics.
            - see next cell
            - add:
                -  4w rolling means
                -  4w rollong max
                -  w1-w4 lags (4 features)
        - multiple lagged/aggregated weather features:
            - 4w averages for temp, rain, humidity
        - limit to 25 dimensions (preferably 20)
- RNN (eg LSTM)
    - keep most original features, drop extremes of 2-3 highly collinear reanalysis, minimal engineering 
    - windowed input with:
        -  4-8 time steps (if single step/week forecast)
        -  10-20 features per step
        -  include city embedding (research if truly needed)

Conclusions from EDA notebook (~ line 33)
- Tree models may handle interrupted seasonality better, but:
    - with adittional feature engineering:
      - seasonal cyclical sin/cos features (weekly on annual basis)
      - weekly case autocorrelation/shift() (must) or annual shift features (indicate time momentum for tree model):
          - trees can handle resulting NaNs
      - weekly per city differencing to hughlight spikes and sudue seasonality - stationary (constant statistical properties over time) benefit trees.
      - Optional: annual diff feature for outbreak signals per city
-  RNNs (eg LSTM) will struggle with outbreak spikes as works best with smooth seasonal patterns:
    - still may benefit from  cyclical sin/cos features
    - precip_avg_4w
    - Rolling_mean_cases_4w (handle NaNs, eg impute with bfill() + ffill() combo)
    - lag1_cases