# Feature Selection Notebook for Dengue Transfer Learning Project

### Dataset
Dengue ML datasets track environmental and temporal factors influencing Aedes mosquito breeding and virus transmission in tropical regions like San Juan and Iquitos.

- #### Temporal Features
    - **city**: Location identifier (e.g., 'sj' for San Juan, 'iq' for Iquitos)—captures city-specific mosquito/dengue patterns.
    - **year, weekofyear, week_start_date**: Time granularity for seasonality; dengue peaks during rainy seasons (weekofyear critical for lagged effects).

- #### Vegetation Indices (NDVI)
    - **ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw**: Normalized Difference Vegetation Index by city quadrant. Higher NDVI indicates lush vegetation providing mosquito shade/breeding sites; key for Aedes habitat detection via satellite.

- #### Precipitation \& Water
    - **precipitation_amt_mm**: Rainfall amount—creates standing water breeding sites.
    - **reanalysis_precip_amt_kg_per_m2, reanalysis_sat_precip_amt_mm**: Reanalysis (modeled) precipitation variants confirming observed rain.
    - **station_precip_mm**: Ground station measurements—most direct rain proxy.

- #### Temperature Metrics
    - **reanalysis_air_temp_k, reanalysis_avg_temp_k, reanalysis_max_air_temp_k, reanalysis_min_air_temp_k**: Reanalysis temps in Kelvin; optimal Aedes range 26-32°C accelerates larval development/virus replication.
    - **station_avg_temp_c, station_max_temp_c, station_min_temp_c**: Station temps in Celsius—ground truth validation.
    - **station_diur_temp_rng_c**: Diurnal range; wider swings stress mosquitoes.
    - **reanalysis_tdtr_k**: Temperature diurnal temperature range (reanalysis).

- #### Humidity \& Moisture
    - **reanalysis_dew_point_temp_k**: Dew point—direct humidity proxy; high values (>20°C) favor mosquito survival.
    - **reanalysis_relative_humidity_percent**: Relative humidity %—critical for egg/larval viability.
    - **reanalysis_specific_humidity_g_per_kg**: Absolute moisture content.


### For a fair fight between LightGBM and our LSTM, both models will have the same starting lineup of features. These include seasonal cues, vegetation signals, missing data hints, and early low-case periods—everything else is just how each model likes to learn over time.

### Notebook sections for the fourth project notebook (Feature Selection/Scaling/Pre-Modeling Splits)
1. Get Data
2. Redundant Feature Removal
    - Common for both models
3. Walk-forward Cross-Validation for feature importance assesment
4. Target processing
5. Benchmark model (TBC, poss notebook 04)
7. Model Tuning  (TBC)
8. Model Evaluation  (TBC, poss notebook 04)

In [1]:
import sys
import os
from pathlib import Path
from typing import List, Tuple, Any, Dict, Optional
# import gc
# import itertools
# import logging
# logging.basicConfig(level=logging.INFO)

# Set one level up as project root|
if os.path.abspath("..") not in sys.path:
    sys.path.insert(0, os.path.abspath(".."))
    
from src.config import ProjectConfig  # project config file parser
from src.utils.eda import value_streaks, top_correlations
from src.utils.visualizations import (compute_correlations_matrix,
                display_distributions, random_color, random_colormap,
                display_timeseries)

import pandas as pd
import numpy as np
import random
# import time
# from datetime import timedelta

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

from src.utils.eda import top_correlations, top_vif
from src.utils.utils import _check_feature_presence, load_file, save_file
from src.preprocessing.clean import (drop_nan_rows, cap_outliers,
                                    median_groupwise_impute, pipe_clean)

from src.preprocessing.engineer import (reduce_features,
                                        add_missingness_features,
                                        low_value_targets,
                                        circular_time_features)

from src.preprocessing.select import remove_features

# from IPython.display import display
# import matplotlib.pyplot as plt
# import matplotlib.colors as mcolors
# from matplotlib.axis import Axis
# from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
# import seaborn as sns

In [2]:
cnfg = ProjectConfig.load_configuration()
PATH_TO_RAW_DATA = cnfg.data.dirs["raw"]
PATH_TO_INTERMEDIATE_DATA = cnfg.data.dirs["intermediate"]

FILE_TRAIN_RAW= cnfg.data.files["features_train"]
FILE_TEST_RAW = cnfg.data.files["features_test"]
FILE_LABELS_RAW = cnfg.data.files["labels_train"]

PATH_TO_INTERMEDIATE_DATA = cnfg.data.dirs["intermediate"]
FILE_NAN_CLEAN = cnfg.data.files["nan_mask"]
FILE_TRAIN_CLEAN = cnfg.data.files["features_clean"]
FILE_LABELS_CLEAN = cnfg.data.files["labels_clean"]

TARGET = cnfg.preprocess.feature_groups["target"]
ENV_FEAT_PREFIX = cnfg.preprocess.feature_groups["env_prefixes"]
CITYGROUP_FEAT = cnfg.preprocess.feature_groups["city"]
WEEK_FEAT = cnfg.preprocess.feature_groups["week"]
DATETIME_FEAT = cnfg.preprocess.feature_groups["datetime"]

## Get Data

- Raw data

In [3]:
df_train_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TRAIN_RAW, datetime_col=DATETIME_FEAT)
df_test_raw = load_file(path=PATH_TO_RAW_DATA / FILE_TEST_RAW, datetime_col=DATETIME_FEAT)
df_labels_raw = load_file(path=PATH_TO_RAW_DATA / FILE_LABELS_RAW)
list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
env_features = [f for f in df_train_raw if f.startswith(tuple(ENV_FEAT_PREFIX))]

- Cleaned data

In [4]:
# %%time
# # use pipeline output
# cleaned_data = pipe_clean(overwrite_files=True)
# df_train_clean = cleaned_data["X_clean_data"]
# df_labels_clean = cleaned_data["y_clean_data"]
# df_nan_mask = cleaned_data["nan_mask_data"]

In [5]:
%%time
# or just load from disc:
df_train_clean = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_TRAIN_CLEAN, datetime_col=DATETIME_FEAT)
df_labels_clean = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_LABELS_CLEAN)
df_nan_mask = load_file(path=PATH_TO_INTERMEDIATE_DATA / FILE_NAN_CLEAN)

CPU times: user 38.2 ms, sys: 6.55 ms, total: 44.8 ms
Wall time: 77.8 ms


- # TODO Engineered data

## Redundant Feature Removal

In [None]:
# # TODO remove after cleaning
# def top_vif(data: pd.DataFrame):
#     """
#     Calculate and return Variance Inflation Factor (VIF) scores for numeric features.
    
#     :param data: pandas DataFrame containing numeric and non-numeric features.
#     :return: pandas DataFrame with features and their VIF scores,
#                 sorted descending (excludes constant).
#     """
#     data_vif = add_constant(data.select_dtypes(include="number"))
#     cols = data_vif.columns
#     if data_vif.isna().sum().sum() > 1:
#         raise ValueError(f"{data_vif.isna().sum().sum()} NaNs in the dataframe.")
#     data_vif = [variance_inflation_factor(
#         data_vif.values, i) for i in range(data_vif.shape[1])]
#     data_vif = pd.DataFrame(data=data_vif, index=cols, columns=["vif"])
#     data_vif = data_vif.sort_values(by="vif", ascending=False,
#                                     na_position="first").drop(index="const")
    
#     return data_vif    

In [None]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        

In [3]:
cnfg.preprocess.multicolinear["removal_list"]

['reanalysis_sat_precip_amt_mm',
 'reanalysis_dew_point_temp_k',
 'reanalysis_air_temp_k',
 'reanalysis_specific_humidity_g_per_kg',
 'reanalysis_avg_temp_k',
 'reanalysis_max_air_temp_k',
 'reanalysis_min_air_temp_k']

In [None]:
corr_threshold=0.8
print(f"City-stratified correlations exceeding {corr_threshold}:")
df_train_clean.groupby(
    by=CITYGROUP_FEAT).apply(
        lambda group: top_correlations(
            data=group, 
            corr_threshold=corr_threshold
        ), include_groups=False)

**Conclusion**:
- Overall `vif_total` Variance Inflation Factor cobined for both cities is below 10:
  - Among highest VIF scoring overall remains `reanalysis_relative_humidity_percent`, but it is important as there are no alternative station humiidity data and humidity is important for dengue detection domain. 
- `San Juan` still suffers from higher Multicolinearity that reflects geographical characteristic (more stable climate):
    - Top VIF scores for `San Juan` reach 15.2 and are in crucial station  temperature measurements. `station_avg_temp_c` could be potentially removed, but risks hiding important signals in `Iquitos`.
    - Keep `station_avg_temp_c` and observe model results:
        - VIF score of ~ 15 should be well handled by `LightGBM`
        - If `LTSM` `San Juan` -> `Iquitos` transfer learn underperforms, attempt addinf `station_avg_temp_c` to `config.yaml` feature removal list. 
- VIF factors for `Iquitos` are all acceptable in range below ~6.1.
- Remaining high correlation pairs are differenft for both cities confirming that closeness of these features are city specific and thus may encompass important signals for models. So they are kept.

## Walk-forward Cross-Validation
- [ ] Test A: Baseline (clened and engineered features w/o lags and `initial_low_case_streak`)
- [ ] Test B: Baseline + lag shortlist (5 new features)
- [ ] Test C: Baseline + lag shortlist + initial_low_case_streak (IQ only)
- [ ] Test D: Baseline + lag shortlist - precip roll + initial_low_case_streak (checks what is stronger: `initial_low_case_streak` or presumably weak climate feature roll.
- [ ] Test E: Baseline + east/west aggregated vegetation `ndvi` features
- [ ] Test E: Baseline + lag shortlist - `station_avg_temp_c` to compare if high 

In [None]:
# DONT DO CAPPING, LOG TRANSFORM INSTEAD
# intermediate_target = cap_outliers(data=df_labels_raw, features=TARGET)
# df_labels_clean = intermediate_target["data"]
# df_labels_log = ...

### Scaling
- ### RobustScaler (features)

## Target processing
- [X] Process zero/low value target value streaks
- [ ] RobustScaler (Targets)
- [ ] Predict log1p(total_cases) to reduce zero target influence (log transform targets) (cross check - complimentary or exclusive to RobustScaler)
- [ ] (Optional) LSTM specific - if LSTM results are sub-optimal, experiment with downweighting affected period data (ie 0.3 or more)

# TODO: adjust **Conclusion**:
- Target outliers are 

- ### RobustScaler (targets)

# TODO Research before applying:
- **Final Preprocessing tactics for outliers:**
    - Target ("total_cases")
        - If tree models used (eg LightGBM) - no issue, trees are not sensitive to outliers:
            - [ ] use huber loss for extra safety when handling tails
            - [ ] RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
            - [ ] `city` one-hot encoding (no string feature remains)
        - RNNs (eg LSTM) are outlier sensitive (gradient instability, hidden state patterns loose importance at peaks, scaling):
            - [ ] Log transform
            - [ ] Scale (RobustScaler  with IQR is more outlier resistant)
            - [ ] apply huber loss
            - [ ] groupby `city` or  one-hot encoding if passing entire dataset (no string feature remains)
            - [ ]  windowed input with:
                -  [ ] 4-8 time steps (if single step/week forecast)
    - Features:
        - [ ] RobustScaler for RNN
        - [ ] RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.

In [None]:
# TODO: review and remove
- RNN preprocessing (eg LSTM)
    