# Data Cleaning Notebook for Dengue Transfer Learning Project

## Task

Conduct EDA for TensorFlow transfer learning pipeline to forecast **weekly dengue cases** (`total_cases`) from 22 multivariate weather/environmental features.

### Dataset
Dengue ML datasets track environmental and temporal factors influencing Aedes mosquito breeding and virus transmission in tropical regions like San Juan and Iquitos.

- #### Temporal Features
    - **city**: Location identifier (e.g., 'sj' for San Juan, 'iq' for Iquitos)—captures city-specific mosquito/dengue patterns.
    - **year, weekofyear, week_start_date**: Time granularity for seasonality; dengue peaks during rainy seasons (weekofyear critical for lagged effects).

- #### Vegetation Indices (NDVI)
    - **ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw**: Normalized Difference Vegetation Index by city quadrant. Higher NDVI indicates lush vegetation providing mosquito shade/breeding sites; key for Aedes habitat detection via satellite.

- #### Precipitation \& Water
    - **precipitation_amt_mm**: Rainfall amount—creates standing water breeding sites.
    - **reanalysis_precip_amt_kg_per_m2, reanalysis_sat_precip_amt_mm**: Reanalysis (modeled) precipitation variants confirming observed rain.
    - **station_precip_mm**: Ground station measurements—most direct rain proxy.

- #### Temperature Metrics
    - **reanalysis_air_temp_k, reanalysis_avg_temp_k, reanalysis_max_air_temp_k, reanalysis_min_air_temp_k**: Reanalysis temps in Kelvin; optimal Aedes range 26-32°C accelerates larval development/virus replication.
    - **station_avg_temp_c, station_max_temp_c, station_min_temp_c**: Station temps in Celsius—ground truth validation.
    - **station_diur_temp_rng_c**: Diurnal range; wider swings stress mosquitoes.
    - **reanalysis_tdtr_k**: Temperature diurnal temperature range (reanalysis).

- #### Humidity \& Moisture
    - **reanalysis_dew_point_temp_k**: Dew point—direct humidity proxy; high values (>20°C) favor mosquito survival.
    - **reanalysis_relative_humidity_percent**: Relative humidity %—critical for egg/larval viability.
    - **reanalysis_specific_humidity_g_per_kg**: Absolute moisture content.


### Notebook sections for the second project notebook (Data Cleaning)
1. Get Data
2. Data Cleaning

In [1]:
import sys
import os
from pathlib import Path
from typing import List, Tuple, Any, Dict
import gc
import itertools

# Set one level up as project root|
if os.path.abspath("..") not in sys.path:
    sys.path.insert(0, os.path.abspath(".."))
    
from src.config import ProjectConfig  # project config file parser
from src.utils.eda import value_streaks, top_correlations
from src.utils.visualizations import compute_correlations_matrix, \
                display_distributions, random_color, random_colormap, \
                display_timeseries

import pandas as pd
import numpy as np
import random
import time
from datetime import timedelta

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

from src.utils.eda import top_correlations, top_vif
from src.utils.utils import _check_feature_presence
from src.preprocessing.clean import cap_outliers, drop_nan_rows, \
                                    median_groupwise_impute
from src.preprocessing.engineer import reduce_features, remove_features

from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
# from matplotlib.axis import Axis
# from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
import seaborn as sns

In [2]:
cnfg = ProjectConfig.load_configuration()
PATH_TO_RAW_DATA = cnfg.data.dirs["raw"]
FILE_TRAIN_RAW= cnfg.data.files["features_train"]
FILE_TEST_RAW = cnfg.data.files["features_test"]
FILE_LABELS_RAW = cnfg.data.files["labels_train"]
TARGET = cnfg.preprocess.feature_groups["target"]
ENV_FEAT_PREFIX = cnfg.preprocess.feature_groups["env_prefixes"]
CITYGROUP_FEAT = cnfg.preprocess.feature_groups["city"]
WEEK_FEAT = cnfg.preprocess.feature_groups["week"]

### Get Data

In [3]:
df_train_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TRAIN_RAW, parse_dates=["week_start_date"])
df_test_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TEST_RAW, parse_dates=["week_start_date"])
df_labels_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_LABELS_RAW)
list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
env_features = [f for f in df_train_raw if f.startswith(tuple(ENV_FEAT_PREFIX))]

In [4]:
# df_train_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TRAIN_RAW, parse_dates=["week_start_date"])
# # df_test_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_TEST_RAW, parse_dates=["week_start_date"])
# df_labels_raw = pd.read_csv(PATH_TO_RAW_DATA / FILE_LABELS_RAW)
# # list_raw_df = [df_train_raw, df_test_raw, df_labels_raw]
# # for df in list_raw_df:
# #     display(df.sample(1))  

***To reduce data snooping, slice last entries for both dataset cities***

In [5]:
holdout_pct = 0.05
cities_first_i = df_train_raw.groupby(by=CITYGROUP_FEAT)["week_start_date"].idxmin()  # Series w Start indices
cities_last_i = df_train_raw.groupby(by=CITYGROUP_FEAT)["week_start_date"].idxmax()  # Series w end indices
cities_last_i = (cities_last_i - (cities_last_i - cities_first_i) * holdout_pct).astype(int)  # indice math with Series
period = tuple(slice(cities_first_i[city], cities_last_i[city], 1) for city in cities_last_i.index[::-1])  # Create tuple of slices from 2 Series
df_train_raw_eda = df_train_raw.iloc[np.r_[period]].reset_index(drop=True)  # apply defuned slices
df_labels_raw_eda = df_labels_raw.iloc[np.r_[period]].reset_index(drop=True)

## Data Cleaning

# TODO:
- [x] Cap BEFORE median computation (Winsorization - cap extreme values at specified threshold):
    - 1%/99% threshold for both models
    - do not cap targets
- [X] Remove rows with over 50% of NaN
- [X] Impute NaNs with the groupwise median.
- [ ] (OPTIONAL, if outliers still there) Reapply outlier handling AFTER imputation with 5%/95% threshold:
    - larger threshold preserves more of original distribution shape than tail-focussed 1%/99% threshold
    - should not cap much of the data at this stage (CHECK FOR AFFECTED DATAPOINT COUNT)

### Outlier handling

In [6]:
# TODO remove after cleaning

# def cap_outliers(data: pd.DataFrame, features: List[str]=None,
#                  group_keys: List[str]=None,
#                  lower_cap:float=None, upper_cap:float=None,
#                 output_stats:bool=True) -> Dict[str, Any]:
#     """
#     Perform groupwise Winsorization (percentile clipping) on specified features to handle outliers.
#     Automatically filter environmental features from config prefixes if not specified.
#     :param data: Input pandas DataFrame.
#     :param features: List of column names to clip. Default None auto-selects env features 
#            from prefixes defined in config.yaml.
#     :param group_keys: List of columns to group by for quantile calculation. Default None uses 
#            config.yaml 'city' grouping.
#     :param lower_cap: Lower percentile for clipping (0-1). Default None uses config.yaml 
#            'outlier_perc.lower' (originally 0.01).
#     :param upper_cap: Upper percentile for clipping (0-1). Default None uses config.yaml 
#            'outlier_perc.upper' (originally 0.99).
#     :param output_stats: If True, returns % rows changed per feature. Default True.
#     :return: Dict containing:
#            - 'data': Clipped DataFrame copy (original unchanged)
#            - 'capped_row_prc': Series of % rows clipped per feature (if output_stats=True)
#     """
#     if features is None:
#         features = [f for f in data.columns if f.startswith(
#             tuple(cnfg.preprocess.feature_groups["env_prefixes"]))]
#     if group_keys is None:
#         group_keys = cnfg.preprocess.feature_groups["city"]
#     if lower_cap is None:
#         lower_cap = cnfg.preprocess.outlier_perc["lower"]
#     if upper_cap is None:
#         upper_cap = cnfg.preprocess.outlier_perc["upper"]

#     data_no_outliers = data.copy()
#     data_no_outliers[features] = data_no_outliers.groupby(by=group_keys)[features].transform(
#         lambda group: group.clip(
#             lower=group.quantile(lower_cap), upper=group.quantile(upper_cap)))

#     if output_stats:
#         capped_row_percent = round(
#             ((data[features] != data_no_outliers[features]).sum() / len(data) * 100), 2)
#         return {"data": data_no_outliers,
#                 "capped_row_prc": capped_row_percent}
#     return {"data": data_no_outliers}

In [7]:
intermediate_output = cap_outliers(data=df_train_raw)
df_train_clean = intermediate_output["data"]

In [8]:
intermediate_output["capped_row_prc"]

ndvi_ne                                  15.25
ndvi_nw                                   5.63
ndvi_se                                   3.71
ndvi_sw                                   3.71
precipitation_amt_mm                      2.40
reanalysis_air_temp_k                     2.82
reanalysis_avg_temp_k                     2.82
reanalysis_dew_point_temp_k               2.88
reanalysis_max_air_temp_k                 2.68
reanalysis_min_air_temp_k                 2.61
reanalysis_precip_amt_kg_per_m2           2.88
reanalysis_relative_humidity_percent      2.88
reanalysis_sat_precip_amt_mm              2.40
reanalysis_specific_humidity_g_per_kg     2.88
reanalysis_tdtr_k                         2.82
station_avg_temp_c                        5.01
station_diur_temp_rng_c                   5.01
station_max_temp_c                        3.09
station_min_temp_c                        2.68
station_precip_mm                         2.61
dtype: float64

In [9]:
# # Uncomment cell for visual check on new distributions for selected features after outlier cliping/Winsorization 

# selected_distro_EDA_features = ["ndvi_ne", "precipitation_amt_mm", "station_diur_temp_rng_c", "station_max_temp_c",
#        "station_min_temp_c", "station_precip_mm"]
# # selected_distro_EDA_features = [feature for feature in df_train_raw_eda.select_dtypes("float") if not feature.startswith("reanalysis")]  # used for outlier check

# for numeric_feature in selected_distro_EDA_features:
#     display_distributions(data=df_train_clean[selected_distro_EDA_features],
#                           features=[numeric_feature],
#                           title_prefix=numeric_feature)

**Conclusion**:
- Extreme outliers are removed
- Data ranges and variations seam to be credible for tropical climate
- No need to adjust 1%/99% clipping percentailes or conduct second round of cliping/Winsorization
- Adjusted row percentaga is rasonable from ~2-15% (does not exceed 20%)

### NaN handling

- Remove rows with > 50% NaN values:
    - also removes all rowws for `wekofyear` # 53 that do not have observational or analytical data (a likely data collection bug)

In [10]:
# TODO remove after cleaning

# def drop_nan_rows(X: pd.DataFrame, y: pd.Series | None = None,
#                   threshold_percent: float = 0.5):
#     """
#     Drop rows with NaN values exceeding threshold_percent of columns.
    
#     :param X: pandas DataFrame of features.
#     :param y: Optional target array/series. Default None.
#     :param threshold_percent: Min non-null fraction required [0,1]. Default 0.5.
#     :return: Filtered X (and y if provided), both with reset_index().
#     """
#     row_drop_threshold = int(len(X.columns) * threshold_percent)
#     result = X.dropna(thresh=row_drop_threshold)
#     if y is not None:
#         return result.reset_index(), y.iloc[result.index].reset_index()
#     return result.reset_index()

In [11]:
df_train_clean, df_labels_clean = drop_nan_rows(X=df_train_clean, y=df_labels_raw)

In [12]:
# TODO remove after cleaning

# def median_groupwise_impute(X: pd.DataFrame,
#                             group_keys: List[str] = ['city', 'weekofyear']):
#     """
#     Impute NaN values in numeric columns using median within specified group keys.
    
#     :param X: pandas DataFrame containing grouping columns and features to impute.
#     :param group_keys: List of column names for grouping. Default ['city', 'weekofyear'].
#     :return: Copy of input DataFrame with NaNs filled by group-wise medians.
#     """
#     missing_keys = set(group_keys) - set(X.columns)
#     if missing_keys:
#         raise ValueError(f"Missing group keys {missing_keys}")

#     X_no_nan = X.copy()
#     cols_with_nan = X_no_nan.select_dtypes(include="number")\
#         .columns[X_no_nan.select_dtypes(include="number").isna().sum() > 0].to_list()

#     if len(cols_with_nan) > 0:
#         X_no_nan[cols_with_nan] = X_no_nan[cols_with_nan + group_keys]\
#             .groupby(by=group_keys)[cols_with_nan]\
#             .transform(lambda group: group.fillna(group.median()))
#     return X_no_nan

In [13]:
df_train_clean, df_nan_mask = median_groupwise_impute(X=df_train_clean)

In [14]:
df_labels_clean.columns

Index(['city', 'year', 'weekofyear', 'total_cases'], dtype='object')

In [15]:
# Check if any NaNs left
df_train_clean.isna().sum().sum()

0

In [16]:
# TODO remove after cleaning

# plt.figure(figsize=(18, 6))  # df_train_raw.shape[1]
# sns.heatmap(
#     # median_groupwise_impute(df_train_clean).isna(),
#     df_train_clean.isna(),
#     cmap='plasma', cbar=False)
# plt.title("Post-clean NaN location check.\n", 
#           fontsize=13, fontweight="bold")
# plt.xticks(rotation=45, ha='right', rotation_mode='anchor')
# plt.show()

**Conclusions**
- primary imputation method: median of all other weekly data from subset of same city data:
    - median because data has outliers
    - data misigness is acceptable - even in worst cases (after removing week 53 that has no data) thera are more than 50% and at least 6 data pints available to produce `city`+`weekofyear` medians (see EDA table).
    - Features in the datset have strong seasonality (rain, temperature, humidity, NDVI). Same week median can handle this.
    - moderately well handles different issues with the dataset - scattered NaNs, long streaks (entire season of 15-weeks for `NDVI`)
    - simple to implement
- Other imputation methods considered:
    - Data reconstruction (eg station average or range features from station_max and station_min):
        - discarded as performing same calculations on non-nan data show significant discrepancies between calculated and original data
    - horizontal imputation from potentially related Reanalysis data:
        - discarded: top correlations for station mesurement and reanalysis data do differ accross city data subsets. San Juan has more promissing correlation ranges from ~0.5 (`station_precip_mm`) to ~0.88 (`station_avg_temp_c`) while Iquitos respective ranges are from below 0.4 (`station_precip_mm`) to ~0.6 (`station_avg_temp_c`). Considering that almost all of the missing data for station measurements are in Iquitos, the correlations do not explain enough variance (R^2) and thus median imputation is potentially better tool.
    - Temporal interpolation (np.interp, splines):
        -  discarded: destroys temporal patterns for long NaN streaks (eg line pattern for entire season or month)
    -  KNN/multi-feature models:
        - discarded: Complexity vs expected gains. Too much effort and bug risk versus potentially minimal model improvements when simple median imputation used.  
    

<!-- ### Remove selected multicolinear features
- [X] Remove initial `config.yaml` milticolinear features form dataframe
- [X] Assess city-wise VIF to EDA instead of correlation matrix for one-vs-all relationships.
- [ ] Remove features with VIF > 10
- [X] always prefer station_* over reanalysis_* -->

In [17]:
# # TODO remove after cleaning

# def reduce_features(X: pd.DataFrame, 
#                     input_feat_groups: List[List[str]]=None,
#                     output_feat_names: List[str]=None,
#                    function: str=None):
#     """
#     Aggregate multiple feature groups into single reduced features using specified function.
#     Combine input features and drop originals.
    
#     :param X: Input pandas DataFrame.
#     :param input_feat_groups: List of feature group lists to aggregate. Default None uses 
#            config.yaml settings (e.g., [['ndvi_ne', 'ndvi_nw']]).
#     :param output_feat_names: Output column names for aggregated features. Default None uses 
#            config.yaml settings (e.g., ['ndvi_north']).
#     :param function: Aggregation function string ('mean', 'sum', 'median'). Default None uses 
#            config.yaml settings (e.g 'mean').
#     :return: DataFrame with reduced features. Original input columns dropped.
#     """
#     X_reduced = X.copy()
#     if input_feat_groups is None:
#         input_feat_groups = cnfg.preprocess.combine_features["input_groups"]
#     if output_feat_names is None:
#         output_feat_names = cnfg.preprocess.combine_features["output_names"]
#     if function is None:
#         function = cnfg.preprocess.combine_features["aggregation"]

        
#     if not len(input_feat_groups) == len(output_feat_names):
#         raise ValueError(f"Input feature groups {input_feat_groups} mismatch target keys {output_feat_names}")
#     missing_features = set(itertools.chain(*input_feat_groups)) - set(X.columns)
#     if missing_features:
#         raise ValueError(f"No {missing_features} features in input dataframe columns: {X.columns}")

#     for name, group in zip(output_feat_names, input_feat_groups):
#         X_reduced[name] = X_reduced[group].agg(function, axis=1)
#         X_reduced.drop(columns=group, inplace=True)
        
#     return X_reduced

In [18]:
# cnfg.preprocess.multicolinear["removal_list"]

In [19]:
# # TODO remove after cleaning
# def top_vif(data: pd.DataFrame):
#     """
#     Calculate and return Variance Inflation Factor (VIF) scores for numeric features.
    
#     :param data: pandas DataFrame containing numeric and non-numeric features.
#     :return: pandas DataFrame with features and their VIF scores,
#                 sorted descending (excludes constant).
#     """
#     data_vif = add_constant(data.select_dtypes(include="number"))
#     cols = data_vif.columns
#     if data_vif.isna().sum().sum() > 1:
#         raise ValueError(f"{data_vif.isna().sum().sum()} NaNs in the dataframe.")
#     data_vif = [variance_inflation_factor(
#         data_vif.values, i) for i in range(data_vif.shape[1])]
#     data_vif = pd.DataFrame(data=data_vif, index=cols, columns=["vif"])
#     data_vif = data_vif.sort_values(by="vif", ascending=False,
#                                     na_position="first").drop(index="const")
    
#     return data_vif           

In [20]:
# temp_vif_pre = df_train_clean.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
# temp_vif_pre =  temp_vif_pre.loc["iq"].join(temp_vif_pre.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
# temp_vif_pre["vif_total"] = top_vif(data=df_train_clean).values
# temp_vif_pre.sort_values(by="vif_sj", ascending=False, na_position="first")

In [21]:
# df_train_clean = remove_features(X=df_train_clean)

In [22]:
# temp_vif_post = df_train_clean.groupby(by=CITYGROUP_FEAT).apply(lambda group: top_vif(data=group), include_groups=False)
# temp_vif_post =  temp_vif_post.loc["iq"].join(temp_vif_post.loc["sj"], lsuffix="_iq", rsuffix="_sj", how="inner")
# temp_vif_post["vif_total"] = top_vif(data=df_train_clean).values
# temp_vif_post.sort_values(by="vif_sj", ascending=False, na_position="first")

In [23]:
# corr_threshold=0.8
# print(f"City-stratified correlations exceeding {corr_threshold}:")
# df_train_clean.groupby(
#     by=CITYGROUP_FEAT).apply(
#         lambda group: top_correlations(
#             data=group, 
#             corr_threshold=corr_threshold
#         ), include_groups=False)

**Conclusion**:
- Overall `vif_total` Variance Inflation Factor cobined for both cities is below 10:
  - Among highest VIF scoring overall remains `reanalysis_relative_humidity_percent`, but it is important as there are no alternative station humiidity data and humidity is important for dengue detection domain. 
- `San Juan` still suffers from higher Multicolinearity that reflects geographical characteristic (more stable climate):
    - Top VIF scores for `San Juan` reach 15.2 and are in crucial station  temperature measurements. `station_avg_temp_c` could be potentially removed, but risks hiding important signals in `Iquitos`.
    - Keep `station_avg_temp_c` and observe model results:
        - VIF score of ~ 15 should be well handled by `LightGBM`
        - If `LTSM` `San Juan` -> `Iquitos` transfer learn underperforms, attempt addinf `station_avg_temp_c` to `config.yaml` feature removal list. 
- VIF factors for `Iquitos` are all acceptable in range below ~6.1.
- Remaining high correlation pairs are differenft for both cities confirming that closeness of these features are city specific and thus may encompass important signals for models. So they are kept.

# TODO:
- **Preprocessing tactics for outliers:**
    - Target ("total_cases")
        - If tree models used (eg LightGBM) - no issue, trees are not sensitive to outliers:
            - use huber loss for extra safety when handling tails
            - RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
        - RNNs (eg LSTM) are outlier sensitive (gradient instability, hidden state patterns loose importance at peaks, scaling):
            - Log transform
            - Scale (RobustScaler  with IQR is more outlier resistant)
            - apply huber loss
    - Features:
        - RobustScaler for RNN
        - RobustScaler may be redunndant for tree models, but if it simplifies pipeline - no harm.
- **Preprocessing tactics for low value target streaks**:
    - Drop 75 rows of initial zero/low value streaks
- Add missigness features that flag NaN rows for top NaN columns that have NaN rate above 1 %:
    - Idea - models, especially `LSTM` may learn from informatioun that "this datapoint was missing"
    - grouped `ndvi_n_missing` and `ndvi_s_missing` (`ndvi_n` and `ndvi_s` will be grouped during feature engneering)
    - `station_missing` - group all station data missing flags
    - `num_features_imputed` - normalized aggregation of all NaNs vs remaining base features, signals overal uncertainty
    - (Optional) During SJ pretraining: Slightly stronger dropout on missingness features than on base features to hide city-correlated missingness
    - # TODO: finkcijas loģika (pārnest eng notebook un idzēst):
        1. izmet lieko: `[feature for feature in (set(df_nan_mask.columns) - set(df_train_clean.columns)) if not feature.startswith( config.yaml prefixi tuple("station", "ndvi_s", "ndvi_n"))]`
        2. agregē  `num_features_imputed` (ko darīt ar lieajām fīčām, kas ir nomestas pēc NaN imputācijas?)
        3. balstoties uz config.yaml prefixiem atlasa fīču grupas un izvelk max, pieloekot "_missing"

# TODO: pārorganizēt notebuukus/moduļus:
2. DATA CLEANING / PREPROCESSING  ← **log1p belongs here**
   - Handle missing values
   - Target transformations (log1p, sqrt, Box-Cox) 
   - Remove duplicates/outliers
   - Date parsing
3. Feature Engineering              ← iq_initial_low_case_streak belongs here
   - Create new features from existing ones
   - Interactions, lags, rolling stats
   - Domain-specific features
4. Feature Selection
5. Target processing
   - log1p(total_cases)

<!-- ### Process zero/low value target value streaks
- [ ] Feature engineer flag for early period's low values (ie `iq_initial_low_case_streak`)
- [ ] Predict log1p(total_cases) to reduce zero target influence
- [ ] (Optional) LSTM specific - if LSTM results are sub-optimal, experiment with downweighting affected period data (ie 0.3 or more) -->

In [24]:
# zero_one_targets = value_streaks(data=df_labels_raw_eda, column=TARGET, value=range(2),
#                              run_threshold=5)
# print("Zero and one consecutive value streaks for target data ('total dengue cases).")
# zero_one_targets

### Cleaning pipeline
- [ ] run cleaning steps in sequence
- [ ] save clean data and nan mask to the disc