Apparently, scaling does not affect the performance of tree based models so i would not scale the data. For GCN and cVAE it does have an effect so we should scale the data for that, but i would do the datapreprocessing for those models in another file.

In [1]:
import pandas as pd
import numpy as np
import pickle
import gc

Remove redundant features (>0.90 correlation):

**THERE IS A SMALL PROBLEM IN DROPPING CORRELATED FEATURES**

We can't jet drop the correlated features from the full train set for each fold because that would lead to data leakage into the validation set.

In this manner, I have decided to remove the correlated features from the inner train sets and apply the same "mask" to their respective validation sets.

After hyperparameter tuning, when we use the entire train set (train + val) and the test set, we have to do start from the entire train set and calculate the correlated features then.

**ChatGPT says the following:**

Do not use the per-inner-fold dropped columns directly for the outer test. That would be inconsistent and messy.

Best practice: use inner folds only for model selection (compute drops in each inner train â†’ apply to inner val). After you pick the best hyperparameters, recompute the correlated-feature removal once on the full outer training set, and then apply that final mask to the outer test (and to the final model training on the outer-train). That gives a single, consistent feature set per outer fold.

Why: the inner-fold removals are for evaluating hyperparameters with realistic training-only transforms. But the final trained model you evaluate on the outer test should be trained with feature selection derived from the full outer training set.

In [2]:
storage_path = "../data/LightGBM/processed/"

import os
if not os.path.exists(storage_path): os.makedirs(storage_path)

In [3]:
def get_correlated_features_to_drop(X, threshold=0.9) -> list:
    # Compute correlation matrix
    corr_matrix = X.corr().abs()

    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Find features with correlation greater than 0.90
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]

    return to_drop

In [4]:
def remove_correlated_features(outer_fold: dict) -> dict:
    
    for fold in outer_fold['inner_folds']:

        X_train, X_val = fold['X_train'], fold['X_val']

        # Determine which columns need to be dropped
        to_drop = get_correlated_features_to_drop(X_train)
        
        # Drop in train and validation sets
        fold['X_train'] = X_train.drop(columns=to_drop)
        fold['X_val'] = X_val.drop(columns=to_drop)

    return outer_fold

In [5]:
for idx in range(5):
    with open(f"../data/LightGBM/unprocessed/outer_fold_{idx}.pkl", "rb") as f:
        outer_fold = pickle.load(f)

        new_outer_fold = remove_correlated_features(outer_fold)

        # Save outer fold to disk
        filename = f"processed_outer_fold_{idx}.pkl"
        with open((storage_path+filename), "wb") as f:
            pickle.dump(new_outer_fold, f)

        print(f"Saved {filename}")

        # This step may not be necessary, I don't know
        # Free up memory before moving to next outer fold
        del outer_fold
        gc.collect()

Saved processed_outer_fold_0.pkl
Saved processed_outer_fold_1.pkl
Saved processed_outer_fold_2.pkl
Saved processed_outer_fold_3.pkl
Saved processed_outer_fold_4.pkl


In [6]:
with open("../data/LightGBM/unprocessed/outer_fold_2.pkl", "rb") as f:
    outer_fold_2 = pickle.load(f)

    inner_fold = outer_fold_2['inner_folds'][0]

    print(inner_fold['X_train'].columns)

del outer_fold_2, inner_fold

Index(['MaxAbsEStateIndex', 'MaxEStateIndex', 'MinAbsEStateIndex',
       'MinEStateIndex', 'qed', 'SPS', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt',
       'NumValenceElectrons',
       ...
       'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene',
       'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene',
       'fr_unbrch_alkane', 'fr_urea'],
      dtype='object', length=217)


In [7]:
with open("../data/LightGBM/processed/processed_outer_fold_2.pkl", "rb") as f:
    outer_fold_2 = pickle.load(f)

    inner_fold = outer_fold_2['inner_folds'][0]

    print(inner_fold['X_train'].columns)

del outer_fold_2, inner_fold

Index(['MaxAbsEStateIndex', 'MinAbsEStateIndex', 'MinEStateIndex', 'qed',
       'SPS', 'MolWt', 'NumRadicalElectrons', 'MaxPartialCharge',
       'MinPartialCharge', 'FpDensityMorgan1',
       ...
       'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene',
       'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene',
       'fr_unbrch_alkane', 'fr_urea'],
      dtype='object', length=178)


In [8]:
# print(outer_fold_2.keys())
# outer_fold_2['inner_folds'][0].keys()