<a href="https://colab.research.google.com/github/BickNutler/Data-Science-Capstone-Two/blob/main/Nicholas_Butler_Captstone_Two_Pre_processing_and_Training_Data_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NOTE: This project was completed using Google Colab and its built in Gemini AI.**

## Notebook Outline: Data Preprocessing and K-Fold Split

This notebook performs comprehensive data preprocessing and sets up stratified k-fold cross-validation for multiple feature engineering strategies. The key steps include:

1.  **Data Loading and Initial Inspection**: Loaded the encoded dataset and performed initial checks on its shape and head.
2.  **Column Dropping and Simplification**: Eliminated redundant and less relevant columns (e.g., `fnlwgt`, all but `education-num` education columns, `has_capital_gain`).
3.  **Feature Engineering (Categorical Simplification)**:
    *   **Native Country**: Collapsed `native-country_` columns into a single `is_US` binary feature.
    *   **Race**: Consolidated `race_` columns into an `is_white` binary feature.
    *   **Marital Status**: Grouped detailed `marital-status_` columns into simplified binary features (`is_Married`, `is_Never_married`, `is_Previously_married`).
    *   **Sex**: Simplified `sex_` columns to a single binary `sex_Male` feature.
4.  **Boolean to Integer Conversion**: Converted all remaining boolean columns (`True`/`False`) to integers (`1`/`0`) to ensure a fully numeric feature set.
5.  **Target and Feature Separation**: Divided the DataFrame into features (`X`) and the target variable (`y`, `income_>50K`).
6.  **Binned Categorical Feature Creation**: Created binned versions of `occupation` and `relationship` categories based on predefined mappings, dropping the original detailed columns.
7.  **Dynamic Parameter Grid Generation (`param_grid_combinations`)**:
    *   Defined sets of raw and binned columns for `workclass`, `occupation`, and `relationship`.
    *   Generated all combinations of using 'raw' versus 'binned' features for these three categories.
    *   Identified fixed numeric and capital gain/loss columns for consistent scaling.
8.  **Stratified K-Fold Setup**: Initialized `StratifiedKFold` with 10 splits to ensure balanced class distribution across folds.
9.  **Preprocessing and Data Splitting for Each Configuration**:
    *   Iterated through each combination in `param_grid_combinations`.
    *   For each combination, a `ColumnTransformer` was dynamically constructed:
        *   Applied `np.log1p` then `StandardScaler` to capital gain/loss columns.
        *   Applied `StandardScaler` to other numeric features.
        *   Passed through selected workclass, occupation, and relationship features (either raw or binned).
        *   Used `remainder='passthrough'` to ensure all other columns were included.
    *   The `ColumnTransformer` was `fit_transform`ed on `X` to create `X_processed` for that configuration.
    *   The `X_processed` and original `y` were then split into 10 stratified training and testing sets using the `skf` object. These splits (`X_train`, `X_test`, `y_train`, `y_test`) were stored within each configuration's entry.

This process has prepared the data in multiple ways, ready for thorough model training and evaluation across different feature engineering strategies.

In [1]:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [2]:

# Load the cleaned, encoded dataframe
df = pd.read_csv("/content/df_encoded.csv")
print(df.shape)
df.head()


(48842, 122)


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,has_capital_gain,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,workclass_binned_Never-worked,workclass_binned_Private,workclass_binned_Self-employed,workclass_binned_Unknown,workclass_binned_Without-pay,work_schedule_Full-time,work_schedule_Overtime,work_schedule_Part-time,native_country_grouped_Non-United-States,native_country_grouped_United-States
0,39,77516,13,2174,0,40,1,False,False,False,...,False,False,False,False,False,True,False,False,False,True
1,50,83311,13,0,0,13,0,False,False,False,...,False,False,True,False,False,False,False,True,False,True
2,38,215646,9,0,0,40,0,False,False,False,...,False,True,False,False,False,True,False,False,False,True
3,53,234721,7,0,0,40,0,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,28,338409,13,0,0,40,0,False,False,False,...,False,True,False,False,False,True,False,False,True,False


In [3]:

# 1. Drop all education columns besides 'education-num', drop 'fnlwgt', drop 'has_capital_gain'

cols_to_drop = []

# Drop any column whose name starts with 'education' EXCEPT 'education-num'
for col in df.columns:
    if col.startswith("education") and col != "education-num":
        cols_to_drop.append(col)

# Explicit columns to drop if present
for col in ["fnlwgt", "has_capital_gain"]:
    if col in df.columns:
        cols_to_drop.append(col)

print("Dropping columns:", cols_to_drop)
df = df.drop(columns=cols_to_drop)
print(df.shape)
df.head()


Dropping columns: ['education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'fnlwgt', 'has_capital_gain']
(48842, 104)


Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,workclass_binned_Never-worked,workclass_binned_Private,workclass_binned_Self-employed,workclass_binned_Unknown,workclass_binned_Without-pay,work_schedule_Full-time,work_schedule_Overtime,work_schedule_Part-time,native_country_grouped_Non-United-States,native_country_grouped_United-States
0,39,13,2174,0,40,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
1,50,13,0,0,13,False,False,False,False,False,...,False,False,True,False,False,False,False,True,False,True
2,38,9,0,0,40,False,False,False,True,False,...,False,True,False,False,False,True,False,False,False,True
3,53,7,0,0,40,False,False,False,True,False,...,False,True,False,False,False,True,False,False,False,True
4,28,13,0,0,40,False,False,False,True,False,...,False,True,False,False,False,True,False,False,True,False


In [4]:
native_country_cols = [c for c in df.columns if c.startswith("native-country_")]
grouped_country_cols = [c for c in df.columns if c.startswith("native_country_grouped_")]

# If we already have grouped native country columns, prefer those
if "native_country_grouped_United-States" in df.columns:
    # Create a single is_US column from the grouped indicator
    df["is_US"] = df["native_country_grouped_United-States"].fillna(0).astype(int)
    # Drop all native-country dummies and grouped columns
    df = df.drop(columns=native_country_cols + grouped_country_cols)
else:
    # Otherwise, create is_US from the one-hot column 'native-country_United-States'
    if "native-country_United-States" in df.columns:
        df["is_US"] = df["native-country_United-States"].fillna(0).astype(int)
    else:
        # Fallback: if no explicit US column, treat non-US as 0
        df["is_US"] = 0

    # Drop all native-country one-hot columns, we only need 'is_US'
    df = df.drop(columns=native_country_cols, errors="ignore")

# Handle race columns: create 'is_white' and drop all other race columns
race_cols = [c for c in df.columns if c.startswith("race_")]

if "race_White" in df.columns:
    df["is_white"] = df["race_White"].fillna(0).astype(int)
else:
    # Fallback: if 'race_White' is not present, assume 0
    df["is_white"] = 0

# Drop all original race columns
df = df.drop(columns=race_cols, errors="ignore")

# Handle marital status columns: create simplified categories and drop original ones
marital_cols = [c for c in df.columns if c.startswith("marital-status_")]

married_cols = [
    "marital-status_Married-civ-spouse",
    "marital-status_Married-AF-spouse",
]
never_married_cols = [
    "marital-status_Never-married",
]
previously_married_cols = [
    "marital-status_Divorced",
    "marital-status_Separated",
    "marital-status_Widowed",
    "marital-status_Married-spouse-absent",
]

# Create new simplified columns
df["is_Married"] = df[married_cols].any(axis=1).astype(int)
df["is_Never_married"] = df[never_married_cols].any(axis=1).astype(int)
df["is_Previously_married"] = df[previously_married_cols].any(axis=1).astype(int)

# Drop all original marital status columns
df = df.drop(columns=marital_cols, errors="ignore")

# Handle sex columns: drop one to leave a single binary column
if 'sex_Female' in df.columns and 'sex_Male' in df.columns:
    df = df.drop(columns=['sex_Female'])

# Convert all remaining boolean columns to int (0 or 1)
for col in df.select_dtypes(include=['bool']).columns:
    df[col] = df[col].astype(int)

print("Columns after native-country, race, marital status, and sex simplification:",
      [c for c in df.columns if "native" in c or "is_US" in c or "race" in c or "is_white" in c or "marital" in c or "is_Married" in c or "sex" in c])
print(df.shape)
df.head()

Columns after native-country, race, marital status, and sex simplification: ['sex_Male', 'is_US', 'is_white', 'is_Married']
(48842, 52)


Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,workclass_binned_Unknown,workclass_binned_Without-pay,work_schedule_Full-time,work_schedule_Overtime,work_schedule_Part-time,is_US,is_white,is_Married,is_Never_married,is_Previously_married
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,1,0,0,1,1,0,1,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,1,1,1,1,0,0
2,38,9,0,0,40,0,0,0,1,0,...,0,0,1,0,0,1,1,0,0,1
3,53,7,0,0,40,0,0,0,1,0,...,0,0,1,0,0,1,0,1,0,0
4,28,13,0,0,40,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0


In [5]:

# 2. Build X and y
# We assume the target columns are 'income_<=50K' and 'income_>50K'

target_col = "income_>50K"

# Drop the 'income_<=50K' column; it is redundant with the >50K one
if "income_<=50K" in df.columns:
    df = df.drop(columns=["income_<=50K"])

# Separate features and target
y = df[target_col].astype(int)
X = df.drop(columns=[target_col])

print("X shape:", X.shape)
print("y value counts:\n", y.value_counts())
X.head()


X shape: (48842, 50)
y value counts:
 income_>50K
0    37155
1    11687
Name: count, dtype: int64


Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,workclass_binned_Unknown,workclass_binned_Without-pay,work_schedule_Full-time,work_schedule_Overtime,work_schedule_Part-time,is_US,is_white,is_Married,is_Never_married,is_Previously_married
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,1,0,0,1,1,0,1,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,1,1,1,1,0,0
2,38,9,0,0,40,0,0,0,1,0,...,0,0,1,0,0,1,1,0,0,1
3,53,7,0,0,40,0,0,0,1,0,...,0,0,1,0,0,1,0,1,0,0
4,28,13,0,0,40,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0


In [6]:
# 3. Scaling:
# - StandardScaler on all numeric columns except capital-gain and capital-loss
# - For capital-gain and capital-loss: np.log1p, then StandardScaler

from sklearn.preprocessing import OneHotEncoder

# Identify numeric columns
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# Capital gain/loss columns (if present)
cap_cols = [c for c in numeric_cols if c in ["capital-gain", "capital-loss"]]

# Other numeric columns (age, education-num, hours-per-week, etc.)
other_numeric_cols = [c for c in numeric_cols if c not in cap_cols]

print("Numeric columns:", numeric_cols)
print("Capital gain/loss columns:", cap_cols)
print("Other numeric columns:", other_numeric_cols)

# Capital gain/loss pipeline: log1p then StandardScaler
cap_pipeline = Pipeline([
    ("log1p", FunctionTransformer(np.log1p, validate=False)),
    ("scaler", StandardScaler())
])

# Other numeric pipeline: StandardScaler
other_num_pipeline = Pipeline([
    ("scaler", StandardScaler())
])

# Categorical columns are everything else that is boolean / object-like and not numeric
# (Given this df is mostly one-hot booleans, we'll treat non-numeric as categorical.)
from sklearn.preprocessing import FunctionTransformer

# Define a named identity function to make it pickleable
def identity_func(x):
    return x

cat_identity = FunctionTransformer(identity_func, validate=False)

categorical_cols = [c for c in X.columns if c not in numeric_cols]
print("Categorical columns (one-hot / booleans):", categorical_cols[:20], "...")

preprocessor = ColumnTransformer(
    transformers=[
        ("other_num", other_num_pipeline, other_numeric_cols),
        ("cap", cap_pipeline, cap_cols),
        ("cat", cat_identity, categorical_cols),
    ]
)

# Example: fit-transform the preprocessor on X
X_processed = preprocessor.fit_transform(X)
print("Processed X shape:", X_processed.shape)


Numeric columns: ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Never-worked', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Unknown', 'workclass_Without-pay', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'occupation_Unknown', 'relationship_Husband', 'relationship_Not-in-family', 'relationship_Other-relative', 'relationship_Own-child', 'relationship_Unmarried', 'relationship_Wife', 'sex_Male', 'workclass_binned_Government', 'workclass_binned_Never-worked', 'workclass_binned_Private', 'wor

In [7]:
display(df.columns)

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Never-worked', 'workclass_Private', 'workclass_Self-emp-inc',
       'workclass_Self-emp-not-inc', 'workclass_State-gov',
       'workclass_Unknown', 'workclass_Without-pay', 'occupation_Adm-clerical',
       'occupation_Armed-Forces', 'occupation_Craft-repair',
       'occupation_Exec-managerial', 'occupation_Farming-fishing',
       'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct',
       'occupation_Other-service', 'occupation_Priv-house-serv',
       'occupation_Prof-specialty', 'occupation_Protective-serv',
       'occupation_Sales', 'occupation_Tech-support',
       'occupation_Transport-moving', 'occupation_Unknown',
       'relationship_Husband', 'relationship_Not-in-family',
       'relationship_Other-relative', 'relationship_Own-child',
       'relationship_Unmarried', 'relationship_Wife', 'sex_Male',
       'i

### Create Binned Occupation Columns

In [8]:
# Define the mapping for occupation binning
occupation_bin_map = {
    'occupation_Exec-managerial': 'High-skill',
    'occupation_Prof-specialty': 'High-skill',
    'occupation_Adm-clerical': 'Mid-skill',
    'occupation_Sales': 'Mid-skill',
    'occupation_Tech-support': 'Mid-skill',
    'occupation_Craft-repair': 'Blue-collar',
    'occupation_Handlers-cleaners': 'Blue-collar',
    'occupation_Machine-op-inspct': 'Blue-collar',
    'occupation_Transport-moving': 'Blue-collar',
    'occupation_Other-service': 'Service',
    'occupation_Priv-house-serv': 'Service',
    'occupation_Farming-fishing': 'Farming-fishing',
    'occupation_Protective-serv': 'Protective',
    'occupation_Armed-Forces': 'Protective',
    'occupation_Unknown': 'Unknown' # Assuming '?' maps to 'occupation_Unknown'
}

# Identify original occupation columns
original_occupation_cols = [col for col in X.columns if col.startswith('occupation_')]

# Initialize new binned occupation columns in X
for binned_category in set(occupation_bin_map.values()):
    X[f'occupation_binned_{binned_category}'] = 0

# Populate binned columns
for original_col, binned_category in occupation_bin_map.items():
    if original_col in X.columns:
        X[f'occupation_binned_{binned_category}'] = X[f'occupation_binned_{binned_category}'] | X[original_col]

# Drop original occupation columns
X = X.drop(columns=original_occupation_cols, errors='ignore')

print("X shape after occupation binning:", X.shape)
print("New binned occupation columns:", [col for col in X.columns if col.startswith('occupation_binned_')])

X shape after occupation binning: (48842, 42)
New binned occupation columns: ['occupation_binned_Farming-fishing', 'occupation_binned_Unknown', 'occupation_binned_High-skill', 'occupation_binned_Mid-skill', 'occupation_binned_Blue-collar', 'occupation_binned_Protective', 'occupation_binned_Service']


### Create Binned Relationship Columns

In [9]:
# Define the mapping for relationship binning
relationship_bin_map = {
    'relationship_Husband': 'Spouse-present',
    'relationship_Wife': 'Spouse-present',
    'relationship_Not-in-family': 'Not-in-family',
    'relationship_Other-relative': 'Other-relative',
    'relationship_Own-child': 'Own-child',
    'relationship_Unmarried': 'Unmarried'
}

# Identify original relationship columns
original_relationship_cols = [col for col in X.columns if col.startswith('relationship_')]

# Initialize new binned relationship columns in X
for binned_category in set(relationship_bin_map.values()):
    X[f'relationship_binned_{binned_category}'] = 0

# Populate binned columns
for original_col, binned_category in relationship_bin_map.items():
    if original_col in X.columns:
        X[f'relationship_binned_{binned_category}'] = X[f'relationship_binned_{binned_category}'] | X[original_col]

# Drop original relationship columns
X = X.drop(columns=original_relationship_cols, errors='ignore')

print("X shape after relationship binning:", X.shape)
print("New binned relationship columns:", [col for col in X.columns if col.startswith('relationship_binned_')])

X shape after relationship binning: (48842, 41)
New binned relationship columns: ['relationship_binned_Other-relative', 'relationship_binned_Spouse-present', 'relationship_binned_Not-in-family', 'relationship_binned_Unmarried', 'relationship_binned_Own-child']


In [10]:
from itertools import product
import json

# Define column groups from X for flexible ColumnTransformer configuration

# Raw workclass columns (all original one-hot encoded workclass categories)
raw_workclass_cols = [c for c in X.columns if c.startswith('workclass_') and not c.startswith('workclass_binned_')]

# Binned workclass columns (simplified one-hot encoded workclass categories)
binned_workclass_cols = [c for c in X.columns if c.startswith('workclass_binned_')]

# Raw occupation columns (all original one-hot encoded occupation categories)
raw_occupation_cols = [c for c in X.columns if c.startswith('occupation_') and not c.startswith('occupation_binned_')]

# Binned occupation columns (simplified one-hot encoded occupation categories)
binned_occupation_cols = [c for c in X.columns if c.startswith('occupation_binned_')]

# Raw relationship columns (all original one-hot encoded relationship categories)
raw_relationship_cols = [c for c in X.columns if c.startswith('relationship_') and not c.startswith('relationship_binned_')]

# Binned relationship columns (simplified one-hot encoded relationship categories)
binned_relationship_cols = [c for c in X.columns if c.startswith('relationship_binned_')]

# Fixed numeric and binary columns that are always included in the preprocessing.
# These columns will be scaled (e.g., age, education-num, hours-per-week, and derived binary features).
fixed_num_cols_for_scaling = [
    'age', 'education-num', 'hours-per-week',
    'is_US', 'is_white', 'is_Married', 'is_Never_married', 'is_Previously_married', 'sex_Male'
]
# Ensure only actual existing and numeric columns are included
fixed_num_cols_for_scaling = [c for c in fixed_num_cols_for_scaling if c in X.columns and np.issubdtype(X[c].dtype, np.number)]

# Capital gain/loss columns for specific log1p+scaling
capital_cols_for_scaling = ['capital-gain', 'capital-loss']
capital_cols_for_scaling = [c for c in capital_cols_for_scaling if c in X.columns and np.issubdtype(X[c].dtype, np.number)]


# Define the options for each categorical feature type
workclass_options = {'raw': raw_workclass_cols, 'binned': binned_workclass_cols}
occupation_options = {'raw': raw_occupation_cols, 'binned': binned_occupation_cols}
relationship_options = {'raw': raw_relationship_cols, 'binned': binned_relationship_cols}

# Generate all combinations of these options for the param grid
param_grid_combinations = []

for wc_strategy, occ_strategy, rel_strategy in product(workclass_options.keys(), occupation_options.keys(), relationship_options.keys()):
    config_name = f"workclass_{wc_strategy}_occupation_{occ_strategy}_relationship_{rel_strategy}"
    config = {
        'name': config_name,
        'workclass_cols_to_use': workclass_options[wc_strategy],
        'occupation_cols_to_use': occupation_options[occ_strategy],
        'relationship_cols_to_use': relationship_options[rel_strategy],
        'fixed_numeric_cols': fixed_num_cols_for_scaling,
        'capital_cols': capital_cols_for_scaling
    }
    param_grid_combinations.append(config)

display(param_grid_combinations)

# Save param_grid_combinations to a JSON file
json_output_filename = 'param_grid_combinations.json'
with open(json_output_filename, 'w') as f:
    json.dump(param_grid_combinations, f, indent=4)
print(f"Parameter grid combinations saved to {json_output_filename}")

[{'name': 'workclass_raw_occupation_raw_relationship_raw',
  'workclass_cols_to_use': ['workclass_Federal-gov',
   'workclass_Local-gov',
   'workclass_Never-worked',
   'workclass_Private',
   'workclass_Self-emp-inc',
   'workclass_Self-emp-not-inc',
   'workclass_State-gov',
   'workclass_Unknown',
   'workclass_Without-pay'],
  'occupation_cols_to_use': [],
  'relationship_cols_to_use': [],
  'fixed_numeric_cols': ['age',
   'education-num',
   'hours-per-week',
   'is_US',
   'is_white',
   'is_Married',
   'is_Never_married',
   'is_Previously_married',
   'sex_Male'],
  'capital_cols': ['capital-gain', 'capital-loss']},
 {'name': 'workclass_raw_occupation_raw_relationship_binned',
  'workclass_cols_to_use': ['workclass_Federal-gov',
   'workclass_Local-gov',
   'workclass_Never-worked',
   'workclass_Private',
   'workclass_Self-emp-inc',
   'workclass_Self-emp-not-inc',
   'workclass_State-gov',
   'workclass_Unknown',
   'workclass_Without-pay'],
  'occupation_cols_to_use': []

Parameter grid combinations saved to param_grid_combinations.json


In [11]:
from sklearn.model_selection import StratifiedKFold

# Setup Stratified K-Fold with k=10
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

print(f"Stratified K-Fold splitter initialized with {skf.get_n_splits(X, y)} splits.")

Stratified K-Fold splitter initialized with 10 splits.


In [12]:
preprocessed_data_configs = []

for config in param_grid_combinations:
    # Get the column lists for the current configuration
    current_workclass_cols = config['workclass_cols_to_use']
    current_occupation_cols = config['occupation_cols_to_use']
    current_relationship_cols = config['relationship_cols_to_use']
    current_fixed_numeric_cols = config['fixed_numeric_cols']
    current_capital_cols = config['capital_cols']

    # Combine all categorical columns for the current config
    # Only include columns that are actually present in the current X (after binning/dropping)
    current_categorical_cols = \
        [col for col in X.columns if col in current_workclass_cols or col in current_occupation_cols or col in current_relationship_cols]

    # Ensure fixed numeric columns are not part of categorical columns to avoid duplicates/errors
    # Also ensure capital columns are not part of other numeric columns
    numeric_cols_for_ct = [c for c in current_fixed_numeric_cols if c not in current_capital_cols]

    # Create the ColumnTransformer for the current configuration
    current_preprocessor = ColumnTransformer(
        transformers=[
            ("other_num", other_num_pipeline, numeric_cols_for_ct),
            ("cap", cap_pipeline, current_capital_cols),
            ("cat", cat_identity, current_categorical_cols),
        ],
        remainder='passthrough' # Pass through any other columns not explicitly transformed
    )

    # Fit and transform X with the current preprocessor
    X_processed_current = current_preprocessor.fit_transform(X)

    # Store the results and the config name
    preprocessed_data_configs.append({
        'config_name': config['name'],
        'X_processed': X_processed_current,
        'preprocessor': current_preprocessor # Store the preprocessor for potential inverse transform or inspection
    })

    print(f"Processed {config['name']}: X_processed shape = {X_processed_current.shape}")

print("Finished processing all parameter grid combinations.")

Processed workclass_raw_occupation_raw_relationship_raw: X_processed shape = (48842, 41)
Processed workclass_raw_occupation_raw_relationship_binned: X_processed shape = (48842, 41)
Processed workclass_raw_occupation_binned_relationship_raw: X_processed shape = (48842, 41)
Processed workclass_raw_occupation_binned_relationship_binned: X_processed shape = (48842, 41)
Processed workclass_binned_occupation_raw_relationship_raw: X_processed shape = (48842, 41)
Processed workclass_binned_occupation_raw_relationship_binned: X_processed shape = (48842, 41)
Processed workclass_binned_occupation_binned_relationship_raw: X_processed shape = (48842, 41)
Processed workclass_binned_occupation_binned_relationship_binned: X_processed shape = (48842, 41)
Finished processing all parameter grid combinations.


In [13]:
for config in preprocessed_data_configs:
    X_processed = config['X_processed']
    config['kfold_splits'] = [] # Initialize list to store splits for this config

    print(f"Splitting data for config: {config['config_name']}...")
    # Generate train/test splits for the current preprocessed X and original y
    for fold, (train_index, test_index) in enumerate(skf.split(X_processed, y)):
        X_train, X_test = X_processed[train_index], X_processed[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        config['kfold_splits'].append({
            'fold': fold,
            'X_train': X_train,
            'X_test': X_test,
            'y_train': y_train,
            'y_test': y_test
        })
    print(f"  {len(config['kfold_splits'])} folds generated for {config['config_name']}.")

print("Finished creating k-fold splits for all configurations.")

Splitting data for config: workclass_raw_occupation_raw_relationship_raw...
  10 folds generated for workclass_raw_occupation_raw_relationship_raw.
Splitting data for config: workclass_raw_occupation_raw_relationship_binned...
  10 folds generated for workclass_raw_occupation_raw_relationship_binned.
Splitting data for config: workclass_raw_occupation_binned_relationship_raw...
  10 folds generated for workclass_raw_occupation_binned_relationship_raw.
Splitting data for config: workclass_raw_occupation_binned_relationship_binned...
  10 folds generated for workclass_raw_occupation_binned_relationship_binned.
Splitting data for config: workclass_binned_occupation_raw_relationship_raw...
  10 folds generated for workclass_binned_occupation_raw_relationship_raw.
Splitting data for config: workclass_binned_occupation_raw_relationship_binned...
  10 folds generated for workclass_binned_occupation_raw_relationship_binned.
Splitting data for config: workclass_binned_occupation_binned_relation

In [14]:
import joblib
import math

# Assuming preprocessed_data_configs is already defined and loaded/generated earlier
# If not, you would need to load it first, e.g.,
# from google.colab import drive
# drive.mount('/content/drive')
# preprocessed_data_configs = joblib.load('/content/preprocessed_data_configs.joblib')

num_parts = 4
total_configs = len(preprocessed_data_configs)
part_size = math.ceil(total_configs / num_parts)

print(f"Total configurations: {total_configs}")
print(f"Each part will have approximately: {part_size} configurations")

for i in range(num_parts):
    start_index = i * part_size
    end_index = min((i + 1) * part_size, total_configs)

    # Get the slice for the current part
    current_part = preprocessed_data_configs[start_index:end_index]

    # Define the output filename for this part
    output_filename = f'preprocessed_data_configs_part_{i+1}.joblib'

    try:
        joblib.dump(current_part, output_filename)
        print(f"Part {i+1} saved to {output_filename} (Contains {len(current_part)} configs)")
    except Exception as e:
        print(f"Error saving part {i+1} to {output_filename}: {e}")


Total configurations: 8
Each part will have approximately: 2 configurations
Part 1 saved to preprocessed_data_configs_part_1.joblib (Contains 2 configs)
Part 2 saved to preprocessed_data_configs_part_2.joblib (Contains 2 configs)
Part 3 saved to preprocessed_data_configs_part_3.joblib (Contains 2 configs)
Part 4 saved to preprocessed_data_configs_part_4.joblib (Contains 2 configs)


**NOTE: This project was completed using Google Colab and its built in Gemini AI.**