<a href="https://colab.research.google.com/github/Edenshmuel/ICU_Nutrition_ML/blob/main/Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook defines the preprocessing pipeline for both clustering and prediction models.
It includes transformations for numerical, categorical, and skewed features**

Importing Necessary Libraries

In [1]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

Log Transform + Scaling for skewed features

In [2]:
log_scaler_pipeline = Pipeline(steps=[("log_transform", FunctionTransformer(np.log1p, validate=True)),
    ("scaler", MinMaxScaler())])

Standard Scaling for non-skewed features

In [3]:
scaler_pipeline = Pipeline(steps=[("scaler", MinMaxScaler())])

One-Hot Encoding for categorical features

In [4]:
cat_transformer = Pipeline(steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))])

This function converts the "Disease" column, which contains multiple diseases as a comma-separated string, into a multi-hot encoded format—creating a separate binary column for each unique disease

In [5]:
def multi_hot_encode_disease(df):
    df = df.copy()
    df["Disease"] = df["Disease"].astype(str).str.split(", ")
    all_diseases = set([d for sublist in df["Disease"] for d in sublist])

    for disease in all_diseases:
        df[disease] = df["Disease"].apply(lambda x: 1 if disease in x else 0)

    df = df.drop(columns=["Disease"])
    return df

disease_transformer = FunctionTransformer(multi_hot_encode_disease)

This code transforms the categorical "Activity Level" column into numerical values, making it suitable for machine learning models

In [6]:
activity_mapping = {
    "Sedentary": 0,
    "Lightly Active": 1,
    "Moderately Active": 2,
    "Very Active": 3,
    "Extremely Active": 4
    }

In [7]:
def encode_activity_level(X):
    X = X.copy()
    X["Activity Level"] = X["Activity Level"].map(activity_mapping)
    return X

activity_transformer = FunctionTransformer(encode_activity_level)

This class is a custom scikit-learn transformer that calculates the Body Mass Index (BMI) based on weight and height

In [8]:
class BMICalculator(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X["BMI"] = X["Weight"] / (X["Height"] ** 2)
        return X

Final Preprocessing Pipeline

In [10]:
num_pipeline = Pipeline(steps=[
    ("bmi_calculator", BMICalculator()),
    ("log_scaled", log_scaler_pipeline),
    ("scaler", scaler_pipeline)])

In [11]:
def create_preprocessor(numerical_features, categorical_features, Multy_categorical_features, right_skewed_features=None):
    transformers = []

    if right_skewed_features:
        transformers.append(("log_scaled", log_scaler_pipeline, right_skewed_features))

    transformers.extend([
        ("scaled", scaler_pipeline, numerical_features),
        ("activity", activity_transformer, ["Activity Level"]),
        ("cat", cat_transformer, categorical_features),
        ("disease", disease_transformer, ["Disease"])])

    preprocessor = ColumnTransformer(transformers=transformers)

    return preprocessor

In [None]:
def get_feature_names(preprocessor, input_features):
    """ מחזיר את שמות העמודות לאחר הטרנספורמציה של ColumnTransformer """
    feature_names = []

    for name, transformer, columns in preprocessor.transformers_:
        if transformer == "passthrough":
            # עמודות שעוברות בלי שינוי
            feature_names.extend(columns)
        elif hasattr(transformer, "get_feature_names_out"):
            # אם הטרנספורמר תומך בהחזרת שמות עמודות
            feature_names.extend(transformer.get_feature_names_out(columns))
        elif isinstance(transformer, Pipeline):
            # אם זה Pipeline פנימי, ניקח את השמות של השלב האחרון (אם אפשר)
            last_step = transformer.steps[-1][1]
            if hasattr(last_step, "get_feature_names_out"):
                feature_names.extend(last_step.get_feature_names_out(columns))
            else:
                # אחרת נשתמש בעמודות המקוריות
                feature_names.extend(columns)
        else:
            # עבור טרנספורמרים שאין להם get_feature_names_out
            feature_names.extend(columns)

    return feature_names