In this example, we focused on integrating Polars DataFrames with scikit-learn's pipeline and preprocessing functionality to create a streamlined and efficient machine learning workflow. The main objective was to leverage the power of Polars for data manipulation and the flexibility of scikit-learn for model training and evaluation.


1. Data Preparation:
   - We introduced the `ScalerType` enum to represent different types of scalers available in scikit-learn, such as `StandardScaler`, `MinMaxScaler`, `MaxAbsScaler`, `RobustScaler`, `PowerTransformer`, `QuantileTransformer`, and `Normalizer`.
   - We updated the Pydantic classes (`Feature`, `NumericalFeature`, `EmbeddingFeature`, `CategoricalFeature`, `FeatureSet`, and `InputConfig`) to include additional configuration options:
     - `NumericalFeature` now has a `scaler_type` field to specify the type of scaler to apply to the numerical feature.
     - `CategoricalFeature` now has a `one_hot_encoding` field to indicate whether to apply one-hot encoding to the categorical feature.
   - We added a `get_scaler` method to the `NumericalFeature` class to retrieve the appropriate scaler based on the `scaler_type` using a dictionary mapping.

2. Classifier Configuration:
   - We introduced the `ClassifierName` enum to represent the supported classifiers: `LogisticRegression`, `RandomForestClassifier`, and `HistGradientBoostingClassifier`.
   - We created Pydantic classes (`BaseClassifierConfig`, `LogisticRegressionConfig`, `RandomForestClassifierConfig`, and `HistGradientBoostingClassifierConfig`) to define and validate the classifier configurations.
   - We added a `ClassifierConfig` class that allows specifying the classifier configuration using one of the supported classifier config classes.

3. Pipeline Creation:
   - We updated the `create_pipeline` function to handle the scalers and one-hot encoding based on the feature configurations:
     - For numerical features, we retrieve the scaler using the `get_scaler` method and apply it to the corresponding features.
     - For categorical features, we apply one-hot encoding or passthrough based on the `one_hot_encoding` flag.
   - We modified the classifier creation logic to instantiate the appropriate classifier based on the provided `ClassifierConfig` instance.

4. Model Evaluation:
   - We created a list of `classifier_configs` to demonstrate the evaluation of different classifier configurations.
   - We iterate over each `classifier_config` and create a pipeline using the `create_pipeline` function.
   - We fit the pipeline on the training data and evaluate the model's performance on the validation and test data using the `evaluate_model` function.

5. Integration with Polars:
   - We continue to use Polars DataFrames for data manipulation and preprocessing throughout the example.
   - We leverage the `convert_utf8_to_enum` function to convert categorical columns to the Enum data type based on a specified threshold.
   - We use the `convert_enum_to_physical` function to convert Enum columns to their physical representation when needed.

By incorporating these enhancements, we have created a more flexible and configurable machine learning workflow that allows users to easily define and customize the feature sets and classifier configurations. The updated example demonstrates how to evaluate different classifier configurations and showcases the seamless integration of Polars DataFrames with scikit-learn's pipeline and preprocessing functionality.

The modular design of the Pydantic classes for feature sets and classifier configurations enables easy extensibility and customization, making it straightforward to add support for additional scalers, classifiers, or configuration options in the future.

Overall, this updated example provides a solid foundation for building a robust and user-friendly machine learning framework that leverages the strengths of Polars and scikit-learn while offering flexibility and configurability to meet diverse project requirements.

In [1]:
from enum import Enum
import polars as pl
from pydantic import BaseModel, ValidationInfo, model_validator,Field
import numpy as np
from enum import Enum
from typing import Optional, Union, Dict, Literal, Any, List, Tuple, Type, TypeVar, Generator
import itertools


from sklearn.ensemble import RandomForestClassifier,HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, PowerTransformer, QuantileTransformer, Normalizer, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline



In [2]:


class ScalerType(str, Enum):
    STANDARD_SCALER = "StandardScaler"
    MIN_MAX_SCALER = "MinMaxScaler"
    MAX_ABS_SCALER = "MaxAbsScaler"
    ROBUST_SCALER = "RobustScaler"
    POWER_TRANSFORMER = "PowerTransformer"
    QUANTILE_TRANSFORMER = "QuantileTransformer"
    NORMALIZER = "Normalizer"

class Feature(BaseModel):
    column_name: str
    name: str
    description: Optional[str] = None

    @model_validator(mode='before')
    def validate_column_name(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
        return values

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"

class NumericalFeature(Feature):
    scaler_type: ScalerType = Field(ScalerType.STANDARD_SCALER, description="The type of scaler to apply to the numerical feature.")

    def get_scaler(self):
        scaler_map = {
            ScalerType.STANDARD_SCALER: StandardScaler(),
            ScalerType.MIN_MAX_SCALER: MinMaxScaler(),
            ScalerType.MAX_ABS_SCALER: MaxAbsScaler(),
            ScalerType.ROBUST_SCALER: RobustScaler(),
            ScalerType.POWER_TRANSFORMER: PowerTransformer(),
            ScalerType.QUANTILE_TRANSFORMER: QuantileTransformer(),
            ScalerType.NORMALIZER: Normalizer(),
        }
        return scaler_map[self.scaler_type]

    @model_validator(mode='before')
    def validate_numerical_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [
                pl.Boolean,
                pl.Int8,
                pl.Int16,
                pl.Int32,
                pl.Int64,
                pl.UInt8,
                pl.UInt16,
                pl.UInt32,
                pl.UInt64,
                pl.Float32,
                pl.Float64,
                pl.Decimal,
            ]:
                raise ValueError(
                    f"Column '{column_name}' must be of a numeric type (Boolean, Integer, Unsigned Integer, Float, or Decimal)."
                )
        return values

class EmbeddingFeature(NumericalFeature):
    @model_validator(mode='before')
    def validate_embedding_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [pl.List(pl.Float32), pl.List(pl.Float64)]:
                raise ValueError(f"Column '{column_name}' must be of type pl.List(pl.Float32) or pl.List(pl.Float64).")
        return values

class CategoricalFeature(Feature):
    one_hot_encoding: bool = Field(True, description="Whether to apply one-hot encoding to the categorical feature.")

    @model_validator(mode='before')
    def validate_categorical_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [
                pl.Utf8,
                pl.Categorical,
                pl.Enum,
                pl.Int8,
                pl.Int16,
                pl.Int32,
                pl.Int64,
                pl.UInt8,
                pl.UInt16,
                pl.UInt32,
                pl.UInt64,
            ]:
                raise ValueError(
                    f"Column '{column_name}' must be of type pl.Utf8, pl.Categorical, pl.Enum, or an integer type."
                )
        return values

class FeatureSet(BaseModel):
    numerical: List[NumericalFeature] = []
    embeddings: List[EmbeddingFeature] = []
    categorical: List[CategoricalFeature] = []

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"

class InputConfig(BaseModel):
    feature_sets: List[FeatureSet]

    def validate_with_dataframe(self, df: pl.DataFrame):
        for feature_set in self.feature_sets:
            for feature_type in ["numerical", "embeddings", "categorical"]:
                for feature in getattr(feature_set, feature_type):
                    feature.model_validate({"context": df, **feature.dict()})

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"



In [4]:


class ClassifierName(str, Enum):
    LOGISTIC_REGRESSION = "LogisticRegression"
    RANDOM_FOREST = "RandomForestClassifier"
    HIST_GRADIENT_BOOSTING = "HistGradientBoostingClassifier"

class BaseClassifierConfig(BaseModel):
    classifier_name: ClassifierName
    

class LogisticRegressionConfig(BaseClassifierConfig):
    classifier_name: Literal[ClassifierName.LOGISTIC_REGRESSION] = ClassifierName.LOGISTIC_REGRESSION
    n_jobs: int = Field(-1, description="Number of CPU cores to use.")
    penalty: str = Field("l2", description="Specify the norm of the penalty.")
    dual: bool = Field(False, description="Dual or primal formulation.")
    tol: float = Field(1e-4, description="Tolerance for stopping criteria.")
    C: float = Field(1.0, description="Inverse of regularization strength.")
    fit_intercept: bool = Field(True, description="Specifies if a constant should be added to the decision function.")
    intercept_scaling: float = Field(1, description="Scaling factor for the constant.")
    class_weight: Optional[Union[str, Dict[Any, float]]] = Field(None, description="Weights associated with classes.")
    random_state: Optional[int] = Field(None, description="Seed for random number generation.")
    solver: str = Field("lbfgs", description="Algorithm to use in the optimization problem.")
    max_iter: int = Field(100, description="Maximum number of iterations.")
    multi_class: str = Field("auto", description="Approach for handling multi-class targets.")
    verbose: int = Field(0, description="Verbosity level.")
    warm_start: bool = Field(False, description="Reuse the solution of the previous call to fit.")
    l1_ratio: Optional[float] = Field(None, description="Elastic-Net mixing parameter.")

class RandomForestClassifierConfig(BaseClassifierConfig):
    classifier_name: Literal[ClassifierName.RANDOM_FOREST] = ClassifierName.RANDOM_FOREST
    n_jobs: int = Field(-1, description="Number of CPU cores to use.")
    n_estimators: int = Field(100, description="The number of trees in the forest.")
    criterion: str = Field("gini", description="The function to measure the quality of a split.")
    max_depth: Optional[int] = Field(None, description="The maximum depth of the tree.")
    min_samples_split: Union[int, float] = Field(2, description="The minimum number of samples required to split an internal node.")
    min_samples_leaf: Union[int, float] = Field(1, description="The minimum number of samples required to be at a leaf node.")
    min_weight_fraction_leaf: float = Field(0.0, description="The minimum weighted fraction of the sum total of weights required to be at a leaf node.")
    max_features: Union[str, int, float] = Field("sqrt", description="The number of features to consider when looking for the best split.")
    max_leaf_nodes: Optional[int] = Field(None, description="Grow trees with max_leaf_nodes in best-first fashion.")
    min_impurity_decrease: float = Field(0.0, description="A node will be split if this split induces a decrease of the impurity greater than or equal to this value.")
    bootstrap: bool = Field(True, description="Whether bootstrap samples are used when building trees.")
    oob_score: bool = Field(False, description="Whether to use out-of-bag samples to estimate the generalization score.")
    
    random_state: Optional[int] = Field(None, description="Seed for random number generation.")
    verbose: int = Field(0, description="Verbosity level.")
    warm_start: bool = Field(False, description="Reuse the solution of the previous call to fit and add more estimators to the ensemble.")
    class_weight: Optional[Union[str, Dict[Any, float]]] = Field(None, description="Weights associated with classes.")
    ccp_alpha: float = Field(0.0, description="Complexity parameter used for Minimal Cost-Complexity Pruning.")
    max_samples: Optional[Union[int, float]] = Field(None, description="If bootstrap is True, the number of samples to draw from X to train each base estimator.")
    monotonic_cst: Optional[Dict[str, int]] = Field(None, description="Monotonic constraint to enforce on each feature.")

class HistGradientBoostingClassifierConfig(BaseClassifierConfig):
    classifier_name: Literal[ClassifierName.HIST_GRADIENT_BOOSTING] = ClassifierName.HIST_GRADIENT_BOOSTING
    loss: str = Field("log_loss", description="The loss function to use in the boosting process.")
    learning_rate: float = Field(0.1, description="The learning rate, also known as shrinkage.")
    max_iter: int = Field(100, description="The maximum number of iterations of the boosting process.")
    max_leaf_nodes: int = Field(31, description="The maximum number of leaves for each tree.")
    max_depth: Optional[int] = Field(None, description="The maximum depth of each tree.")
    min_samples_leaf: int = Field(20, description="The minimum number of samples per leaf.")
    l2_regularization: float = Field(0.0, description="The L2 regularization parameter.")
    max_features: Union[str, int, float] = Field(1.0, description="Proportion of randomly chosen features in each and every node split.")
    max_bins: int = Field(255, description="The maximum number of bins to use for non-missing values.")
    categorical_features: Optional[Union[str, List[int], List[bool]]] = Field("warn", description="Indicates the categorical features.")
    monotonic_cst: Optional[Dict[str, int]] = Field(None, description="Monotonic constraint to enforce on each feature.")
    interaction_cst: Optional[Union[str, List[Tuple[int, ...]]]] = Field(None, description="Specify interaction constraints, the sets of features which can interact with each other in child node splits.")
    warm_start: bool = Field(False, description="Reuse the solution of the previous call to fit and add more estimators to the ensemble.")
    early_stopping: Union[str, bool] = Field("auto", description="Whether to use early stopping to terminate training when validation score is not improving.")
    scoring: Optional[str] = Field("loss", description="Scoring parameter to use for early stopping.")
    validation_fraction: float = Field(0.1, description="Proportion of training data to set aside as validation data for early stopping.")
    n_iter_no_change: int = Field(10, description="Used to determine when to stop if validation score is not improving.")
    tol: float = Field(1e-7, description="The absolute tolerance to use when comparing scores.")
    verbose: int = Field(0, description="Verbosity level.")
    random_state: Optional[int] = Field(None, description="Seed for random number generation.")
    class_weight: Optional[Union[str, Dict[Any, float]]] = Field(None, description="Weights associated with classes.")

class ClassifierConfig(BaseModel):
    classifier: Union[LogisticRegressionConfig, RandomForestClassifierConfig, HistGradientBoostingClassifierConfig]

In [8]:
class FoldMode(str, Enum):
    COMBINATORIAL = "Combinatorial"
    MONTE_CARLO = "MonteCarlo"

class KFoldConfig(BaseModel):
    k: int = Field(5, description="Number of folds. Divides the data into k equal parts. last k can be smaller or larger than the rest depending on the // of the data by k." )
    n_test_folds: int = Field(1, description="Number of test folds to use for cross-validation. Must be strictly less than k. if the fold mode is montecarlo they are sampled first and then the rest are used for training. If the fold mode is combinatorial the all symmetric combinations n_test out of k are sampled.")
    fold_mode: FoldMode = Field(FoldMode.COMBINATORIAL, description="The mode to use for splitting the data into folds. Combinatorial splits the data into k equal parts, while Monte Carlo randomly samples the k equal parts without replacement.")
    shuffle: bool = Field(True, description="Whether to shuffle the data before splitting.")
    random_state: Optional[int] = Field(None, description="Seed for random number generation. In the case of montecarlo cross-validation at each replica the seed is increased by 1 mantaining replicability while ensuring that the samples are different.")
    montecarlo_replicas: int = Field(5, description="Number of random replicas to use for montecarlo cross-validation.")
    
class StratificationMode(str, Enum):
    PROPORTIONAL = "Proportional"
    UNIFORM_STRICT = "UniformStrict"
    UNIFORM_RELAXED = "UniformRelaxed"

class StratifiedConfig(KFoldConfig):
    groups: List[str] = Field([], description="The df column(s) to use for stratification. They will be used for a group-by operation to ensure that the stratification is done within each group.")
    strat_mode : StratificationMode = Field(StratificationMode.PROPORTIONAL, description="The mode to use for stratification. Proportional ensures that the stratification is done within each group mantaining the original proportion of each group in the splits, this is done by first grouping and then breaking each group inot k equal parts, this ensure all the samples in each group are in train and test with the same proprtion. Uniform instead ensures that each group has the same number of samples in each train and test fold, this is not compatible with the proportional mode.")
    group_size : Optional[int] = Field(None, description="The number of samples to use for each group in the stratificaiton it will only be used if the strat_mode is uniform or uniform relaxed. If uniform relaxed is used the group size will be used as a target size for each group but if a group has less samples than the target size it will be used as is. If uniform strict is used the group_size for all groups will be forced to the min(group_size, min_samples_in_group).")

class PurgedConfig(KFoldConfig):
    groups: List[str] = Field([], description="The df column(s) to use for purging. They will be used for a group-by operation to ensure that the purging at the whole group level. K is going to used to determine the fraction of groups to purge from train and restrict to test. When the mode is montecarlo the groups are sampled first and then the rest are used for training. If the fold mode is combinatorial the all symmetric combinations n_test out of k groups partitions are sampled")

class CVConfig(BaseModel):
    inner: Union[KFoldConfig,StratifiedConfig,PurgedConfig]
    inner_replicas: int = Field(1, description="Number of random replicas to use for inner cross-validation.")
    outer: Optional[Union[KFoldConfig,StratifiedConfig,PurgedConfig]] = None
    outer_replicas: int = Field(1, description="Number of random replicas to use for outer cross-validation.")

In [17]:
def check_add_cv_index(df:pl.DataFrame,strict:bool=False) -> Optional[pl.DataFrame]:
    if "cv_index" not in df.columns and not strict:
        df = df.with_row_index(name="cv_index")
    elif "cv_index" not in df.columns and strict:
        raise ValueError("cv_index column not found in the DataFrame.")
    return df


def shuffle_frame(df:pl.DataFrame):
    return df.sample(fraction=1,shuffle=True)

def slice_frame(df:pl.DataFrame, num_slices:int, shuffle:bool = False, explode:bool = False) -> List[pl.DataFrame]:
    max_index = df.shape[0]
    if shuffle:
        df = shuffle_frame(df)
    indexes = [0] + [max_index//num_slices*i for i in range(1,num_slices)] + [max_index]
    if explode:
        return [df.slice(indexes[i],indexes[i+1]-indexes[i]).explode("cv_index").select(pl.col("cv_index")) for i in range(len(indexes)-1)]
    else:
        return [df.slice(indexes[i],indexes[i+1]-indexes[i]).select(pl.col("cv_index")) for i in range(len(indexes)-1)]

def hacky_list_relative_slice(list: List[int], k: int):
    slices = {}
    slice_size = len(list) // k
    for i in range(k):
        if i < k - 1:
            slices["fold_{}".format(i)] = list[i*slice_size:(i+1)*slice_size]
        else:
            # For the last slice, include the remainder
            slices["fold_{}".format(i)] = list[i*slice_size:]
    return slices

In [33]:
def kfold_combinatorial(df: pl.DataFrame, config: KFoldConfig) -> Generator[pl.Series, pl.Series]:
    df = check_add_cv_index(df,strict=True)
    cv_index = df["cv_index"].shuffle(seed=config.random_state)
    num_samples = cv_index.shape[0]
    fold_size = num_samples // config.k
    index_start = pl.Series([int(i*fold_size) for i in range(config.k)])
    
    print("index_start",index_start)
    folds = [cv_index.slice(offset= start,length=fold_size) for start in index_start]
    #use iter-tools to compute all combinations of indexes in train ant test let's assume only combinatorial for now
    # folds are indexed from 0 to k-1 and we want to return k tuples with the indexes of the train and test folds the indexes are lists of integers of length respectively k-n_test and n_test
    test_folds = list(itertools.combinations(range(config.k),config.n_test_folds))
    print("num of test_folds combinations",len(test_folds))
    for test_fold in test_folds:
        # train_folds is a list list of indexes of the train folds and test is list of list of indexes of the test folds we have to flatten the lists and use those to vcat the series in folds to get the indexes of the train and test samples for each fold
        test_series = pl.concat([folds[i] for i in test_fold]).sort()
        train_series = pl.concat([folds[i] for i in range(config.k) if i not in test_fold]).sort()
        yield train_series,test_series

def kfold_montecarlo(df: pl.DataFrame, config: KFoldConfig) -> Generator[pl.Series, pl.Series]:
    df = check_add_cv_index(df,strict=True)
    cv_index = df["cv_index"].shuffle(seed=config.random_state)
    num_samples = cv_index.shape[0]
    fold_size = num_samples // config.k
    for i in range(config.montecarlo_replicas):
        train_series = cv_index.sample(frac=(config.k-config.n_test_folds)/config.k,replace=False,seed=config.random_state+i)
        test_series = cv_index.filter(train_series,keep=False)
        yield train_series,test_series

def purged_combinatorial(df:pl.DataFrame, config: PurgedConfig) -> Generator[pl.Series, pl.Series]:
    df = check_add_cv_index(df,strict=True)
    gdf = df.group_by(config.groups).agg(pl.col("cv_index")).select(pl.col([config.groups]+["cv_index"]))
    gdf_slices = slice_frame(gdf,config.k,shuffle=config.shuffle,explode=True)
    test_folds = list(itertools.combinations(range(config.k),config.n_test_folds))
    for test_fold in test_folds:
        test_series = pl.concat([gdf_slices[i] for i in test_fold]).sort()
        train_series = pl.concat([gdf_slices[i] for i in range(config.k) if i not in test_fold]).sort()
        yield train_series,test_series

def purged_montecarlo(df:pl.DataFrame, config: PurgedConfig) -> Generator[pl.Series, pl.Series]:
    df = check_add_cv_index(df,strict=True)
    gdf = df.group_by(config.groups).agg(pl.col("cv_index")).select(pl.col([config.groups]+["cv_index"]))
    for i in range(config.montecarlo_replicas):
        gdf_slices = slice_frame(gdf,config.k,shuffle=True,explode=True)
        train_series = pl.concat(gdf_slices[:config.k-config.n_test_folds]).sort()
        test_series = pl.concat(gdf_slices[config.k-config.n_test_folds:]).sort()
        yield train_series,test_series

def stratified_combinatorial(df:pl.DataFrame, config: StratifiedConfig) -> Generator[pl.Series, pl.Series]:
    k = config.k
    df = check_add_cv_index(df)
    sdf = df.group_by(config.groups).agg(pl.col("cv_index"))
    if config.shuffle:
        sdf = sdf.with_columns(pl.col("cv_index").list.sample(fraction=1,shuffle=True))
    sliced = sdf.select(pl.col("cv_index").map_elements(lambda s: hacky_list_relative_slice(s,k)).alias("hacky_cv_index")).unnest("hacky_cv_index")
    test_folds = list(itertools.combinations(range(config.k),config.n_test_folds))
    for test_fold in test_folds:
        test_series=sliced.select(pl.concat_list([sliced["fold_{}".format(j)] for j in range(config.k) if j in test_fold]).alias("cv_index")).explode("cv_index")["cv_index"]
        train_series = sliced.select(pl.concat_list([sliced["fold_{}".format(j)] for j in range(config.k) if j not in test_fold]).alias("cv_index")).explode("cv_index")["cv_index"]
        yield train_series,test_series

def stratified_montecarlo(df:pl.DataFrame, config: StratifiedConfig) -> Generator[pl.Series, pl.Series]:
    k = config.k
    df = check_add_cv_index(df)
    sdf = df.group_by(config.groups).agg(pl.col("cv_index"))
    if config.shuffle:
        sdf = sdf.with_columns(pl.col("cv_index").list.sample(fraction=1,shuffle=True))
    #instead of hackyrelative slice we can sampple the t
    for i in range(config.montecarlo_replicas):
        traintest = sdf.select(pl.col("cv_index"),pl.col("cv_index").list.sample(fraction=config.n_test_folds/k).alias("test_index")).with_columns(pl.col("cv_index").list.set_difference(pl.col("test_index")))
        train_series = traintest.select("train_index").explode("train_index")["train_index"]
        test_series = traintest.select("test_index").explode("test_index")["test_index"]
        yield train_series,test_series


In [None]:
def stratified_kfold(df:pl.DataFrame,group:List[str],k:int,shuffle:bool=True,pre_name:str="",target_names:Tuple[str,str] = ("train","test")):
    df = check_add_cv_index(df)
    sdf = df.group_by(group).agg(pl.col("cv_index"))
    if shuffle:
        sdf = sdf.with_columns(pl.col("cv_index").list.sample(fraction=1,shuffle=True))
    sliced = sdf.select(pl.col("cv_index").map_elements(lambda s: hacky_list_relative_slice(s,k)).alias("hacky_cv_index")).unnest("hacky_cv_index")

    index_df = df.select(pl.col("cv_index"))

    for i in range(k):
            test_set = sliced.select(pl.col("fold_{}".format(i)).alias("cv_index")).explode("cv_index")
            train_set = sliced.select(pl.concat_list([sliced["fold_{}".format(j)] for j in range(k) if j!=i]).alias("cv_index")).explode("cv_index")
            test_df = test_set.with_columns([pl.lit(target_names[1]).alias(pre_name+"fold_{}".format(i))])
            train_df = train_set.with_columns([pl.lit(target_names[0]).alias(pre_name+"fold_{}".format(i))])
            index_df = index_df.join(train_df.vstack(test_df), on="cv_index", how="left")
    return index_df

In [None]:
def purged_kfold(df:pl.DataFrame,group:List[str],k:int,shuffle:bool=True,pre_name:str="",target_names:Tuple[str,str] = ("train","test")):
    #group is a list of columns that will be used to group the data
    #k is the number of splits
    #we will use the group to split the data into k groups and then we will use the group to make sure that the same group is not in the train and test set
    #we will use the row index to split the data
    #first we will shuffle the data
    df = check_add_cv_index(df)
    gdf = df.group_by(group).agg(pl.col("cv_index"))
    #then we will split the data into k slices we need to explode the index since we are working on groups
    gdf_slices = slice_frame(gdf,k,shuffle=shuffle,explode=True)
    #then we will iterate over the slices and use the slice as the test set and the rest as the train set
    index_df = df.select(pl.col("cv_index"))
    for i in range(k):
        test_set = gdf_slices[i]
        train_set = [gdf_slices[j] for j in range(k) if j!=i]
        train_set = pl.concat(train_set)

        test_df = test_set.with_columns([pl.lit(target_names[1]).alias(pre_name+"fold_{}".format(i))])
        train_df = train_set.with_columns([pl.lit(target_names[0]).alias(pre_name+"fold_{}".format(i))])
        fold_df = train_df.vstack(test_df)
        index_df = index_df.join(fold_df, on="cv_index", how="left")
    return index_df

In [42]:
config = KFoldConfig(k=10,n_test_folds=5)
generator = kfold_montecarlo(df,config)
i = 0
for train,test in generator:
    print(train)
    print(test)
    print("")
    i+=1
print("total number of folds",i)

index_start shape: (10,)
Series: '' [i64]
[
	0
	1
	2
	3
	4
	5
	6
	7
	8
	9
]
num of test_folds combinations 252
shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	3
	5
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	2
	4
	6
	7
	8
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	3
	5
	8
]
shape: (5,)
Series: 'cv_index' [i64]
[
	2
	4
	6
	7
	9
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	3
	8
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	2
	4
	5
	6
	7
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	5
	8
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	2
	3
	4
	6
	7
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	3
	5
	8
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	1
	2
	4
	6
	7
]

shape: (5,)
Series: 'cv_index' [i64]
[
	1
	3
	5
	8
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	0
	2
	4
	6
	7
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	2
	3
	5
]
shape: (5,)
Series: 'cv_index' [i64]
[
	4
	6
	7
	8
	9
]

shape: (5,)
Series: 'cv_index' [i64]
[
	0
	1
	2
	3
	9
]
shape: (5,)
Series: 'cv_index' [i64]
[
	4


In [None]:



def hacky_list_relative_slice(list: List[int], k: int):
    slices = {}
    slice_size = len(list) // k
    for i in range(k):
        if i < k - 1:
            slices["fold_{}".format(i)] = list[i*slice_size:(i+1)*slice_size]
        else:
            # For the last slice, include the remainder
            slices["fold_{}".format(i)] = list[i*slice_size:]
    return slices
def check_add_cv_index(df:pl.DataFrame):
    if "cv_index" not in df.columns:
        df = df.with_row_index(name="cv_index")
    return df
def vanilla_kfold(df:pl.DataFrame,group,k:int,shuffle:bool=True,pre_name:str="",target_names:Tuple[str,str] = ("train","test")):
    #we will use the row index to split the data
    #first we will shuffle the data
    df = check_add_cv_index(df)
    if shuffle:
        df = shuffle_frame(df)
    #then we will split the data into k slices
    df_slices = slice_frame(df,k,shuffle=False,explode=False)
    index_df = df.select(pl.col("cv_index"))
    for i in range(k):
        test_set = df_slices[i]
        train_set = [df_slices[j] for j in range(k) if j!=i]
        train_set = pl.concat(train_set)

        test_df = test_set.with_columns([pl.lit(target_names[1]).alias(pre_name+"fold_{}".format(i))])
        train_df = train_set.with_columns([pl.lit(target_names[0]).alias(pre_name+"fold_{}".format(i))])
        fold_df = train_df.vstack(test_df)
        index_df = index_df.join(fold_df, on="cv_index", how="left")
    return index_df

In [None]:
def generate_simulated_data(n_samples: int, n_classes: int) -> pl.DataFrame:
    class_0 = np.random.multivariate_normal(mean=[30, 50000], cov=[[100, 0], [0, 1000000]], size=n_samples // 2)
    class_1 = np.random.multivariate_normal(mean=[50, 80000], cov=[[100, 0], [0, 1000000]], size=n_samples // 2)
    data = {
        "age": np.concatenate((class_0[:, 0], class_1[:, 0])),
        "income": np.concatenate((class_0[:, 1], class_1[:, 1])),
        "gender": np.random.choice(["Male", "Female"], size=n_samples),
        "education": np.random.choice(["Bachelor's", "Master's", "PhD"], size=n_samples),
        "target": np.concatenate((np.zeros(n_samples // 2), np.ones(n_samples // 2))),
        "fold": np.random.choice([0, 1, 2], size=n_samples),
    }
    return pl.DataFrame(data)



In [5]:
def convert_utf8_to_enum(df: pl.DataFrame, threshold: float = 0.2) -> pl.DataFrame:
    if not 0 < threshold < 1:
        raise ValueError("Threshold must be between 0 and 1 (exclusive).")

    for column in df.columns:
        if df[column].dtype == pl.Utf8 and len(df[column]) > 0:
            unique_values = df[column].unique()
            unique_ratio = len(unique_values) / len(df[column])

            if unique_ratio <= threshold:
                enum_dtype = pl.Enum(unique_values.to_list())
                df = df.with_columns(df[column].cast(enum_dtype))
            else:
                print(f"Column '{column}' has a high ratio of unique values ({unique_ratio:.2f}). Skipping conversion to Enum.")
        elif df[column].dtype == pl.Utf8 and len(df[column]) == 0:
            print(f"Column '{column}' is empty. Skipping conversion to Enum.")

    return df
def evaluate_model(pipeline: Pipeline, X, y):
    predictions = pipeline.predict(X)
    accuracy = accuracy_score(y, predictions)
    return accuracy

def convert_enum_to_physical(df: pl.DataFrame) -> pl.DataFrame:
    df_physical = df.with_columns(
        [pl.col(col).to_physical() for col in df.columns if df[col].dtype == pl.Enum]
    )
    return df_physical

def create_pipeline(df: pl.DataFrame, input_config: InputConfig, classifier_config: ClassifierConfig) -> Pipeline:
    transformers = []
    for feature_set in input_config.feature_sets:
        numerical_features = [feature.column_name for feature in feature_set.numerical]
        if numerical_features:
            scaler = feature_set.numerical[0].get_scaler()  # Assuming all numerical features use the same scaler
            transformers.append(("numerical", scaler, numerical_features))

        categorical_features = [feature.column_name for feature in feature_set.categorical]
        if categorical_features:
            for feature in feature_set.categorical:
                if feature.one_hot_encoding:
                    if df[feature.column_name].dtype == pl.Categorical:
                        categories = [df[feature.column_name].unique().to_list()]
                    elif df[feature.column_name].dtype == pl.Enum:
                        categories = [df[feature.column_name].dtype.categories]
                    else:
                        raise ValueError(f"Column '{feature.column_name}' must be of type pl.Categorical or pl.Enum for one-hot encoding.")
                    one_hot_encoder = OneHotEncoder(categories=categories, handle_unknown='error', sparse_output=False)
                    transformers.append((f"categorical_{feature.column_name}", one_hot_encoder, [feature.column_name]))
                else:
                    if df[feature.column_name].dtype not in [pl.Float32, pl.Float64]:
                        raise ValueError(f"Column '{feature.column_name}' must be of type pl.Float32 or pl.Float64 for physical representation.")
                    transformers.append((f"categorical_{feature.column_name}", "passthrough", [feature.column_name]))

    preprocessor = ColumnTransformer(transformers)

    # Create the classifier based on the classifier configuration
    if isinstance(classifier_config.classifier, LogisticRegressionConfig):
        classifier = LogisticRegression(**classifier_config.classifier.dict(exclude={"classifier_name"}))
    elif isinstance(classifier_config.classifier, RandomForestClassifierConfig):
        classifier = RandomForestClassifier(**classifier_config.classifier.dict(exclude={"classifier_name"}))
    elif isinstance(classifier_config.classifier, HistGradientBoostingClassifierConfig):
        classifier = HistGradientBoostingClassifier(**classifier_config.classifier.dict(exclude={"classifier_name"}))
    else:
        raise ValueError(f"Unsupported classifier: {type(classifier_config.classifier)}")

    pipeline = Pipeline([("preprocessor", preprocessor), ("classifier", classifier)])
    pipeline.set_output(transform="polars")
    return pipeline

# Example usage
n_samples = 1000
n_classes = 2
df = generate_simulated_data(n_samples, n_classes)
# Convert categorical columns to Enum
df_enum = convert_utf8_to_enum(df, threshold=0.8)
df_physical = df_enum

# Declare feature sets using Pydantic classes
numerical_features = [
    NumericalFeature(column_name="age", name="Age"),
    NumericalFeature(column_name="income", name="Income"),
]
categorical_features = [
    CategoricalFeature(column_name="gender", name="Gender"),
    CategoricalFeature(column_name="education", name="Education"),
]
feature_set = FeatureSet(
    numerical=numerical_features,
    categorical=categorical_features,
)
input_config = InputConfig(feature_sets=[feature_set])

# Declare classifier configurations
classifier_configs = [
    ClassifierConfig(
        classifier=LogisticRegressionConfig(
            penalty="l2",
            C=1.0,
            solver="lbfgs",
            max_iter=100,
        )
    ),
    ClassifierConfig(
        classifier=RandomForestClassifierConfig(
            n_estimators=100,
            max_depth=5,
            min_samples_split=2,
            min_samples_leaf=1,
            random_state=42,
        )
    ),
    ClassifierConfig(
        classifier=HistGradientBoostingClassifierConfig(
            learning_rate=0.1,
            max_iter=100,
            max_leaf_nodes=31,
            min_samples_leaf=20,
            random_state=42,
        )
    ),
]

# Simulate cross-validation
fold_name = "fold"
train_df = df_physical.filter(pl.col(fold_name) == 0)
val_df = df_physical.filter(pl.col(fold_name) == 1)
test_df = df_physical.filter(pl.col(fold_name) == 2)

# Evaluate each classifier configuration
for classifier_config in classifier_configs:
    print(f"Evaluating {type(classifier_config.classifier).__name__}")

    # Create the pipeline
    pipeline = create_pipeline(df_enum, input_config, classifier_config)

    # Fit the pipeline on the training data
    pipeline.fit(train_df.drop(fold_name), train_df["target"])

    # Evaluate the model on the validation and test data
    val_accuracy = evaluate_model(pipeline, val_df.drop([fold_name, "target"]), val_df["target"])
    test_accuracy = evaluate_model(pipeline, test_df.drop([fold_name, "target"]), test_df["target"])

    print("Validation accuracy:", val_accuracy)
    print("Test accuracy:", test_accuracy)
    print()

Evaluating LogisticRegressionConfig
Validation accuracy: 1.0
Test accuracy: 1.0

Evaluating RandomForestClassifierConfig
Validation accuracy: 1.0
Test accuracy: 1.0

Evaluating HistGradientBoostingClassifierConfig
Validation accuracy: 1.0
Test accuracy: 1.0



found 0 physical cores < 1
  File "c:\Users\Tommaso\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\externals\loky\backend\context.py", line 217, in _count_physical_cores
    raise ValueError(


: 