In this example, we focused on integrating Polars DataFrames with scikit-learn's pipeline and preprocessing functionality to create a streamlined and efficient machine learning workflow. The main objective was to leverage the power of Polars for data manipulation and the flexibility of scikit-learn for model training and evaluation.

Here's a summary of what we accomplished:

1. Data Preparation:
   - We defined a `convert_utf8_to_enum` function to convert categorical columns in a Polars DataFrame from UTF-8 to the Enum data type based on a specified threshold.
   - We created Pydantic classes (`Feature`, `NumericalFeature`, `EmbeddingFeature`, `CategoricalFeature`, `FeatureSet`, and `InputConfig`) to define and validate the feature sets used in the machine learning pipeline.
   - We generated simulated data using the `generate_simulated_data` function to demonstrate the workflow.

2. Pipeline Creation:
   - We defined a `create_pipeline` function that takes an `InputConfig` instance and creates a scikit-learn `Pipeline` object.
   - The pipeline consists of a `ColumnTransformer` for preprocessing numerical and categorical features, followed by a `LinearSVC` classifier.
   - For numerical features, we used `StandardScaler` to standardize the data.
   - For categorical features, we used `"passthrough"` to pass the physical representation of the Enum columns directly through the pipeline, avoiding the need for `OneHotEncoder`.
   - We set the output of the pipeline to "polars" using `pipeline.set_output(transform="polars")` to ensure that the pipeline returns a Polars DataFrame.

3. Model Training and Evaluation:
   - We simulated cross-validation by filtering the DataFrame based on a "fold" column to obtain the training, validation, and test sets.
   - We fit the pipeline on the training data using `pipeline.fit()`.
   - We evaluated the model's performance on the validation and test data using the `evaluate_model` function, which calculates the accuracy using scikit-learn's `accuracy_score`.

4. Integration with Polars:
   - Throughout the example, we used Polars DataFrames for data manipulation and preprocessing.
   - We converted the categorical columns to their physical representation using `pl.col(col).to_physical()` to ensure compatibility with the pipeline.
   - We used Polars' `filter` and `drop` methods to select the appropriate subsets of data for training, validation, and testing.

In our next working session, we will focus on the following tasks:

1. Refactoring the Cynde-related methods to integrate the new approaches developed in this example.
2. Introducing equivalent models for defining the model configurations, replacing the current dictionary-based approach.
3. Cleaning and generalizing all the cross-validation methods under a common framework to improve code organization and reusability.

By building upon the foundation established in this example and incorporating the planned enhancements, we aim to create a more robust, efficient, and user-friendly machine learning framework within the Cynde project.

In [1]:
import polars as pl
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from typing import List, Optional
from pydantic import BaseModel, ValidationInfo, model_validator
import numpy as np
from sklearn.metrics import accuracy_score


def convert_utf8_to_enum(df: pl.DataFrame, threshold: float = 0.5) -> pl.DataFrame:
    if not 0 < threshold < 1:
        raise ValueError("Threshold must be between 0 and 1 (exclusive).")

    for column in df.columns:
        if df[column].dtype == pl.Utf8 and len(df[column]) > 0:
            unique_values = df[column].unique()
            unique_ratio = len(unique_values) / len(df[column])

            if unique_ratio <= threshold:
                enum_dtype = pl.Enum(unique_values.to_list())
                df = df.with_columns(df[column].cast(enum_dtype))
            else:
                print(f"Column '{column}' has a high ratio of unique values ({unique_ratio:.2f}). Skipping conversion to Enum.")
        elif df[column].dtype == pl.Utf8 and len(df[column]) == 0:
            print(f"Column '{column}' is empty. Skipping conversion to Enum.")

    return df

class Feature(BaseModel):
    column_name: str
    name: str
    description: Optional[str] = None

    @model_validator(mode='before')
    def validate_column_name(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
        return values

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"

class NumericalFeature(Feature):
    @model_validator(mode='before')
    def validate_numerical_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [
                pl.Boolean,
                pl.Int8,
                pl.Int16,
                pl.Int32,
                pl.Int64,
                pl.UInt8,
                pl.UInt16,
                pl.UInt32,
                pl.UInt64,
                pl.Float32,
                pl.Float64,
                pl.Decimal,
            ]:
                raise ValueError(
                    f"Column '{column_name}' must be of a numeric type (Boolean, Integer, Unsigned Integer, Float, or Decimal)."
                )
        return values

class EmbeddingFeature(Feature):
    @model_validator(mode='before')
    def validate_embedding_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [pl.List(pl.Float32), pl.List(pl.Float64)]:
                raise ValueError(f"Column '{column_name}' must be of type pl.List(pl.Float32) or pl.List(pl.Float64).")
        return values

class CategoricalFeature(Feature):
    @model_validator(mode='before')
    def validate_categorical_column(cls, values):
        column_name = values.get("column_name")
        context = values.get("context")
        if context is not None and isinstance(context, pl.DataFrame):
            if column_name not in context.columns:
                raise ValueError(f"Column '{column_name}' not found in the DataFrame.")
            if context[column_name].dtype not in [
                pl.Utf8,
                pl.Categorical,
                pl.Enum,
                pl.Int8,
                pl.Int16,
                pl.Int32,
                pl.Int64,
                pl.UInt8,
                pl.UInt16,
                pl.UInt32,
                pl.UInt64,
            ]:
                raise ValueError(
                    f"Column '{column_name}' must be of type pl.Utf8, pl.Categorical, pl.Enum, or an integer type."
                )
        return values

class FeatureSet(BaseModel):
    numerical: List[NumericalFeature] = []
    embeddings: List[EmbeddingFeature] = []
    categorical: List[CategoricalFeature] = []

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"

class InputConfig(BaseModel):
    feature_sets: List[FeatureSet]

    def validate_with_dataframe(self, df: pl.DataFrame):
        for feature_set in self.feature_sets:
            for feature_type in ["numerical", "embeddings", "categorical"]:
                for feature in getattr(feature_set, feature_type):
                    feature.model_validate({"context": df, **feature.dict()})

    class Config:
        arbitrary_types_allowed = True
        extra = "allow"

In [2]:
def generate_simulated_data(n_samples: int, n_classes: int) -> pl.DataFrame:
    class_0 = np.random.multivariate_normal(mean=[30, 50000], cov=[[100, 0], [0, 1000000]], size=n_samples // 2)
    class_1 = np.random.multivariate_normal(mean=[50, 80000], cov=[[100, 0], [0, 1000000]], size=n_samples // 2)
    data = {
        "age": np.concatenate((class_0[:, 0], class_1[:, 0])),
        "income": np.concatenate((class_0[:, 1], class_1[:, 1])),
        "gender": np.random.choice(["Male", "Female"], size=n_samples),
        "education": np.random.choice(["Bachelor's", "Master's", "PhD"], size=n_samples),
        "target": np.concatenate((np.zeros(n_samples // 2), np.ones(n_samples // 2))),
        "fold": np.random.choice([0, 1, 2], size=n_samples),
    }
    return pl.DataFrame(data)



In [3]:
def evaluate_model(pipeline: Pipeline, X, y):
    predictions = pipeline.predict(X)
    accuracy = accuracy_score(y, predictions)
    return accuracy

def create_pipeline(input_config: InputConfig) -> Pipeline:
    transformers = []
    for feature_set in input_config.feature_sets:
        numerical_features = [feature.column_name for feature in feature_set.numerical]
        if numerical_features:
            transformers.append(("numerical", StandardScaler(), numerical_features))
        categorical_features = [feature.column_name for feature in feature_set.categorical]
        if categorical_features:
            transformers.append(("categorical", "passthrough", categorical_features))
    preprocessor = ColumnTransformer(transformers)
    classifier = LinearSVC(dual=False)
    pipeline = Pipeline([("preprocessor", preprocessor), ("classifier", classifier)])
    pipeline.set_output(transform="polars")
    return pipeline

# Example usage
n_samples = 1000
n_classes = 2
df = generate_simulated_data(n_samples, n_classes)

# Convert categorical columns to Enum
df_enum = convert_utf8_to_enum(df, threshold=0.8)

# Convert Enum columns to their physical representation
df_physical = df_enum.with_columns(
    [pl.col(col).to_physical() for col in df_enum.columns if df_enum[col].dtype == pl.Enum]
)

# Declare feature sets using Pydantic classes
numerical_features = [
    NumericalFeature(column_name="age", name="Age"),
    NumericalFeature(column_name="income", name="Income"),
]
categorical_features = [
    CategoricalFeature(column_name="gender", name="Gender"),
    CategoricalFeature(column_name="education", name="Education"),
]
feature_set = FeatureSet(
    numerical=numerical_features,
    categorical=categorical_features,
)
input_config = InputConfig(feature_sets=[feature_set])

# Validate feature sets with the DataFrame
try:
    input_config.validate_with_dataframe(df_physical)
    print("Feature set validation successful!")
except ValueError as e:
    print(f"Feature set validation failed: {str(e)}")

# Create the pipeline
pipeline = create_pipeline(input_config)

# Simulate cross-validation
fold_name = "fold"
train_df = df_physical.filter(pl.col(fold_name) == 0)
val_df = df_physical.filter(pl.col(fold_name) == 1)
test_df = df_physical.filter(pl.col(fold_name) == 2)

# Fit the pipeline on the training data
pipeline.fit(train_df.drop(fold_name), train_df["target"])

# Evaluate the model on the validation and test data
val_accuracy = evaluate_model(pipeline, val_df.drop([fold_name, "target"]), val_df["target"])
test_accuracy = evaluate_model(pipeline, test_df.drop([fold_name, "target"]), test_df["target"])

print("Validation accuracy:", val_accuracy)
print("Test accuracy:", test_accuracy)

Feature set validation successful!
Validation accuracy: 1.0
Test accuracy: 1.0
