In [None]:
# this cell's metadata contains
# "nbsphinx": "hidden" so it is hidden by nbsphinx

def _set_paths() -> None:
    # set the correct path when launched from within PyCharm

    module_paths = ["pytools", "sklearndf"]

    import sys
    import os
    
    if 'cwd' not in globals():
        # noinspection PyGlobalUndefined
        global cwd
        cwd = os.path.join(os.getcwd(), os.pardir, os.pardir, os.pardir)
        os.chdir(cwd)   
    print(f"working dir is '{os.getcwd()}'")
    for module_path in module_paths:
        if module_path not in sys.path:
            sys.path.insert(0, os.path.abspath(f"{cwd}/{os.pardir}/{module_path}/src"))
        print(f"added `{sys.path[0]}` to python paths")
        
def _ignore_warnings():
    # ignore irrelevant warnings that would affect the output of this tutorial notebook
    
    # ignore a useless LGBM warning
    import warnings
    warnings.filterwarnings("ignore", category=UserWarning, message=r".*Xcode_8\.3\.3")

_set_paths()
_ignore_warnings()

del _set_paths, _ignore_warnings

# Scikit-learn and data frames


The `sklearndf` package enhances scikit-learn for advanced support of data frames.

It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.

`sklearndf` enhances scikit-learn's estimators to:
- return data frames as results of transformations, preserving feature names as the column index
- add additional estimatgor properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers 

Using `sklearndf` is very simple: Append `DF` at the end of scikit-learn class names and you will get enhanced data frame support.

In [None]:
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearndf.classification import RandomForestClassifierDF
from sklearndf.pipeline import PipelineDF, RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF
from sklearndf.regression.extra import LGBMRegressorDF
from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF
from sklearndf.transformation.extra import BorutaDF

We load our data:

In [None]:
housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)
housing_features_df = housing_features_df.drop(["Id", "YrSold", "MoSold", "MSSubClass", "MiscVal"], axis=1)

The data set includes categorical features, e.g., garage types:

In [None]:
housing_df["GarageType"].unique()

Let us build a preprocessing pipeline which:

- for categorical variables fills missing values with the string 'nan' and then one-hot encodes
- for numerical values fills missing values using median values

In [None]:
categorical_features = housing_features_df.select_dtypes(object).columns
numerical_features = housing_features_df.select_dtypes(pd.np.number).columns

categorical_features, numerical_features

# Contrasting a scikit-learn and sklearndf pipeline

## A scikit-learn pipeline

We first build the preprocessing pipeline with native scikit-learn transformers.
This is achievable with a few lines of code; however does not allow us to keep track of feature names. 

In [None]:
preprocessing_numeric = SimpleImputer(strategy="median", add_indicator=True)

preprocessing_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoder(sparse=False))
    ]
)

preprocessing = ColumnTransformer(
    transformers=[
        ('numeric', preprocessing_numeric, numerical_features),
        ('categorical', preprocessing_categorical, categorical_features),
    ]
)

In [None]:
preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)

The strength of `sklearndf` is to maintain the scikit-learn conventions and expressivity, and to also preserve dataframes, hence keeping track of the feature names.

## An sklearndf pipeline

The convention in `sklearndf` is to append `DF` at the end of each corresponding scikit-learn class. 
For instance, to reproduce the above example, we write:

In [None]:
preprocessing_numeric_df = SimpleImputerDF(strategy="median", add_indicator=True)

preprocessing_categorical_df = PipelineDF(
    steps=[
        ('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
    ]
)

preprocessing_df = ColumnTransformerDF(
    transformers=[
        ('categorical', preprocessing_categorical_df, categorical_features),
        ('numeric', preprocessing_numeric_df, numerical_features),
    ]
)

In [None]:
transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)
transformed_df.head()

The `~sklearndf.transformation.ColumnTransformerDF.features_original_` attribute returns a series mapping the output columns (the series' index) to the input columns (the series' values):

In [None]:
preprocessing_df.features_original_.to_frame().head(10)

You can therefore easily select all output features generated from a given input feature:

In [None]:
garage_type_derivatives = preprocessing_df.features_original_ == "GarageType"

transformed_df.loc[:, garage_type_derivatives].head()

# Supervised learners

## Regressors

As for transformers, scikit-learn regressors and classifiers have a `sklearndf` sibling obtained by appending `DF` to the class name, and the API remains the same. The result of any predict and decision function will be returned as a pandas series (single output) or data frame (class probabilities or multi-output).

For a random forest regressor we get:

In [None]:
# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)
numerical_features_df = housing_features_df.loc[:, numerical_features].fillna(0)

df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(
    numerical_features_df,
    housing_target_sr,
    random_state=42
)

random_forest_regressor_df = RandomForestRegressorDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)

random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)
random_forest_regressor_df.score(X=df_numerical_test, y=y_test)

In [None]:
random_forest_regressor_df.predict(df_numerical_test.iloc[:10]).to_frame()

In [None]:
random_forest_regressor_df.get_params()

In [None]:
random_forest_regressor_df.set_params(max_depth=7)

The underlying scikit-learn regressor is stored in the `native_estimator` attribute:

In [None]:
random_forest_regressor_df.native_estimator

Property `is_fitted` tells if the regressor is fitted, and -- for fitted estimators -- property `features_in_` returns the names of the ingoing features as a pandas index.

In [None]:
random_forest_regressor_df.is_fitted

In [None]:
random_forest_regressor_df.features_in_

## Classifiers

Classifiers follow the same logic:

In [None]:
# we create for house prices house below 100k, below 200k, and above 200k for multi-label classification
y_classes = housing_target_sr.apply(lambda x: '>=200k' if x >= 200000 else '>=100k' if x >= 100000 else '<100k')

df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(
    numerical_features_df,
    y_classes,
    random_state=42
)

In [None]:
random_forest_classifier_df = RandomForestClassifierDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)
random_forest_classifier_df.fit(df_numerical_train, y_classification_train)
random_forest_classifier_df.score(df_numerical_test, y_classification_test)

In [None]:
random_forest_classifier_df.predict(df_numerical_test.iloc[:10]).to_frame()

In [None]:
random_forest_classifier_df.predict_proba(df_numerical_test.iloc[:10])

In [None]:
random_forest_classifier_df.predict_log_proba(df_numerical_test.iloc[:10])

## Pipeline

We can combine the above steps to build a full predictive pipeline. `sklearndf` provides two useful, specialised pipeline objects for this, `~sklearndf.pipeline.RegressorPipelineDF` and `~sklearndf.pipeline.ClassifierPipelineDF`. Both implement a special two-step pipeline with one pre-processing step and one prediction step, while staying compatible with the general sklearn pipeline idiom. 

In [None]:
pipeline_df = RegressorPipelineDF(
    preprocessing=preprocessing_df,
    regressor=RandomForestRegressorDF(
        n_estimators=1000,
        max_features=2/3,
        max_depth=7,
        random_state=42,
        n_jobs=-3
    )
)

In [None]:
df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)
pipeline_df.fit(df_train, y_train)
pipeline_df.score(df_test, y_test)

# Extras

`sklearndf` also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in `.extra` modules, such as

- `sklearndf.regression.extra.LGBMRegressorDF`
- `sklearndf.transformation.extra.BorutaDF`
- `sklearndf.transformation.extra.OutlierRemoverDF`

## LightGBM regressor

In [None]:
lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)
lgbm_df.fit(df_numerical_train, y_train)
lgbm_df.predict(df_numerical_test.iloc[:10]).to_frame()

## Boruta

`Boruta <https://www.jstatsoft.org/article/view/v036i11>`_ is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.

The `sklearndf.transformation.extra.BorutaDF` transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.

For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument ``n_estimators="auto"``).


In [None]:
boruta_pipeline = PipelineDF(
    steps=[
        ('preprocess', preprocessing_df),
        ('boruta', BorutaDF(
            estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3), 
            n_estimators="auto", 
            random_state=42,
            verbose=2
        )),
    ]
)

In [None]:
boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)

Boruta is implemented as an sklearn transformer; its output features are all features that passed the Boruta test.

In [None]:
boruta_pipeline.features_out_.to_list()

`sklearndf` allows us to trace outgoing features back to the original features from which they were derived, using the `~sklearndf.TransformerDF.features_original_` property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.

In our example, feature `BsmtQual_Ex` is a derivative of feature `BsmtQual`, obtained through one-hot encoding: 

In [None]:
boruta_pipeline.features_original_.to_frame()

So, to obtain all features we want to select from the original data set, we can select the unique ingoing features from the original feature mapping:

In [None]:
boruta_pipeline.features_original_.unique()