In [4]:
# this cell's metadata contains
# "nbsphinx": "hidden" so it is hidden by nbsphinx

def _ignore_warnings():
    # ignore irrelevant warnings that would affect the output of this tutorial notebook
    
    # ignore a useless LGBM warning
    import warnings
    warnings.filterwarnings("ignore", category=UserWarning, message=r".*Xcode_8\.3\.3")

_ignore_warnings()

del _ignore_warnings

working dir is 'C:\Projects\facet\sklearndf'
added `C:\Projects\facet\pytools\src` to python paths
added `C:\Projects\facet\sklearndf\src` to python paths


# Scikit-learn and data frames


The `sklearndf` package enhances scikit-learn for advanced support of data frames.

It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.

`sklearndf` enhances scikit-learn's estimators to:

- return data frames as results of transformations, preserving feature names as the column index
- add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers 

Using `sklearndf` is very simple: Append `DF` at the end of scikit-learn class names and you will get enhanced data frame support.

In [2]:
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearndf.classification import RandomForestClassifierDF
from sklearndf.pipeline import PipelineDF, RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF
from sklearndf.regression.extra import LGBMRegressorDF
from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF
from sklearndf.transformation.extra import BorutaDF



We load our data:

In [5]:
housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)
housing_features_df = housing_features_df.drop(["Id", "YrSold", "MoSold", "MSSubClass", "MiscVal"], axis=1)

The data set includes categorical features, e.g., garage types:

In [6]:
housing_features_df["GarageType"].unique()

array(['Attchd', 'Detchd', 'BuiltIn', 'CarPort', None, 'Basment',
       '2Types'], dtype=object)

Let us build a preprocessing pipeline which:

- for categorical variables fills missing values with the string 'nan' and then one-hot encodes
- for numerical values fills missing values using median values

In [7]:
categorical_features = housing_features_df.select_dtypes(object).columns
numerical_features = housing_features_df.select_dtypes(pd.np.number).columns

categorical_features, numerical_features

(Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
        'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
        'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
        'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
        'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
        'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
        'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
        'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
        'SaleType', 'SaleCondition'],
       dtype='object'),
 Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
        'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
        'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
        'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
        'Kit

# Contrasting a scikit-learn and sklearndf pipeline

## A scikit-learn pipeline

We first build the preprocessing pipeline with native scikit-learn transformers.
This is achievable with a few lines of code; however does not allow us to keep track of feature names. 

In [8]:
preprocessing_numeric = SimpleImputer(strategy="median", add_indicator=True)

preprocessing_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoder(sparse=False))
    ]
)

preprocessing = ColumnTransformer(
    transformers=[
        ('numeric', preprocessing_numeric, numerical_features),
        ('categorical', preprocessing_categorical, categorical_features),
    ]
)

In [9]:
preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)

array([[6.500e+01, 8.450e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [8.000e+01, 9.600e+03, 6.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [6.800e+01, 1.125e+04, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       ...,
       [6.600e+01, 9.042e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [6.800e+01, 9.717e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [7.500e+01, 9.937e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00]])

The strength of `sklearndf` is to maintain the scikit-learn conventions and expressivity, and to also preserve dataframes, hence keeping track of the feature names.

## An sklearndf pipeline

The convention in `sklearndf` is to append `DF` at the end of each corresponding scikit-learn class. 
For instance, to reproduce the above example, we write:

In [10]:
preprocessing_numeric_df = SimpleImputerDF(strategy="median", add_indicator=True)

preprocessing_categorical_df = PipelineDF(
    steps=[
        ('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
    ]
)

preprocessing_df = ColumnTransformerDF(
    transformers=[
        ('categorical', preprocessing_categorical_df, categorical_features),
        ('numeric', preprocessing_numeric_df, numerical_features),
    ]
)

In [11]:
transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)
transformed_df.head()

feature_out,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_<unknown>,Alley_Grvl,Alley_Pave,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,LotFrontage__missing,MasVnrArea__missing,GarageYrBlt__missing
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,460.0,298.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,608.0,0.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,642.0,0.0,35.0,272.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,836.0,192.0,84.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The `sklearndf.transformation.ColumnTransformerDF.feature_names_original_` attribute returns a series mapping the output columns (the series' index) to the input columns (the series' values):

In [12]:
preprocessing_df.feature_names_original_.to_frame().head(10)

Unnamed: 0_level_0,feature_in
feature_out,Unnamed: 1_level_1
MSZoning_C (all),MSZoning
MSZoning_FV,MSZoning
MSZoning_RH,MSZoning
MSZoning_RL,MSZoning
MSZoning_RM,MSZoning
Street_Grvl,Street
Street_Pave,Street
Alley_<unknown>,Alley
Alley_Grvl,Alley
Alley_Pave,Alley


You can therefore easily select all output features generated from a given input feature:

In [13]:
garage_type_derivatives = preprocessing_df.feature_names_original_ == "GarageType"

transformed_df.loc[:, garage_type_derivatives].head()

feature_out,GarageType_2Types,GarageType_<unknown>,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0


# Supervised learners

## Regressors

As for transformers, scikit-learn regressors and classifiers have a `sklearndf` sibling obtained by appending `DF` to the class name, and the API remains the same. The result of any predict and decision function will be returned as a pandas series (single output) or data frame (class probabilities or multi-output).

For a random forest regressor we get:

In [14]:
# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)
numerical_features_df = housing_features_df.loc[:, numerical_features].fillna(0)

df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(
    numerical_features_df,
    housing_target_sr,
    random_state=42
)

random_forest_regressor_df = RandomForestRegressorDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)

random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)
random_forest_regressor_df.score(X=df_numerical_test, y=y_test)

0.8638857401761125

In [15]:
random_forest_regressor_df.predict(df_numerical_test.iloc[:10]).to_frame()

Unnamed: 0,prediction
892,138678.817934
1105,305008.808241
413,133420.81078
522,171533.659061
1036,307214.384636
614,88023.125186
218,198144.411028
1160,167850.680826
649,88023.125186
887,126950.121597


In [16]:
random_forest_regressor_df.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': 5,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': -3,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [17]:
random_forest_regressor_df.set_params(max_depth=7)

RandomForestRegressorDF(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                        max_depth=7, max_features='auto', max_leaf_nodes=None,
                        max_samples=None, min_impurity_decrease=0.0,
                        min_impurity_split=None, min_samples_leaf=1,
                        min_samples_split=2, min_weight_fraction_leaf=0.0,
                        n_jobs=-3, oob_score=False, random_state=42, verbose=0,
                        warm_start=False)

The underlying scikit-learn regressor is stored in the `native_estimator` attribute:

In [18]:
random_forest_regressor_df.native_estimator

RandomForestRegressor(max_depth=7, n_jobs=-3, random_state=42)

Property `is_fitted` tells if the regressor is fitted, and -- for fitted estimators -- property `feature_names_in_` returns the names of the ingoing features as a pandas index.

In [19]:
random_forest_regressor_df.is_fitted

True

In [20]:
random_forest_regressor_df.feature_names_in_

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'],
      dtype='object', name='feature_in')

## Classifiers

Classifiers follow the same logic:

In [21]:
# we create for house prices house below 100k, below 200k, and above 200k for multi-label classification
y_classes = housing_target_sr.apply(lambda x: '>=200k' if x >= 200000 else '>=100k' if x >= 100000 else '<100k')

df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(
    numerical_features_df,
    y_classes,
    random_state=42
)

In [22]:
random_forest_classifier_df = RandomForestClassifierDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)
random_forest_classifier_df.fit(df_numerical_train, y_classification_train)
random_forest_classifier_df.score(df_numerical_test, y_classification_test)

0.8767123287671232

In [23]:
random_forest_classifier_df.predict(df_numerical_test.iloc[:10]).to_frame()

Unnamed: 0,prediction
892,>=100k
1105,>=200k
413,>=100k
522,>=100k
1036,>=200k
614,<100k
218,>=100k
1160,>=100k
649,<100k
887,>=100k


In [24]:
random_forest_classifier_df.predict_proba(df_numerical_test.iloc[:10])

Unnamed: 0,<100k,>=100k,>=200k
892,0.056555,0.905298,0.038148
1105,0.001141,0.091114,0.907745
413,0.104472,0.860927,0.034602
522,0.047416,0.814917,0.137667
1036,0.000812,0.081764,0.917424
614,0.795575,0.201516,0.002909
218,0.02344,0.569006,0.407554
1160,0.05507,0.89993,0.045
649,0.810701,0.186019,0.00328
887,0.066446,0.89682,0.036734


In [25]:
random_forest_classifier_df.predict_log_proba(df_numerical_test.iloc[:10])

Unnamed: 0,<100k,>=100k,>=200k
892,-2.872548,-0.099491,-3.266291
1105,-6.775823,-2.395643,-0.096792
413,-2.25884,-0.149746,-3.363852
522,-3.0488,-0.204669,-1.982916
1036,-7.116299,-2.503917,-0.086185
614,-0.22869,-1.601889,-5.839849
218,-3.753297,-0.563864,-0.897583
1160,-2.899155,-0.105438,-3.101087
649,-0.209856,-1.681904,-5.720017
887,-2.711371,-0.1089,-3.30404


## Pipeline

We can combine the above steps to build a full predictive pipeline. `sklearndf` provides two useful, specialised pipeline objects for this, `RegressorPipelineDF` and `ClassifierPipelineDF`. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom.

In [26]:
pipeline_df = RegressorPipelineDF(
    preprocessing=preprocessing_df,
    regressor=RandomForestRegressorDF(
        n_estimators=1000,
        max_features=2/3,
        max_depth=7,
        random_state=42,
        n_jobs=-3
    )
)

In [27]:
df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)
pipeline_df.fit(df_train, y_train)
pipeline_df.score(df_test, y_test)

0.8892115754403426

# Extras

`sklearndf` also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in `.extra` modules, such as

- `sklearndf.regression.extra.LGBMRegressorDF`
- `sklearndf.transformation.extra.BorutaDF`


## LightGBM regressor

In [28]:
lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)
lgbm_df.fit(df_numerical_train, y_train)
lgbm_df.predict(df_numerical_test.iloc[:10]).to_frame()

Unnamed: 0,prediction
892,139447.461334
1105,288031.725712
413,124917.506705
522,167320.406141
1036,315868.109901
614,77154.764703
218,223173.451376
1160,149507.513568
649,76728.165603
887,136832.195002


## Boruta

[Boruta](https://www.jstatsoft.org/article/view/v036i11) is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.

The `sklearndf.transformation.extra.BorutaDF` transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.

For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument ``n_estimators="auto"``).


In [29]:
boruta_pipeline = PipelineDF(
    steps=[
        ('preprocess', preprocessing_df),
        ('boruta', BorutaDF(
            estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3), 
            n_estimators="auto", 
            random_state=42,
            verbose=2
        )),
    ]
)

In [30]:
boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	303
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	29
Rejected: 	274
Iteration: 	9 / 100
Confirmed: 	12
Tentative: 	17
Rejected: 	274
Iteration: 	10 / 100
Confirmed: 	12
Tentative: 	17
Rejected: 	274
Iteration: 	11 / 100
Confirmed: 	12
Tentative: 	17
Rejected: 	274
Iteration: 	12 / 100
Confirmed: 	13
Tentative: 	14
Rejected: 	276
Iteration: 	13 / 100
Confirmed: 	13
Tentative: 	14
Rejected: 	276
Iteration: 	14 / 100
Confirmed: 	13
Tentative: 	14
Rejected: 	276
Iteration: 	15 / 100
Confirmed: 	13
Tentative: 	14
Rejected: 	276
Iteration: 	16 / 100
Confirmed: 	1

PipelineDF(memory=None,
           steps=[('preprocess',
                   ColumnTransformerDF(n_jobs=None, remainder='drop',
                                       sparse_threshold=0.3,
                                       transformer_weights=None,
                                       transformers=[('categorical',
                                                      PipelineDF(memory=None,
                                                                 steps=[('imputer',
                                                                         SimpleImputerDF(add_indicator=False,
                                                                                         copy=True,
                                                                                         fill_value='<unknown>',
                                                                                         missing_values=None,
                                                                                   

Boruta is implemented as natively an sklearn transformer; its output features are all features that passed the Boruta test.

In [31]:
boruta_pipeline.feature_names_out_.to_list()

['BsmtQual_Ex',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea',
 'FullBath',
 'TotRmsAbvGrd',
 'GarageCars',
 'GarageArea']

`sklearndf` allows us to trace outgoing features back to the original features from which they were derived, using the `sklearndf.TransformerDF.features_original_` property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.

In our example, feature `BsmtQual_Ex` is a derivative of feature `BsmtQual`, obtained through one-hot encoding: 

In [32]:
boruta_pipeline.feature_names_original_.to_frame()

Unnamed: 0_level_0,feature_in
feature_out,Unnamed: 1_level_1
BsmtQual_Ex,BsmtQual
LotFrontage,LotFrontage
LotArea,LotArea
OverallQual,OverallQual
YearBuilt,YearBuilt
YearRemodAdd,YearRemodAdd
MasVnrArea,MasVnrArea
BsmtFinSF1,BsmtFinSF1
TotalBsmtSF,TotalBsmtSF
1stFlrSF,1stFlrSF


So, to obtain all features we want to select from the original data set, we can select the unique ingoing features from the original feature mapping:

In [33]:
boruta_pipeline.feature_names_original_.unique()

array(['BsmtQual', 'LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd',
       'GarageCars', 'GarageArea'], dtype=object)