# Classification using Cyclic Boosting

First, install the  package and its dependencies

```sh
!pip install cyclic-boosting
```

In [None]:
# Optional formatting if juypter-black is installed
try:
    import jupyter_black

    jupyter_black.load()
except ImportError:
    ...

In [None]:
import pandas as pd
import numpy as np


from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline

from cyclic_boosting import flags, common_smoothers, observers, binning
from cyclic_boosting.plots import plot_analysis
from cyclic_boosting.pipelines import pipeline_CBClassifier

Let's load the adult census income dataset from OpenML

In [None]:
from sklearn.datasets import fetch_openml

data = fetch_openml(data_id=1590)


# Read the DataFrame, first using the feature data
df = pd.DataFrame(data.data, columns=data.feature_names)  # Add a target column, and fill it with the target data
df

For convenience we split the colums into two groups, categorical and continuous

In [None]:
cols_categorical = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
]
cols_noncat = [n for n in df.columns if n not in cols_categorical]

Adding the target column to the dataframe and convert to 0 and 1

In [None]:
df["target"] = data.target.eq(">50K").mul(1)

In [None]:
cols_noncat + cols_categorical

In [None]:
df.columns

In [None]:
assert set(cols_noncat + cols_categorical) - set(df.columns) == set(), "Columns not in data set"
print("unused columns:", set(df.columns) - set(cols_noncat + cols_categorical))

# Prepare the data

The data has to be prepared for the training. We want to convert the categorical variables into numerical values using the scikit-learn OrdinalEncoder (guess, who contributed this 😜).

In [None]:
def prepare_data(df):
    enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)

    df[cols_categorical] = enc.fit_transform(df[cols_categorical])

    y = np.asarray(df["target"])
    X = df.drop(columns="target")

    return X, y

In [None]:
X, y = prepare_data(df)

In [None]:
y

# Set the feature properties

We need to tell Cyclic Boosting which feature to use and what type of feature these are and how to handle them.

We want the continuous features be `IS_CONTINUOUS` with missing values (very handy, isn't it 😎) and the categorical features to be treated as unordered classes (no neighboring relation as in weekdays for example).

Note: there is next to no feature engineering done here deliberately. Checking the feature carefully, there can be potentially improved a lot by treating the features individually and maybe even combing them into 2D features (see documentation). We just want to get it up-and-running here.

In [None]:
features = cols_categorical + cols_noncat

feature_properties = {
    **{col: flags.IS_UNORDERED | flags.HAS_MISSING | flags.MISSING_NOT_LEARNED for col in cols_categorical},
    **{col: flags.IS_CONTINUOUS | flags.HAS_MISSING | flags.MISSING_NOT_LEARNED for col in cols_noncat},
}
features, feature_properties

# Build the model

The model is implemented as a scikit-learn pipeline, stitching together a Binner and the CB classifier estimator. Most natably, we reduce the number of used bins in all continuous features to 10 instead of 100, should be plenty.

In [None]:
def cb_classifier_model():
    plobs = [observers.PlottingObserver(iteration=-1)]

    CB_pipeline = pipeline_CBClassifier(
        feature_properties=feature_properties,
        feature_groups=features,
        observers=plobs,
        maximal_iterations=50,
        number_of_bins=10,
        smoother_choice=common_smoothers.SmootherChoiceGroupBy(
            use_regression_type=True,
            use_normalization=False,
        ),
    )

    return CB_pipeline

In [None]:
CB_est = cb_classifier_model()

CB_est

# The training

In [None]:
%%timeit -r 1
CB_est.fit(X.copy(), y)

That's it, now we did the training, that was fast and easy, isn't it?

## Evaluation

Now we can do the inference for all samples. Note that we get proper probabilities for all target categories using predict_proba, which is really nice! 

In [None]:
yhat = CB_est.predict_proba(X.copy())

With this we can calculate the mean absolute deviation

In [None]:
mad = np.nanmean(np.abs(y - yhat[:, 0]))
mad

Or the scikit-learn in-sample score (yes, you should do some cross-validation for a real world problem 😬)

In [None]:
# in-sample score
CB_est.score(X, y)

# Some nice plots

Cyclic Boosting has some useful reporting of the traning included. We can create a pdf with this code

In [None]:
def plot_CB(filename, plobs, binner):
    for i, p in enumerate(plobs):
        plot_analysis(plot_observer=p, file_obj=filename + "_{}".format(i), use_tightlayout=False, binners=[binner])

In [None]:
plot_CB("analysis_CB_iterlast", [CB_est[-1].observers[-1]], CB_est[-2])

You will now find a pdf file containing all sorts of plots. They are explained in the documentation of Cyclic Boosting.

Just as an eye candy, lets plot the separation of of both classes.

In [None]:
df["pred"] = yhat[:, 0]

In [None]:
ax = df[df["target"] > 0].pred.hist(log=True, alpha=0.5)
df[df["target"] == 0].pred.hist(log=True, alpha=0.5, ax=ax)

You see, it is easy to do a classification using Cyclic Boosting and it works!