# Categorical features

In the field of machine learning, categorical features play a crucial role in determining the predictive ability of a model. Categorical features are features that can take a limited number of values, such as color, gender or location. While these features can provide useful insights into patterns and relationships within data, they also pose unique challenges for machine learning models.

One of these challenges is the need to transform categorical features before they can be used by most models. This transformation involves converting categorical values into numerical values that can be processed by the machine learning algorithm.

Another challenge is dealing with infrequent categories, which can lead to biased models. If a categorical feature has a large number of categories, but some of them are rare or appear infrequently in the data, the model may not be able to learn accurately from these categories, resulting in biased predictions and inaccurate results.

Despite these difficulties, categorical features are still an essential component in many use cases. When properly encoded and handled, machine learning models can effectively learn from patterns and relationships in categorical data, leading to better predictions.

There are various transformations described in the literature, each with its own benefits and drawbacks. Choosing the right one can significantly impact model performance. This document describes three of the most commonly used transformations: One-hot encoding, Ordinal encoding, and Target encoding, and explains how to apply them in skforecast using scikit-learn encoders.

## Libraries and data

In [15]:
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5

from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline

from skforecast.ForecasterAutoreg import ForecasterAutoreg

In [16]:
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
       'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url)

# Preprocess data
# ==============================================================================
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data = data[['holiday', 'weather', 'temp', 'hum', 'users']]
data[['holiday', 'weather']] = data[['holiday', 'weather']].astype(str)
data.head(3)

Unnamed: 0_level_0,holiday,weather,temp,hum,users
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011-01-01 00:00:00,0.0,clear,9.84,81.0,16.0
2011-01-01 01:00:00,0.0,clear,9.02,80.0,40.0
2011-01-01 02:00:00,0.0,clear,9.02,80.0,32.0


In [17]:
# Split train-val-test
# ==============================================================================
start_train = '2012-06-01 00:00:00'
end_train = '2012-07-31 23:59:00'
end_test = '2012-08-15 23:59:00'
data_train = data.loc[start_train:end_train, :]
data_test  = data.loc[end_train:end_test, :]

print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates test  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train : 2012-06-01 00:00:00 --- 2012-07-31 23:00:00  (n=1464)
Dates test  : 2012-08-01 00:00:00 --- 2012-08-15 23:00:00  (n=360)


## One Hot Encoding

One hot encoding, also known as dummy encoding or one-of-K encoding, consists in replacing the categorical variable by a group of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation. For example, suppose a dataset that includes a categorical variable called "color" with possible values of "red," "blue," and "green." Using one hot encoding, this variable is converted into a set of binary variables such as "color_red," "color_blue," and "color_green," where each variable takes a value of 0 or 1 depending on the category.

The `OneHotEncoder` class in Scikit-learn can be used to transform each categorical feature with *n* possible values into *n* new binary features, with one of them taking the value 1, and all others 0. The `OneHotEncoder` can be configured to handle certain corner cases, including unknown categories, missing values, and infrequent categories.

+ When `handle_unknown='ignore'` and `drop` is not `None`, unknown categories are encoded as all zeros. Additionally, if a feature contains both `np.nan` and `None`, they will be considered separate categories

+ `OneHotEncoder` supports aggregating infrequent categories into a single output for each feature. The parameters to enable the gathering of infrequent categories are `min_frequency` and `max_categories`. By setting `handle_unknown` to 'infrequent_if_exist', unknown categories will be considered infrequent.

+ To avoid collinearity between features, it is possible to drop one of the categories per feature. This is particularly important when using linear models. This behavior can be set using the drop argument.


[ColumnTransformers in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) provide a powerful way to define transformations and apply them to specific features. By encapsulating the `OneHotEncoder` within a `ColumnTransformer` object, it can be passed to a forecaster using the `transformer_exog` argument. For more details on how to use `OneHotEncoder`, please refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder).

In [18]:
# ColumnTransformer with a one hot encoding transformer
# ==============================================================================
# The ColumnTransformer is used to transform categorical features (no numerical)
# into one-hot encoded features. Numerical features are left untouched. For binary
# features, only one column is created.
one_hot_encoder = ColumnTransformer(
    [
        ('oh', OneHotEncoder(sparse_output=False, drop='if_binary'), make_column_selector(dtype_exclude=np.number))
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

In [19]:
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                regressor = LGBMRegressor(random_state=123),
                lags = 5,
                transformer_exog = one_hot_encoder
             )

forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster

ForecasterAutoreg 
Regressor: LGBMRegressor(random_state=123) 
Lags: [1 2 3 4 5] 
Transformer for y: None 
Transformer for exog: ColumnTransformer(remainder='passthrough',
                  transformers=[('oh',
                                 OneHotEncoder(drop='if_binary',
                                               sparse_output=False),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7fb94aff0820>)],
                  verbose_feature_names_out=False) 
Window size: 5 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['holiday', 'weather', 'temp', 'hum'] 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', '

<script src="https://kit.fontawesome.com/d20edc211b.js" crossorigin="anonymous"></script>

<div class="admonition note" name="html-admonition" style="background: rgba(0,184,212,.1); padding-top: 0px; padding-bottom: 6px; border-radius: 8px; border-left: 8px solid #00b8d4;">

<p class="title">
    <i class="fa-circle-exclamation fa" style="font-size: 18px; color:#00b8d4;"></i>
    <b> &nbsp Note</b>
</p>

Applying a transformation to the entire dataset independent of the forecaster is feasible. However, it is crucial to ensure that the transformations are learned exclusively from the training data to prevent information leakage. Additionally, the same transformation should be applied to the input data during predictions. Therefore, it is advisable to include the transformation inside the forecaster, so that it is handled internally. This approach ensures consistency in the application of transformations and reduces the likelihood of errors.

To examine how data is being transformed, it is possible to use the <code>create_train_X_y</code> method to generate the matrices that the forecaster is using to train the model. This approach enables gaining insight into the specific data manipulations that occur during the training process.

</div>

In [20]:
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
X_train.head()

Unnamed: 0_level_0,lag_1,lag_2,lag_3,lag_4,lag_5,holiday_1.0,weather_clear,weather_mist,weather_rain,temp,hum
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 05:00:00,1.0,13.0,32.0,40.0,16.0,0.0,0.0,1.0,0.0,9.84,75.0
2011-01-01 06:00:00,1.0,1.0,13.0,32.0,40.0,0.0,1.0,0.0,0.0,9.02,80.0
2011-01-01 07:00:00,2.0,1.0,1.0,13.0,32.0,0.0,1.0,0.0,0.0,8.2,86.0
2011-01-01 08:00:00,3.0,2.0,1.0,1.0,13.0,0.0,1.0,0.0,0.0,9.84,75.0
2011-01-01 09:00:00,8.0,3.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,13.12,76.0


## Ordinal encoding

Ordinal encoding is a technique used to convert categorical variables into numerical variables. Each category is assigned a unique numerical value based on its order or rank, as determined by a chosen criterion such as frequency or importance. This encoding method is particularly useful when categories have a natural order or ranking, such as with educational levels. However, it is important to note that the numerical values assigned to each category do not represent any inherent numerical difference between them, but simply provide a numerical representation.

The `OrdinalEncoder` class in Scikit-learn can be used to transform each categorical feature into a new feature of integers, with values ranging from 0 to n_categories-1. This class also offers the `encoded_missing_value` parameter to encode missing values.

## Target encoding

Target encoding is a technic that encodes categorical variables based on the relationship between the categories and the target variable. Each category is encoded based on a shrinked estimate of the average target values for observations belonging to the category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category.

For example, suppose we have a categorical variable "City" with categories "New York," "Los Angeles," and "Chicago," and a target variable "Salary." One can calculate the mean salary for each city category based on the training data, and use these mean values to encode the categories.

This encoding scheme is useful with categorical features with high cardinality, where one-hot encoding would inflate the feature space making it more expensive for a downstream model to process. A classical example of high cardinality categories are location based such as zip code or region.

The `TargetEncoder` class in Scikit-learn (since version 1.3) considers missing values, such as `np.nan` or `None`, as another category and encodes them like any other category. Categories that are not seen during fit are encoded with the target mean, i.e. target_mean_.

## Native implementation for categorical features

Gradient boosting libraries (XGBoost, LightGBM, CatBoost) assume that the input categories are integers starting from 0 up to the number of categories [0, 1, ..., n_categories-1]. In most real cases, categorical variables are not encoded with numbers but with strings, so an intermediate transformation step is necessary. Two options are:

+ Set columns with categorical variables to `category` type. Internally, this data structure consists of an array of categories and an array of integer values (codes) that point to the actual value of the array of categories. That is, internally they are a numeric array with a mapping that relates each value to a category. Models are able to automatically identify the columns of type `category` and access their internal codes. This approach is applicable with XGBoost, LightGBM and CatBoost.

+ Preprocess the categorical columns with an `OrdinalEncoder` to transform their values into integers and then store them again as `category` type. If this last step is not done, the models will treat the variable as if they were numeric.

In order to use the XGBoost native implementation in skforecast, categorical variables must be encoded first as integers and then stored as `category` type. This is because skforecast makes internal use of numerical numpy array to speed the computation.

In [21]:
# Transformer: Ordinal encoding + cast to category type
# ==============================================================================
pipeline_categorical = make_pipeline(
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            FunctionTransformer(
                                func=lambda x: x.astype('category'),
                                feature_names_out= 'one-to-one'
                            )
                       )

transformer_exog = make_column_transformer(
                        (
                            pipeline_categorical,
                            make_column_selector(dtype_exclude=np.number)
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

In [24]:
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                regressor = LGBMRegressor(categorical_features='auto'),
                lags = 5,
                transformer_exog = transformer_exog
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster



ForecasterAutoreg 
Regressor: LGBMRegressor(categorical_features='auto') 
Lags: [1 2 3 4 5] 
Transformer for y: None 
Transformer for exog: ColumnTransformer(remainder='passthrough',
                  transformers=[('pipeline',
                                 Pipeline(steps=[('ordinalencoder',
                                                  OrdinalEncoder(dtype=<class 'int'>,
                                                                 encoded_missing_value=-1,
                                                                 handle_unknown='use_encoded_value',
                                                                 unknown_value=-1)),
                                                 ('functiontransformer',
                                                  FunctionTransformer(feature_names_out='one-to-one',
                                                                      func=<function <lambda> at 0x7fb94ae53a30>))]),
                                 <sklearn.compos

In [25]:
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
X_train.head()

Unnamed: 0_level_0,lag_1,lag_2,lag_3,lag_4,lag_5,holiday,weather,temp,hum
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2011-01-01 05:00:00,1.0,13.0,32.0,40.0,16.0,0,1,9.84,75.0
2011-01-01 06:00:00,1.0,1.0,13.0,32.0,40.0,0,0,9.02,80.0
2011-01-01 07:00:00,2.0,1.0,1.0,13.0,32.0,0,0,8.2,86.0
2011-01-01 08:00:00,3.0,2.0,1.0,1.0,13.0,0,0,9.84,75.0
2011-01-01 09:00:00,8.0,3.0,2.0,1.0,1.0,0,0,13.12,76.0


## 

<script src="https://kit.fontawesome.com/d20edc211b.js" crossorigin="anonymous"></script>

<div class="admonition note" name="html-admonition" style="background: rgba(0,184,212,.1); padding-top: 0px; padding-bottom: 6px; border-radius: 8px; border-left: 8px solid #00b8d4;">

<p class="title">
    <i class="fa-circle-exclamation fa" style="font-size: 18px; color:#00b8d4;"></i>
    <b> &nbsp Note</b>
</p>

Coming Soon. This section is currently being created :)

</div>

In [None]:
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>