# Categorical features

In the field of machine learning, categorical features play a crucial role in determining the predictive ability of a model. Categorical features are features that can take a limited number of values, such as color, gender or location. While these features can provide useful insights into patterns and relationships within data, they also pose unique challenges for machine learning models.

One of these challenges is the need to transform categorical features before they can be used by most models. This transformation involves converting categorical values into numerical values that can be processed by the machine learning algorithm.

Another challenge is dealing with infrequent categories, which can lead to biased models. If a categorical feature has a large number of categories, but some of them are rare or appear infrequently in the data, the model may not be able to learn accurately from these categories, resulting in biased predictions and inaccurate results.

Despite these difficulties, categorical features are still an essential component in many use cases. When properly encoded and handled, machine learning models can effectively learn from patterns and relationships in categorical data, leading to better predictions.

There are various transformations described in the literature, each with its own benefits and drawbacks. Choosing the right one can significantly impact model performance. This document describes three of the most commonly used transformations: One-hot encoding, Ordinal encoding, and Target encoding, and explains how to apply them in skforecast using scikit-learn encoders.

## Ordinal encoding

Ordinal encoding is a technique used to convert categorical variables into numerical variables. Each category is assigned a unique numerical value based on its order or rank, as determined by a chosen criterion such as frequency or importance. This encoding method is particularly useful when categories have a natural order or ranking, such as with educational levels. However, it is important to note that the numerical values assigned to each category do not represent any inherent numerical difference between them, but simply provide a numerical representation.

The `OrdinalEncoder` class in Scikit-learn can be used to transform each categorical feature into a new feature of integers, with values ranging from 0 to n_categories-1. This class also offers the `encoded_missing_value` parameter to encode missing values.

In [1]:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X = [['male'], ['female'], [np.nan], ['female']]
encoder.fit_transform(X)

array([[ 1.],
       [ 0.],
       [nan],
       [ 0.]])

## One Hot Encoding

In One hot encoding, also known as dummy encoding or one-of-K encoding, each categorical value is converted into a binary vector of 0s and 1s, where the index of the 1 represents the category. For example, suppose we have a dataset that includes a categorical variable called "color" with possible values "red," "blue," and "green." To use this variable in a machine learning algorithm, we would convert it into a set of binary variables, such as "color_red," "color_blue," and "color_green," where each variable has a value of either 0 or 1 depending on the category.

The `OneHotEncoder` class in Scikit-learn can be used to transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.  One might want to drop one of the two columns only for features with 2 categories. In this case, you can set the `parameter drop='if_binary'`. When `handle_unknown='ignore'` and `drop` is not None, unknown categories will be encoded as all zeros.

OneHotEncoder also supports categorical features with missing values by considering the missing values as an additional category. If a feature contains both `np.nan` and `None`, they will be considered separate categories.

Furthermore, OneHotEncoder supports aggregating infrequent categories into a single output for each feature. The parameters to enable the gathering of infrequent categories are `min_frequency` and `max_categories`. By setting `handle_unknown` to 'infrequent_if_exist', unknown categories will be considered infrequent.

## 

<script src="https://kit.fontawesome.com/d20edc211b.js" crossorigin="anonymous"></script>

<div class="admonition note" name="html-admonition" style="background: rgba(0,184,212,.1); padding-top: 0px; padding-bottom: 6px; border-radius: 8px; border-left: 8px solid #00b8d4;">

<p class="title">
    <i class="fa-circle-exclamation fa" style="font-size: 18px; color:#00b8d4;"></i>
    <b> &nbsp Note</b>
</p>

Coming Soon. This section is currently being created :)

</div>

In [13]:
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>