#### Column transformations

TL;DR - Perform the data preparation, exploratory data analysis (EDA), feature encoding/engineering, outliers removal in Pandas before the ML modeling as we have seen so far in this course. When the data is in a good format and ready for the ML models, convert the DataFrame to a Numpy 2d float array with .values and pass it to the sklearn estimator .fit/predict/score methods or Pipeline object if a StandardScaler is needed.

Explanation: So far in this program, we have seen how to structure our data analysis into (1) data preparation (2) exploratory data analysis (EDA) and (3) machine learning parts. The first two steps (1) and (2) involve a lot of data manipulation and are done in Pandas and the last one (3) with Scikit-learn. Because ML models only work with numerical data, we usually convert our DataFrame into a Numpy float 2d array only at step (3) for sklearn estimators. It’s important to understand that the data preprocessing (1) and (2) are done in Pandas. The only exception is for common preprocessing steps that are very specific to ML such as StandardScaler or dimensionality reduction such as PCA (more about this in the next course) - those are usually encapsulated into a Pipeline object as shown in the last unit. The reason for this exception is simple: those ML operations are independent of the nature of the column unlike the data manipulation steps from (1) and (2) that are usually very specific to each variable. For instance, feature engineering and outliers removal are done in Pandas because they depend on the type of variable and their meaning ex. it doesn’t make sense to create polynomial features for categories or apply z-score outliers removal to ordinal variables or skewed ones.

In this unit and the next ones: Jupyter notebooks are a great way to develop/share a data analysis pipeline and document each step with Markdown cells and plots in an iterative way. At the end of this “prototyping” work, we sometimes want to encapsulate our code from (1) and (2) into a “clean” ML pipeline. In this unit and the next ones from this Advanced Scikit-learn chapter, we will see tools to achieve this. However, note that it’s not required to use those tools - they are only helpful to do this extra step of encapsulating the Pandas code from steps (1) and (2) into Scikit-learn objects at the end of the analysis/prototyping work. For this reason, this unit and the next ones from this chapter are entirely optional. You can skip those units and start working now on the final course project. When you are happy with your work, you can optionally read this unit and the next ones and think about how you could apply those tools to your analysis. However, this is entirely optional and requires good programming/debugging experience.

#### Column transformations
In this unit, we will see how to use the ColumnTransformer object from Scikit-learn to perform a few common preprocessing steps such as ordinal and one-hot encoding.

Before going into the code, it’s important to understand that this tool is part of a new workflow in Scikit-learn that tries to consolidate the data manipulation steps, usually done in Pandas, with the modeling part. As we will see in this unit and the next ones, this new Pandas/Scikit-learn workflow can be very powerful - however - Scikit-learn only provides partial support for DataFrames at the moment, so it can be difficult to model complex sequences of data manipulations with it.

In such cases, don’t hesitate to do part or all of the data manipulation work in Pandas as we saw previously. Also, keep an eye on the upcoming Scikit-learn releases to see how these new features evolve.

#### One-hot encoding with Scikit-learn
Let’s start by loading the data.

In [1]:
import pandas as pd

data_df = pd.read_csv("c3_bike-sharing-data.csv")
data_df.head()

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,weekday,season,weathersit,casual
0,0.344,0.806,0.16,2011,no,no,6,spring,cloudy,331
1,0.363,0.696,0.249,2011,no,no,0,spring,cloudy,131
2,0.196,0.437,0.248,2011,yes,no,1,spring,clear,120
3,0.2,0.59,0.16,2011,yes,no,2,spring,clear,108
4,0.227,0.437,0.187,2011,yes,no,3,spring,clear,82


Scikit-learn implements a OneHotEncoder transformer to handle categorical variables. Like the other objects from Scikit-learn, it accepts array-like objects, including DataFrames, as input but always returns Numpy arrays or related objects as we are will see below.

Let’s test it on our data_df DataFrame.



In [2]:
from sklearn.preprocessing import OneHotEncoder

# Create encoder
encoder = OneHotEncoder()
encoder.fit_transform(data_df)

<731x1714 sparse matrix of type '<class 'numpy.float64'>'
	with 7310 stored elements in Compressed Sparse Row format>

The result can be a bit surprising at first sight: we pass a DataFrame object and get a sparse matrix with 1,714 columns! In fact, the transformer encodes all the columns from the input data, including the numerical ones. So it creates a new one-hot encoded column for each distinct value in the DataFrame.

Let’s see how to fix this.

#### ColumnTransformer object
So far, we always converted the input data into Numpy arrays to avoid any issues during the ml part. However, Scikit-learn recently released a ColumnTransformer object that can apply different transformations to the columns of a Pandas DataFrame object.

In our case, we can use it to apply one-hot encoding to the categorical variables.



In [3]:
from sklearn.compose import ColumnTransformer

# Handle categorical variables
cat_columns = ["yr", "workingday", "holiday", "weekday", "season", "weathersit"]
cat_transformer = OneHotEncoder(sparse=False)

# Create the column transformer
preprocessor = ColumnTransformer(
    [("categorical", cat_transformer, cat_columns)], remainder="passthrough"
)

In this code, we first list the categorical columns in a cat_columns variable and create the OneHotEncoder() object. This time, we specify sparse=False when creating the encoder to get Numpy arrays instead of sparse matrices. We then create the ColumnTransformer object and specify the different transformations - one in our case - by defining (name, transformer, vars) triplets. We pass the list of categorical variables with the one-hot encoder and tell the object to leave the other columns unchanged by setting its remainder attribute to 'passthrough'.

Let’s test it on our input DataFrame



In [4]:
encoded = preprocessor.fit_transform(data_df)
encoded

array([[1.00e+00, 0.00e+00, 1.00e+00, ..., 8.06e-01, 1.60e-01, 3.31e+02],
       [1.00e+00, 0.00e+00, 1.00e+00, ..., 6.96e-01, 2.49e-01, 1.31e+02],
       [1.00e+00, 0.00e+00, 0.00e+00, ..., 4.37e-01, 2.48e-01, 1.20e+02],
       ...,
       [0.00e+00, 1.00e+00, 1.00e+00, ..., 7.53e-01, 1.24e-01, 1.59e+02],
       [0.00e+00, 1.00e+00, 1.00e+00, ..., 4.83e-01, 3.51e-01, 3.64e+02],
       [0.00e+00, 1.00e+00, 0.00e+00, ..., 5.78e-01, 1.55e-01, 4.39e+02]])

It’s important to note that we pass a DataFrame object as input and get a Numpy array. By looking at the values, we can see that the first columns correspond to the one-hot encoded columns and the last ones to the untransformed ones.

Let’s check the type and size of our encoded data.



In [6]:
print("Shape:", encoded.shape)
print("Type:", type(encoded))
print("Data type:", encoded.dtype)


Shape: (731, 24)
Type: <class 'numpy.ndarray'>
Data type: float64


This time, we get a reasonable number of columns. The result is now a (731, 24) Numpy array with the usual uniform float data type.

The encoded data is ready for the ML estimators, but not for additional Pandas data manipulation steps since we lost the column names in the conversion. This is why we said above that it can be complex to model sequences of data manipulation steps in Scikit-learn.

If we want to convert the result back into a DataFrame, we need to use the get_feature_names_out() method from our encoder.

In [7]:
try:
    cat_transformer.get_feature_names_out()
except Exception as e:
    print(e)


This OneHotEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.


We get an error saying that the transformer is not fitted yet. Just like Pipeline objects, the ColumnTransformer works on copies and not on the original objects directly. To access the copies, we need to use the named_transformers_ attribute which returns the steps. This is similar to the named_steps attribute from Pipeline objects



In [8]:
preprocessor.named_transformers_

{'categorical': OneHotEncoder(sparse=False), 'remainder': 'passthrough'}

To get the feature names, we simply need to retrieve the encoder copy and call its get_feature_names_out()



In [9]:
preprocessor.named_transformers_["categorical"].get_feature_names_out()

array(['yr_2011', 'yr_2012', 'workingday_no', 'workingday_yes',
       'holiday_no', 'holiday_yes', 'weekday_0', 'weekday_1', 'weekday_2',
       'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'season_fall',
       'season_spring', 'season_summer', 'season_winter',
       'weathersit_clear', 'weathersit_cloudy', 'weathersit_rainy'],
      dtype=object)

Scikit-learn names the columns by order: x0 corresponds to the first column in cat_columns which is yr.

#### Issue with missing categories
The one-hot encoder creates a new column for each categorical value. A common issue is to have new, previously unknown, categories in the test data. For instance, let’s see what happens if we create a new storm category for the weathersit feature.



In [10]:
new_data = data_df.iloc[:1].copy()
new_data["weathersit"] = "storm"
new_data

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,weekday,season,weathersit,casual
0,0.344,0.806,0.16,2011,no,no,6,spring,storm,331


If you take a look at the column names retrieved with the get_feature_names_out() call from above, you can see that the encoder only knows about the clear, cloudy and rainy categories. Let’s see how it handles this new storm category.

In [11]:
try:
    preprocessor.transform(new_data)
except Exception as e:
    print(e)

Found unknown categories ['storm'] in column 5 during transform


The one-hot encoder returns an exception saying that 'storm' is an unknown value. A common practice is to simply ignore unseen values and set all the corresponding one-hot encoded variables to zero i.e. x5_clear, x5_cloudy and x5_rainy.

We can specify this behavior by setting the handle_unknown attribute of our OneHotEncoder



In [12]:
# Handle categorical variables
cat_transformer = OneHotEncoder(handle_unknown="ignore", sparse=False)

# Create the column transformer
preprocessor = ColumnTransformer(
    [("categorical", cat_transformer, cat_columns)], remainder="passthrough"
)
preprocessor.fit_transform(data_df)
preprocessor.transform(new_data)


array([[1.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 1.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 3.44e-01, 8.06e-01, 1.60e-01, 3.31e+02]])

If we look at the entries with index 17, 18, 19 that correspond to the weathersit variable, we can see that they all have a value of zero.

#### Ordinal encoding with Scikit-learn
Scikit-learn also provides an OrdinalEncoder object to encode ordinal variables. It takes the list of ordinal values and encodes them using a 0 to N integer scale. Let’s test it on the weathersit variable.

In [13]:
from sklearn.preprocessing import OrdinalEncoder

# Handle ordinal variables
ord_columns = ["weathersit"]
ord_transformer = OrdinalEncoder(categories=[["clear", "cloudy", "rainy"]])

In this case, the encoder will simply map clear, cloudy and rainy to respectively 0, 1 and 2.

#### FunctionTransformer object
Ordinal and one-hot encoding are two common transformations which have their dedicated Scikit-learn transformers. However, we can also create new transformers with the FunctionTransformer object.

For instance, let’s create polynomial features with continuous variables.



In [14]:
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Add polynomial features
poly_columns = ["temp", "hum", "windspeed"]
poly_transformer = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("poly", FunctionTransformer(lambda X: np.c_[X, X ** 2, X ** 3])),
    ]
)

The FunctionTransformer takes a function to apply as a parameter. In the code from above, we create an anonymous one with the lambda notation. The function simply adds the degree 2 and 3 to the input array X with the np.c_[] concatenation operation.

Note that our transformer from above is not equivalent to the PolynomialFeatures one which adds all the interaction terms in addition to the polynomial features.



In [15]:
from sklearn.preprocessing import PolynomialFeatures

polyfeat = PolynomialFeatures(degree=3, include_bias=False)
polyfeat.fit(data_df[poly_columns])
polyfeat.get_feature_names_out()


array(['temp', 'hum', 'windspeed', 'temp^2', 'temp hum', 'temp windspeed',
       'hum^2', 'hum windspeed', 'windspeed^2', 'temp^3', 'temp^2 hum',
       'temp^2 windspeed', 'temp hum^2', 'temp hum windspeed',
       'temp windspeed^2', 'hum^3', 'hum^2 windspeed', 'hum windspeed^2',
       'windspeed^3'], dtype=object)

As we can see, with the interaction terms, the PolynomialFeatures object creates a total of 19 features instead of just the 9 polynomial ones.

#### Complete pipeline
Let’s assemble the different transformations into a final ColumnTransformer



In [16]:
# Create the column transformer
preprocessor = ColumnTransformer(
    [
        ("categorical", cat_transformer, cat_columns),
        ("ordinal", ord_transformer, ord_columns),
        ("poly", poly_transformer, poly_columns),
    ],
    remainder="drop",
)

encoded = preprocessor.fit_transform(data_df)
encoded.shape

(731, 30)

This time, we apply the three different transformations and make sure that any additional columns, if any, are dropped by setting remainder to 'drop'.

If you execute the code from above, you will probably get a FutureWarning. Scikit-learn is simply warning us that the default value for one of the object parameters will change in a future release of the library. We can ignore such warnings by adding a simplefilter using the Python warnings module


In [18]:
import warnings

warnings.simplefilter("ignore", FutureWarning)

Let’s encapsulate our preprocessor with a LinearRegression estimator into a pipeline

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Create Pipeline
pipe = Pipeline([("preprocessor", preprocessor), ("regressor", LinearRegression())])


and use the usual train/test split methodology to evaluate it

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE

# Split into train/test sets
X = data_df.drop("casual", axis=1)
y = data_df.casual
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit/evaluate pipeline
pipe.fit(X_tr, y_tr)
print("MAE: {:.2f}".format(MAE(y_te, pipe.predict(X_te))))

MAE: 253.03


This time, we get a slightly better MAE score than what we obtained in the previous unit with only one-hot encoding: 253 vs. 280.

#### Summary
In this unit, we saw how to perform feature-specific preprocessing steps such as OneHotEncoder, OrdinalEncoder or any FunctionTransformer handmade ones with the ColumnTransformer object.

In the next unit, we will see how to encapsulate complex transformations into a custom transformer object that can be used with the tools from Scikit-learn.