# Under The Hood: ColumnTransformers in sklearn

ColumnTransformers coupled with Pipelines are great tools to manage features before training a model. Sadly, its over-complicated documentation of usage isn't that great. With a trivial dataset, I demonstrate how ColumnTransformers can be used without hassle and what happens under the hood.

When using the sklearn library, one most definitely struggles with transforming the features before model training. ‘Transforming’ in this context means:
- Scaling (e.g. MinMax, Standard etc) for numerical variables
- Encoding (e.g. Ordinal, OneHot, with an option to avoid the dummy variable trap) for the categorical variables
- Imputing (e.g Simple mean) the missing data points

A ColumnTransformer is great at seamlessly doing these 3 - for example, you need not separate the categorical features from numerical, process them separately and join them back. ColumnTransformer does this for you in a single flow using a Pipeline. 

However, ColumnTransformers do way too much with too less. In this document I will explore the inner workings of ColumnTransfomers and answer questions like:
- How to best manage categorical and numerical variables? Is it necessary to use a ColumnTransformer?
- In what order does ColumnTransformer process columns? 
- How to keep track of lost column names since a ColumnTransformer returns a numpy nd-array?
- What if there are other variables in my dataset which are not model features but are required to be retained?

The best way to manage variables is to have two separate lists- one containing the names of categorical variables, the other containing the names of numerical variables. (Tip: It’s even better to carry them together in a dictionary.)

In [77]:
numeric_columns = ['income', 'age']
categorical_columns = ['class', 'gender']

For a large DataFrame, you can create these two lists by looking at the `dtypes` and sending the `int`/`float` type variables to `numeric_columns` and the object type variables to `categorical_columns`. 

In case this doesn’t work out and you have categories that are numbered, it’s best to use a YAML file to manually keep a record of all the model variables under suitable headings/keys. Creating a YAML would just be a one-time effort.

Note: Unless specified, encoding functions create encoded categories in the alphabetic order (so you can expect to see `gender_F` and then `gender_M` for `gender`).

Now create separate pipes for each list.
Note: The syntax of creating a sklearn pipe is (Name, Transform)

In [3]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_pipe = Pipeline([("onehot", OneHotEncoder(handle_unknown="ignore"))])
numerical_pipe = Pipeline([("scale", MinMaxScaler())])

To avoid the dummy variable trap while using linear models, use:
`drop_first = True` instead of `handle_unknown="ignore"` in OHE.

Now define the ColumnTransformer in the format (Name, Object, Columns):

In [79]:
preprocess = ColumnTransformer([("cat", categorical_pipe, categorical_columns),    
("num", numerical_pipe, numeric_columns),
], remainder = 'passthrough')

The `fit` and `transform` methods can now be invoked on a `DataFrame`.

In [80]:
import pandas as pd
train = pd.DataFrame({ 'income': [100,500,600],
                        'age': [23,12, 5],
                      'class': [6,7,8],
                      'gender': ['M','F','F']})
preprocess = preprocess.fit(train)
train_transformed = preprocess.transform(train)
print(train_transformed)
pd.DataFrame(train_transformed)

[[1.         0.         0.         0.         1.         0.
  1.        ]
 [0.         1.         0.         1.         0.         0.8
  0.38888889]
 [0.         0.         1.         1.         0.         1.
  0.        ]]


Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,1.0,0.0,0.8,0.388889
2,0.0,0.0,1.0,1.0,0.0,1.0,0.0


Or, you can straightaway do `train_transformed = fit_transform(train)` but this won’t provide you with the preprocess object you _should_ be having for transforming the validation and test sets.
Notice that `train_transformed` is returned as an nd-array (something you don’t want) with column information lost.

The workaround for this is the function `get_transformer_feature_names(columnTransformer)`

Credits: https://stackoverflow.com/questions/57528350/can-you-consistently-keep-track-of-column-labels-using-sklearns-transformer-api

In [81]:
def get_transformer_feature_names(columnTransformer):

    output_features = []

    for name, pipe, features in columnTransformer.transformers_:
        if name!='remainder':
            for i in pipe:
                trans_features = []
                if hasattr(i,'categories_'):
                    trans_features.extend(i.get_feature_names(features))
                else:
                    trans_features = features
            output_features.extend(trans_features)

    return output_features

get_transformer_feature_names(preprocess)

['class_6', 'class_7', 'class_8', 'gender_F', 'gender_M', 'income', 'age']

But then you ask, what about the order of the columns? How am I sure the columns are rightly named and there is no mix up?
The answer to this is: you need not worry. The column order within categorical and numerical variables will be **exactly** as declared in the two initial lists and the ColumnTransformer will not mess with the ordering **at all**. And as for the arrangement of the numerical and categorical variables overall, they will be **in the order as submitted to the ColumnTransformer.**

The original order of columns in the DataFrame **does not matter!**

See what happens if I send numerical first and categorical next in the ColumnTransformer.

In [39]:
preprocess = ColumnTransformer([("num", numerical_pipe, numeric_columns),
("cat", categorical_pipe, categorical_columns),    
], remainder = 'passthrough')

preprocess = preprocess.fit(train)
train_transformed = preprocess.transform(train)
get_transformer_feature_names(preprocess)

['income', 'age', 'class_6', 'class_7', 'class_8', 'gender_F', 'gender_M']

`get_transformer_feature_names` will retrieve names from `ColumnTransformer` and `Pipeline` in the same order, so one need not worry about mis-naming.

What about `remainder = 'passthrough'?` This is used to retain any other column(s) of the dataset that were not mentioned in any `Pipeline`. These columns can be anywhere in between, maybe even intertwined with the model variables. **After transformation, these are pasted at the right end of the dataset in the order in which they appear in the dataset from left to right.** See the example below.

In [82]:
train = pd.DataFrame({ 'income': [100,500,600],
                        'age': [23,12, 5],
                      'location':['Urban','Urban','Urban'], 
                      'nbrhd':['Rich','Poor','Poor'], #Added these 2 new columns that should 'passthrough'
                      'class': [6,7,8],
                      'gender': ['M','F','F']})
preprocess = preprocess.fit(train)
train_transformed = preprocess.transform(train)
pd.DataFrame(train_transformed)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,0,0,0,1,0.0,1.0,Urban,Rich
1,0,1,0,1,0,0.8,0.388889,Urban,Poor
2,0,0,1,1,0,1.0,0.0,Urban,Poor


However, note that these feature names can't be retrieved using:

In [83]:
get_transformer_feature_names(preprocess)

['class_6', 'class_7', 'class_8', 'gender_F', 'gender_M', 'income', 'age']

ColumnTransformer was not told what to do with the passthrough columns so they were never tracked. To get all column names, one should explicitly extend the column name list returned by `get_transformer_feature_names`: 

In [84]:
passed_cols = list(train.columns[~train.columns.isin(numeric_columns + categorical_columns)])
pd.DataFrame(train_transformed, columns = get_transformer_feature_names(preprocess) + passed_cols)

Unnamed: 0,class_6,class_7,class_8,gender_F,gender_M,income,age,location,nbrhd
0,1,0,0,0,1,0.0,1.0,Urban,Rich
1,0,1,0,1,0,0.8,0.388889,Urban,Poor
2,0,0,1,1,0,1.0,0.0,Urban,Poor


Similarly, one need not worry about using `preprocess.transform()`on the test set:
- If any column from the training set is missing in the test set, even if it's a passthrough column, it will throw a ValueError, saying that a higher number of features were expected
- If the column order is different from the training set in any way, even slightly, it will throw a ValueError saying that column ordering must be same as when fitted
- If there is any never-before-seen extra column in between, it will throw the same error as above
- If there is any never-before-seen extra column at the last, it will let those columns passthrough, but raise a FutureWarning saying that this will fail in sklearn’s next version

In [86]:
test = pd.DataFrame({ 'income': [400,600,900],
                        'age': [12,15, 76],
                      'location':['Rural','Urban','Urban'], 
                      'nbrhd':['Rich','Rich','Poor'],
                      'class': [6,7,7],
                      'gender': ['M','M','F']})
pd.DataFrame(preprocess.transform(test))
# Try removing any column, even if it is a passthrough column
# Try adding 'occu':['ag','ag','non-ag'] in various locations in between and at the last

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,0,0,0,1,0.6,0.388889,Rural,Rich
1,0,1,0,0,1,1.0,0.555556,Urban,Rich
2,0,1,0,1,0,1.6,3.94444,Urban,Poor


Lastly.

Another option is to use `pandas.get_dummies()` but it is generally not recommended as your training set (which is larger and richer) might have categories not present in validation and test sets, and so after column transformation the column numbers will not match and the model will fail to predict. The ColumnTransformer object created by sklearn especially takes care that this does not happen using `handle_unknown = “ignore”`

You can use `pandas.get_dummies()` if: 
- categories are broad and less in your data
- it is possible for you to create dummies first and split data into train-test second

See: http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set/

But scaling, encoding and imputation all come together as a package in Pipeline, so I will make a case for it.