# Categorical Variables

In the last section we talked about scaling for continuous features, which all the features in the cancer dataset were.
In the lending club data, we also saw another type of feature, categorical or discrete features. Categorical or discrete features are those can take one of several distinct values that are usually not numerical, and often not even ordered. In the lending club data, examples were the grade, which could be a letter from 'A' to 'G', or home ownership, which could be 'RENT', 'MORTGAGE', 'OWN' or 'ANY'.


In [1]:
import pandas as pd

In [2]:
loans = pd.read_csv("C:/Users/t3kci/Downloads/loan.csv/loan.csv", nrows=100000)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
loans.grade.unique()

array(['C', 'D', 'B', 'A', 'E', 'F', 'G'], dtype=object)

In [4]:
loans.home_ownership.unique()

array(['RENT', 'MORTGAGE', 'OWN', 'ANY'], dtype=object)

Most machine learning algorithms require that we preprocess these features into some numeric encoding before we can use them. For our old friend ``KNeighborsClassifier``, we could in theory define a distance metric on these categorical features (which is certainly possible and many options exist), though the simpler and more common approach is to transform the data so we can use the standard euclidean distance on all the features.
Let's take a small subset of the data to investigate the possible preprocessing methods:

In [5]:
# look only at the two loan statuses we discussed in Chapter TODO
loans_paid = loans[loans.loan_status.isin(['Fully Paid', 'Charged Off'])]
# For this example, we want some paid and some charged off loans
# we group them to make sure we get some of both. We then take the first 10 and remove the grouping index
some_loans = loans_paid.groupby('loan_status').apply(lambda x: x.head(10)).reset_index(drop=True)
# we only consider three relatively well-behaved features and the loan status
small_data = some_loans[['loan_amnt', 'home_ownership', 'grade', 'loan_status']]
small_data

Unnamed: 0,loan_amnt,home_ownership,grade,loan_status
0,8000,MORTGAGE,A,Charged Off
1,6000,MORTGAGE,C,Charged Off
2,10000,MORTGAGE,A,Charged Off
3,10000,RENT,E,Charged Off
4,35000,MORTGAGE,C,Charged Off
5,4800,MORTGAGE,C,Charged Off
6,35000,RENT,C,Charged Off
7,15000,OWN,B,Charged Off
8,16000,RENT,B,Charged Off
9,25000,MORTGAGE,A,Charged Off


The loan amount is a continuous feature; it's an integer amount, but grade and home ownership are categorical. The loan_status is our classification target (which means it's also a discrete variable, but we don't consider it a feature and won't process it as such).
Scikit-learn requires you to explicitly handle categorical features in most cases, which is unlike some other libraries and frameworks. However, that gives you more control over the processing, and a better idea of what happens to your data.

The encoding that is most appropriate depends on the model you're using, but there are some general encoding schemes that are frequently used.

## Ordinal (or integer) encoding
One of the simplest ways to encode categorical data is to assign an integer to each category. You could do this with the ``OrdinalEncoder`` in scikit-learn, or with pandas by using pandas categorical data:

In [6]:
# extract a column and convert it to categorical data (it was represented as strings before)
home_ownership_cat = small_data.home_ownership.astype('category')
home_ownership_cat

0     MORTGAGE
1     MORTGAGE
2     MORTGAGE
3         RENT
4     MORTGAGE
5     MORTGAGE
6         RENT
7          OWN
8         RENT
9     MORTGAGE
10    MORTGAGE
11    MORTGAGE
12    MORTGAGE
13        RENT
14    MORTGAGE
15        RENT
16        RENT
17        RENT
18    MORTGAGE
19         OWN
Name: home_ownership, dtype: category
Categories (3, object): [MORTGAGE, OWN, RENT]

In [7]:
# get integer codes from the categorical data
# all categorical operations are accessible through the cat attribute:
home_ownership_cat.cat.codes

0     0
1     0
2     0
3     2
4     0
5     0
6     2
7     1
8     2
9     0
10    0
11    0
12    0
13    2
14    0
15    2
16    2
17    2
18    0
19    1
dtype: int8

We could create a new dataframe using these integer codes, which now could be interpreted by a machine learning model. However, this is often problematic as this imposes an order and a distance between the different categories, that might not accurately reflect the semantics of the data. Both pandas and scikit-learn by default use the lexical ordering of categories, so MORTGAGE corresponds to 0, OWN to 1 and RENT to 2. This order makes little sense. We could specify our own ordering, say RENT, MORTGAGE, OWN (describing degrees of ownership) but this is also not entirely satisfactory: if we encode it using integers, we postulate that the difference between RENT and MORTGAGE is the same as the difference between MORTGAGE and OWN, and the difference between RENT and OWN is twice the distance between MORTGAGE and OWN. Making these assumption seems somewhat questionable, and in many cases, even ordering the categories might be hard - imagine working on a dataset of cars that includes the color, imposing any ordering there seems very arbitray.

For the grades, using integer encodings might be reasonable, as there is a clear ordering and distance. Whether this is appropriate might depend on the model you are using. If there is very few categories, such as here, it's probably a safer bet to forego the ordinal encoding and use a different scheme instead.

## One-Hot (Dummy) Encoding

The most commonly used encoding scheme for categorical variables by far is the so-called one-hot encoding or dummy encoding. The idea behind one-hot encoding is to add a new column for each value of a categorical variable, and set the column to 1 for the category that applies to the row, and 0 for all the other categories. An easy way to compute this encoding is the ``get_dummies`` function in pandas:

In [8]:
small_data[['grade']].head()

Unnamed: 0,grade
0,A
1,C
2,A
3,E
4,C


In [9]:
pd.get_dummies(small_data[['grade']]).head()

Unnamed: 0,grade_A,grade_B,grade_C,grade_D,grade_E
0,1,0,0,0,0
1,0,0,1,0,0
2,1,0,0,0,0
3,0,0,0,0,1
4,0,0,1,0,0


As you can see, ``get_dummies`` replaced the one column ``grade`` by five columns, one for each possible value. The original value for the grade of first row was ``A`` so the new column ``grade_A`` has a 1, while all the other columns have a 0.
We can also call ``get_dummies`` on the whole dataframe. In this case, it will apply dummy encoding to all the columns that have either categorical data or objects (including strings):

In [10]:
pd.get_dummies(small_data).head()

Unnamed: 0,loan_amnt,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E,loan_status_Charged Off,loan_status_Fully Paid
0,8000,1,0,0,1,0,0,0,0,1,0
1,6000,1,0,0,0,0,1,0,0,1,0
2,10000,1,0,0,1,0,0,0,0,1,0
3,10000,0,0,1,0,0,0,0,1,1,0
4,35000,1,0,0,0,0,1,0,0,1,0


As you can see, ``loan_amnt`` wasn't changed, while the dummy encoding was applied to all the other columns, including the target ``loan_status``. As we don't want to encode this column, we can explicitly provide the columns that we want to encode:

In [11]:
pd.get_dummies(small_data, columns=['home_ownership', 'grade']).head()

Unnamed: 0,loan_amnt,loan_status,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E
0,8000,Charged Off,1,0,0,1,0,0,0,0
1,6000,Charged Off,1,0,0,0,0,1,0,0
2,10000,Charged Off,1,0,0,1,0,0,0,0
3,10000,Charged Off,0,0,1,0,0,0,0,1
4,35000,Charged Off,1,0,0,0,0,1,0,0


```{note}
If a categorical variable was already represented as an integer you can force pandas to apply dummy encoding by passing the column name to the ``columns`` parameter of ``pd.get_dummies``.
```

### Aligning dataframes with pandas
A common problem in using ``get_dummies`` is that if you have multiple datasets or files, and you call ``get_dummies`` on each of them, you might get inconsistent encodings.
Let's split our toy data into training and test set and apply ``get_dummies`` on them separately:

In [12]:
from sklearn.model_selection import train_test_split
small_train, small_test = train_test_split(small_data, random_state=21)
# To avoid setting with copy warnings we copy the data after splitting
# this is probably not necessary
small_train = small_train.copy()
small_test = small_test.copy()
pd.get_dummies(small_train, columns=['home_ownership', 'grade'])

Unnamed: 0,loan_amnt,loan_status,home_ownership_MORTGAGE,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E
17,2500,Fully Paid,0,1,0,0,1,0,0
18,4000,Fully Paid,1,0,0,0,0,1,0
11,40000,Fully Paid,1,0,0,0,1,0,0
6,35000,Charged Off,0,1,0,0,1,0,0
14,8425,Fully Paid,1,0,0,0,0,0,1
1,6000,Charged Off,1,0,0,0,1,0,0
2,10000,Charged Off,1,0,1,0,0,0,0
12,20000,Fully Paid,1,0,1,0,0,0,0
3,10000,Charged Off,0,1,0,0,0,0,1
8,16000,Charged Off,0,1,0,1,0,0,0


In [13]:
pd.get_dummies(small_test, columns=['home_ownership', 'grade'])

Unnamed: 0,loan_amnt,loan_status,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D
7,15000,Charged Off,0,1,0,0,1,0,0
10,30000,Fully Paid,1,0,0,0,0,0,1
19,2700,Fully Paid,0,1,0,1,0,0,0
13,4500,Fully Paid,0,0,1,0,1,0,0
5,4800,Charged Off,1,0,0,0,0,1,0


Both of the dataframes have seven columns, but the meaning of the columns is quite different. The training set doesn't have a column for ``home_ownership=OWN`` while the test set doesn't have a column for ``grade=E``.
However, remember that scikit-learn doesn't know about column names in dataframes, so if you'd passed this data directly into scikit-learn, you would get meaningless predictions without knowing it! TODO callout?

There's several ways to avoid this; the easiest is to use ``get_dummies`` before splitting the data, that might not be possible if new data arrives and you want to apply an existing model.
Another way is to encode all categorical variables using the pandas categorical type, and specifying all the known categories:

In [14]:
ownership_cats = ['MORTGAGE', 'OWN', 'RENT']
grade_cats = ['A', 'B', 'C', 'D', 'E']

small_test_explicit_cats = small_test.copy()
small_test_explicit_cats['home_ownership'] = pd.Categorical(small_test_explicit_cats['home_ownership'], categories=ownership_cats)
small_test_explicit_cats['grade'] = pd.Categorical(small_test_explicit_cats['grade'], categories=grade_cats)

Now the columns are aware of all possible values, and all of them will receive a column, whether the value is present or not (not the all-zero column `grade_E`):

In [15]:
pd.get_dummies(small_test_explicit_cats, columns=['home_ownership', 'grade'])

Unnamed: 0,loan_amnt,loan_status,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E
7,15000,Charged Off,0,1,0,0,1,0,0,0
10,30000,Fully Paid,1,0,0,0,0,0,1,0
19,2700,Fully Paid,0,1,0,1,0,0,0,0
13,4500,Fully Paid,0,0,1,0,1,0,0,0
5,4800,Charged Off,1,0,0,0,0,1,0,0


This method is very explicit and safe, but can be a bit cumbersome if there's many categorical columns, or not all of the categories are known beforehand.
A somewhat simpler and less explicit method is using ``pd.align`` after calling ``get_dummies``:

In [16]:
# compute dummy encoding (the columns will not match afterwards)
small_train_dummies = pd.get_dummies(small_train, columns=['home_ownership', 'grade'])
small_test_dummies = pd.get_dummies(small_test, columns=['home_ownership', 'grade'])

# align dataframes
# join='right' aligns test (left=self) to train (right=other) keeping only the columns in train
# axis=1 means we align only the columns, and don't try joining the row indices
# align returns two aligned frames; because we did a right join, train is unchanged
# so we can discard the second return value (and assign it to _)
# fill value specifies what to put into previously non-existing columns
test_aligned, _ = small_test_dummies.align(small_train_dummies, join='right', axis=1, fill_value=0)
test_aligned

Unnamed: 0,loan_amnt,loan_status,home_ownership_MORTGAGE,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E
7,15000,Charged Off,0,0,0,1,0,0,0
10,30000,Fully Paid,1,0,0,0,0,1,0
19,2700,Fully Paid,0,0,1,0,0,0,0
13,4500,Fully Paid,0,1,0,1,0,0,0
5,4800,Charged Off,1,0,0,0,1,0,0


The result is an aligned dataframe ``test_aligned`` that has the same columns as ``small_train_dummies``, the dataframe it was aligned with.
This ensures that the shapes are compatible between training and test set and we will get meaningful results from scikit-learn.
Note that 'OWN' column that was previously present in the test set was dropped as it is not present in the training set.
We could also perform an inner join when aligning the dataframes (which is the default in ``pd.align`` if you don't specify ``join``), so that the aligned dataframes contain all the columns present in either of the dataframes:


In [17]:
# if we don't specify join, an inner join is performed and we retain all the columns
# in this case we also want to store the new aligned training set
test_aligned, train_aligned = small_test_dummies.align(small_train_dummies, axis=1, fill_value=0)
test_aligned

Unnamed: 0,grade_A,grade_B,grade_C,grade_D,grade_E,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,loan_amnt,loan_status
7,0,1,0,0,0,0,1,0,15000,Charged Off
10,0,0,0,1,0,1,0,0,30000,Fully Paid
19,1,0,0,0,0,0,1,0,2700,Fully Paid
13,0,1,0,0,0,0,0,1,4500,Fully Paid
5,0,0,1,0,0,1,0,0,4800,Charged Off


However, that's not very useful if you already created a model using the original training dataset.

## OneHotEncoder - dummies with sklearn
Another option (and the one we will use for most of the book) is to do our encoding with scikit-learn. Because scikit-learn has a concept of training and test set, the issue of aligning the data is handled automatically.
The dummy or one-hot encoding in scikit-learn is implemented in the ``OneHotEncoder``, which is a transformer, just like other preprocessing methods.

In [18]:
from sklearn.preprocessing import OneHotEncoder
# By default, OneHotEncoder errors if it sees unknown categories in the test data
# we will overwrite this behavior by specifying handle_unknown='ignore'
# also OneHotEncoder by default outputs scipy sparse matrices, which are more efficient but cumbersome
# we disable that with sparse=False
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
# fit on the training set; stores the categories for each column
ohe.fit(small_train)
# apply one-hot encoding to both training and test set
X_train_ohe = ohe.transform(small_train)
X_test_ohe = ohe.transform(small_test)

In [19]:
X_train_ohe

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
        1., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
        1., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 1., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1.,
        0., 0., 0.,

```{margin} Feature names in scikit-learn
While scikit-learn doesn't output pandas dataframes (for now) we can get feature names for the output of some transformers using the ``get_feature_names`` method.
```

There are two things you might notice in this output: as with all scikit-learn transformations, the output is a numpy array without column names, which can be a bit inconvenient.
Secondly, all the columns have been one-hot-encoded, the loan amount is not present any more.
We can actually get the feature names by using the ``get_feature_names`` method of the ``OneHotEncoder``:

In [20]:
ohe.get_feature_names()

array(['x0_2500', 'x0_4000', 'x0_6000', 'x0_6600', 'x0_8000', 'x0_8425',
       'x0_10000', 'x0_16000', 'x0_20000', 'x0_25000', 'x0_35000',
       'x0_40000', 'x1_MORTGAGE', 'x1_RENT', 'x2_A', 'x2_B', 'x2_C',
       'x2_D', 'x2_E', 'x3_Charged Off', 'x3_Fully Paid'], dtype=object)

By default, the input columns in scikit-learn are named ``x0``, ``x1`` and so on, so what this tells us is that for the first feature, loan amount, several new columns were added to one-hot-encode the observed integer values, which was not what we had in mind. We can get more informative feature names by passing the original dataframe column names to ``get_feature_names``:

In [21]:
ohe.get_feature_names(small_train.columns)

array(['loan_amnt_2500', 'loan_amnt_4000', 'loan_amnt_6000',
       'loan_amnt_6600', 'loan_amnt_8000', 'loan_amnt_8425',
       'loan_amnt_10000', 'loan_amnt_16000', 'loan_amnt_20000',
       'loan_amnt_25000', 'loan_amnt_35000', 'loan_amnt_40000',
       'home_ownership_MORTGAGE', 'home_ownership_RENT', 'grade_A',
       'grade_B', 'grade_C', 'grade_D', 'grade_E',
       'loan_status_Charged Off', 'loan_status_Fully Paid'], dtype=object)

OneHotEncoder, like all other estimators in scikit-learn, always works on all input columns. So we need to pass it only the column that we want to transform.
So one possible way to do this would be to slice of the categorical columns, transform them with ``OneHotEncoder``, like so:

In [22]:
small_train.columns

Index(['loan_amnt', 'home_ownership', 'grade', 'loan_status'], dtype='object')

In [23]:
train_cat = small_train[['home_ownership', 'grade']]
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
train_cat_ohe = ohe.fit_transform(train_cat)
train_cat_ohe

array([[0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0.]])

We can now make this one-hot-encoded array back into a dataframe by using ``get_feature_names``:

In [24]:
train_df_ohe = pd.DataFrame(train_cat_ohe,
                            # create column names
                            columns=ohe.get_feature_names(train_cat.columns),
                            # keep the old index.
                            # this is optional but will make joining with the other features easier
                            index=train_cat.index)
train_df_ohe.head()

Unnamed: 0,home_ownership_MORTGAGE,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E
17,0.0,1.0,0.0,0.0,1.0,0.0,0.0
18,1.0,0.0,0.0,0.0,0.0,1.0,0.0
11,1.0,0.0,0.0,0.0,1.0,0.0,0.0
6,0.0,1.0,0.0,0.0,1.0,0.0,0.0
14,1.0,0.0,0.0,0.0,0.0,0.0,1.0


Then we can concatenate it again with the remaining 'amount' feature:

In [25]:
train_df_all = pd.concat([train_df_ohe, small_train[['loan_amnt']]], axis=1)
train_df_all.head()

Unnamed: 0,home_ownership_MORTGAGE,home_ownership_RENT,grade_A,grade_B,grade_C,grade_D,grade_E,loan_amnt
17,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2500
18,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4000
11,1.0,0.0,0.0,0.0,1.0,0.0,0.0,40000
6,0.0,1.0,0.0,0.0,1.0,0.0,0.0,35000
14,1.0,0.0,0.0,0.0,0.0,0.0,1.0,8425


While this is the result we wanted, this was pretty complicated compared to using ``pd.get_dummies``.
Luckily, scikit-learn has another tool that will make this much easier, the ``ColumnTransformer``.

## ColumnTransformer

The ``ColumnTransformer`` is another meta-estimator, similar to the ``Pipeline``, for combining multiple transformations.
In particular, the ``ColumnTransformer`` allows you to apply different transformations to different subsets of the columns.
It's the only part of scikit-learn that explicitly uses pandas column names, and is made specificly to ease the use of pandas dataframes
as input to scikit-learn models.

In contrast to ``Pipeline``, which basically applies several transformations in sequence, the ``ColumnTransformer`` applies several transformations in parallel,
each on a subset of columns, and then concatenates the results, similar to what we did manually above. This is illustrate in Figure TODO.

TODO new image

![:scale 100%](images/column_transformer_schematic.png)

As with the pipeline, there is a ``make_column_transformer`` helper function to easily create a new column transformer.
The function takes tuples, where each tuple consists of a transformation and the columns the transformation should be applied to.
So to apply the ``OneHotEncoder`` to only the ``home_ownership`` and ``grade`` columns, we could do:

In [26]:
from sklearn.compose import make_column_transformer
# a single transformation, OneHotEncoder, applied to two columns
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False), ['home_ownership', 'grade']))
# now we can pass the full dataset, as the column transformer will do the subsetting for us:
ct.fit(small_train)

ColumnTransformer(transformers=[('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse=False),
                                 ['home_ownership', 'grade'])])

The ``make_column_transformer`` has created ``ColumnTransformer`` object for us with a single transformer. It has also automatically generated a name for the transformer using the lower cased class name, `'onehotencoder'`.
We can now use the ``ColumnTransformer`` to transform our dataset:

In [27]:
ct.transform(small_train)

array([[0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0.]])

Again, we can get the column names for the output using ``get_feature_names``:
TODO this doesn't take the columns?!

In [28]:
ct.get_feature_names()

['onehotencoder__x0_MORTGAGE',
 'onehotencoder__x0_RENT',
 'onehotencoder__x1_A',
 'onehotencoder__x1_B',
 'onehotencoder__x1_C',
 'onehotencoder__x1_D',
 'onehotencoder__x1_E']

As you can see, by default, the ``ColumnTransformer`` only keeps the columns that we mentioned.
If we want to keep the remaining columns, we can specify ``remainder='passthrough':

In [29]:
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              ['home_ownership', 'grade']),
                             remainder='passthrough'
                            )
ct.fit_transform(small_train)

array([[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2500, 'Fully Paid'],
       [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 4000, 'Fully Paid'],
       [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 40000, 'Fully Paid'],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 35000, 'Charged Off'],
       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 8425, 'Fully Paid'],
       [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 6000, 'Charged Off'],
       [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 10000, 'Charged Off'],
       [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 20000, 'Fully Paid'],
       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 10000, 'Charged Off'],
       [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 16000, 'Charged Off'],
       [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 8000, 'Charged Off'],
       [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 6600, 'Fully Paid'],
       [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 35000, 'Charged Off'],
       [0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 20000, 'Fully Paid'],
       [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 25000, 'Charged Off']],
      dtype=object)

Typically we would not want the target in our data within scikit-learn, though, so we might want do drop it before, or we might want to specify only to pass through the 'loan_amnt' column.
We can pass through only some columns by adding a new transformation in the ColumnTransformer, though instead of passing a scikit-learn transformer, we pass the string `'passthrough'`:

In [30]:
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              ['home_ownership', 'grade']),
                             ('passthrough', ['loan_amnt'])
                            )
# we set numpy printoptions to not use scientific notation for nicer output
# and suppress all but three decimal places
import numpy as np
np.set_printoptions(suppress=True, precision=3)
ct.fit_transform(small_train)

array([[    0.,     1.,     0.,     0.,     1.,     0.,     0.,  2500.],
       [    1.,     0.,     0.,     0.,     0.,     1.,     0.,  4000.],
       [    1.,     0.,     0.,     0.,     1.,     0.,     0., 40000.],
       [    0.,     1.,     0.,     0.,     1.,     0.,     0., 35000.],
       [    1.,     0.,     0.,     0.,     0.,     0.,     1.,  8425.],
       [    1.,     0.,     0.,     0.,     1.,     0.,     0.,  6000.],
       [    1.,     0.,     1.,     0.,     0.,     0.,     0., 10000.],
       [    1.,     0.,     1.,     0.,     0.,     0.,     0., 20000.],
       [    0.,     1.,     0.,     0.,     0.,     0.,     1., 10000.],
       [    0.,     1.,     0.,     1.,     0.,     0.,     0., 16000.],
       [    1.,     0.,     1.,     0.,     0.,     0.,     0.,  8000.],
       [    0.,     1.,     0.,     1.,     0.,     0.,     0.,  6600.],
       [    1.,     0.,     0.,     0.,     1.,     0.,     0., 35000.],
       [    0.,     1.,     0.,     0.,     0.,    

If instead, we want to scale the 'loan_amnt' column, we can pass ``StandardScaler`` instead of ``'passthrough'``:


In [31]:
from sklearn.preprocessing import StandardScaler
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              ['home_ownership', 'grade']),
                             (StandardScaler(), ['loan_amnt'])
                            )
ct.fit_transform(small_train)

array([[ 0.   ,  1.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   , -1.173],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   , -1.047],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  1.984],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  1.563],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   , -0.674],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   , -0.879],
       [ 1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   , -0.542],
       [ 1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.3  ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   , -0.542],
       [ 0.   ,  1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   , -0.037],
       [ 1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   , -0.71 ],
       [ 0.   ,  1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   , -0.828],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  1.563],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  1.

Now we can finally transform our test dataset without much work:

In [32]:
ct.transform(small_test)

array([[ 0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   , -0.121],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  1.142],
       [ 0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   , -1.156],
       [ 0.   ,  1.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   , -1.005],
       [ 1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   , -0.98 ]])

Instead of using the ``make_column_transformer`` function, we can also directly use the ``ColumnTransformer`` class. As with the ``Pipeline``, this allows us to explicitly give names to the different transformers:

In [33]:
from sklearn.compose import ColumnTransformer
# this is equivalent to the the column transformer created above
# each transformation is a tuple (name, transformer, columns)
ct = ColumnTransformer([('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False),
                         ['home_ownership', 'grade']),
                        ('scaler', StandardScaler(), ['loan_amnt'])])

In [35]:
# hidden for gluing
import sklearn
sklearn.set_config(display='diagram')
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              ['home_ownership', 'grade']),
                             (StandardScaler(), ['loan_amnt']))
from myst_nb import glue
glue('ct_diagram', ct)
ct

````{note} Diagram representations of estimators
Starting with version 0.23, scikit-learn has diagram representations for estimators when running in jupyter.
These can be enabled using ``sklearn.set_config``:

```python
import sklearn
sklearn.set_config(display='diagram')
```

```{glue:} ct_diagram
```

In particular for more complex workflows they can come in really handy.
````

## Combining ColumnTransformer and Pipeline

While ``ColumnTransformer`` by itself is already quite awesome, the real power comes from combining it with ``Pipeline`` to encapsulate the whole preprocessing and model training.
Let's apply ``KNeighborsClassifier`` to the small subset of the lending club data we have been looking at. This is more for illustrative purposes, as we're using a tiny subset, and KNeighborsClassifier is potentially not a good model for this dataset, but we will see this combination in many of our later examples.

Let's s

```python
categorical = df.dtypes == object

preprocess = make_column_transformer(
    (StandardScaler(), ~categorical),
    (OneHotEncoder(), categorical))

model = make_pipeline(preprocess, LogisticRegression())
```


The way to use this with mixed type data is column transformer, which
allows you to transforms only some of the columns. For
example, you can call categorical encoder only on the
categorical columns and call StandardScaler on the
non-categorical columns, and then use that to preprocess
your data. Right now using Pandas, make sure your column
names match up, make everything to an integer,
or use column transformer and everything is awesome.

In contrast to basically all other estimators in sklearn,
this uses the column information in pandas and allows you to slice
out different columns based on column names, integer indices or boolean masks.
In this example I'm constructing a boolean mask

Here's a schematic of the column transformer.
Most commonly you might want to separate continuous and categorical columns,
but you can select any subsets of columns you like. They can also overlap.
Or you can apply multiple transformations to the same set of columns.
Let's say I want a scaled version of the data, but I also want to
extract principal components. I can use the same column as inputs to multiple
transformers, and the results will be concatenated.

FIXME add code

## Determining what variables are categorical

## Dropping one category


# Dummy variables and colinearity

- One-hot is redundant (last one is 1 – sum of others)
- Can introduce co-linearity
- Can drop one
- Choice which one matters for penalized models
- Keeping all can make the model more interpretable




N/A


#Models Supporting Discrete Features

- In principle:
  - All tree-based models, naive Bayes
- In scikit-learn:
  - Some Naive Bayes classifiers.
- In scikit-learn "soon":
  - Decision trees, random forests, gradient boosting




In principle all tree-based models support categorical
features, in scikit-learn none of them do, hopefully, soon
they will. So what you can do is either you do the One Hot
Encoder or you just encode this as integers and treat it as
a continuous. If you have very high categorical variables
with many levels, maybe it keeping it as an integer might
make more sense.

## Target Encoding (Impact Encoding)

![:scale 100%](images/zip_code_prices.png)


- For high cardinality categorical features
- Instead of 70 one-hot variables, one “response encoded” variable.
- For regression:
  - "average price in zip code”
- Binary classification:
  – “building in this zip code have a likelihood p for class 1”
- Multiclass:
  – One feature per class: probability distribution




So there's also another way to encode categorical variables
that is often used, I like to call it target-Based Encoding.
It's basically for very high cardinality categorical
features. For example, if you have categorical feature it's
all US states and you don't have a lot of samples or if you
have categorical features that's all US zip codes, if you
have all different things, you don't want to do One Hot
Encoding. So you get 50 new features, which if you don't
have a lot of data would be a lot of features. So instead,
you can use one single variable, it basically encodes the
response. So for regression, it would be people in this
state have an average response of that. Obviously you don't
want to do this on the test set basically or you want to do
this on the whole dataset for each level of the categorical
variable, you want to find out what is the mean response and
just use this as the future value. So you get one single
future. For binary classification, you can just use the
fraction of people that are classified as Class One. For
multi-class, you usually do the percentage or fraction of
people in each of the classes. So in multi-class, you get
one new feature per class and you count for each state how
many people in this state are classified for each of them.


 More encodings for categorical features:
## http://contrib.scikit-learn.org/categorical-encoding/

##  Load data, include ZIP code
```python
data = fetch_openml("house_sales", as_frame=True)
X = data.frame.drop(['date', 'price'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, target)
X_train.columns
```
```
Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
       'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')
```

.tiny[
```python
X_train.head()
```


  

    

      
bedrooms
bathrooms
sqft_living
sqft_lot
floors
...
zipcode
lat
long
sqft_living15
sqft_lot15

  

  

    
10666
4.0
2.50
2160.0
7000.0
2.0
...
98029.0
47.566
-122.013
2300.0
7440.0

    
19108
4.0
4.25
3250.0
11780.0
2.0
...
98004.0
47.632
-122.203
1800.0
9000.0

    
20132
3.0
2.50
1280.0
1920.0
3.0
...
98105.0
47.662
-122.324
1450.0
1900.0

    
16169
4.0
1.50
1220.0
9600.0
1.0
...
98014.0
47.646
-121.909
1180.0
9000.0

    
16890
3.0
1.50
2120.0
6290.0
1.0
...
98108.0
47.566
-122.318
1620.0
5400.0

  




```python
te = TargetEncoder(cols='zipcode').fit(X_train, y_train)
te.transform(X_train).head()
```


  

    

      
bedrooms
bathrooms
sqft_living
sqft_lot
floors
...
zipcode
lat
long
sqft_living15
sqft_lot15

  

  

    
10666
4.0
2.50
2160.0
7000.0
2.0
...
6.164e+05
47.566
-122.013
2300.0
7440.0

    
19108
4.0
4.25
3250.0
11780.0
2.0
...
1.357e+06
47.632
-122.203
1800.0
9000.0

    
20132
3.0
2.50
1280.0
1920.0
3.0
...
8.503e+05
47.662
-122.324
1450.0
1900.0

    
16169
4.0
1.50
1220.0
9600.0
1.0
...
4.464e+05
47.646
-121.909
1180.0
9000.0

    
16890
3.0
1.50
2120.0
6290.0
1.0
...
3.604e+05
47.566
-122.318
1620.0
5400.0

  




```python
y_train.groupby(X_train.zipcode).mean()[X_train.head().zipcode])
```


  

    

      
zipcode
98029.0
98004.0
98105.0
98014.0
98108.0

  

  

    
price
616356.941
1.357e+06
850306.816
446448.065
360416.811

  




]

```python
X = data.frame.drop(['date', 'price', 'zipcode'], axis=1)
scores = cross_val_score(Ridge(), X, target)
np.mean(scores)
```
```
0.69
```
--

```python
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
X = data.frame.drop(['date', 'price'], axis=1)

ct = make_column_transformer((OneHotEncoder(), ['zipcode']), remainder='passthrough')
pipe_ohe = make_pipeline(ct, Ridge())
scores = cross_val_score(pipe_ohe, X, target)
np.mean(scores)
```
```
0.52
```

--

```python
from category_encoders import TargetEncoder
X = data.frame.drop(['date', 'price'], axis=1)
pipe_target = make_pipeline(TargetEncoder(cols='zipcode'), Ridge())
scores = cross_val_score(pipe_target, X, target)
np.mean(scores)
```
```
0.78
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
                   'salary': [103, 89, 142, 54, 63, 219],
                   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
df

In [None]:
df['boro_ordinal'] = df.boro.astype("category").cat.codes
# reorder columns so it looks nice
df = df[['boro', 'boro_ordinal', 'vegan']]
df

In [None]:
df_int = df.copy()
df_int['vegan'] = df.vegan.astype("category").cat.codes
plt.figure(figsize=(4, 2))
df_int.plot(x='boro_ordinal', y='vegan', kind='scatter', ax=plt.gca())

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(df[['boro_ordinal']], df.vegan)
lr.coef_

In [None]:
lr.intercept_

In [None]:
dec = lr.decision_function(np.linspace(0, 3).reshape(-1, 1))

In [None]:
lr.predict(np.linspace(0, 3).reshape(-1, 1))

In [None]:
plt.figure(figsize=(4, 2))
df_int.plot(x='boro_ordinal', y='vegan', kind='scatter', ax=plt.gca())
plt.vlines([1.5], 0, 1, linestyle='--', label='linear_classifier')
plt.legend(loc='best')

In [None]:
df = pd.DataFrame({'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
                   'salary': [103, 89, 142, 54, 63, 219],
                   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
df

In [None]:
pd.get_dummies(df)

In [None]:
pd.get_dummies(df, columns=['boro'])

In [None]:
df_ordinal = df.copy()
df_ordinal['boro'] = df.boro.astype("category").cat.codes
df_ordinal

In [None]:
df2html(df_ordinal)

In [None]:
pd.get_dummies(df_ordinal, columns=['boro'])

In [None]:
df2html(pd.get_dummies(df_ordinal, columns=['boro']))

In [None]:
df = pd.DataFrame({'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
                   'salary': [103, 89, 142, 54, 63, 219],
                   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
df_dummies = pd.get_dummies(df, columns=['boro'])
df_dummies

In [None]:
df = pd.DataFrame({'boro': ['Brooklyn', 'Manhattan', 'Brooklyn', 'Queens', 'Brooklyn', 'Staten Island'],
                   'salary': [61, 146, 142, 212, 98, 47],
                   'vegan': ['Yes', 'No','Yes','No', 'Yes', 'No']})
df_dummies = pd.get_dummies(df, columns=['boro'])
display(df_dummies)

In [None]:
df = pd.DataFrame({'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
                   'salary': [103, 89, 142, 54, 63, 219],
                   'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
df['boro'] = pd.Categorical(
    df.boro, categories=['Manhattan', 'Queens', 'Brooklyn', 'Bronx', 'Staten Island'])
df_dummies = pd.get_dummies(df, columns=['boro'])
display(df_dummies)

In [None]:
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'salary': [103, 89, 142, 54, 63, 219],
                   'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx']})
ce = OneHotEncoder().fit(df)
ce.transform(df).toarray()

In [None]:
from sklearn.datasets import fetch_openml

data = fetch_openml("house_sales", as_frame=True)

data.frame.columns

In [None]:
data.frame.zipcode.value_counts()

In [None]:
import seaborn as sns
plt.figure(figsize=(15, 5))
ax = sns.boxplot(data.frame.zipcode, data.frame.price)
#plt.tight_layout()
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

In [None]:
from sklearn.model_selection import train_test_split
target = data.frame.price

data = fetch_openml("house_sales", as_frame=True)
X = data.frame.drop(['date', 'price'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, target)
X_train.columns

In [None]:
# drop some stuff so it fits on slide
import pandas as pd
pd.set_option('display.max_columns', 10)
#disp = X_train.drop(['waterfront', 'view', 'condition', 'grade', 'sqft_basement', 'yr_renovated'], axis=1).head()
disp = X_train.head()
disp

In [None]:
from category_encoders import TargetEncoder
te = TargetEncoder(cols='zipcode').fit(X_train, y_train)
te.transform(X_train).head()

In [None]:
disp2 = te.transform(X_train).head()
disp2

In [None]:
pd.DataFrame(target.groupby(X.zipcode).mean()[X_train.head().zipcode]).T

In [None]:
pd.DataFrame(y_train.groupby(X_train.zipcode).mean()[X_train.head().zipcode]).T

In [None]:
from category_encoders import LeaveOneOutEncoder, TargetEncoder

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
X = data.frame.drop(['date', 'price'], axis=1)
pipe_target = make_pipeline(TargetEncoder(cols='zipcode'), Ridge())
scores = cross_val_score(pipe_target, X, target)
np.mean(scores)

In [None]:
X = data.frame.drop(['date', 'price', 'zipcode'], axis=1)
scores = cross_val_score(Ridge(), X, target)
np.mean(scores)

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
X = data.frame.drop(['date', 'price'], axis=1)

pipe_ohe = make_pipeline(make_column_transformer((OneHotEncoder(), ['zipcode']), remainder='passthrough'), Ridge())
scores = cross_val_score(pipe_ohe, X, target)
np.mean(scores)

In [None]:
X.columns

In [None]:
TargetEncoder(cols='zipcode').fit_transform(data.frame, target)