# Encoding Categorical Variable
s as Quantitative

## Review

Last class, we discussed how to measure the distance between two observations $\mathrm{x}$ and $\mathrm{\acute{x}}$.

For example, we can calculate the Euclidean ($\ell_2$) distance:

$$
d\mathrm{(x,\acute{x})} = \sqrt{\sum_{j=1}^{m}(x_j - \acute{x_j})^2}
$$

In [2]:
# load dataset
# Data URL: https://datasci112.stanford.edu/data/housing.tsv

import pandas as pd
df = pd.read_table("data/housing.tsv")
df

Unnamed: 0,PID,Gr Liv Area,Bedroom AbvGr,Full Bath,Half Bath,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,526301100,1656,3,1,0,20,RL,141.0,31770,Pave,...,0,,,,0,5,2010,WD,Normal,215000
1,526350040,896,2,1,0,20,RH,80.0,11622,Pave,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,526351010,1329,3,1,1,20,RL,81.0,14267,Pave,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,526353030,2110,3,2,1,20,RL,93.0,11160,Pave,...,0,,,,0,4,2010,WD,Normal,244000
4,527105010,1629,3,2,1,60,RL,74.0,13830,Pave,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,923275080,1003,3,1,0,80,RL,37.0,7937,Pave,...,0,,GdPrv,,0,3,2006,WD,Normal,142500
2926,923276100,902,2,1,0,20,RL,,8885,Pave,...,0,,MnPrv,,0,6,2006,WD,Normal,131000
2927,923400125,970,3,1,0,85,RL,62.0,10441,Pave,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,132000
2928,924100070,1389,2,1,0,20,RL,77.0,10010,Pave,...,0,,,,0,4,2006,WD,Normal,170000


## What if there are categorical variables?

In [None]:
# There is a categorical variable into which we want to encode
# If we want to calculate distances, we need to convert the categorical variables into quantitative variables first!
features = ["Gr Liv Area", "House Style", "Bedroom AbvGr", "Full Bath", "Half Bath", "Neighborhood"]
df[features]

Unnamed: 0,Gr Liv Area,House Style,Bedroom AbvGr,Full Bath,Half Bath,Neighborhood
0,1656,1Story,3,1,0,NAmes
1,896,1Story,2,1,0,NAmes
2,1329,1Story,3,1,1,NAmes
3,2110,1Story,3,2,1,NAmes
4,1629,2Story,3,2,1,Gilbert
...,...,...,...,...,...,...
2925,1003,SLvl,3,1,0,Mitchel
2926,902,1Story,2,1,0,Mitchel
2927,970,SFoyer,3,1,0,Mitchel
2928,1389,1Story,2,1,0,Mitchel


## Encoding Categorical Variables as Quantitative

There is a standard way to encode a categorical variable as a quantitative variable: **dummy encoding** or **one-hot encoding**.

- Each class gets its own column.
- Each column consists of $\mathnormal{0}$ s and $\mathnormal{1}$ s. A $\mathnormal{1}$ indicates that the observation was in that class.
  
Questions:

1. How many $\mathnormal{1}$ s are in each row?
2. How many $\mathnormal{1}$ s are in each column? 

In [None]:
# We have to encode the categorical variables to quantitative variables.
# The standard way to do is the dummy encoding.

pd.get_dummies(df[['House Style']])

Unnamed: 0,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl
0,False,False,True,False,False,False,False,False
1,False,False,True,False,False,False,False,False
2,False,False,True,False,False,False,False,False
3,False,False,True,False,False,False,False,False
4,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...
2925,False,False,False,False,False,False,False,True
2926,False,False,True,False,False,False,False,False
2927,False,False,False,False,False,False,True,False
2928,False,False,True,False,False,False,False,False


In [None]:
# encode multiple categorical variables at once
pd.get_dummies(df[['House Style', 'Neighborhood']], dtype=int)

Unnamed: 0,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl,Neighborhood_Blmngtn,Neighborhood_Blueste,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2926,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2927,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2928,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If you pass a mix of quantitative variables and categorical variables to `pd.get_dummies()`, it will dummy encode the categorical variables and leave the quantitative variables alone.

In [6]:
pd.get_dummies(df[features])

Unnamed: 0,Gr Liv Area,Bedroom AbvGr,Full Bath,Half Bath,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,1656,3,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,896,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1329,3,1,1,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2110,3,2,1,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1629,3,2,1,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,1003,3,1,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2926,902,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2927,970,3,1,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2928,1389,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Dummy Encoding in Scikit-Learn

We can do dummy encoding in Scikit-Learn using `OneHotEncoder`.

In [7]:
from sklearn.preprocessing import OneHotEncoder

# declare the encoder
encoder = OneHotEncoder()

# fit the encoder to data
encoder.fit(df[['House Style']])

# transform the data
encoder.transform(df[['House Style']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2930 stored elements and shape (2930, 8)>

We can cast a sparse matrix to a "dense" one using `.todense()`...

...or specify that we don't want a sparse matrix to begin with.

In [None]:
# declare the encoder with sparse_output=False to get a dense matrix
encoder = OneHotEncoder(sparse_output=False)

# fit the encoder to data
encoder.fit(df[['House Style']])

# transform the data
encoder.transform(df[['House Style']])

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]], shape=(2930, 8))

## Mixed Variables in Scikit-Learn

What if we have a mix of quantitative and categorical variables, and we only want to dummy encode the categorical ones?

We make a `ColumnTransformer`.

In [15]:
# use make_column_transformer to encode mixed variables
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(), ['House Style', 'Neighborhood']),
    remainder='passthrough'
)
transformer.fit(df[features])
transformer.transform(df[features])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 15717 stored elements and shape (2930, 40)>

In [16]:
# use make_column_transformer to encode mixed variables (with sparse_output=False)

transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ['House Style', 'Neighborhood']),
    remainder='passthrough'
)
transformer.fit(df[features])
transformer.transform(df[features])

array([[0., 0., 1., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 1., ..., 3., 1., 1.],
       ...,
       [0., 0., 0., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 0., ..., 3., 2., 1.]], shape=(2930, 40))

## Visualizing a `ColumnTransformer`

Scikit-Learn provides a nice visualization of a `ColumnTransformer`.

In [17]:
transformer

0,1,2
,transformers,"[('onehotencoder', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


## Scaling and Encoding in Scikit-Learn

We can mix scalers and encoders with `ColumnTransformer`!

In [19]:
from sklearn.preprocessing import StandardScaler

transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ['House Style', 'Neighborhood']),
    (StandardScaler(), ['Gr Liv Area']),
    remainder='passthrough'
)
transformer.fit(df[features])
transformer.transform(df[features])

array([[0., 0., 1., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 1., ..., 3., 1., 1.],
       ...,
       [0., 0., 0., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 0., ..., 3., 2., 1.]], shape=(2930, 40))

In [20]:
# Let's visualize this `ColumnTransformer` as well.
transformer

0,1,2
,transformers,"[('onehotencoder', ...), ('standardscaler', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True


## A Look Ahead

In section tommorow, you will put all the pieces from the last two lectures together.

1. Convert categorical variables to quantitative variables.
2. Calculate distances on the transformed data to solve a real problem.