## Let's set up the data and target as before

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
adult_census = adult_census.drop(columns='education-num')

target_name = 'class'
target = adult_census[target_name]
data = adult_census.drop(columns = [target_name])

## Identifying categorical variables

### Explore one of the variables

Ask if someone remembers any of the categorical variable
  -> use whatever suggestion in line below

In [2]:
# Use suggestion from audience

data['native-country'].value_counts()

# or head or whatever


 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 

ASK: Does anyone remember how to check which categories might be categorical?

In [3]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

# Select features based on types

We can ask sklearn to select all categorical data for us

In [4]:
from sklearn.compose import make_column_selector as selector

categorical_selector = selector(dtype_include=object)
categorical_columns = categorical_selector(data)

categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [5]:
# Now move them all into a new DataFrame
data_categorical = data[categorical_columns]

# Let's take a quick look to make sure it looks right
data_categorical.head()


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [6]:
print(f"the data set is composed of {data_categorical.shape[1]} features")

the data set is composed of 8 features


# Encoding categorical data

Q: Is this data useful for machine learning?

Q: Is the formatting of the data useful for a computer? / What does a computer know about native countries or marital statuses, etc?

Q: How can a machine read this data?

## Ordinal encoding

We will take marital status as an example of how to encode data

In [7]:
print(data['marital-status'])

0              Never-married
1         Married-civ-spouse
2         Married-civ-spouse
3         Married-civ-spouse
4              Never-married
                ...         
48837     Married-civ-spouse
48838     Married-civ-spouse
48839                Widowed
48840          Never-married
48841     Married-civ-spouse
Name: marital-status, Length: 48842, dtype: object


In [8]:
data['marital-status'].value_counts()

 Married-civ-spouse       22379
 Never-married            16117
 Divorced                  6633
 Separated                 1530
 Widowed                   1518
 Married-spouse-absent      628
 Married-AF-spouse           37
Name: marital-status, dtype: int64

#### OK, now let's set up an encoder to convert this data to numbers

In [9]:
from sklearn.preprocessing import OrdinalEncoder

# select the column to work with
mar_status_column = data[['marital-status']]

# Initialize the encoder
ordinal_encoder = OrdinalEncoder()

# Use fit_transform method to encode the data
mar_status_ordinal = ordinal_encoder.fit_transform(mar_status_column)

# and let's take a look at what that looks like:
mar_status_ordinal


array([[4.],
       [2.],
       [2.],
       ...,
       [6.],
       [4.],
       [2.]])

In [10]:
# and let's take a look at which categories are present:
ordinal_encoder.categories_

[array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
        ' Married-spouse-absent', ' Never-married', ' Separated',
        ' Widowed'], dtype=object)]

In [11]:
#### Now let's transform all the data at once

In [12]:
data_ordinal = ordinal_encoder.fit_transform(data_categorical)
data_ordinal[:5]

array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

In [13]:
print(f'The encoded data set contains {data_ordinal.shape[1]} features')

The encoded data set contains 8 features


## Evaluation:

Q: Is ordinal encoding is appropriate for marital status? For which categories in the data set would it be?  
A: Only education (in fact, the encoder was already present in the data set as education-num)

Q: Can anyone think of another example of categorical data that is ordinal?  
A: Try to find at least 2 examples, one alphabetized and one not. Examples could be:
  - Alphabetized: US grading system: A, B, C, D, F
  - Not alphabetized: clothing sizes: XS, S, M, L, XL, XXL

Q: What problem arises if we use ordinal encoding on sizing chart (or their own suggestion)?  
HINT if needed: Look at `ordinal_encoder.categories_`  
A: Would not be in correct order (it's alphabetized)
  - How to solve the issue? (Look in documentation; at the very top)
    - Answer:
  ```
  ordered_size_list = ['XS', 'S', 'M', 'L', 'XL', 'XXL']
  encoder_with_order = OrdinalEncoder(categories=ordered_size_list)
  ```

In [14]:
# Answer to question on how to give the order

ordered_size_list = ['XS', 'S', 'M', 'L', 'XL', 'XXL']
encoder_with_order = OrdinalEncoder(categories=ordered_size_list)

## Encoding nominal categories

In [15]:
# Let's take another look at the marital status data

data['marital-status'].value_counts()

 Married-civ-spouse       22379
 Never-married            16117
 Divorced                  6633
 Separated                 1530
 Widowed                   1518
 Married-spouse-absent      628
 Married-AF-spouse           37
Name: marital-status, dtype: int64

In [16]:
from sklearn.preprocessing import OneHotEncoder

encoder_onehot = OneHotEncoder(sparse_output=False)
mar_status_onehot = encoder_onehot.fit_transform(mar_status_column)
mar_status_onehot[-5:]

array([[0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.]])

In [17]:
mar_status_ordinal[-5:]

array([[2.],
       [2.],
       [6.],
       [4.],
       [2.]])

In [18]:
feature_names = encoder_onehot.get_feature_names_out(input_features=['marital-status'])
marital_status_onehot = pd.DataFrame(mar_status_onehot, columns=feature_names)

marital_status_onehot

Unnamed: 0,marital-status_ Divorced,marital-status_ Married-AF-spouse,marital-status_ Married-civ-spouse,marital-status_ Married-spouse-absent,marital-status_ Never-married,marital-status_ Separated,marital-status_ Widowed
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...
48837,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48838,0.0,0.0,1.0,0.0,0.0,0.0,0.0
48839,0.0,0.0,0.0,0.0,0.0,0.0,1.0
48840,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [19]:
data_onehot = encoder_onehot.fit_transform(data_categorical)
data_onehot[:3]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [20]:
# Before there were 8 features:
print(f'The data set contains {data_categorical.shape[1]} features')

# How many features after ordinal encoding? --> still 8
print(f'The ordinal encoded data set contains {data_ordinal.shape[1]} features')

# How many features after nominal encoding? --> many
print(f'The one-hot encoded data set contains {data_onehot.shape[1]} features')

The data set contains 8 features
The ordinal encoded data set contains 8 features
The one-hot encoded data set contains 102 features


In [21]:
# Why are there suddenly so many different features?
# Let's take a look 

columns_onehot = encoder_onehot.get_feature_names_out(data_categorical.columns)
pd.DataFrame(data_onehot, columns=columns_onehot).head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Choosing an encoding strategy

## Generally speaking:

Linear model --> OneHot  
Tree model --> Ordinal

Using an OrdinalEncoder will output ordinal categories. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2).  
The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.

! They will learn more about tree-based and linear models in a later module



### Ordinal encoder in linear models:
You can still use an OrdinalEncoder with linear models but you need to be sure that:
- the original categories (before encoding) have an ordering;
- the encoded categories follow the same ordering than the original categories.

Exercise 4: What happens if we violate these conditions?



### OneHot encoder in tree models:

One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.

Exercise 5: What does this issue look like in practice?

# Evaluate predictive pipeline using categorical variables

--> Do we need this before exercises?

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: 

let’s train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using cross-validation.

In [22]:
# let's look at some statistics from the native country column

data['native-country'].value_counts()

 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 

data with low numbers (1 for NL) is problematic for train/test-split and cross validation
use `handle_unknown` parameter to solve

For one-hot encoding, use: `handle_unknown='ignore'`  
For ordinal encoding, use: `handle_unknow='use_encoded_value'` and `unknown_value` parameters 

In [23]:
# Make the pipeline

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown='ignore'), LogisticRegression(max_iter=500)
)


# Note: 
# Here, we need to increase the maximum number of iterations to 
# obtain a fully converged LogisticRegression and silence a ConvergenceWarning. 

# Contrary to the numerical features, the one-hot encoded categorical features are 
# all on the same scale (values are 0 or 1),so they would not benefit from scaling. 

# In this case, increasing max_iter is the right thing to do.


In [24]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

{'fit_time': array([0.69111085, 0.57528877, 0.46406245, 0.46208835, 0.48548198]),
 'score_time': array([0.01783872, 0.01935291, 0.01820779, 0.01695299, 0.01785111]),
 'test_score': array([0.83222438, 0.83560242, 0.82882883, 0.83312858, 0.83466421])}

In [25]:
# check score:

scores = cv_results['test_score']

print(f'The accuracy of scoring using only categorical data is: {scores.mean():.3f}, with a standard deviation of {scores.std():.3f}')

The accuracy of scoring using only categorical data is: 0.833, with a standard deviation of 0.002


This worked slightly better than just using the numerical features

Q: Any idea why it worked better?  
A1: Because it's using 8 features instead of 4
A2: Randomly (difference is not that big)

Either way: there is no reason to assume that categorical or numerical features are the better way to go
In fact, there is no reason to use only one or the other, we will see this in the next part