In this notebook, we'll dive through the most important and very common case of categorical data encoding.

### What is Categorical Data?

Categorical data is a type of data that can be stored into groups or categories. It represents characteristics such as:
- Color: Red, Blue, Green
- Size: Small, Medium, Large
- Type: Car

In [1]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
  
# fetch dataset 
census_income = fetch_ucirepo(id=20) 
  
# data (as pandas dataframes) 
X = census_income.data.features 
y = census_income.data.targets 

In [9]:
print(data.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')


In [3]:
data = pd.concat([X, y], axis=1)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


For this data as a starting point, let's say we want to learn a logistic regression to classifier this data. We can't use the data as it is, because the logistic regression model can't handle categorical data.

Clearly, we need to convert the categorical data into numerical data. This process is called encoding.

### One-Hot Encoding (Dummy Variables)

The most common way to represent categorical data is using *one-hot-encoding* or *one-out-of-N encoding*, which creates a binary column for each category.

The ideas behind this is to replace the categorical variable with one or more new features that can have 0 and 1. Beside, with multiple categorical, we can represent the data in a more efficient way like 0, 1 and 2,...

There're many ways to do one-hot encoding, but the most common way is using `get_dummies` function from `pandas` library or `OneHotEncoder` from `sklearn` library.

#### Checking string-encoded categorical data

After reading the data, we can check if there's any string-encoded categorical data using `select_dtypes` function from `pandas` library.

When working with data that input by user, we can't be sure that the data is in the correct format. So, we need to check and convert the data to the correct format.

A good way to complete this is using `value_counts` function from `pandas` library to check the unique values of the column.

In [11]:
# Pandas way
for col in data.columns:
    print(col, data[col].nunique())
    print(data[col].value_counts())

age 74
age
36    1348
35    1337
33    1335
23    1329
31    1325
      ... 
88       6
85       5
87       3
89       2
86       1
Name: count, Length: 74, dtype: int64
workclass 9
workclass
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
State-gov            1981
?                    1836
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64
fnlwgt 28523
fnlwgt
203488    21
120277    19
190290    19
125892    18
126569    18
          ..
286983     1
185942     1
234220     1
214706     1
350977     1
Name: count, Length: 28523, dtype: int64
education 16
education
HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th      

We can see very clearly all the unique values of the column and decide if we need to convert it to the correct format or not.

In [12]:
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))

Original features:
 ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] 

Features after get_dummies:
 ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass_?', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Never-worked', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'marital-status_Divorced', 'marital-status_Married-AF-spouse', 'marital-status_Married-civ-spouse', 'mari

Let's take a look after encoding this data, the structure of the data may look like Decison Tree, Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost, etc.:

In [14]:
data_dummies.head(10)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,income_<=50K,income_<=50K.,income_>50K,income_>50K.
0,39,77516,13,2174,0,40,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
1,50,83311,13,0,0,13,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
2,38,215646,9,0,0,40,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
3,53,234721,7,0,0,40,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
4,28,338409,13,0,0,40,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
5,37,284582,14,0,0,40,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
6,49,160187,5,0,0,16,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
7,52,209642,9,0,0,45,False,False,False,False,...,False,False,False,True,False,False,False,False,True,False
8,31,45781,14,14084,0,50,False,False,False,False,...,False,False,False,True,False,False,False,False,True,False
9,42,159449,13,5178,0,40,False,False,False,False,...,False,False,False,True,False,False,False,False,True,False


In [17]:
# Scikit-learn way
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
print(ohe.fit_transform(X))

  (0, 22)	1.0
  (0, 81)	1.0
  (0, 3545)	1.0
  (0, 28616)	1.0
  (0, 28635)	1.0
  (0, 28643)	1.0
  (0, 28647)	1.0
  (0, 28663)	1.0
  (0, 28672)	1.0
  (0, 28674)	1.0
  (0, 28702)	1.0
  (0, 28798)	1.0
  (0, 28936)	1.0
  (0, 29032)	1.0
  (1, 33)	1.0
  (1, 80)	1.0
  (1, 3872)	1.0
  (1, 28616)	1.0
  (1, 28635)	1.0
  (1, 28641)	1.0
  (1, 28650)	1.0
  (1, 28662)	1.0
  (1, 28672)	1.0
  (1, 28674)	1.0
  (1, 28675)	1.0
  :	:
  (48840, 28616)	1.0
  (48840, 28635)	1.0
  (48840, 28639)	1.0
  (48840, 28647)	1.0
  (48840, 28665)	1.0
  (48840, 28669)	1.0
  (48840, 28674)	1.0
  (48840, 28757)	1.0
  (48840, 28798)	1.0
  (48840, 28936)	1.0
  (48840, 29032)	1.0
  (48841, 18)	1.0
  (48841, 79)	1.0
  (48841, 14143)	1.0
  (48841, 28616)	1.0
  (48841, 28635)	1.0
  (48841, 28641)	1.0
  (48841, 28650)	1.0
  (48841, 28662)	1.0
  (48841, 28672)	1.0
  (48841, 28674)	1.0
  (48841, 28675)	1.0
  (48841, 28798)	1.0
  (48841, 28956)	1.0
  (48841, 29032)	1.0


Ok, time for training

In [22]:
features = data_dummies.loc[:, 'age':'native-country_ Yugoslavia']
X = features.values
y = data_dummies['income_ >50K'].values

KeyError: 'native-country_ Yugoslavia'

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

ValueError: could not convert string to float: 'State-gov'