# Categorical Features

Usually, datasets contain categorical variables stored as text values representing various features. Some examples include color (red, white, and blue), size (small, medium, and large) among others. The challenge is knowing how to use this information in further analysis, turning these text attributes into numerical quantities that machine learning techniques can understand.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 10)

## Automobile dataset

We will use the Automobile Data Set [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. It includes categorical and continuous variables. 

In [2]:
# Defining the headers
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", 
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", 
           "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

In [3]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,...,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,...,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,...,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,...,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,...,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,...,115.0,5500.0,18,22,17450.0


In [4]:
df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

We will work only with some variables. Let us start with `body_style`.

In [5]:
dfb = df[['body_style']].copy()
dfb.head()

Unnamed: 0,body_style
0,convertible
1,convertible
2,hatchback
3,sedan
4,sedan


`body_style` has 5 different categories.

In [6]:
dfb.body_style.value_counts()

sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64

## Label Encoding

Label encoding consists in assigning a numeric label to each category in the feature. 

### Using sckit-learn library

In [7]:
from sklearn.preprocessing import LabelEncoder

In [8]:
# Creating an instance of LabelEncoder
labelencoder = LabelEncoder()

In [9]:
# Let's create a new variable called body_style_cat with the encoding values.
dfb['body_style_n'] = labelencoder.fit_transform(dfb['body_style'])
dfb.head()

Unnamed: 0,body_style,body_style_n
0,convertible,0
1,convertible,0
2,hatchback,2
3,sedan,3
4,sedan,3


In [10]:
body_style_val = dfb.body_style.unique()
body_style_val

array(['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop'],
      dtype=object)

In [11]:
body_style_n = dfb.body_style_n.unique()
body_style_n

array([0, 2, 3, 4, 1])

In [12]:
res = pd.DataFrame({'body_style':body_style_val, 'body_style_n':body_style_n})
res

Unnamed: 0,body_style,body_style_n
0,convertible,0
1,hatchback,2
2,sedan,3
3,wagon,4
4,hardtop,1


In [13]:
res.sort_values(by='body_style_n')
res

Unnamed: 0,body_style,body_style_n
0,convertible,0
1,hatchback,2
2,sedan,3
3,wagon,4
4,hardtop,1


As you can see, each value in `body_style` has a number, a code. Different numbers mean different `body_style` categories.

### Category Codes

In [14]:
dfb.head()

Unnamed: 0,body_style,body_style_n
0,convertible,0
1,convertible,0
2,hatchback,2
3,sedan,3
4,sedan,3


`body_style` is a type object.

In [15]:
dfb.dtypes

body_style      object
body_style_n     int32
dtype: object

In [16]:
# Changing body_style to category
dfb['body_style'] = dfb['body_style'].astype('category')
dfb.dtypes

body_style      category
body_style_n       int32
dtype: object

Now, you can create a new variable enumerating the categories of `body_style`

In [17]:
dfb['body_style_cat'] = dfb['body_style'].cat.codes
dfb.head()

Unnamed: 0,body_style,body_style_n,body_style_cat
0,convertible,0,0
1,convertible,0,0
2,hatchback,2,2
3,sedan,3,3
4,sedan,3,3


Both methods: `label encoding` and `category codes` produce the same result.

In [18]:
res['body_style_cat'] = dfb.body_style_cat.unique()
res.sort_values(by='body_style_cat')

Unnamed: 0,body_style,body_style_n,body_style_cat
0,convertible,0,0
4,hardtop,1,1
1,hatchback,2,2
2,sedan,3,3
3,wagon,4,4


## One-Hot Encoding

One-hot encoding is a representation of categorical variables as binary vectors. It is useful for nominal variables.

We create new columns (dummy variables) with binary encoding (0 or 1) for each category to denote whether a particular row belongs to this category. 

### Using sckit-learn library

In [19]:
from sklearn.preprocessing import OneHotEncoder

In [20]:
# Create the OneHotEncoder object
onehotencoder = OneHotEncoder(sparse=False, drop=None)

In [21]:
dfo = pd.DataFrame(onehotencoder.fit_transform(dfb[['body_style']]))
dfo.head()

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0


In [22]:
dfo['body_style'] = dfb['body_style']
dfo.head(8)

Unnamed: 0,0,1,2,3,4,body_style
0,1.0,0.0,0.0,0.0,0.0,convertible
1,1.0,0.0,0.0,0.0,0.0,convertible
2,0.0,0.0,1.0,0.0,0.0,hatchback
3,0.0,0.0,0.0,1.0,0.0,sedan
4,0.0,0.0,0.0,1.0,0.0,sedan
5,0.0,0.0,0.0,1.0,0.0,sedan
6,0.0,0.0,0.0,1.0,0.0,sedan
7,0.0,0.0,0.0,0.0,1.0,wagon


We create 5 new binary variables:
- for `convertible` value,  you have a 1 in the variable '0' and 0 in the others.
- for `hardtop` value,      you have a 1 in the variable '1' and 0 in the others.
- for `hatchback` value,    you have a 1 in the variable '2' and 0 in the others.
- for `sedan` value,        you have a 1 in the variable '3' and 0 in the others.
- for `wagon` value,        you have a 1 in the variable '4' and 0 in the others.

In [23]:
# Create the OneHotEncoder object droping the first category
onehotencoder1 = OneHotEncoder(sparse=False, drop='first')

In [24]:
dfo1 = pd.DataFrame(onehotencoder1.fit_transform(dfb[['body_style']]))
dfo1.head()

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


In [25]:
dfo1['body_style'] = dfb['body_style']
dfo1.head(100)

Unnamed: 0,0,1,2,3,body_style
0,0.0,0.0,0.0,0.0,convertible
1,0.0,0.0,0.0,0.0,convertible
2,0.0,1.0,0.0,0.0,hatchback
3,0.0,0.0,1.0,0.0,sedan
4,0.0,0.0,1.0,0.0,sedan
...,...,...,...,...,...
95,0.0,1.0,0.0,0.0,hatchback
96,0.0,0.0,1.0,0.0,sedan
97,0.0,0.0,0.0,1.0,wagon
98,1.0,0.0,0.0,0.0,hardtop


Here we create 4 new binary variables:
- for `convertible` value, the first category, you have a 0 in all variables.
- for `hardtop` value,      you have a 1 in the variable '0' and 0 in the others.
- for `hatchback` value,    you have a 1 in the variable '1' and 0 in the others.
- for `sedan` value,        you have a 1 in the variable '2' and 0 in the others.
- for `wagon` value,        you have a 1 in the variable '3' and 0 in the others.

### Using pandas library

As one-hot encoding is typical, pandas provide a function for the corresponding new binary features representing the categorical variable. 

In [26]:
dfbd = pd.get_dummies(dfb.body_style, prefix='bs')
dfbd['body_style'] = dfb['body_style']
dfbd.head(8)

Unnamed: 0,bs_convertible,bs_hardtop,bs_hatchback,bs_sedan,bs_wagon,body_style
0,1,0,0,0,0,convertible
1,1,0,0,0,0,convertible
2,0,0,1,0,0,hatchback
3,0,0,0,1,0,sedan
4,0,0,0,1,0,sedan
5,0,0,0,1,0,sedan
6,0,0,0,1,0,sedan
7,0,0,0,0,1,wagon


As you can see, the result is identical to the one got in `dfo`, but variable names are more clear here.

In [27]:
dfb1 = pd.get_dummies(dfb.body_style, prefix='bs', drop_first=True)
dfb1['body_style'] = dfb['body_style']
dfb1.head(8)

Unnamed: 0,bs_hardtop,bs_hatchback,bs_sedan,bs_wagon,body_style
0,0,0,0,0,convertible
1,0,0,0,0,convertible
2,0,1,0,0,hatchback
3,0,0,1,0,sedan
4,0,0,1,0,sedan
5,0,0,1,0,sedan
6,0,0,1,0,sedan
7,0,0,0,1,wagon


## Ordinal Encoding

An ordinal encoder encodes categorical features into an ordinal variable. This approach transforms categorical values into numerical values in ordered sets.

### Car Evaluation Dataset

In [28]:
car_var = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'carEval']

In [29]:
car = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
                  header=None, names=car_var, na_values="?")
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,carEval
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


### Using Python map function

Let's analyze the categories of prices related variables: `buying` and `maint`.

In [30]:
car.buying.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [31]:
# getting all the values
car.maint.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

`buying` and `maint` are ordinal variable. Its values are: 
- vhigh (very high)
- high (high)
- med (medium)
- low (low)

In [32]:
# Creating a dictionary for mapping the ordinal values
priceDict = {'low':0, 'med':1, 'high':2 , 'vhigh':3}

In [33]:
# Let's put together the variables `buying` and its ordinal encoding
b = pd.DataFrame({'buying': car.buying, 'buying_cat': car.buying.map(priceDict)})
b

Unnamed: 0,buying,buying_cat
0,vhigh,3
1,vhigh,3
2,vhigh,3
3,vhigh,3
4,vhigh,3
...,...,...
1723,low,0
1724,low,0
1725,low,0
1726,low,0


In [34]:
# Let's see all codes for buying
res2 = pd.DataFrame()
res2['buying'] = b.buying.unique()
res2['buying_cat'] = b.buying_cat.unique()
res2

Unnamed: 0,buying,buying_cat
0,vhigh,3
1,high,2
2,med,1
3,low,0


In [35]:
# Assigning ordinal numerical value to the buying and maint columns in the original DataFrame
car['buying'] = car.buying.map(priceDict)
car['maint']  = car.maint.map(priceDict)
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,carEval
0,3,3,2,2,small,low,unacc
1,3,3,2,2,small,med,unacc
2,3,3,2,2,small,high,unacc
3,3,3,2,2,med,low,unacc
4,3,3,2,2,med,med,unacc


In [36]:
car.lug_boot.value_counts()

small    576
med      576
big      576
Name: lug_boot, dtype: int64

In [37]:
# Creating a dictionary for mapping lug_boot values
bootDict = {'small':0, 'med':1, 'big':2}

In [38]:
car['lug_boot']  = car.lug_boot.map(bootDict)
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,carEval
0,3,3,2,2,0,low,unacc
1,3,3,2,2,0,med,unacc
2,3,3,2,2,0,high,unacc
3,3,3,2,2,1,low,unacc
4,3,3,2,2,1,med,unacc


In [39]:
car.safety.value_counts()

low     576
med     576
high    576
Name: safety, dtype: int64

In [40]:
# Creating a dictionary for mapping safety values
safeDict = {'low':0, 'med':1, 'high':2}

In [41]:
car['safety']  = car.safety.map(safeDict)
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,carEval
0,3,3,2,2,0,0,unacc
1,3,3,2,2,0,1,unacc
2,3,3,2,2,0,2,unacc
3,3,3,2,2,1,0,unacc
4,3,3,2,2,1,1,unacc


So far we have encode almost all ordinal features.

References:
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html