# Dealing with categorical values

**Two types of categorical features**
- Ordinal Values - these categorical values have a natural order. We can sort or order them (e.g, Grades of students => A > B > C)
- Normial Values - don't have any sort of natural order. You can't order nominal values (e.g, Countries)

*Encoding is the process of converting data from one form to another required form*


**Two main techniques**
- *Label (Ordinal) Encoding* - encodes the values between 0 and the number of unique classes/values
- *One Hot Encoder* - for each unique categorical value, it creates a column that contians 1s and 0s, depending on which 

In [9]:
import pandas as pd
import numpy as np
df = pd.read_csv('Smaller_Building_Permits.csv')
df.dtypes

Unnamed: 0                                  int64
Permit Number                              object
Permit Type                                 int64
Permit Type Definition                     object
Permit Creation Date                       object
Block                                      object
Lot                                        object
Street Number                               int64
Street Number Suffix                       object
Street Name                                object
Street Suffix                              object
Unit                                      float64
Unit Suffix                                object
Description                                object
Current Status                             object
Current Status Date                        object
Filed Date                                 object
Issued Date                                object
Completed Date                             object
First Construction Document Date           object


In [10]:
# 27 CATEGORICAL FEATURES
df_categorical = df.select_dtypes(include=np.object)
df_categorical.shape

(10372, 27)

In [11]:
# Get number of unique values for each features
unique_values_df = pd.DataFrame(columns = ['Column Name', 'Number unique values'])
cols = list()
unique_values = list()
for col in df_categorical.columns:
    cols.append(col)
    uni = df_categorical[col].nunique()
    unique_values.append(uni)
unique_values_df['Column Name'] = cols
unique_values_df['Number unique values'] = unique_values

unique_values_df


Unnamed: 0,Column Name,Number unique values
0,Permit Number,9211
1,Permit Type Definition,8
2,Permit Creation Date,255
3,Block,2976
4,Lot,446
5,Street Number Suffix,7
6,Street Name,1006
7,Street Suffix,15
8,Unit Suffix,45
9,Description,6977


In [13]:
# replace missing values
df_categorical = df_categorical.fillna('Missing')

## Label (Ordinal) Encoder

The problem with ordinal encodier is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order => 0<1<2

It's ok to use this encoder with a feature that has some type of order, eg. Grades (C < B < A would be encoded to 1 < 2 < 3).
But in cases where there is no kind of order, this encoder can turn [dog, cat, dog, mouse, cat] into [1,2,1,3,2] which mean that the average of a dog and a mouse is a cat.

The problem is that the model **may derive a correlation** that does not represent the column. For example, the bigger the animal feature, the bigger the size (and mouse(3) isn't bigger than a cat(2) or a dog(1))

In [18]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()
transformed_features = ord_enc.fit_transform(df_categorical)
type(transformed_features)
transformed_df = pd.DataFrame(data=transformed_features, columns=df_categorical.columns)

## One Hot Encoder

One Hot Encoding has the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space.

The disadvantage is that **for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality**

In [20]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder()
transformed_features = ord_enc.fit_transform(df_categorical)
transformed_df = pd.DataFrame(data=transformed_features, columns=df_categorical.columns)