# Categorical Variables.

There's a lot of non-numeric data out there, here's how to use it for Machine Learning. 

A <bold>Categorical variable</bold> takes ony a fixed number of values.

- Consider a survey that asks how often you eat breakfast and provides four options: 'Never', 'Rarely','Most days','Every Day'. In this case, the data ic categorical because responses fall in a fixed set of catehories. 

You'll get an error if you try to plug these variables into most machine learning models in python without preprocessing them. 
We'll go over 3 approaches.

# Three Approaches

1. **Drop Categorical Variables :** The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns do not contain useful information. 

2. **Ordinal Encoding :** This assigns each unique value to a different integer. 
<img src = '../images/ordinal_encoding.png'>

This approach assumes an ordering of the categories: 'Never' : 0 > 'Rarely' 1 > 'Most Days' 2 > 'Every Day' 3.

This assumption makes sense in this example because there is an indisputable ranking to the categories. Not all categorical variables have a clear orddering in the values, but we refer to those that do as <bold>Ordinal Variables</bold>. For tree based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables. 

3. **One Hot Encoding :** This creates new columns indicating the presence or absence of each possible value in the original data. 
<img src = '../images/one_hot_encoding.png'> 

In the Original Dataset, 'Color' is a Categorical Variable with 3 Categories : 'Red', 'Yellow', and 'Green'. The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Whenever the original value was 'Red', we put a 1 in the 'Red' column; if the original value was 'Yellow', we put 1 in the 'Yellow' column and so on. 

In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect the approach to work particularly well if there's no clear ordering in the categorical data (e.g, 'Red' is neither more, nor less than 'Yellow'). We refer to categorical variables without an intrinsic ranking as **nominal variables**.

One-Hot encoding generally does not perform well if the categorical variables takes on a large number of values (i.e, you generally won't use it for variables taking more than 15 different values.)

# Bench Mark. 
Let's bench mark these things. We'll use the melbourn housing dataset.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_absolute_error

In [6]:
melb_data = pd.read_csv(filepath_or_buffer = '../datasets/melb_data.csv')
melb_data.dropna(inplace = True) # Fastest way to "Clean" the data.
y = melb_data.Price
X = melb_data.drop(['Date','Address','CouncilArea','Suburb','SellerG'], axis = 1)


# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [7]:
# SellerG was dropped because it contains 246 unique entries, that's a lot to be honest.
# I don't need the Date it was sold... i think. There are only 58 unique dates, so...? Maybe we should come back to that.
# Suburb was removed because there are 314 unique values, we also have region name already, which kinda seems like a generalization. 
# I don't need the address.  


In [8]:
# Let's obtain a list of all categorical columns in the dataset. 
categorical = (X_train.dtypes == 'object')

categorical = list(categorical[categorical].index)
print(f' Categorical Variables: \n {categorical}')

 Categorical Variables: 
 ['Type', 'Method', 'Regionname']


### Define a Funcion to look at the quality of each approach. 
We'll define a function that uses the mean absolute error to compare the 3 different approaches of dealing with categorical variables. 

In [9]:
def score_approach(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor(random_state = 0)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    MAE = mean_absolute_error(y_test, preds)
    return MAE

### Evaluating approach 1, drop dtypes. 


In [10]:
drop_X_train = X_train.select_dtypes(exclude = 'object')
drop_X_test = X_test.select_dtypes(exclude = 'object')

print('MAE FOR APPROACH 1 (DROP CATEGORICAL VARIABLES)')
score_approach(drop_X_train, drop_X_test, y_train, y_test)

MAE FOR APPROACH 1 (DROP CATEGORICAL VARIABLES)


np.float64(976.2410587475787)

### Evaluating approach 2, Ordinal Encoding. 


In [11]:
# Make copy to avoid changing the original Dataset.

label_X_train = X_train.copy()
label_X_test = X_test.copy()

# Apply ordinal Encoder. 
OE = OrdinalEncoder()
label_X_train[categorical] = OE.fit_transform(label_X_train[categorical]) 
label_X_test[categorical] = OE.transform(label_X_test[categorical]) 

print('MAE FOR APPROACH 2 (ORDINAL ENCODING)')
score_approach(label_X_train, label_X_test, y_train, y_test)

MAE FOR APPROACH 2 (ORDINAL ENCODING)


np.float64(797.5960038734668)

### Evaluating approach 3, One-Hot Encoding.

We use the OneHotEncoder class from scikit-learn to get one-hot-encodings. There are a number of parameters that can be used to customize its behaviour.

- We set `handle_unknown = ignore` to avoid errors when the validation data contains classes that aren't represented in the training data. 
- Setting `sparse = False` ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix). 

In [12]:
# Apply one hot encoder to each column with categorical data. 
OH = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
OH_cols_train =  pd.DataFrame(OH.fit_transform(X_train[categorical]))
OH_cols_test = pd.DataFrame(OH.transform(X_test[categorical]))

# One Hot encoding removed the index; put it back. 
OH_cols_test.index = X_test.index
OH_cols_train.index = X_train.index

# Remove Categorical Columns, we'll replace with one hot encoding. 
num_X_train = X_train.select_dtypes(exclude = 'object')
num_X_test = X_test.select_dtypes(exclude = 'object')

# Concatenate One_Hot and Numerical features. 
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis = 1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis = 1)

# Ensure all columns have string types. 
OH_X_train.columns = OH_X_train.columns.astype('str') 
OH_X_test.columns = OH_X_test.columns.astype('str')

print('MAE FOR APPROACH 3 (ONE-HOT ENCODING)')
score_approach(OH_X_train, OH_X_test, y_train, y_test)


MAE FOR APPROACH 3 (ONE-HOT ENCODING)


np.float64(831.4351387992252)

### Which approach is best?

Well, here, it is clear that Ordinal Encoding did best and Dropping The columns did worse but i believe it varies from dataset to dataset so check it out.