# Categorical Variables

Categorical data takes only a limited number of variables. For example, we might have a column `Weather` that takes the variales `Sunny`, `Cloudy`, `Windy` and `Rainy`.

We will use the *Melbourne housing data set* as an example of how to deal with categorical variables.

In [98]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
print("Locked and Loaded")

Locked and Loaded


In [130]:
# Load data
df = pd.read_csv('/home/vosti/machine_learning/csvs/melb_data.csv')

# Features and target
X = df.drop(['Price'], axis=1)
y = df.Price

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Drop columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
X_train.drop(cols_with_missing, axis=1, inplace=True)
X_val.drop(cols_with_missing, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


How do we know columns with categorical variables?

Hint: They have **cardinality**. *Cardinality* is the number of unique values in a column. Also, columns with dtype `object` may have categorical data

In [129]:
# Get columns with low cardinality
low_card_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10
                and X_train[cname].dtype == 'object']

# Numerical columns
num_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64','float64']]

# prediction features
my_cols = low_card_cols + num_cols
new_X_train = X_train[my_cols].copy()
new_X_val = X_val[my_cols].copy()

The approaches below are used to deal with categorical variables:

* Droping columns with categorical variable
* Ordinal Encoding
* One_Hot Encoding

We shall get the `mean_absolute_error` for each approach using the `score()` function defined below. The approach with low MAE value is suitable for prediction.

In [74]:
def score(X_train, X_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    mae = mean_absolute_error(y_val, preds)
    return mae

## **Approach 1: Droping Columns**

In [78]:
# Drop columns with categorical data
drop_X_train = new_X_train.select_dtypes(exclude=['object'])
drop_X_val = new_X_val.select_dtypes(exclude=['object'])

print("MAE from droping columns is:")
print(score(drop_X_train, drop_X_val, y_train, y_val))

MAE from droping columns is:
182715.92268504738


## **Approach 2: Ordinal Encoding**

Ordinal encoding is assigning integers to categorical data.

In [94]:
# Get columns with categorical variables
s = (new_X_train.dtypes == 'object') # cat is short for categorical
cat_cols = list(s[s].index)

# make a copy of training and validation features dataset
encoded_X_train = new_X_train.copy()
encoded_X_val = new_X_val.copy()

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder()

encoded_X_train[cat_cols] = ordinal_encoder.fit_transform(new_X_train[cat_cols])
encoded_X_val[cat_cols] = ordinal_encoder.transform(new_X_val[cat_cols])

# Ordinal encoding removes index, put index back:
encoded_X_train.index = new_X_train.index
encoded_X_val.index = new_X_val.index

print("Categorical Data before Ordinal encoding:")
new_X_train

Categorical Data before Ordinal encoding:


Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
2038,h,SP,Northern Metropolitan,3,9.2,3058.0,3.0,1.0,404.0,-37.73050,144.98030,3445.0
10039,h,SP,South-Eastern Metropolitan,4,34.7,3977.0,4.0,2.0,448.0,-38.08170,145.19654,1721.0
6191,u,S,Southern Metropolitan,2,11.2,3127.0,3.0,1.0,0.0,-37.81740,145.08930,5457.0
4598,h,S,Northern Metropolitan,3,9.9,3044.0,3.0,1.0,560.0,-37.71560,144.94290,7485.0
557,h,VB,Southern Metropolitan,5,9.7,3103.0,5.0,6.0,739.0,-37.80390,145.07140,5682.0
...,...,...,...,...,...,...,...,...,...,...,...,...
13123,h,SP,Northern Metropolitan,3,5.2,3056.0,3.0,1.0,212.0,-37.77695,144.95785,11918.0
3264,h,S,Eastern Metropolitan,3,10.5,3081.0,3.0,1.0,748.0,-37.74160,145.04810,2947.0
9845,h,PI,Northern Metropolitan,4,6.7,3058.0,4.0,2.0,441.0,-37.73572,144.97256,11204.0
10799,h,S,Northern Metropolitan,3,12.0,3073.0,3.0,1.0,606.0,-37.72057,145.02615,21650.0


In [95]:
print("Categorical data after Ordinal Encoding:")
encoded_X_train

Categorical data after Ordinal Encoding:


Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
2038,0.0,3.0,2.0,3,9.2,3058.0,3.0,1.0,404.0,-37.73050,144.98030,3445.0
10039,0.0,3.0,4.0,4,34.7,3977.0,4.0,2.0,448.0,-38.08170,145.19654,1721.0
6191,2.0,1.0,5.0,2,11.2,3127.0,3.0,1.0,0.0,-37.81740,145.08930,5457.0
4598,0.0,1.0,2.0,3,9.9,3044.0,3.0,1.0,560.0,-37.71560,144.94290,7485.0
557,0.0,4.0,5.0,5,9.7,3103.0,5.0,6.0,739.0,-37.80390,145.07140,5682.0
...,...,...,...,...,...,...,...,...,...,...,...,...
13123,0.0,3.0,2.0,3,5.2,3056.0,3.0,1.0,212.0,-37.77695,144.95785,11918.0
3264,0.0,1.0,0.0,3,10.5,3081.0,3.0,1.0,748.0,-37.74160,145.04810,2947.0
9845,0.0,0.0,2.0,4,6.7,3058.0,4.0,2.0,441.0,-37.73572,144.97256,11204.0
10799,0.0,1.0,2.0,3,12.0,3073.0,3.0,1.0,606.0,-37.72057,145.02615,21650.0


In [97]:
print("MAE using Ordinal encoding is:")
print(score(encoded_X_train, encoded_X_val, y_train, y_val))

MAE using Ordinal encoding is:
170808.84246192835


## **One-Hot encoding**

In [125]:
# One  Hot Encoding
OH_encoder = OneHotEncoder(handle_unknown = 'ignore', sparse=False)
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(new_X_train[cat_cols]))
OH_X_val = pd.DataFrame(OH_encoder.transform(new_X_val[cat_cols]))

# One Hot encoding removes index, put it back;
OH_X_train.index = new_X_train.index
OH_X_val.index = new_X_val.index


# numerical columns
num_X_train = new_X_train.drop(cat_cols, axis=1)
num_X_val = new_X_val.drop(cat_cols, axis=1)

# Combine one-hot encoded columns with numerical columns
num_OH_X_train = pd.concat([num_X_train, OH_X_train], axis=1)
num_OH_X_val = pd.concat([num_X_val, OH_X_val], axis=1)

print("Data after one hot encoding:")
num_OH_X_train

Data after one hot encoding:


Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount,0,...,6,7,8,9,10,11,12,13,14,15
2038,3,9.2,3058.0,3.0,1.0,404.0,-37.73050,144.98030,3445.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10039,4,34.7,3977.0,4.0,2.0,448.0,-38.08170,145.19654,1721.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6191,2,11.2,3127.0,3.0,1.0,0.0,-37.81740,145.08930,5457.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4598,3,9.9,3044.0,3.0,1.0,560.0,-37.71560,144.94290,7485.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
557,5,9.7,3103.0,5.0,6.0,739.0,-37.80390,145.07140,5682.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13123,3,5.2,3056.0,3.0,1.0,212.0,-37.77695,144.95785,11918.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3264,3,10.5,3081.0,3.0,1.0,748.0,-37.74160,145.04810,2947.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9845,4,6.7,3058.0,4.0,2.0,441.0,-37.73572,144.97256,11204.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10799,3,12.0,3073.0,3.0,1.0,606.0,-37.72057,145.02615,21650.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [126]:
print("MAE from One Hot Encoding is:")
print(score(num_OH_X_train, num_OH_X_val, y_train, y_val))

MAE from One Hot Encoding is:
169682.9182286275
