### Categorical Variables

A categorical variable takes only a limited number of values.

Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of categories.
If people responded to a survey about which what brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.
You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first. In this tutorial, we'll compare three approaches that you can use to prepare your categorical data.

In [1]:
import pandas as pd

data=pd.read_csv('data/melb_data.csv')

In [2]:
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [3]:
# dividing data into train and validation data
from sklearn.model_selection import train_test_split

In [4]:
#extracting column Price
y=data.Price
X=data.drop(['Price'], axis=1)

In [5]:
X_train_full,X_valid_full,y_train,y_valid=train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [6]:
#dropping columns with missing values
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


#### Investigating cardinality (from exercises)
"Cardinality" means the number of unique values in a column. We need to select categorical columns with relatively low cardinality. We will take all columns (type=object) containing less then 10 different observations.

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset.  For this reason, we typically will only one-hot encode columns with relatively low cardinality.  Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

In [7]:
low_cardinality_cols= [cname for cname in X_train_full.columns if 
                      X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]
#(nunique() function return Series with number of distinct observations over requested axis)

In [8]:
#extract numerical columns
numerical_cols=[cname for cname in X_train_full.columns if 
               X_train_full[cname].dtype in ['int64','float64']]

In [9]:
#keep only selected columns
cols=low_cardinality_cols + numerical_cols
X_train=X_train_full[cols].copy()
X_valid=X_valid_full[cols].copy()

In [10]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [11]:
#creating a list of categorical variables in train data and extract object column in a list
s=(X_train.dtypes=='object')
object_cols=list(s[s].index) 

print('Categorical variables: ')
print(object_cols)

Categorical variables: 
['Type', 'Method', 'Regionname']


We define a function score_dataset() to compare the three different approaches to dealing with categorical variables.
This function reports the mean absolute error (MAE) from a random forest model. 
In general, we want the MAE to be as low as possible!

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset (X_train, X_valid, y_train, y_valid):
    model=RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train,y_train)
    preds=model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### 1. Approach - Drop categorical variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.

In [13]:
drop_X_train=X_train.select_dtypes(exclude=['object'])
drop_X_valid=X_valid.select_dtypes(exclude=['object'])

In [14]:
print('MAE for Dropping categorical var: ')
print(score_dataset(drop_X_train,drop_X_valid, y_train, y_valid))

MAE for Dropping categorical var: 
175703.48185157913


### 2. Approach - Label Encoding

Label encoding assigns each unique value to a different integer.

Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (like decision trees and random forests), you can expect label encoding to work well with ordinal variables.

In [15]:
from sklearn.preprocessing import LabelEncoder

In [16]:
# making copy of data because we dont want to change our data
label_X_train=X_train.copy()
label_X_valid=X_valid.copy()

In [17]:
label_encoder=LabelEncoder()
# transforming object columns
for col in object_cols: 
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

In [18]:
print('MAE for Label Encoding: ')
print(score_dataset(label_X_train,label_X_valid, y_train, y_valid))

MAE for Label Encoding: 
165936.40548390493


### 3. Approach - One-Hot Encoding

In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as nominal variables.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [19]:
from sklearn.preprocessing import OneHotEncoder

In [20]:
# apply one-hot encoder to each column with categorical data
OH_encoder=OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train=pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid=pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

In [21]:
# One-hot encoding removed index; put it back
OH_cols_train.index=X_train.index
OH_cols_valid.index=X_valid.index

In [22]:
# remove categorical columns (will replace with one-hot encoding)
num_X_train=X_train.drop(object_cols, axis=1)
num_X_valid=X_valid.drop(object_cols, axis=1)

In [23]:
# add one-hot encoded columns to numerical features
OH_X_train=pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid=pd.concat([num_X_valid, OH_cols_valid], axis=1)

In [24]:
print('MAE for OH Encoder: ')
print(score_dataset(OH_X_train,OH_X_valid, y_train, y_valid))

MAE for OH Encoder: 
166089.4893009678


In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.

In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns (Approach 1) typically performs worst.