# Intermediate Machine Learning Micro 

An expansion on the concepts and methods used in the intro to machine learning micro course

## Categorical Variables

Categorical values are ones with a set of predefined options. Things like car brands (Honda, Ford, Toyota, etc.) or entries to dropdown boxes (in progress, not started, etc.) are categorical variables. These will lead to errors in most machine learning models by default if fed in raw due to the fact they are strings. We will look at 3 approaches to dealing with them.

In [5]:
# Import data and set up training / validation sets
import pandas as pd
from sklearn.model_selection import train_test_split

MELBOURNE_DATA_PATH = '..\\data\\melb_data.csv'
melb_data = pd.read_csv(MELBOURNE_DATA_PATH)

y = melb_data.Price
X = melb_data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [6]:
# Define funciton to return MAE score for the data
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

def score_dataset(X_train, X_val, y_train, y_val) -> float:
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    return mean_absolute_error(y_val, preds)

In [7]:
# Get a list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print(f'Categorical variables:\n{object_cols}')

Categorical variables:
['Type', 'Method', 'Regionname']


### Approach 1: Drop Categorical Variables

Simply dropping the data that consists of categorical values can be an effetive approach if the data is not meaningful.



In [10]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print(f'MAE from approach #1 (drop categorical variables): {score_dataset(drop_X_train, drop_X_valid, y_train, y_valid)}')

MAE from approach #1 (drop categorical variables): 183550.22137772635


### Approach 2: Ordinal Encoding

Replacing each of the unique values with a unique integer. for example in a survey where one of the questions can be answered "Never", "Rarely", "Most Days", or "Always", each value can be assigned a number 0, 1, 2, 3. This approach is called "ordinal encoding".

Works especially well for data where the numbers corolate to something in the data like in this example. may not work as well with something like car brands.

In [19]:
from sklearn.preprocessing import OrdinalEncoder

# Make a copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.fit_transform(X_valid[object_cols])

print(f'MAE from approach #2 (Ordinal Encoding): {score_dataset(label_X_train, label_X_valid, y_train, y_valid)}')


MAE from approach #2 (Ordinal Encoding): 175062.2967599411


This is a common approach that is simpler than providing custom labels; however, we can expect an additional boost in performance if we provide better-informed labels for all ordinal variables.

When using Ordinal Encoding be sure to check the unique values in both the training and validation data. sometimes a value will appear only in the training or only in the validation and this will cause and error. It willl work if the trianing data has all of the categories that show up in the validation data but if the validation data has a new value then theres a problem.

### Approach 3: One-Hot Encoding

Create a new column for each of the categories, a 1 is placed in the new column related to the data that was originally in the row and a zero for each of the categories not represented.

In [17]:
from sklearn.preprocessing import OneHotEncoder

# Apply on-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
# NOTE: handle_unknown describes the behavior of the encoder if it encounters values not in the training data
# NOTE: sparese tells teh encoder if we want it returned as sparse matrix (True) or numpy array (False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (to be replaced by one-hot encoded)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns  to numberical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print(f'MAE from approach #3 (One-Hot Encoding): {score_dataset(OH_X_train, OH_X_valid, y_train, y_valid)}')

MAE from approach #3 (One-Hot Encoding): 176703.63810751104


One hot encoding typically performs best followed by Ordinal encoding with dropping data generally performing worst.