# Categorical Encodings

In this tutorial I'll show you methods for representing categorical variables as numerical features. This will typically improve the performance of your models. In general these encodings will be learned from the data itself. The two most basic encodings are one-hot encoding which you should be familiar with and label encoding which you saw in the first tutorial. Here I'll cover count encoding, target encoding (and variations), and learning encodings with singular value decomposition.

First I'm going to load in the data and rebuild the baseline model from the first tutorial.

In [82]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])

# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)

In [101]:
# Defining some functions that will help us test our encodings
import lightgbm as lgb
from sklearn import metrics

def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    
    return train, valid, test

def train_model(train, valid):
    feature_cols = train.columns.drop('outcome')

    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    print("Training model!")
    bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid], 
                    early_stopping_rounds=10, verbose_eval=False)

    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['outcome'], valid_pred)
    print(f"Validation AUC score: {valid_score:.4f}")
    return bst

In [100]:
# Training a model on the baseline data
train, valid, _ = get_data_splits(baseline_data)
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7467


# Count Encoding

Count encoding replaces each categorical value with the number of times it appears in the dataset. For example, if the value "GB" occured 10 times in the country feature, then "GB" would be replaced with 10 for each occurence.

I'll be using the [`categorical-encodings` package](https://github.com/scikit-learn-contrib/categorical-encoding) for most of the work here. However, the current release does not include a count encoder. It should be in the next release, but for now I'll write my own class for count encoding in the style of a scikit-learn transformer.


In [86]:
class CountEncoder:
    def __init__(self):
        self.mapping = {}
        
    def fit(self, df):
        """ Calculates count encodings for each column in a dataframe. """
        for col in df.columns:
            self.mapping[col] = df.groupby(col).count().iloc[:, 0]
        
    def transform(self, df):
        """ Applies learned encodings to a dataframe. Returned datafrom has the same
            indices and columns as original dataframe. 
        """
        out_df = df.copy()
        for col, encoding in self.mapping.items():
            out_df[col] = df[col].map(self.mapping[col]).fillna(0)
        
        return out_df
    
    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)

With that defined we can apply it to our data.

In [87]:
cat_features = ['category', 'currency', 'country']
count_enc = CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

data = baseline_data.join(count_encoded.add_suffix("_count"))

In [89]:
# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7486


Adding the count encoding features increase the validation score from 0.7467 to 0.7486, a slight improvement!

# Target Encoding

Target encoding replaces a categorical value with the target probability given that value. For example, given the country value "CA", you'd calculate the average outcome for all the rows with `country == 'CA'`, around 0.28. This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.

One thing to note here is that we are using the targets to create new features. This means if we include the validation or test data in the target encodings, information from these datasets will end up in the model (often called leakage). For this reason, I'll learn the target encodings from the training dataset only and apply it to the other datasets.

The `category_encoders` package provides `TargetEncoder` for this. It works like scikit-learn transformers with `.fit` and `.transform` methods.

In [61]:
import category_encoders as ce

In [115]:
cat_features = ['category', 'currency', 'country']

# Create the encoder itself
target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)

# Fit the encoder using the categorical features and target
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

In [116]:
train.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_count,currency_count,country_count,category_target,currency_target,country_target
0,1000.0,12,11,8,2015,0,108,5,9,1362,33853,33393,0.36019,0.357122,0.361636
1,30000.0,4,2,9,2017,0,93,13,22,5174,293624,290887,0.384615,0.373392,0.376631
2,45000.0,0,12,1,2013,0,93,13,22,5174,293624,290887,0.384615,0.373392,0.376631
3,5000.0,3,17,3,2012,0,90,13,22,15647,293624,290887,0.412655,0.373392,0.376631
4,19500.0,8,4,7,2015,0,55,13,22,10054,293624,290887,0.302625,0.373392,0.376631


In [108]:
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7491


The validation score is higher again, from 0.7467 to 0.7491.

# Leave-One-Out Encoding

Leave-One-Out encoding is the same as target encoding except that it leaves out the target for each row. That is, for each row it calculates the target probability over all the other rows with the same value. This can help the model generalize since it's leaving out information from the current record. This is implemented with `LeaveOneOutEncoder`, similar to `TargetEncoder`.

In [109]:
cat_features = ['category', 'currency', 'country']
target_enc = ce.LeaveOneOutEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)
target_enc.fit(train[cat_features], train['outcome'])

train = train.join(target_enc.transform(train[cat_features]).add_suffix('_loo'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_loo'))

In [110]:
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7491


This gives us the same improvement as target encoding.

# CatBoost Encoding

Finally, we'll look at CatBoost encoding. This is similar to leave-one-out encoding and target encoding in that it's based on the target probablity for a given value. However with CatBoost, for each row, the target probability is calculated only from the rows before it.

In [112]:
cat_features = ['category', 'currency', 'country']
target_enc = ce.CatBoostEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)
target_enc.fit(train[cat_features], train['outcome'])

train = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))

In [113]:
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7492


This does slightly better than target and LOO encoding.

# Encoding with Singular Value Decomposition

Now we'll use singular value decomposition to learn encodings from pairs of categorical features. The idea here is we'll construct a matrix of co-occurences for each pair of categorical features. Each row corresponds to a value in feature A, while each column corresponds to a value in feature B. Each element is the count of rows where the value in A appears together with the value in B.

We can use singular value decomposition (SVD) to find two smaller matrices that equal the count matrix when multiplied.

<center><img src="https://i.imgur.com/mnnsBKJ.png" width=600px></center>

We can choose how many components we'll find with SVD, how long each factor vector will be. In general, longer vectors will contain more information at the cost of more memory/computation. To get the encodings for feature A, you multiply the count matrix by the small matrix for feature B.

I'll show you how you can do this for one pair of features using scikit-learn's `TruncatedSVD` class.

In [145]:
from sklearn.decomposition import TruncatedSVD

# Use 3 components in the latent vectors
svd = TruncatedSVD(n_components=3)

First we can use `.groupby` to count up co-occurences for any pair of features.

In [162]:
train, valid, _ = get_data_splits(data)

# Create a sparse matrix with cooccurence counts
pair_counts = train.groupby(['country', 'category'])['outcome'].count()
pair_counts.head(10)

country  category
0        0            3
         1            1
         2            3
         3            1
         5            3
         6            1
         7           10
         8           22
         9            1
         10           6
Name: outcome, dtype: int64

Now we have a series with a two-level index. We want to convert this into a matrix with `country` on one axis and `category` on the other. To do this, we can use `.unstack`. By default it'll put `NaN`s where data doesn't exist, but we can tell it to fill those spots with zeros.

In [147]:
pair_matrix = pair_counts.unstack(fill_value=0)
pair_matrix

category,0,1,2,3,4,5,6,7,8,9,...,149,150,151,152,153,154,155,156,157,158
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,1,3,1,0,3,1,10,22,1,...,4,0,14,1,4,6,0,2,1,0
1,13,17,94,17,10,34,17,199,218,20,...,22,4,185,9,69,15,3,13,18,13
2,2,2,6,2,1,4,1,14,18,1,...,1,0,15,2,1,4,0,3,2,0
3,33,34,160,32,16,80,42,275,298,32,...,68,5,254,52,133,36,9,49,44,24
4,0,2,19,0,0,4,0,13,34,0,...,5,0,27,0,2,4,0,2,1,0
5,17,29,33,6,6,9,7,93,125,9,...,22,1,109,0,15,11,1,12,8,7
6,3,1,13,1,0,2,0,17,41,4,...,4,0,36,1,6,7,1,2,6,1
7,10,12,39,4,1,8,6,49,85,5,...,12,0,44,2,10,3,1,7,4,2
8,14,8,46,3,1,15,0,50,117,13,...,27,2,107,4,16,6,1,14,4,8
9,52,70,272,78,29,180,71,568,502,55,...,76,9,483,25,273,64,12,99,42,68


In [161]:
svd_encoding = pd.DataFrame(svd.fit_transform(pair_matrix))
svd_encoding.head(10)

Unnamed: 0,0,1,2
0,59.365276,-17.716868,28.732793
1,803.302921,-230.297291,311.076301
2,56.609274,-16.332383,37.15316
3,1516.634213,-389.099596,498.836358
4,78.918379,-31.018666,42.184847
5,419.671057,-141.159794,240.848483
6,108.802633,-30.550797,45.430678
7,226.403703,-100.570748,150.670188
8,293.403001,-102.305092,190.624686
9,3517.365433,-583.252591,1020.85602


This gives us a mapping of the values in the country feature, the index of the dataframe, to our encoded vectors. Next, we need to replace the values in our data with these vectors. We can do this using the `.reindex` method. This method takes the values in the country column and creates a new dataframe from from `svd_encoding` using those values as the index. Then we need to set the index back to the original index. Note that I learned the encodings from the training data, but I'm applying them to the whole dataset.

In [176]:
encoded = svd_encoding.reindex(data['country']).set_index(data.index)
encoded.head(10)

Unnamed: 0,0,1,2
0,3517.365433,-583.252591,1020.85602
1,33124.671369,51.45325,-159.740957
2,33124.671369,51.45325,-159.740957
3,33124.671369,51.45325,-159.740957
4,33124.671369,51.45325,-159.740957
5,33124.671369,51.45325,-159.740957
6,33124.671369,51.45325,-159.740957
7,33124.671369,51.45325,-159.740957
8,33124.671369,51.45325,-159.740957
9,33124.671369,51.45325,-159.740957


In [182]:
# Join encoded feature to the dataframe, with info in the column names
data_svd = data.join(encoded.add_prefix("country_category_svd_"))
data_svd.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_count,currency_count,country_count,country_category_svd_0,country_category_svd_1,country_category_svd_2
0,1000.0,12,11,8,2015,0,108,5,9,1362,33853,33393,3517.365433,-583.252591,1020.85602
1,30000.0,4,2,9,2017,0,93,13,22,5174,293624,290887,33124.671369,51.45325,-159.740957
2,45000.0,0,12,1,2013,0,93,13,22,5174,293624,290887,33124.671369,51.45325,-159.740957
3,5000.0,3,17,3,2012,0,90,13,22,15647,293624,290887,33124.671369,51.45325,-159.740957
4,19500.0,8,4,7,2015,0,55,13,22,10054,293624,290887,33124.671369,51.45325,-159.740957


In [181]:
train, valid, _ = get_data_splits(data_svd)
bst = train_model(train, valid)

Training model!
Validation AUC score: 0.7495


This is the best score yet, 0.7495 compared to baseline of 0.7467. In practice you'd create these encodings for each pair of categorical variables, likely improving the score even more.

Next up, you'll get hands-on practice at encoding categorical features.