### A comprehensive and detailed explanation of an award winning strategy to limit feature explosion during encoding categorical variables. 

### Do drop a like 👍 to support my efforts.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# The dataset here is in zip format so we need to first unzip it and extract the csv file.
# The unzip method extracts the csv file from the zip file and stores in the kaggle working directory.

!unzip ../input/mercedes-benz-greener-manufacturing/train.csv.zip

In [None]:
# Reading data
data = pd.read_csv('./train.csv',usecols=['X1','X2','X3','X4','X5','X6'])

In [None]:
data.head()

In [None]:
# Checking the unique labels in each categorical variables.
for col in data.columns:
    print(col,':', len(data[col].unique()),'labels')

In [None]:
# Simple one hot encoding using get_dummies method. It creates separate columns for each label in a categorical variable 
# and drops one of them to solve the problem of multicollinear data
pd.get_dummies(data, drop_first=True).shape

So we had 6 categorical variables. Encoding that with one hot encoding ended up creating 177 new columns for various values of categorical variables. This can be computationally expensive to train our model on this data and cause performance issues.

### What can we do instead?

A very efficient approach was put forward by the Winning solution of the KDD 2009 cup: " Winning the Cup Orange Challenge with Ensemble Selection". The authors presented an idea of limiting the one hot encoding to the 10 most frequent labels only. this is equivalent to grouping all the other labels under a new category, that will be dropped in our case for this dataset.

Link to Paper: http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

So in this case the dummy variables will indicate that whether for a particular data all those 10 most frequent labels are present(1) or not (0) for a particular observation.



In [None]:
# Finding the top 10 most frequent categories for the variable X2

data.X2.value_counts().sort_values(ascending=False).head(20)

In [None]:
# Making a list of all the top values in the categorical variable X2

top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

In [None]:
# We now make the binary variables for thr top 10 categories

for label in top_10:
    data[label] = np.where(data["X2"]==label,1,0)
    
data[['X2']+top_10].head(40)

Lets understand how it happened. 
We iterate over each of the top 10 labels in the X2 categories that we have stored in the list.

Now for each of the top_10 label we check where in the X2 column that same label occurs. If the label and the X2 categorical value matches then we fix it as 1. Else it is 0 which indicates this value is not among the top 10 labels.

In [None]:
# Now we will apply this to all the catgeorical columns to attain the final set of dummy variables

def one_hot_top_x(df, variable, top_x_labels):
    # this is the function to create the dummy variables for the most frequent labels
    # we can also vary the number of most frquent labels to encode. We can make it top 10 or 20.
    
    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable]==label, 1,0)

# reading the data again

data = pd.read_csv('./train.csv',usecols=['X1','X2','X3','X4','X5','X6'])

#encoding X2 into the top 10 most frequent categories

one_hot_top_x(data,'X2',top_10)
data.head()
    
    

In [None]:
# Finding the top 10 most frequent categories in column X1
top_10 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]
top_10

# now creating the 10 most frequent dummy variables for X1
one_hot_top_x(data,'X1',top_10)
data.head()
    

We can continue and one hot encode all the the 6 columns to limit the encoding to the top_10 frequent labels of the column. 

# One hot encoding the Top Variables

## Advantages
- Very straightforward and easy to implement
- Saves time and effort in hours of data and variable exploration
- Limits expansion of feature space(number of columns in the dataset). Initially encoding resulted in 117 columns for 6 categories. Taking top 10 features reduces it to only 60( top 10 * 6 columns =60 ).

## Disadvantages
- Does not add any information that makes the variable more predictive.
- Loses informaton of the ignored labels or less frequent labels.


It is not quite unusual that categorical variables have few of the dominating categories and the remaining less frequent labels are mostly noise and distorts the predicitve abilities of the variable. This approach is quite simple and straightforward that can be useful in many occasions.

Also it is worth noting that taking the top 10 labels is completely arbitrary. It can be top 5 or 20 depending on the distribution of labels in a particular variables.
