# **Is there a cat in your dat?**

### Overview

A common task in machine learning pipelines is encoding categorical variables for a given algorithm in a format that allows as much useful signal as possible to be captured.

Because this is such a common task and important skill to master, we've put together a dataset that contains only categorical features, and includes:

- binary features
- low- and high-cardinality nominal features
- low- and high-cardinality ordinal features
- (potentially) cyclical features <br>

This Playground competition will give you the opportunity to try different encoding schemes for different algorithms to compare how they perform. 

![cat](https://i.kinja-img.com/gawker-media/image/upload/s--rqCW9nxC--/c_scale,f_auto,fl_progressive,q_80,w_800/p4b69sblvgebowkdhnfy.jpg)

### Table of Content

- [Importing Libraries](#imports)
- [Exploring the Data](#explore_data)
   - [Binary features](#binary_features)
   - [Nominal features](#nominal_features)
   - [Ordinal features](#ordinal_features)
   - [Cyclical features](#cyclical_features)
- [Categorical Feature Encoding](#cat)  
   - [Binary features encoding](#bin_cat)
   - [Nominal features encoding](#nom_cat)
   - [Ordinal features encoding](#ord_cat)
   - [Cyclical features encoding](#cyc_cat)

### Importing Libraries <a class="anchor" id="imports"></a>

In [None]:
# Importing numpy (linear algebra) and pandas (data processing): 
import numpy as np 
import pandas as pd 

# Imports for plotting:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

import os
import matplotlib.ticker as ticker

### Exploring the Data <a class="anchor" id="explore_data"></a>

In this competition, we will be predicting the probability [0, 1] of a binary target column.

The data contains binary features (bin_), nominal features (nom_), ordinal features (ord_) as well as (potentially cyclical) day (of the week) and month features. The string ordinal features ord_{3-5} are lexically ordered according to string.ascii_letters.

Since the purpose of this competition is to explore various encoding strategies, the data has been simplified in that (1) there are no missing values, and (2) the test set does not contain any "unseen" feature values. (Of course, in real-world settings both of these factors are often important to consider!)

In [None]:
# Explore what's in the cat-in-the-dat folder:
print(os.listdir("../input/cat-in-the-dat"))

In [None]:
# Read train, test and sample_submission data:
train_df = pd.read_csv("../input/cat-in-the-dat/train.csv")
test_df = pd.read_csv("../input/cat-in-the-dat/test.csv")
submission = pd.read_csv("../input/cat-in-the-dat/sample_submission.csv")

In [None]:
submission.head()

In [None]:
# Shape of the train and testdataset:
print(train_df.shape)

In [None]:
# To display first 5 rows of the train_df:
train_df.head()

**Names of all columns**

In [None]:
# Print the names of all columns in train DataFrame:
print(train_df.columns.values)

**Checking for missing data (nan)**

In [None]:
# Are there any missing values in train_df?
# train_df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

In [None]:
# Function to describe variables
def desc(df):
    summ = pd.DataFrame(df.dtypes,columns=['Data_Types'])
    summ = summ.reset_index()
    summ['Columns'] = summ['index']
    summ = summ[['Columns','Data_Types']]
    summ['Missing'] = df.isnull().sum().values    
    summ['Uniques'] = df.nunique().values
    return summ

# Function to analyse missing values
def nulls_report(df):
    nulls = df.isnull().sum()
    nulls = nulls[df.isnull().sum()>0].sort_values(ascending=False)
    nulls_report = pd.concat([nulls, nulls / df.shape[0]], axis=1, keys=['Missing_Values','Missing_Ratio'])
    return nulls_report

In [None]:
# Use desc function to describe test data:
desc(train_df)

**Target distribution**

In [None]:
# Bar chart of frequency of digit occurance in our train dataset:
total = float(len(train_df))

plt.figure(figsize=(16,4))
ax = sns.countplot(x = 'target', data=train_df,  palette = 'rocket_r')

# Make twin axis
ax2=ax.twinx()
ax2.set_ylabel('Frequency [%]')

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height*100/total),
           # '{0:.0%}'.format(height/total),
            ha="center") 


# Use a LinearLocator to ensure the correct number of ticks
ax.yaxis.set_major_locator(ticker.LinearLocator(11))

# Fix the Frequency [%] range to 0-100
ax2.set_ylim(0,100)
ax.set_ylim(0,300000)

# And use a MultipleLocator to ensure a tick spacing of 10
ax.yaxis.set_major_locator(ticker.MultipleLocator(25000))
ax2.yaxis.set_major_locator(ticker.MultipleLocator(10))

# Turn the grid on ax2 off, otherwise the gridlines will cut through percentages %:
ax.grid(False)
ax2.grid(False)   
    
plt.title('Target Distribution')
plt.show()

In [None]:
print(train_df['target'].value_counts())

In our train_df we have 300,000 rows of data with 208,236 (69.41%) rows with the target of 0 and 91,764 (30.59%) rows with the target of 1. 

#### **Binary (bin_) features** <a class="anchor" id="binary_features"></a>

In [None]:
# Define bin list:
bin = ['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']

In [None]:
# Bar charts for binary features, split according to the target:
for i in bin:
    plt.figure(figsize=(16,4))
    ax = sns.countplot(x=i, 
                       hue="target", 
                       palette= 'ocean_r',
                       data=train_df
                       )
    
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}'.format(height*100/total),
                #'{0:.0%}'.format(height/total),
                ha="center") 
       
        ax.set_ylim(0,200000)
        ax.grid(False)

        plt.title('Target Distribution')
plt.show()

Columns bin_3 and bin_4 contain T,F and Y,N respectively, isntead of numerical values 0,1.

#### **Nominal (nom_) features** <a class="anchor" id="nominal_features"></a>

In [None]:
# Define nom as:
nom = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']

In [None]:
# Bar charts for nominal features, split according to the target:
for i in nom[0:5]:
    plt.figure(figsize=(16,4))
    ax = sns.countplot(x=i, 
                       hue="target", 
                       palette= 'gist_heat_r',
                       data=train_df
                       )
    
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height*100/total),
                #'{0:.0%}'.format(height/total),
                ha="center") 
       
        ax.set_ylim(0,100000)
        ax.grid(False)

        plt.title('Target Distribution')
plt.show()  

This is interesting: there are some similarities in Target Distributions for nom1, nom2 and nom3. To be more at target distribution rounded to the nearest integer and compare the following: 
- Trapezoid, Lion, Russia (24%,10%)
- Square, Cat, Canada (11%, 6%)
- Star, Snake, China (11%, 5%)
- Circle, Dog, Finaland (9%, 3%)
- Polygon, Axolotl, Costa Rica (8%, 4%)
- Triangle, Hamster, India (6%,4%)

Let's have a look at the value tables for nom_1, nom_2 and nom_3, just to confirm that Target Distribution is very similar for all three features.

In [None]:
# Create a crosstab with nom_1 and target:
print('Crosstab for numerical target distribution in nom_1:')

pd.crosstab([train_df.target], 
            [train_df.nom_1],
             margins=True).style.background_gradient(cmap='autumn_r')

In [None]:
# Create a crosstab with nom_2 and target:
print('Crosstab for numerical target distribution in nom_2:')

pd.crosstab([train_df.target], 
            [train_df.nom_2],
             margins=True).style.background_gradient(cmap='autumn_r')

In [None]:
# Create a crosstab with nom_3 and target:
print('Crosstab for numerical target distribution in nom_3:')

pd.crosstab([train_df.target], 
            [train_df.nom_3],
             margins=True).style.background_gradient(cmap='autumn_r')

We still have columns from nom_5 to nom_9, those hold from 222 to 11,981 categories respectively. Let's have a look at how many categories each of the columns hold:

#### **Ordinal (ord_) features** <a class="anchor" id="ordinal_features"></a>

In [None]:
ord = ['ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']

In [None]:
# Bar charts for ordinal features, split according to the target:

for i in ord[0:3]:
    plt.figure(figsize=(16,4))
    ax = sns.countplot(x=i, 
                       hue="target", 
                       palette= 'winter_r',
                       data=train_df
                       )
    
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.1f}%'.format(height*100/total),
                #'{0:.0%}'.format(height/total),
                ha="center") 
       
        ax.set_ylim(0,150000)
        ax.grid(False)

        plt.title('Target Distribution')
plt.show()

In [None]:
for i in ord[3:5]:
    plt.figure(figsize=(16,4))
    ax = sns.countplot(x=i, 
                       hue="target", 
                       palette= 'winter_r',
                       data=train_df
                       )
    
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                #'{:1.1f}%'.format(height*100/total),
                '{0:.0%}'.format(height/total),
                ha="center") 
       
        ax.set_ylim(0,35000)
        ax.grid(False)

        plt.title('Target Distribution')
plt.show()

In [None]:
# Number of unique values in ord_5:
print('Number of unique values for ord_5: ' + str(train_df['ord_5'].nunique()))

For ord_5 we have 192 unique values, all of them consist of 2 alphabet letters.

#### **Cyclical features** <a class="anchor" id="cyclical_features"></a>

Hours of the day, days of the week, months in a year are all examples of features that are cyclical. In our DataFrame we have days and months, let's have a look at unique values for those features.

In [None]:
print('Unique values of day:',train_df.day.unique())
print('Unique values of month:',train_df.month.unique())

As we could expect, we have 1-7 values for day and 1-12 values for month feature. 

In [None]:
cyc = ['day', 'month']


for i in cyc:
    plt.figure(figsize=(16,4))
    ax = sns.countplot(x=i, 
                       hue="target", 
                       palette= 'cool_r',
                       data=train_df
                       )
    
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                #'{:1.1f}%'.format(height*100/total),
                '{0:.0%}'.format(height/total),
                ha="center") 
       
        ax.set_ylim(0,60000)
        ax.grid(False)

        plt.title('Target Distribution')
plt.show()      

Interesting! We don't have much data for June and for Saturdays. 

## Categorical Features Encoding <a class="anchor" id="cat"></a>

Before we start working on feature encoding, we will combine train_df and test_df into one DataFrame called tetra_df and separate target column. This will allow us to make changes to both DataFrames at the same time.

In [None]:
# Assign output target to the following variable:
target = train_df['target']

In [None]:
# Merge train and test data into tetra_df and drop target and id column:
tetra_df = train_df.append(test_df, ignore_index = True, sort = 'True')
tetra_df = tetra_df.drop(['target', 'id'], axis = 1)

In [None]:
# Check if merge worked (must have 500,000 entries):
tetra_df.shape

In [None]:
# Create indexes to separate data later:
train_df_idx = len(train_df)
test_df_idx = len(tetra_df) - len(test_df)

### Binary features encoding

Since bin_3 and bin_4 contain only two values, we can convert them to a binary columns. Let's assume that: <br>
 => T = True and F = False, <br>
 => Y = Yes and N = No <br>
We can just simply replace T by1 in bin_3, F by 0 and Y by 1, N by 0 in bin_4.  


In [None]:
# Convert T, F in bin_3 to binary values (0,1):
tetra_df['bin_3'] = tetra_df['bin_3'].map({'T':1, 'F':0})

# Similarly convert Y, N in bin_4 to binary values:
tetra_df['bin_4'] = tetra_df['bin_4'].map({'Y':1, 'N':0})

In [None]:
# Check the outcome:
tetra_df[bin].head()

### Nominal features encoding

In [None]:
# One hot encoding for column : nom_0 to nom_4
tetra_df = pd.get_dummies(tetra_df, columns = nom[0:5],
                        prefix = nom[0:5], 
                        drop_first = True)

In [None]:
# Encoding hex features
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
features_hex = nom[5:]

for col in features_hex:
    labelencoder.fit(tetra_df[col])
    tetra_df[col] = labelencoder.transform(tetra_df[col])

### Ordinal features encoding

In [None]:
#tetra_df[ord].head()

In [None]:
# Convert ord_1 by dictionary mapping as follows:
tetra_df['ord_1'] = tetra_df['ord_1'].map({
    'Novice': 0,
    'Contributor': 1,
    'Master': 2,
    'Expert' : 3,
    'Grandmaster': 4
})

# Similarly convert ord_2:
tetra_df['ord_2'] = tetra_df['ord_2'].map({
    'Freezing': 0,
    'Cold': 1,
    'Warm': 2,
    'Hot' : 3,
    'Boiling Hot': 4,
    'Lava Hot' : 5
})

In [None]:
# Change type of ord_3 to category, create a dictionary alph that orders letters alphabetically:
tetra_df['ord_3'] = tetra_df['ord_3'].astype('category')
alph = dict(zip(tetra_df['ord_3'],tetra_df['ord_3'].cat.codes))
# Map alphord to ord_3 and change type of ord_3 to integer:
tetra_df['ord_3'] = tetra_df['ord_3'].map(alph)
tetra_df['ord_3'] = tetra_df['ord_3'].astype(int)

# Similarly change ord_4:
tetra_df['ord_4'] = tetra_df['ord_4'].astype('category')
alph1 = dict(zip(tetra_df['ord_4'],tetra_df['ord_4'].cat.codes))
tetra_df['ord_4'] = tetra_df['ord_4'].map(alph1)
tetra_df['ord_4'] = tetra_df['ord_4'].astype(int)

In [None]:
# Create sorted list of ord_5 values (ordered alphabetically):
ordli = sorted(list(set(tetra_df['ord_5'].values)))

# Create mapping dictionary alph2 for ord_5
alph2 = dict(zip(ordli, range(len(ordli))))  

# Map alph2 dictionary to ord_5
tetra_df['ord_5'] = tetra_df['ord_5'].map(alph2)

### Cyclical features encoding

One of the methods for cyclical features encoding is to perform sine and cosine transformation of the feature by using the following formulas:

$$x_{sin} = sin(\frac{2*\pi*x}{max(x)})$$

$$x_{cos} = cos(\frac{2*\pi*x}{max(x)})$$

Since both trigonometric functions are periodical, it's not a good idea to use only one of them for encoding. The reason is simple: two different features can be encoded as the same value. <br>
By using sin and cos function we will avoid this and assign an unique position on a [unit circle](http://mathworld.wolfram.com/UnitCircle.html).

In [None]:
# Cyclical encoding for day:
tetra_df['day_sin'] = np.sin(2 * np.pi * tetra_df['day']/7.0)
tetra_df['day_cos'] = np.cos(2 * np.pi * tetra_df['day']/7.0)

# Cyclical encoding for month:
tetra_df['month_sin'] = np.sin(2 * np.pi * tetra_df['month']/12.0)
tetra_df['month_cos'] = np.cos(2 * np.pi * tetra_df['month']/12.0)

Both sin and cos values will be in the range between -1 and 1.

In [None]:
# Show that Encoded values are now placed on the circle with radius 1 and origing at [0,0]:
x = tetra_df.day_sin
y = tetra_df.day_cos

tetra_df.sample(5000).plot.scatter('day_sin','day_cos').set_aspect('equal')
tetra_df.sample(5000).plot.scatter('month_sin','month_cos').set_aspect('equal')

In [None]:
tetra_df = tetra_df.drop(['day', 'month'], axis = 1)

In [None]:
# Print the names of all columns in tetra_df DataFrame:
 print(tetra_df.columns.values)

### Normalize data columns

In [None]:
#from sklearn.preprocessing import MinMaxScaler
#min_max_scaler = MinMaxScaler()

# x returns a numpy array
#x = tetra_df.values 


#x_scaled = min_max_scaler.fit_transform(x)
#tetra_df = pd.DataFrame(x_scaled)

In [None]:
#tetra_df.describe()

### Training the Model

In [None]:
# Creating training and testing data:
training = tetra_df[ : train_df_idx]
testing = tetra_df[test_df_idx :]

In [None]:
# For splitting data we will be using train_test_split from sklearn:
from sklearn.model_selection import train_test_split

In [None]:
X = training
y = target

In [None]:
# Splitting the training data into test and train, we are testing on 0.20 = 20% of dataset:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=13)

**USE XGBoost CLASSIFIER**

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_validate, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler

In [None]:
xgb = XGBClassifier(objective= 'binary:logistic'
                    , learning_rate=0.7
                    , max_depth=3
                    , n_estimators=250
                    , scale_pos_weight=2
                    , random_state=42
                    , colsample_bytree=0.5
                    )
    
xgb.fit(X_train, y_train)   

In [None]:
y_predict = xgb.predict(X_test)
print(classification_report(y_test,y_predict))

In [None]:
# Confusion matrix cm:
cm = confusion_matrix(y_test,y_predict)
cm

In [None]:
# Quick overview of our confusion matrix:
sns.heatmap(cm, annot = True, square = True, fmt='g')

In [None]:
prediction = xgb.predict(testing)

In [None]:
# Combine ImageID and Label into one DataFrame:
final_result = pd.DataFrame({'target': prediction, 'id': submission.id})
final_result = final_result[['id', 'target']]

# Downloading final_result dataset as digit_output.csv:
final_result.to_csv('cat_output.csv', index = False)