# 🎆Encode like there's no tomorrow

**This kernel will be a combination of a variety of encoding techniques, pipelines and models with the aim of achieving a better score**
I have exported the encoded dataset which I made in this notebook, feel free to use it in your pipelines and dont forget to credit this notebook. 😉
Hope this notebook is as fun to read as it was for me to write.

# 1. Data
I will be using a custom dataset which I created. This contains the contest data which is cleaned, preprocessed and encoded. The different types of encoding used are as follows:


In [None]:
import pandas as pd
import numpy as np
test = pd.read_csv("../input/cat-in-the-dat-ii/test.csv")
train = pd.read_csv("../input/cat-in-the-dat-ii/train.csv")
test_encoded = pd.read_csv("../input/catindat2encoded/test_encoded.csv")
train_encoded = pd.read_csv("../input/catindat2encoded/train_encoded.csv")
test_id = pd.read_csv("../input/cat-in-the-dat-ii/sample_submission.csv")['id']

* Empty data was filled with mode for respective columns

In [None]:
target = train['target']
test.drop(['id'], axis =1 , inplace = True)
train.drop(['id','target'], axis =1 , inplace = True)
data = pd.concat([train,test])
for i in data.columns:
    data[i] = data[i].fillna(data[i].mode()[0])

* Binary data to boolean values using the function below

In [None]:
def bin_encoder(integer):
    if integer == 0 or integer == 'N' or integer == 'F':
        return False
    elif integer ==1 or integer == 'Y' or integer== 'T' :
        return True
bin = ['bin_0','bin_1','bin_2','bin_3','bin_4']
for i in bin:
    data[i] = data[i].apply(bin_encoder)

* nom_0 to nom_4 were Label Encoded


In [None]:
from sklearn.preprocessing import LabelEncoder
for i in range(0,5):
    l_encoder = LabelEncoder()
    key = "nom_" + str(i)
    data[key] = l_encoder.fit_transform(data[key].fillna("NULL").astype(str).values) 

* nom_5 to nom_9 were target encoded due their high cardinality value

In [None]:
from category_encoders import TargetEncoder
for i in range(5,10):
    target_encoder = TargetEncoder()
    key = "nom_" + str(i)
    train_te = target_encoder.fit_transform(train[key].fillna("NULL").astype(str).values,target).values
    test_te = target_encoder.transform(test[key].fillna("NULL").astype(str).values).values
    data[key] =np.concatenate((train_te,test_te), axis=0)

* ordinal variables 0 to 3 were mapped using dicts and assigned values corresponding to their interpreted values

In [None]:
ordinal = [
    [1.0, 2.0, 3.0],
    ['Novice', 'Contributor', 'Expert', 'Master', 'Grandmaster'],
    ['Freezing', 'Cold', 'Warm', 'Hot', 'Boiling Hot', 'Lava Hot']
]

for i in range(1, 3):
    ordinal_dict = {i : j for j, i in enumerate(ordinal[i])}
    key = "ord_" + str(i)
    data[key] = (data[key]).map(ordinal_dict)

* The alphabetical ordinal variables are encoded using Ordinal Encoder

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oencoder = OrdinalEncoder(dtype=np.int16)
for enc in ["ord_3","ord_4","ord_5"]:
    data[enc] = oencoder.fit_transform(np.array(data[enc]).reshape(-1,1))

**Feature generation from cyclic variables**
* After reading a few notebooks I thought we should generate a few more features to extract information from day and month

In [None]:
data['month_sin'] = np.sin((data['month'] - 1) * (2.0 * np.pi / 12))
data['month_cos'] = np.cos((data['month'] - 1) * (2.0 * np.pi / 12))

data['day_sin'] = np.sin((data['day'] - 1) * (2.0 * np.pi / 7))
data['day_cos'] = np.cos((data['day'] - 1) * (2.0 * np.pi / 7))

* Scaling data to get normalized values

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
data = scaler.fit_transform(data)
data = pd.DataFrame(data)

* Exporting our newly made dataset to csv

In [None]:
train = pd.concat([data[:600000],target],axis =1)
train.to_csv("train_encoded.csv",index = False)
test = data[600000:]
data[600000:].to_csv("test_encoded.csv",index = False)

**Borrowing a bit of code from my [previous notebook](https://www.kaggle.com/amoghjrules/intro-to-stacking-averaging-base-models)**

In [None]:
from tqdm import tqdm
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
class average_stacking(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self,models):
        self.models = models
    def fit(self, x,y):
        self.model_clones = [clone(x) for x in self.models]
        
        for model in tqdm(self.model_clones):
            model.fit(x,y)
        return self
    def predict(self, x):
        preds = np.column_stack([
            model.predict(x) for model in tqdm(self.model_clones)
        ])
        return np.mean(preds, axis = 1)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostClassifier
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn import linear_model
import xgboost as xgb
import lightgbm as lgb

In [None]:
glm = linear_model.LogisticRegression( random_state=1, solver='lbfgs', max_iter=2020, fit_intercept=True, penalty='none', verbose=0)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=920,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
GBoost = GradientBoostingRegressor(n_estimators=1500, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   verbose = True,
                                   loss='huber', random_state =5)
model_ada = AdaBoostClassifier(n_estimators= 2200, learning_rate= 0.75)

In [None]:
averaged_models = average_stacking(models = (model_lgb, glm, GBoost, model_ada))
train_y = train_encoded['target']
train_encoded.drop(['target'], axis =1 , inplace = True)
# averaged_models.fit(train_encoded, train_y)
# avg_pred = averaged_models.predict(test_encoded)

In [None]:
# pd.DataFrame({'id': test_id, 'target': avg_preds}).to_csv('submission.csv', index=False)

### <span style="color:red">Upvote and share if this notebook helped you in any way</span> 😁