# Introduction
Hi, this is my second notebook so feel free to comment on my mistake.
In this notebook, I will show everything that I have done in this competition and their reason.
This may not be the best notebook but I think someone will learn something new from my work. 
So, LET'S GO.


# First look at our data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set(style='white', context='notebook', palette='deep', rc = {'figure.figsize':(15,8)})
import matplotlib.pyplot as plt
%matplotlib inline

Maybe-useful-knowledge: There are a lot of data points so if you want to speed up the reading phase, you can save the data in feather mode and read them using pandas

In [None]:
url = "../input/tabular-playground-series-may-2022/"
train = pd.read_csv(url + "train.csv")
train = train.drop('id', axis=1)
test  = pd.read_csv(url + "test.csv")
target = train['target']

In [None]:
train

First look: 
1. There are 900,000 data points and 31 features
2. The label is binary
3. Every features are labeled so we don't have any domain knowledge about this problem

In [None]:
null_count = train.isnull().sum()
null_count[null_count > 0]

No null value in our train dataset

In [None]:
null_count = test.isnull().sum()
null_count[null_count > 0]

Also, the test set is full

In [None]:
train.info()

Looking at the info(), I can see there is only 1 object feature(f_27) while other are numerical

Let see if any of our numerical features are actually category

In [None]:
unique_value = train.select_dtypes(include='number').nunique().sort_values()
unique_value.plot.bar(logy=True, title='Unique value per feature', figsize=(20,10))

From the above chart, we can see those features on the left are actually categorical data

Then, let classify features into continuous and categorical (easily done by using unique_value above)

In [None]:
continuous  = unique_value[unique_value > 20].index.to_list()
continuous.sort()
categorical = unique_value[unique_value <= 20].index.to_list()
categorical.sort()
categorical.remove('target')

# Target

In [None]:
sns.countplot(train['target'])

So, the number of 0 and 1 data is almost the same

# Continuous feature

Distribution

In [None]:
fig, axes = plt.subplots(4,4, figsize=(30,30))
for col, ax in zip(continuous, axes.ravel()):
    sns.kdeplot(train[col], ax=ax)

Look like our continuos data are almost normal

In [None]:
plt.figure(figsize=(12, 12))
sns.heatmap(train[continuous + ['target']].corr(), center=0, annot=True, fmt='.2f')
plt.show()

From the heatmap, we know
1. The highest correlation to target is 0.13 so most of our data dont have linear relationship with target
2. f_00 to f_06 have slightly linear relationship with f_28


# Categorical feature

In [None]:
fig, axes = plt.subplots(4,4, figsize=(30,30))
fig.delaxes(axes[3][3])
fig.delaxes(axes[3][2])
for col, ax in zip(categorical, axes.ravel()):
    sns.countplot(train[col], ax=ax)

1. Categorical feature f_7 to f_18 looks almost similar and the number of higher category is minor compared to the lower category.
2. f_7 to f_18 can be ordinal 
3. Categorical feature f_30 is almost uniform.

In [None]:
fig, axes = plt.subplots(4,4, figsize=(30,30))
fig.delaxes(axes[3][3])
fig.delaxes(axes[3][2])
for col, ax in zip(categorical, axes.ravel()):
    sns.countplot(train[col], hue=train['target'], ax=ax)

There is difference between number of 0 and 1 in all those categorical feature which means all of them are informative.

# Object feature


In [None]:
train['f_27'].apply(lambda x: len(x)).unique()

So, f_27 feature are strings of 10 uppercase characters

In [None]:
counts = train['f_27'].value_counts()
print(counts)
print(len(counts[counts > 1]))

We have a lot of repeating strings and BBBBBBCJBC is the most commom.

In [None]:
for i in range(10):
    print("Position {}: {} unique character".format(i,train['f_27'].apply(lambda x: x[i]).nunique()))

Each positions has small cardinality so instead of encoding f_27, we should encode each postion in f_27

# F_27 feature


In [None]:
train = pd.read_csv(url + "train.csv")
#train = train.sample(100000, random_state=2) used when feature engineering and tuning hyperparameter
train = train.drop('id', axis=1)
test  = pd.read_csv(url + "test.csv")
id_test = test['id']
test  = test.drop('id', axis=1)
target = train['target']

In [None]:
unique_value = train.select_dtypes(include='number').nunique().sort_values()
continuous  = unique_value[unique_value > 20].index.to_list()
continuous.sort()
categorical = unique_value[unique_value <= 20].index.to_list()
categorical.sort()
categorical.remove('target')

Next, we will split f_27 into 10 features based on their position and then encode them

In [None]:
test['target'] = 0
train_len = len(train)
traintest = pd.concat([train, test])
del train,test
f_27_cols = []
for i in range(10):
    new_col = "f_27_{}".format(i)
    f_27_cols.append(new_col)
    traintest[new_col] = traintest['f_27'].apply(lambda x: x[i])
traintest["unique_characters"] = traintest['f_27'].apply(lambda x: len(set(x)))
traintest = traintest.drop('f_27',axis=1)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
OE = OrdinalEncoder(categories='auto')
OE.fit(traintest[f_27_cols])
print(OE.categories_)

I used ordinal encoder instead of label encoder as I believe alphabet is likely to be ordinal

In [None]:
traintest[f_27_cols] = OE.transform(traintest[f_27_cols])
traintest['f_27_sum'] = 0
for col in f_27_cols:
    traintest['f_27_sum'] = traintest['f_27_sum'] + traintest[col]

In [None]:
fig, axes = plt.subplots(2,5, figsize=(50,30))
for col, ax in zip(f_27_cols, axes.ravel()):
    sns.countplot(traintest[:train_len][col], hue=traintest[:train_len]['target'], ax=ax)

The difference between 0 and 1 in some features is quite high so f_27 can be very important

# Feature engineering

I'm just a newbie in feature engineering so I only know some techniques but I will try to apply all of them in this notebook.
1. The first and maybe the most is feature interaction where I choose to use + - * / between some features. I choosed those features as I have tried 100 random combination and choose only the top features. That took really long (5 hours running on Kaggle) but as a student, I run it on Kaggle and went to school 😂.
2. Next, I fitted a base lgbm classifier and ploted the feature importance. I saw f_27_5 is useless but maybe it is useless on its own so I create its interaction with all other feature and take out two conbinations with f_27_7 and f_27_8.
3. Lastly, I tried polynomial features which is simple but powerful. Again, create all 2 degree polynomial features on continuous data and take out top 3 features.

If I remember correctly, the base lgbm without any feature engineering, trained on 100000 data points, is 0.9818 and with feature engineering is 0.9821. A resonable increase right!!

In [None]:
def create_interaction_feature(df, fea1, fea2):
    df['{}+{}'.format(fea1, fea2)] = df[fea1] - df[fea2]
    df['{}-{}'.format(fea1, fea2)] = df[fea1] + df[fea2]
    df['{}/{}'.format(fea1, fea2)] = df[fea1] / df[fea2]
    df['{}*{}'.format(fea1, fea2)] = df[fea1] * df[fea2]

In [None]:
import itertools
fea1 = 'f_27_5'
other = ['f_27_7', 'f_27_8']
for fea2 in other:
    create_interaction_feature(traintest, fea1, fea2)

top_per_feature = ['f_26','f_00']
for fea1, fea2 in itertools.permutations(top_per_feature, 2):
    create_interaction_feature(traintest, fea1, fea2)

for col in ['f_21', 'f_22', 'f_26']:
    traintest['{}^2'.format(col)] = traintest[col]**2 


In [None]:
traintest = pd.get_dummies(traintest, columns=['f_29','f_30','f_27_0','f_27_2','f_27_5'])

In [None]:
traintest = traintest.drop('target',axis=1)
train = traintest[:train_len]
test  = traintest[train_len:]
del traintest

# Model

In [None]:
import lightgbm as lgbm
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, confusion_matrix

Hyperparameter tuning is done with optuna on 100000 datapoints and based on this blog [lgbm tuning guide](https://neptune.ai/blog/lightgbm-parameters-guide). Tuned lgbm roc_auc: 0.9361097364777405


In [None]:
score_list = []
kf = KFold(n_splits=3, shuffle=True, random_state=2)
fold = 1
params = {
     'n_estimators': 8000, 
     'lambda_l1': 0.41952180928025645, 
     'bagging_fraction': 0.965448697013478,
     'bagging_freq': 1,
     'num_leaves': 60, 
     'max_depth': 10, 
     'max_bin': 786, 
     'learning_rate': 0.023740024697292472, 
     'feature_fraction': 0.7754066689188489, 
     'min_data_in_leaf': 12,
     'objective' : 'binary',
     'metric' : 'auc',
     'is_unbalance': True
     }
for idx_tr, idx_va in kf.split(train):
     X_tr = train.iloc[idx_tr]
     X_va = train.iloc[idx_va]
     y_tr = target.iloc[idx_tr]
     y_va = target.iloc[idx_va]
     model = lgbm.LGBMClassifier(**params)
     model.fit(X_tr, y_tr, eval_set=[(X_va,y_va)], eval_metric='auc', callbacks=[lgbm.early_stopping(800)])
     y_va_pred = model.predict_proba(X_va.values)[:,1]
     score = roc_auc_score(y_va, y_va_pred)
     score_list.append(score)
     print("Fold {} done".format(fold))
     fold += 1
np.mean(score_list)

In [None]:
from lightgbm import plot_importance
plot_importance(model, figsize=(14,30))

In [None]:
import scipy
pred_list = []
for seed in range(5):
    model = lgbm.LGBMClassifier(**params, random_seed =seed+1000)
    model.fit(train,target)
    y_pred = model.predict_proba(test.values)[:,1]
    pred_list.append(scipy.stats.rankdata(y_pred))
    del y_pred
    print(f"{seed:2}", pred_list[-1])
print()
y_pred = np.array(pred_list).mean(axis=0)
submission = pd.DataFrame({'id': id_test, 'target': y_pred})
submission.to_csv('submission.csv', index=False)