# Approach:-
My advice to beginners will be to follow the following workflow for approaching any tabular data for competition:-  
1. Read the Problem Statement and understand the requirements carefull.
2. READ THE PROBLEM STATEMENT AND REQUIREMENTS AGAIN.
3. Look carefully at the Data and do EDA (It will help largely during feature engineering and Inference). It is worth the time.
4. Decide the poroblem category (Classification/Regression).
5. Spit data into Kfolds before doing Feature engineering. Because it's very easy to leak data/contaminate the validation set during Feature Engineering. Then the validation set is no longer representative of real data.
6. Build Basic model and record the performance.
7. Try to improve the performance by Feature engineering, better encoding, synthetic feature creation etc.
8. Do feature selection to only use the important/relevant features.
9. If that saturates or starts to drop, go back and try using other models and see whichone works better.
10. Tune the hyperparameters for the models which seem to work good as per your observations.
11. Select some of the best models from previous step and do an ensemble/stacking-blending.
12. Submit to the leaderboard and gaze the difference in CV and LB. If it is huge, then most likely there was some overfitting and leakage across folds. Try to identify and rectify the same.
13. When done, resubmit and you should see a close result.
14. Not satisfied with result/Trying to get better rank? Head over to the Discussion Forum and read through interesting discussions and see what others are trying to do. If happy, implement them and gradually start improving your scores and skills.

**Happy Kaggling!**

# Why this Competition?
This competition provides an unique oppertunity for Data Science begiiners to participate in a Hackathon style challenge for Data Science. It also provides the unique oppertunities for beginners to get their hands dirty and indulge is practical application of ML and do one of the basic tasks machine learning algorithms are capable of doing:- **Classification**.  

This competition has the right mix to Catergorical and Numerical features we might expect in a practical problem and this helps us know how to leverage both of thhem in conjugation for a Classification task.

# Problem Statement
The goal of this competition is to provide a fun, and approachable for anyone, tabular dataset. These competition will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.  

The dataset used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.  

So we are sort of dealing with a variation of actual real-world data and here as Data Scientists are expected to predict the Binary Classification based on these features.

## Data Description:-
We have 19 Categorical Features and 11 Continuous Features in the data.

## Expected Outcome:-
* Build a model to predict the probability of binary class given the information above.
* Grading Metric: **ROC_AUC_SCORE**

## Problem Category:-
From the data and objective its is evident that this is a **Binary Classification Problem** in the **Tabular Data** format.

So without further ado, let's now start with some basic imports to take us through this journey:-

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import os
from scipy.optimize import fmin as scip_fmin

# Visialisation
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")

# Machine Learning

# Utils
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn import preprocessing
import category_encoders as ce

#Feature Selection
from sklearn.feature_selection import chi2, f_classif, f_regression
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile, VarianceThreshold
# Models
import xgboost as xgb

#Metrics
from sklearn.metrics import roc_auc_score

In [None]:
data_dir = '../input/tabular-playground-series-mar-2021'

train_file_path = os.path.join(data_dir, 'train.csv')
test_file_path = os.path.join(data_dir, 'test.csv')
sample_sub_file_path = os.path.join(data_dir, 'sample_submission.csv')

print(f'Train file: {train_file_path}')
print(f'Train file: {test_file_path}')
print(f'Train file: {sample_sub_file_path}')

In [None]:
RANDOM_SEED = 42

In [None]:
def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

In [None]:
seed_everything()

# EDA
Let's have a basic look around the data we have at hand first

In [None]:
train_df = pd.read_csv(train_file_path)
test_df = pd.read_csv(test_file_path)
sub_df = pd.read_csv(sample_sub_file_path)

Let's see what columns we have in the training data.

In [None]:
train_df.sample(10)

In [None]:
train_df.columns

The naming scheme makes it obvious which column has what datatype.

In [None]:
train_df.describe().T

We can see that all of the continuous features are almost on a similar scale of magnitude. Makes job easy for us, as we do not need to do any sort scaling later. Let's see if we have any null values in the dataset...

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

Another good sign for us 😄.  
There are no Null values in either train or test dataset. Loess work for us as we do not have to handle for missing values and do imputation.  

## 1. Class Imbalance

In [None]:
ax = plt.subplots(figsize=(12, 6))
sns.set_style("whitegrid")
sns.countplot(x='target', data=train_df);
plt.ylabel("No. of Observations", size=20);
plt.xlabel("Targets", size=20);

Ok, so this is a imbalanced dataset. We have to keep this in mind while developing our models later.  
Now let's move on to understanding each individual feature through EDA and some visualizations...

## 2. cat0

In [None]:
def plot_cat_distribution(cat, train_df=train_df):
    ax = plt.subplots(figsize=(12, 6))
    sns.set_style('whitegrid')
    sns.countplot(x=cat, data=train_df);
    plt.ylabel('No. of Observations', size=20);
    plt.xlabel(cat+' Count', size=20);
    plt.show()
    
def plot_cat_response(cat, train_df=train_df):
    ax = plt.subplots(figsize=(8, 5))
    sns.set_style('whitegrid')
    sns.countplot(x=cat, hue='target', data=train_df);
    plt.show()

In [None]:
plot_cat_distribution('cat0')

In [None]:
plot_cat_response('cat0')

* We can say that when cat0 has value A it has a significant response rate as compared to class B.  

## 3. cat1

In [None]:
plot_cat_distribution('cat1')

In [None]:
plot_cat_response('cat1')

* We can see there are certain categories where the data is very rare and we do not necessarily learn anything new from them.
* Also there are categories like A which are almost exclusively representing target=0.
* Even though in the overall dataset there is a imbalance in favour of Target 0 but in certain categories here like L, H and G both the targets are similarly represented.  

## 4. cat2

In [None]:
plot_cat_distribution('cat2')

In [None]:
plot_cat_response('cat2')

* We can see there are certain categories where the data is very rare and we do not necessarily learn much from them.
* Category A is the dominant category by far in this feature.
* Category O gives a good response rate for Target value 1.  

## 5. cat3

In [None]:
plot_cat_distribution('cat3')

In [None]:
plot_cat_response('cat3')

* There are some very rare categories in this feature.  

## 6. cat4

In [None]:
plot_cat_distribution('cat4')

In [None]:
plot_cat_response('cat4')

* There are some very rare categories in this feature.  

## 7. cat5

In [None]:
train_df['cat5'].value_counts()

* This is a very highly cardinal feature.
* There are certain categories which are very rare in nature.
* BI is the most dominant category by far.

## 8. cat6

In [None]:
plot_cat_distribution('cat6')

In [None]:
plot_cat_response('cat6')

* There are some rare categories in this feature.  

## 9. cat7

In [None]:
train_df['cat7'].value_counts()

## 10. cat8

In [None]:
train_df['cat8'].value_counts()

* There are some very rare categories in this feature.  

## 11. cat9

In [None]:
plot_cat_distribution('cat9')

In [None]:
plot_cat_response('cat9')

* Category A is by far the dominant category by far.
* There are some very rare categories in this feature.

## 12. cat10

In [None]:
train_df['cat10'].value_counts()

* There are some very rare categories in this feature.  

## 13. cat11

In [None]:
plot_cat_distribution('cat11')

In [None]:
plot_cat_response('cat11')

* Category B has a strong response rate to Target 1 despite 1 being the lesser class.

## 14. cat12

In [None]:
plot_cat_distribution('cat12')

In [None]:
plot_cat_response('cat12')

## 15. cat13

In [None]:
plot_cat_distribution('cat13')

In [None]:
plot_cat_response('cat13')

* B is a very rare category in this feature.
* But category B has a very strong correlation with the Target class 1.

## 16. cat14

In [None]:
plot_cat_distribution('cat14')

In [None]:
plot_cat_response('cat14')

* Category A has a strong response towards Target 0.

## 15. cat15

In [None]:
plot_cat_distribution('cat15')

In [None]:
plot_cat_response('cat15')

* Category C is very rare in nature.
* Category B and C have a strong response towards target 0.  

## 18. cat16

In [None]:
plot_cat_distribution('cat16')

In [None]:
plot_cat_response('cat16')

## 19. cat17

In [None]:
plot_cat_distribution('cat17')

In [None]:
plot_cat_response('cat17')

* Category A is very rare in nature for this feature.

## 20. cat18

In [None]:
plot_cat_distribution('cat18')

In [None]:
plot_cat_response('cat18')

* Categtory A is very rare in nature in this feature.

## 21. Numerical Features

In [None]:
not_features = ['id', 'target']
features = [feat for feat in train_df.columns if feat not in not_features]
numerical_features = []
categorical_features = []

for feat in features:
    if train_df[feat].dtype == 'object':
        categorical_features.append(feat)
    else:
        numerical_features.append(feat)

print(f'Numerical Features: {numerical_features}')
print(f'Categorical Features: {categorical_features}')

Let's look at the distribution, correlation and target relationship of all the numerical features through a pair plot.

In [None]:
g = sns.pairplot(train_df[numerical_features + ['target']], hue='target')
g.map_lower(sns.kdeplot, levels=4, color=".2")
g.fig.set_size_inches(20,20)

In [None]:
train_df_cor_spear = train_df[numerical_features].corr(method='pearson')
plt.figure(figsize=(10,10))
sns.heatmap(train_df_cor_spear, square=True, cmap='coolwarm', annot=True);

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data=train_df[numerical_features], orient="h", palette="Set2");

* There doesnot seem to be a very high definitive relationship between the numeric values despite them having good numbers of correlation coefficient.
* There are some continuous features which have a long tail, so we can take log of them to diminish the effects of outliers.

This is close to the basic EDA we could do given that we do not know any feature names and have no idea regarding their relationship for trying bi-variate EDA at this point in time.  
Having said that, we can always iterate over unknown features as well to establish/discover any relationships between them. But we will skip that for now. And back to it later given when we want to further improve our model and need to create new features.

# KFold Splits  
Before we move on to feature engineering, it is always a good idea to perform cross validation splits. In that way, we will not risk any data leakage and would be more certain of the validation set being aptly represenative of the real world unknown data.

In [None]:
# From https://github.com/abhishekkrthakur/approachingalmost
NUM_SPLITS = 5

train_df["kfold"] = -1
train_df = train_df.sample(frac=1).reset_index(drop=True)
y = train_df.target.values
kf = StratifiedKFold(n_splits=NUM_SPLITS)
for f, (t_, v_) in enumerate(kf.split(X=train_df, y=y)):
    train_df.loc[v_, 'kfold'] = f
    
train_df.head()

# Feature Engineering
As we have seen in EDA some features also have very rare ctagories. As a rule we will select all rare classes having representation <0.1% of population and group them together under a new label called 'RARE'. So let's do that first...

In [None]:
train_df[categorical_features] = train_df[categorical_features].apply(lambda x: x.mask(x.map(x.value_counts())< (0.001*train_df.shape[0]), 'RARE'))

for col in categorical_features:
    vals = list(train_df[col].unique())
    test_df[col] = test_df[col].apply(lambda x: 'RARE' if x not in vals else x)

## Categorical Value Encoding
Categorical values are of datatype object, and in most of the scenarios those can not be fed directly to any machine learning algorithm. So the process of converting these objects to numbers is called encoding.  
There can be two types of categorical features:-
1. Ordinal - Have order associated with object name (Hot, Cold, Warm)
2. Nominal - Do not have any order associated (Cake, Bread, Burger)  

As the ordinality of the features are unknown in our case we will assume that all of them are nominal and proceed accordingly.  
There can be two scenarios in categorical varibles (as we know from the EDA):-
* High Cardinality Features (very high number of rare values)
* Low Cardinality Features (low number of rare values)

But despite that, there are some encoders like Target and Catboost encoders which work well with both. We will be using Catboost Encoder here. 

### Define Encoders

In [None]:
def catboost_enc(train_df, test_df, features):
    cb_enc = ce.CatBoostEncoder(cols=features)
    cb_enc.fit(train_df[features], train_df['target'])
    
    train_df = train_df.join(cb_enc.transform(train_df[features]).add_suffix('_cb'))
    test_df = test_df.join(cb_enc.transform(test_df[features]).add_suffix('_cb'))
    
    train_df = train_df.drop(features, axis=1)
    test_df = test_df.drop(features, axis=1)
    
    return train_df, test_df

In [None]:
low_cardinality_cols = []
high_cardinality_cols = []

for feat in categorical_features:
    if train_df[feat].nunique() < 5:
        low_cardinality_cols.append(feat)
    else:
        high_cardinality_cols.append(feat)
        
print(f'Low Cardinality Cols: {low_cardinality_cols}')
print(f'High Cardinality Cols: {high_cardinality_cols}')

In [None]:
train_df, test_df = catboost_enc(train_df, test_df, low_cardinality_cols)

In [None]:
train_df, test_df = catboost_enc(train_df, test_df, high_cardinality_cols)

## Diminishing Outliers  
As seen in EDA, there is a long tail for some numerical values. So, let's take log of the same to diminish the effects of outliers.

In [None]:
cols_for_log = ['cont8', 'cont9', 'cont10']

for col in cols_for_log:
    train_df[col] = np.log(train_df[col])
    test_df[col] = np.log(test_df[col])

# Features Selection
We need to select only the important features for better performance of the model. As unnecessary in best case scenario will not add to any productive calculation of the algorithm or in wrst case scenario 'confuse' the model.  

Also there is something called the curse of dimensionality associated with mathematical models, which says **"The curse of dimensionality, indicates that the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with respect to the number of input variables (i.e., dimensionality) of the function."** Hence removing unnecessay/low-contributing features can actually help the algorithm make better predictions.

To do the same let's create a wrapper class that has all the built in statistical tests required to perform feature selection and takes some basic inputs from user and spits out the required features.

In [None]:
target = ['target']
not_features = ['id', 'target', 'kfold']
cols = list(train_df.columns)
features = [feat for feat in cols if feat not in not_features]

In [None]:
# From https://github.com/abhishekkrthakur/approachingalmost
class UnivariateFeatureSelction:
    def __init__(self, n_features, problem_type, scoring, return_cols=True):
        """
        Custom univariate feature selection wrapper on
        different univariate feature selection models from
        scikit-learn.
        :param n_features: SelectPercentile if float else SelectKBest
        :param problem_type: classification or regression
        :param scoring: scoring function, string
        """
        self.n_features = n_features
        
        if problem_type == "classification":
            valid_scoring = {
                "f_classif": f_classif,
                "chi2": chi2,
                "mutual_info_classif": mutual_info_classif
            }
        else:
            valid_scoring = {
                "f_regression": f_regression,
                "mutual_info_regression": mutual_info_regression
            }
        if scoring not in valid_scoring:
            raise Exception("Invalid scoring function")
            
        if isinstance(n_features, int):
            self.selection = SelectKBest(
                valid_scoring[scoring],
                k=n_features
            )
        elif isinstance(n_features, float):
            self.selection = SelectPercentile(
                valid_scoring[scoring],
                percentile=int(n_features * 100)
            )
        else:
            raise Exception("Invalid type of feature")
    
    def fit(self, X, y):
        return self.selection.fit(X, y)
    
    def transform(self, X):
        return self.selection.transform(X)
    
    def fit_transform(self, X, y):
        return self.selection.fit_transform(X, y)
    
    def return_cols(self, X):
        if isinstance(self.n_features, int):
            mask = SelectKBest.get_support(self.selection)
            selected_features = []
            features = list(X.columns)
            for bool, feature in zip(mask, features):
                if bool:
                    selected_features.append(feature)
                    
        elif isinstance(self.n_features, float):
            mask = SelectPercentile.get_support(self.selection)
            selected_features = []
            features = list(X.columns)
            for bool, feature in zip(mask, features):
                if bool:
                    selected_features.append(feature)
        else:
            raise Exception("Invalid type of feature")
        
        return selected_features

In [None]:
ufs = UnivariateFeatureSelction(
    n_features=0.9,
    problem_type="classification",
    scoring="f_classif"
)

ufs.fit(train_df[features], train_df[target].values.ravel())
selected_features = ufs.return_cols(train_df[features])

# Utils  
Before creating models, let's create some helper functions which will help us rate the models and not do trivial repetative codes again and again. This is a good practice for keeping your code clean and reusable.

In [None]:
def rate_model(clf, x, cv = StratifiedKFold(n_splits=NUM_SPLITS),
               features=selected_features, target=target):
    '''
    Prints out various evaluation metrics for a classification task. Like:-
    1. Classification Accuracy
    2. ROC-AUC Score
    3. Precision
    4. Recall
    5. F1 Score
    All score are calculated in base format. No averaging is performed.
    
    clf - Classifier
    x - Input features
    cv - Cross Validation criteria
    features - Feature column names
    target - Target column name
    '''
    
    scoring = {'acc' : 'accuracy', 'roc' : 'roc_auc', 'precision' : 'precision', 'recall' : 'recall', 'f1' : 'f1'}
    scores = cross_validate(clf, x[features], x[target].values.ravel(), scoring=scoring, cv=cv, return_train_score=False)
    roc = np.mean(scores['test_roc'])
    acc = np.mean(scores['test_acc'])
    prec = np.mean(scores['test_precision'])
    rec = np.mean(scores['test_recall'])
    f1 = np.mean(scores['test_f1'])
    print(f'ROC: {roc}')
    print(f'Accuracy: {acc}')
    print(f'Precision: {prec}')
    print(f'Recall: {rec}')
    print(f'F-Score: {f1}')

In [None]:
def prepare_submission_one_model(clf, train_df=train_df, test_df=test_df,
                                 features=selected_features, target=target):
    '''
    Prepared the submission.csv file for submitting to the competition.
    
    Inputs:-
    clf - Classifier
    train_df - Training Dataset
    test_df - Test Dataset
    features - Feature column names
    target - Target column name
    '''
    
    clf.fit(train_df[features], train_df[target].values.ravel())
    preds = clf.predict_proba(test_df[features])[:, 1]
    output = pd.DataFrame({'id': test_df['id'],
                           'target': preds})
    output.to_csv('submission.csv', index=False)
    print('Prediction file saved. All the Best!')

# Model Building
Here we will be building a simple single XGBoost Classifier. As this is a starter notebook, I believe this is enough to get anyone started. You can build upon the same layout and add your own models and evaluate the performance in same way.  
In the end after selecting some goof models, you can try doing an ensemble of some best performing models on the dataset.

In [None]:
NUM_BOOSTERS = 5000

params = {
    'booster': 'gbtree',
    'eval_metric': 'auc',
    'random_state': RANDOM_SEED,
    'use_label_encoder': False,
    'tree_method': 'gpu_hist',
    'max_depth': 8,
    'learning_rate': 0.01,
    'n_estimators': NUM_BOOSTERS,
    'min_child_weight': 20,
    'gamma': 0.1,
    'alpha': 0.2,
    'lambda': 9,
    'colsample_bytree': 0.2,
    'subsample': 0.8,
    'nthread': -1
}

clf = xgb.XGBClassifier(**params)
rate_model(clf, train_df)

# Submission

In [None]:
NUM_BOOSTERS = 5000

params = {
    'booster': 'gbtree',
    'eval_metric': 'auc',
    'random_state': RANDOM_SEED,
    'use_label_encoder': False,
    'tree_method': 'gpu_hist',
    'max_depth': 8,
    'learning_rate': 0.01,
    'n_estimators': NUM_BOOSTERS,
    'min_child_weight': 20,
    'gamma': 0.1,
    'alpha': 0.2,
    'lambda': 9,
    'colsample_bytree': 0.2,
    'subsample': 0.8,
    'nthread': -1
}

clf = xgb.XGBClassifier(**params)
prepare_submission_one_model(clf)

This was a quick overview/implementation example of a basic model for tabular data.  
Hope you learnt something from this notebook.  
I will always keep updating and adding new things to this notebook as and when I come across more algorithms/Feature engineering/models worth sharing. So come back for more if you liked this one...

***Also if you found this notebook useful and use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. This keeps me motivated to write and share similar such starter kernels for beginners.***