# Tabular Playground Series - Mar 2021
In this notebook we work out the binary dataset and create the machine learning classification model. The dataset contains the 31 different features and a target variable.

# Load the Dataset
In this section, we load our useful libraries and load the dataset into the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import warnings

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier, Pool

warnings.filterwarnings('ignore')
plt.style.use('seaborn')


In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
df_train.head()

In [None]:
df_test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
df_test.head()

In [None]:
df_train_label = df_train.drop('target', axis=1)
df_train_label['train-test'] = 1
df_test['train-test'] = 0

In [None]:
df = pd.concat([df_train_label, df_test])

In [None]:
len(df)

# Dataset Statistics
In this section, we find out the basic statistic values from the dataset to understand the dataset and find out the different normalization factors like outliers, skewness, etc.

In [None]:
df.describe()

In [None]:
df.info()

From this, we can say that we don't have missing data in our dataset. But there are some categorical feature present in our dataset. To make them in numerical format we need to perfrom some encoding technique to get rid from that. But before that we perfrom EDA to find out the error and where our dataset is unbalance.

# Exploaratory Data Analysis
In this section, we perform the different techniques for the analysis of the data and find the patterns on the basis on correlation and k-neighbours. And also find the skewness and outliers present in the dataset.


`TODO`
* find the outliers. ✅
* find the skewness.
* find the unbalance data using the target variable.
* find the distint categorical value use in the dataset.
* find the correleation between the dataset features.
* find the k-neighbours in the dataset using the distinct features in the dataset.

## Outliers

In [None]:
df.head()

In [None]:
numerical_col = [col for col in df.columns if pd.api.types.is_float_dtype(df[col])]
plt.boxplot(df[numerical_col])
plt.title('Numerical Boxplot', fontsize=24, fontweight='bold')
plt.xlabel('Features');

We have outliers in the columns `cont8`, `cont9` and `cont10`. So, now we need to hangle these outliers using the percentile technique.

In [None]:
outlier_col = ['cont8', 'cont9', 'cont10']
for col in outlier_col:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    IQR = q3 - q1
    df[col] = np.where(df[col] < q1, (q1 - 1.5 * IQR), df[col])
    df[col] = np.where(df[col] > q3, (q3 + 1.5 * IQR), df[col])

In [None]:
numerical_col = [col for col in df.columns if pd.api.types.is_float_dtype(df[col])]
plt.boxplot(df[numerical_col])
plt.title('Numerical Boxplot', fontsize=24, fontweight='bold')
plt.xlabel('Features');

Now, we dont have any outlier present in our numerical dataset. Lets check if there is any skew present in the numerical dataset or not.

In [None]:
df.head()

In [None]:
train_data = df[df['train-test'] == 1]
test_data = df[df['train-test'] == 0]

In [None]:
cat_features = ['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18']
train_data.drop(['id', 'train-test'], axis=1, inplace=True)
train_pool = Pool(train_data, df_train['target'], cat_features)
test_data.drop(['id', 'train-test'], axis=1, inplace=True)
test_pool = Pool(test_data, cat_features=cat_features)

In [None]:
train_pool.get_label()

In [None]:
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(train_data, df_train['target'], test_size=0.2)
len(X_train), len(y_train)

In [None]:
test_X_pool = Pool(X_test, y_test, cat_features=cat_features)

In [None]:
cat_model = CatBoostClassifier()
cat_model.randomized_search(grid, train_pool)

In [None]:
cat_model.score(test_X_pool)

In [None]:
y_preds = cat_model.predict_proba(test_pool)

In [None]:
submission = pd.DataFrame(y_preds[:, 1], columns=['target'])
submission.index = df_test['id']
submission.to_csv('./submission-final.csv')

In [None]:
submission.head()