<a id='0'> </a>
### **Table of Contents**

   1. [Basic Overview 📺](#1)
   
      1.1. [Train Dataset](#1.1)
      
      1.2. [Test Dataset](#1.2)
   
   2. [Exploratory Data Analysis 📊](#2)
         
      2.1. [Train Dataset](#2.1)
      
      2.2. [Test Dataset](#2.2)
   
   3. [Modelling 🍀](#3)
   
      3.1. [XGB Model](#3.1)
      
      3.2. [Neural Networks](#3.2)
      
      
      

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_train = pd.read_csv("../input/tabular-playground-series-may-2022/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-may-2022/test.csv")

[Slide Up](#0)

<a id = 1></a>
## **Basic Overview 📺**

<a id = 1.1></a>
### **1. Train Dataset**

In [None]:
df_train.head(5)

In [None]:
print("***** Shape of Training Dataset ***** ",df_train.shape)


1) Dataset contains 16 Int64 Columns , 16 float64 columns and 1 object columns 

2) There are not any null values in the dataset 

3) Shape of training dataset is 900000 Rows and 33 Columns

4) 'f_07', 'f_08', 'f_09', 'f_10', 'f_11', 'f_12', 'f_13', 'f_14', 'f_15', 'f_16', 'f_17', 'f_18', 'f_29', 'f_30' , target , are integer number columns

5) 'f_00', 'f_01', 'f_02', 'f_03', 'f_04', 'f_05', 'f_06', 'f_19', 'f_20', 'f_21', 'f_22', 'f_23', 'f_24', 'f_25' , 'f_26', 'f_28' , are float number columns

6) f_27 is the columns which contains object values i.e. String type

In [None]:
df_train.info()

In [None]:
lis = ['f_07','f_08','f_09','f_10','f_11','f_12','f_13','f_14','f_15',
       'f_16','f_17','f_18','f_29','f_30']

for feature in lis:
    print( "Number of Unique Feature "+feature +" : ", len(df_train[feature].unique()))
    print()
    print( "Feature "+feature +" : ", sorted(df_train[feature].unique()))
    print()

In [None]:
df_train.describe(include='all').style.background_gradient(cmap='Blues')

In [None]:
df_train['f_27'].value_counts()

[Slide Up](#0)
<a id = 1.2></a>
### **2. Test Dataset**

In [None]:
df_test.head()

In [None]:
print("***** Shape of Testing Dataset ***** ",df_test.shape)


1) Test Dataset contains 15 Int64 Columns(excluding target) , 16 float64 columns and  1 object columns 

2) There are not any null values in the dataset 

3) Shape of training dataset is 700000 Rows and 32 Columns

4) 'f_07', 'f_08', 'f_09', 'f_10', 'f_11', 'f_12', 'f_13', 'f_14', 'f_15', 'f_16', 'f_17', 'f_18', 'f_29', 'f_30' , target , are integer number columns

5) 'f_00', 'f_01', 'f_02', 'f_03', 'f_04', 'f_05', 'f_06', 'f_19', 'f_20', 'f_21', 'f_22', 'f_23', 'f_24', 'f_25' , 'f_26', 'f_28' , are float number columns

6) f_27 is the columns which contains object values i.e. String type

In [None]:
df_test.info()

In [None]:
df_test.describe(include='all').style.background_gradient(cmap='Blues')

In [None]:
lis = ['f_07','f_08','f_09','f_10','f_11','f_12','f_13','f_14','f_15',
       'f_16','f_17','f_18','f_29','f_30']

for feature in lis:
    print( "Number of Unique Feature "+feature +" : ", len(df_test[feature].unique()))
    print()
    print( "Feature "+feature +" : ", sorted(df_test[feature].unique()))
    print()

As we can observe f_27 feature is object type feature i.e. String

In [None]:
df_test['f_27'].value_counts()

[Slide Up](#0)
<a id = 2></a>
## **Exploratory Data Analysis 📊**

<a id = 2.1></a>
### **1. Train Dataset**

First , let's all findout if there is how features are related to each other i.e. we will find correlation between features.

1) I have used spearman rank correlation method to find correlation between features and target , because it will also tell us about non-linear relation between columns (if present). 

2) As observed from the correlation table , there is not any siginificant relation between any two features 

3) f_28 and f_03 has positive correlation 0.32121

4) f_28 have correlation > 0.1 with f_00 , f_01 , f_02 , f_03 , f_04 , f_05 , f_06 .

5) f_22 and f_30 has positive correlation i.e. 0.318088

6) Neither of features available have significant relation with target.

In [None]:
df_train1 = df_train.drop(['id','f_27'],axis=1)

In [None]:
spear_corr=df_train1.corr(method='spearman')
spear_corr.style.background_gradient(cmap='Blues')

In [None]:
def without_hue(data,feature,ax):
    
    total=float(len(data))
    bars_plot=ax.patches
    
    for bars in bars_plot:
        percentage = '{:.1f}%'.format(100 * bars.get_height()/total)
        x = bars.get_x() + bars.get_width()/2.0
        y = bars.get_height()
        ax.text(x, y,(percentage,bars.get_height()),ha='center',fontweight='bold',fontsize=10)

In [None]:
fig=plt.figure(figsize=(10,5))

#Setting Colour
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")


#Dealing with spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

#count plot
train_target = sns.countplot(data=df_train,x='target',palette="Set1")
without_hue(df_train,'target',train_target)

In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_train,x='f_00',kde=True,palette='Set1',ax=ax[0][0],hue='target')
sns.histplot(data=df_train,x='f_01',kde=True,palette='Set1',ax=ax[0][1],hue='target')
sns.histplot(data=df_train,x='f_02',kde=True,palette='Set1',ax=ax[0][2],hue='target')
sns.histplot(data=df_train,x='f_03',kde=True,palette='Set1',ax=ax[0][3],hue='target')
sns.histplot(data=df_train,x='f_04',kde=True,palette='Set1',ax=ax[1][0],hue='target')
sns.histplot(data=df_train,x='f_05',kde=True,palette='Set1',ax=ax[1][1],hue='target')
sns.histplot(data=df_train,x='f_06',kde=True,palette='Set1',ax=ax[1][2],hue='target')
sns.histplot(data=df_train,x='f_07',kde=True,palette='Set1',ax=ax[1][3],hue='target')
sns.histplot(data=df_train,x='f_08',kde=True,palette='Set1',ax=ax[2][0],hue='target')
sns.histplot(data=df_train,x='f_09',kde=True,palette='Set1',ax=ax[2][1],hue='target')
sns.histplot(data=df_train,x='f_10',kde=True,palette='Set1',ax=ax[2][2],hue='target')
sns.histplot(data=df_train,x='f_11',kde=True,palette='Set1',ax=ax[2][3],hue='target')
sns.histplot(data=df_train,x='f_12',kde=True,palette='Set1',ax=ax[3][0],hue='target')
sns.histplot(data=df_train,x='f_13',kde=True,palette='Set1',ax=ax[3][1],hue='target')
sns.histplot(data=df_train,x='f_14',kde=True,palette='Set1',ax=ax[3][2],hue='target')
sns.histplot(data=df_train,x='f_15',kde=True,palette='Set1',ax=ax[3][3],hue='target')


In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_train,x='f_15',kde=True,palette='Set1',ax=ax[0][0],hue='target')
sns.histplot(data=df_train,x='f_16',kde=True,palette='Set1',ax=ax[0][1],hue='target')
sns.histplot(data=df_train,x='f_17',kde=True,palette='Set1',ax=ax[0][2],hue='target')
sns.histplot(data=df_train,x='f_18',kde=True,palette='Set1',ax=ax[0][3],hue='target')
sns.histplot(data=df_train,x='f_19',kde=True,palette='Set1',ax=ax[1][0],hue='target')
sns.histplot(data=df_train,x='f_20',kde=True,palette='Set1',ax=ax[1][1],hue='target')
sns.histplot(data=df_train,x='f_21',kde=True,palette='Set1',ax=ax[1][2],hue='target')
sns.histplot(data=df_train,x='f_22',kde=True,palette='Set1',ax=ax[1][3],hue='target')
sns.histplot(data=df_train,x='f_23',kde=True,palette='Set1',ax=ax[2][0],hue='target')
sns.histplot(data=df_train,x='f_24',kde=True,palette='Set1',ax=ax[2][1],hue='target')
sns.histplot(data=df_train,x='f_25',kde=True,palette='Set1',ax=ax[2][2],hue='target')
sns.histplot(data=df_train,x='f_26',kde=True,palette='Set1',ax=ax[2][3],hue='target')
#sns.histplot(data=df_train,x='f_27',kde=True,palette='Set1',ax=ax[3][0],hue='target')
sns.histplot(data=df_train,x='f_28',kde=True,palette='Set1',ax=ax[3][0],hue='target')
sns.histplot(data=df_train,x='f_29',kde=True,palette='Set1',ax=ax[3][1],hue='target')
sns.histplot(data=df_train,x='f_30',kde=True,palette='Set1',ax=ax[3][2],hue='target')


In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.boxplot(data=df_train,x='f_00',palette='Set1',ax=ax[0][0])
sns.boxplot(data=df_train,x='f_01',palette='Set1',ax=ax[0][1])
sns.boxplot(data=df_train,x='f_02',palette='Set1',ax=ax[0][2])
sns.boxplot(data=df_train,x='f_03',palette='Set1',ax=ax[0][3])
sns.boxplot(data=df_train,x='f_04',palette='Set1',ax=ax[1][0])
sns.boxplot(data=df_train,x='f_05',palette='Set1',ax=ax[1][1])
sns.boxplot(data=df_train,x='f_06',palette='Set1',ax=ax[1][2])
sns.boxplot(data=df_train,x='f_19',palette='Set1',ax=ax[1][3])
sns.boxplot(data=df_train,x='f_20',palette='Set1',ax=ax[2][0])
sns.boxplot(data=df_train,x='f_21',palette='Set1',ax=ax[2][1])
sns.boxplot(data=df_train,x='f_22',palette='Set1',ax=ax[2][2])
sns.boxplot(data=df_train,x='f_23',palette='Set1',ax=ax[2][3])
sns.boxplot(data=df_train,x='f_24',palette='Set1',ax=ax[3][0])
sns.boxplot(data=df_train,x='f_25',palette='Set1',ax=ax[3][1])
sns.boxplot(data=df_train,x='f_26',palette='Set1',ax=ax[3][2])
sns.boxplot(data=df_train,x='f_28',palette='Set1',ax=ax[3][3])


In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
        
sns.stripplot(data=df_train,y='f_00',x='target',palette='Set1',ax=ax[0][0])
sns.stripplot(data=df_train,y='f_01',x='target',palette='Set1',ax=ax[0][1])
sns.stripplot(data=df_train,y='f_02',x='target',palette='Set1',ax=ax[0][2])
sns.stripplot(data=df_train,y='f_03',x='target',palette='Set1',ax=ax[0][3])
sns.stripplot(data=df_train,y='f_04',x='target',palette='Set1',ax=ax[1][0])
sns.stripplot(data=df_train,y='f_05',x='target',palette='Set1',ax=ax[1][1])
sns.stripplot(data=df_train,y='f_06',x='target',palette='Set1',ax=ax[1][2])
sns.stripplot(data=df_train,y='f_19',x='target',palette='Set1',ax=ax[1][3])
sns.stripplot(data=df_train,y='f_20',x='target',palette='Set1',ax=ax[2][0])
sns.stripplot(data=df_train,y='f_21',x='target',palette='Set1',ax=ax[2][1])
sns.stripplot(data=df_train,y='f_22',x='target',palette='Set1',ax=ax[2][2])
sns.stripplot(data=df_train,y='f_23',x='target',palette='Set1',ax=ax[2][3])
sns.stripplot(data=df_train,y='f_24',x='target',palette='Set1',ax=ax[3][0])
sns.stripplot(data=df_train,y='f_25',x='target',palette='Set1',ax=ax[3][1])
sns.stripplot(data=df_train,y='f_26',x='target',palette='Set1',ax=ax[3][2])
sns.stripplot(data=df_train,y='f_28',x='target',palette='Set1',ax=ax[3][3])


In [None]:
df_f_27 = pd.DataFrame(df_train.value_counts('f_27'))
df_f_27 = df_f_27.iloc[:20,:]

fig=plt.figure(figsize=(12,5))

#Setting Colour
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")


#Dealing with spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')


train_target = sns.barplot(x=df_f_27.index,y= df_f_27[0])
plt.xticks(rotation=-90)
plt.show()

[Slide Up](#0)
<a id = 2.2></a>
### **2. Test Dataset**

In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_test,x='f_00',kde=True,palette='Set1',ax=ax[0][0])
sns.histplot(data=df_test,x='f_01',kde=True,palette='Set1',ax=ax[0][1])
sns.histplot(data=df_test,x='f_02',kde=True,palette='Set1',ax=ax[0][2])
sns.histplot(data=df_test,x='f_03',kde=True,palette='Set1',ax=ax[0][3])
sns.histplot(data=df_test,x='f_04',kde=True,palette='Set1',ax=ax[1][0])
sns.histplot(data=df_test,x='f_05',kde=True,palette='Set1',ax=ax[1][1])
sns.histplot(data=df_test,x='f_06',kde=True,palette='Set1',ax=ax[1][2])
sns.histplot(data=df_test,x='f_07',kde=True,palette='Set1',ax=ax[1][3])
sns.histplot(data=df_test,x='f_08',kde=True,palette='Set1',ax=ax[2][0])
sns.histplot(data=df_test,x='f_09',kde=True,palette='Set1',ax=ax[2][1])
sns.histplot(data=df_test,x='f_10',kde=True,palette='Set1',ax=ax[2][2])
sns.histplot(data=df_test,x='f_11',kde=True,palette='Set1',ax=ax[2][3])
sns.histplot(data=df_test,x='f_12',kde=True,palette='Set1',ax=ax[3][0])
sns.histplot(data=df_test,x='f_13',kde=True,palette='Set1',ax=ax[3][1])
sns.histplot(data=df_test,x='f_14',kde=True,palette='Set1',ax=ax[3][2])
sns.histplot(data=df_test,x='f_15',kde=True,palette='Set1',ax=ax[3][3])


In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_test,x='f_15',kde=True,palette='Set1',ax=ax[0][0])
sns.histplot(data=df_test,x='f_16',kde=True,palette='Set1',ax=ax[0][1])
sns.histplot(data=df_test,x='f_17',kde=True,palette='Set1',ax=ax[0][2])
sns.histplot(data=df_test,x='f_18',kde=True,palette='Set1',ax=ax[0][3])
sns.histplot(data=df_test,x='f_19',kde=True,palette='Set1',ax=ax[1][0])
sns.histplot(data=df_test,x='f_20',kde=True,palette='Set1',ax=ax[1][1])
sns.histplot(data=df_test,x='f_21',kde=True,palette='Set1',ax=ax[1][2])
sns.histplot(data=df_test,x='f_22',kde=True,palette='Set1',ax=ax[1][3])
sns.histplot(data=df_test,x='f_23',kde=True,palette='Set1',ax=ax[2][0])
sns.histplot(data=df_test,x='f_24',kde=True,palette='Set1',ax=ax[2][1])
sns.histplot(data=df_test,x='f_25',kde=True,palette='Set1',ax=ax[2][2])
sns.histplot(data=df_test,x='f_26',kde=True,palette='Set1',ax=ax[2][3])
#sns.histplot(data=df_train,x='f_27',kde=True,palette='Set1',ax=ax[3][0],hue='target')
sns.histplot(data=df_test,x='f_28',kde=True,palette='Set1',ax=ax[3][0])
sns.histplot(data=df_test,x='f_29',kde=True,palette='Set1',ax=ax[3][1])
sns.histplot(data=df_test,x='f_30',kde=True,palette='Set1',ax=ax[3][2])


In [None]:
nrows=4
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,ncols):
    for j in range(0,nrows):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.boxplot(data=df_test,x='f_00',palette='Set1',ax=ax[0][0])
sns.boxplot(data=df_test,x='f_01',palette='Set1',ax=ax[0][1])
sns.boxplot(data=df_test,x='f_02',palette='Set1',ax=ax[0][2])
sns.boxplot(data=df_test,x='f_03',palette='Set1',ax=ax[0][3])
sns.boxplot(data=df_test,x='f_04',palette='Set1',ax=ax[1][0])
sns.boxplot(data=df_test,x='f_05',palette='Set1',ax=ax[1][1])
sns.boxplot(data=df_test,x='f_06',palette='Set1',ax=ax[1][2])
sns.boxplot(data=df_test,x='f_19',palette='Set1',ax=ax[1][3])
sns.boxplot(data=df_test,x='f_20',palette='Set1',ax=ax[2][0])
sns.boxplot(data=df_test,x='f_21',palette='Set1',ax=ax[2][1])
sns.boxplot(data=df_test,x='f_22',palette='Set1',ax=ax[2][2])
sns.boxplot(data=df_test,x='f_23',palette='Set1',ax=ax[2][3])
sns.boxplot(data=df_test,x='f_24',palette='Set1',ax=ax[3][0])
sns.boxplot(data=df_test,x='f_25',palette='Set1',ax=ax[3][1])
sns.boxplot(data=df_test,x='f_26',palette='Set1',ax=ax[3][2])
sns.boxplot(data=df_test,x='f_28',palette='Set1',ax=ax[3][3])


In [None]:
df_f_27_test = pd.DataFrame(df_test.value_counts('f_27'))
df_f_27_test = df_f_27_test.iloc[:20,:]

fig=plt.figure(figsize=(12,5))

#Setting Colour
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")


#Dealing with spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')


train_target = sns.barplot(x=df_f_27_test.index,y= df_f_27_test[0])
plt.xticks(rotation=-90)
plt.show()

In [None]:
#Adding Features , Referred from @AbrosM

#1) Here are getting features from string , by making columns for occurence of alphabets in each index of string 
# with respect to alphabet "A".

#2) Adding another column , which will tell us number of unique alphabets in each row 

for data in [df_train,df_test]:
    for i in range(10):
        data['f_27_'+'ch'+str(i)] = data['f_27'].str[i].apply(lambda x : ord(x)-ord('A'))
        

def unique_char(string):
    return(len(set(string)))

df_train['unique_length'] = df_train['f_27'].apply(unique_char)
df_test['unique_length'] = df_test['f_27'].apply(unique_char)


In [None]:
df_train1 = df_train.drop(['id','f_27'],axis=1)
df_test1 = df_test.drop(['id','f_27'],axis=1)

In [None]:
df_train1.head()

[Slide Up](#0)
#### **Distribution of alphabets in training dataset**

In [None]:

nrows=3
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_train1,x='f_27_ch0',palette='Set1',ax=ax[0][0],hue="target")
sns.histplot(data=df_train1,x='f_27_ch1',palette='Set1',ax=ax[0][1],binwidth=1,hue='target')
sns.histplot(data=df_train1,x='f_27_ch2',palette='Set1',ax=ax[0][2],hue="target")
sns.histplot(data=df_train1,x='f_27_ch3',palette='Set1',ax=ax[0][3],binwidth=1,hue="target")
sns.histplot(data=df_train1,x='f_27_ch4',palette='Set1',ax=ax[1][0],binwidth=1,hue="target")
sns.histplot(data=df_train1,x='f_27_ch5',palette='Set1',ax=ax[1][1],hue="target")
sns.histplot(data=df_train1,x='f_27_ch6',palette='Set1',ax=ax[1][2],binwidth=1,hue="target")
sns.histplot(data=df_train1,x='f_27_ch7',palette='Set1',ax=ax[1][3],binwidth=1,hue="target")
sns.histplot(data=df_train1,x='f_27_ch8',palette='Set1',ax=ax[2][0],binwidth=1,hue="target")
sns.histplot(data=df_train1,x='f_27_ch9',palette='Set1',ax=ax[2][1],binwidth=1,hue="target")


[Slide Up](#0)
#### **Distribution of alphabets in test dataset**

In [None]:
nrows=3
ncols=4
f,ax= plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,15))
f.patch.set_facecolor('#F2EDD7FF')

for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
    
sns.histplot(data=df_test1,x='f_27_ch0',palette='Set1',ax=ax[0][0])
sns.histplot(data=df_test1,x='f_27_ch1',palette='Set1',ax=ax[0][1],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch2',palette='Set1',ax=ax[0][2])
sns.histplot(data=df_test1,x='f_27_ch3',palette='Set1',ax=ax[0][3],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch4',palette='Set1',ax=ax[1][0],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch5',palette='Set1',ax=ax[1][1])
sns.histplot(data=df_test1,x='f_27_ch6',palette='Set1',ax=ax[1][2],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch7',palette='Set1',ax=ax[1][3],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch8',palette='Set1',ax=ax[2][0],binwidth=1)
sns.histplot(data=df_test1,x='f_27_ch9',palette='Set1',ax=ax[2][1],binwidth=1)


#### **TRAIN VS TEST**

1) f_07 (Test) does not have value 14 , it contains value 16 instead which is little different from f_07 (Train) , both have 16 unique categories.

2) f_08 (train) and f_08(test) have 16 and 14 unique categories respectively. 

3) f_09(train) and f_09(test) have 15 and 16 unique categories respectively.

4) f_10(train) and f_10(test) have 15 and 16 unique categories respectively.

5) Similarly f_11 , f_13 , f_14 , f_15 , f_16 differs in number of unique categories. 

6) f_29 and f_20 have some extreme features i.e. [0,1] and [0,1,2] respectively

7) Continous values in both train and test dataset are normally distibuted , there are outliers we can observe from box plots , but will keep it to prevent data loss.

8) f_27 is string type columns , which contains all string of lenght 10 , after feature engineering we can observe occcurence of alphabets in index of strings.





[Slide Up](#0)
<a id = 3></a>
## **Modelling**

In [None]:
import xgboost as xgb
import lightgbm as lgb
import optuna
from sklearn.model_selection import train_test_split , cross_val_score
from sklearn.metrics import accuracy_score , roc_auc_score , plot_confusion_matrix , confusion_matrix ,classification_report


<a id = 3.1></a>
### XGB Model

In [None]:
y= df_train1['target']
x= df_train1.drop(['target'],axis=1)

In [None]:
#Hyper-parameter tuning using optuna
def objective(trial,data=x,target=y):
    
    x_train, x_test, y_train , y_test = train_test_split(x, y, test_size=0.3,random_state=42)
    
    #'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
     #'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20]),
           
    param = {
        'tree_method':'gpu_hist',
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02]),
        'n_estimators': trial.suggest_int('n_estimators', 50,10000),
        'max_depth': trial.suggest_int('max_depth', 2,20),
        'random_state': trial.suggest_categorical('random_state', [24, 48]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }
    
    model = xgb.XGBClassifier(**param)  
    
    model.fit(x_train,y_train,eval_set=[(x_test,y_test)],early_stopping_rounds=100,verbose=False)
    
    pred = model.predict_proba(x_test)[:,1]
    
    roc_score = roc_auc_score(y_test,pred)
    
    return(roc_score)
    

In [None]:
study_xgb= optuna.create_study(direction='maximize')
study_xgb.optimize(objective, n_trials=20)

In [None]:
trial_xgb= study_xgb.best_trial
print(trial_xgb.value)
print(trial_xgb.params)

In [None]:
'''params = {'tree_method':'gpu_hist','lambda': 0.0012300031400918612, 'alpha': 0.3886003728767329, 
          'colsample_bytree': 0.9, 'subsample': 0.6, 
          'learning_rate': 0.018, 'n_estimators': 2796, 
          'max_depth': 12, 'random_state': 24, 'min_child_weight': 2}'''

'''params = {'tree_method':'gpu_hist','lambda': 0.027954391300634148, 'alpha': 0.1774939353037524, 'colsample_bytree': 0.9, 
          'subsample': 0.6, 'learning_rate': 0.016, 'n_estimators': 2661, 'max_depth': 17, 
          'random_state': 48, 'min_child_weight': 9}'''

params= {'tree_method':'gpu_hist','lambda': 7.12453109103749, 'alpha': 0.0012447844254356213, 
         'colsample_bytree': 0.9, 'subsample': 1.0, 'learning_rate': 0.02, 'n_estimators': 5079, 
         'max_depth': 8, 'random_state': 48, 'min_child_weight': 3}


In [None]:

x_train, x_test, y_train , y_test = train_test_split(x, y, test_size=0.3,random_state=42)

model_xgb_tr = xgb.XGBClassifier(**params)

model_xgb_tr.fit(x_train,y_train)


preds = model_xgb_tr.predict_proba(x_test)[:,1]



In [None]:
#Getting Classification Score
print(classification_report(y_test,preds.round())



In [None]:
#Getting Roc Score
print(roc_auc_score(y_test,preds))

In [None]:
#model_xgb = xgb.XGBClassifier(**params)
model_xgb_tr.fit(x,y)


#Prediction
pred=model_xgb_tr.predict_proba(df_test1)[:,1]

In [None]:
#Getting Submission Dataframe
dataframe = pd.DataFrame({'id':df_test['id'],'target':pred})

In [None]:
dataframe.head()

In [None]:
#Saving File for submission
dataframe.to_csv("Optuna+XGB+more estimators2.csv",index=False)

[Slide Up](#0)
<a id = 3.2></a>
### **Neural Networks**

In [None]:
from tensorflow.keras.models import Model , Sequential
from tensorflow.keras.layers import Dense, Flatten , Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint , EarlyStopping , ReduceLROnPlateau
from tensorflow.keras import regularizers , metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score , classification_report , accuracy_score , plot_confusion_matrix ,plot_roc_curve

In [None]:
y_nn= df_train1['target']
x_nn= df_train1.drop(['target'],axis=1)

In [None]:
ss = StandardScaler()
x_nn = pd.DataFrame(ss.fit_transform(x_nn))

#df_test1 = ss.fit_transform(df_test1)

In [None]:
x_nn.head()

In [None]:
x_train_nn, x_test_nn, y_train_nn , y_test_nn = train_test_split(x_nn, y_nn, test_size=0.3,random_state=42)


In [None]:
model = Sequential()

model.add(Input(shape=(len(x_nn.columns))))

#model.add(Dense(512, activation='swish',kernel_regularizer= regularizers.l2(1e-5)))

model.add(Dense(64, activation='swish',kernel_regularizer= regularizers.l2(1e-5)))

model.add(Dense(64 , activation='swish',kernel_regularizer= regularizers.l2(1e-5)))

model.add(Dense(64, activation='swish',kernel_regularizer= regularizers.l2(1e-5)))

model.add(Dense(16,activation='swish',kernel_regularizer= regularizers.l2(1e-5)))

model.add(Dense(1 , activation='sigmoid'))


model.summary()

In [None]:
model.compile(optimizer='Adam',loss='binary_crossentropy',metrics = [metrics.AUC(name='auc')])


In [None]:
lr = ReduceLROnPlateau(monitor='val_loss',patience=5,factor=0.7,verbose=1)
es = EarlyStopping(monitor='val_loss',patience=10,verbose=1)
mc = ModelCheckpoint(filepath='/content',monitor='val_loss',verbose=1,save_best_only=True)

In [None]:
history = model.fit(x_train_nn,y_train_nn,validation_data=(x_test_nn,y_test_nn),
                    callbacks=[lr,es,mc],epochs=200,batch_size=2048)

In [None]:
fig=plt.figure(figsize=(10,5))
fig.patch.set_color("#F2EDD7FF")

ax=plt.axes()
ax.set_facecolor("#F2EDD7FF")
fig.patch.set_color("#F2EDD7FF")

plt.plot(history.history['loss'],color='b',label='Training Loss')
plt.plot(history.history['val_loss'],color='r',label='Validation Loss')

plt.xlabel("epochs")
plt.ylabel("loss_value")
plt.title("loss")
plt.legend()
plt.show()

In [None]:
preds= model.predict(x_test_nn).squeeze()

print("***** Accuracy of Neural Network *****")
print(accuracy_score(y_test_nn,preds.round()))

print()
print("***** Classification Report of Neural Network *****")
print(classification_report(y_test_nn,preds.round()))

print()
print("***** ROC-AUC Score of Neural Network")
print(roc_auc_score(y_test_nn,preds))


In [None]:
model.fit(x_nn,y_nn,epochs=300,batch_size=2048)

In [None]:
df_test1= ss.fit_transform(df_test1)

preds_test= model.predict(df_test1).squeeze()

In [None]:
preds_test

In [None]:
dataframe = pd.DataFrame({'id':df_test['id'],'target':preds_test})

In [None]:
dataframe.to_csv("NN16.csv",index=False)

> ### **Thanks for Scrolling Down this far  , Please give your suggestion if any and provide your upvote if you find this notebook doing little justice to the dataset ✌😊**