<html>
    <a id='toc'></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:55px; border:10px solid brown ; padding:15px;"><center><b>TABLE OF CONTENTS ⏩</b></center></h1>
</html>

* [ Motivation 💪](#1)
* [ Dataset 📋](#2)
* [ Overview 📺](#3) 
* [ Visualisation 📉](#4) 

  * [ Univariate Analysis](#4.1)
    * [ Categorical Features](#4.1.1)
    
    * [ Continuous Features](#4.1.2)
    
  * [ Bivariate Analysis](#4.2)
    * [ Continuous Vs Categorical](#4.2.1)
    
    * [ Continuous VS Continuous](#4.2.2)
    
  * [ Multivariate Analysis](#4.3)
  
* [ Model and Prediction 🧭](#5)
  * [ Preprocessing](#5.1)
  
  * [ Logistic Regression](#5.2)
  
  * [ Random Forest Classifier](#5.3)
  
    * [ Feature Importance Of RF](#5.3.1)
    
  *  [ Light GBM Classifier](#5.4)
  
     * [ Feature Importance Of LGBM](#5.4.1)

[Table of Contents](#toc)
<html>
    <a id="1"></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:55px; border:10px solid brown ; padding:15px;"><center><b>1. MOTIVATION 💪</b></center></h1>
</html>

![Heart Attack](https://media.sciencephoto.com/image/c0249835/800wm/C0249835-Heart_Attack_and_Atherosclerosis.jpg)

**What is heart attack?**

A heart attack happens when the flow of oxygen-rich blood in one or more of the coronary arteries, which supply the heart muscle, suddenly becomes blocked, and a section of heart muscle can’t get enough oxygen. The blockage is usually caused when a plaque ruptures. If blood flow isn’t restored quickly, either by a medicine that dissolves the blockage or a catheter placed within the artery that physically opens the blockage, the section of heart muscle begins to die

**Why we need to predict chances of heart attack early?**

1. Early detection of heart attacks is one of the crucial stage to save the life of a person
2. Now-a-days it is become very common in india and even lethal. India have more than one crore cases every year
3. If we can predict the chances of getting heart attack through machine learning it will be great breakthrough in the field of medical science
4. We have the dataset which we can use to predict it to some extent but we need more data and features to make our model more robust

[Table of contents](#toc)
<html>
    <a id="2"></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:55px; border:10px solid brown ; padding:15px;"><center><b>2. DATASET 📋</b></center></h1>
</html>

1. **age** - age in years

2. **sex** - sex (1 = male; 0 = female)

3. **cp** - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)

4. **trestbps** - resting blood pressure (in mm Hg on admission to the hospital)

5. **chol** - serum cholestoral in mg/dl

6. **fbs** - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

7. **restecg** - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)

8. **thalach** - maximum heart rate achieved

9. **exang** - exercise induced angina (1 = yes; 0 = no)

10. **oldpeak** - ST depression induced by exercise relative to rest

11. **slope** - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)

12. **ca** - number of major vessels (0-3) colored by flourosopy

13. **thal(thallium stress)** - 0 to 3

14. **Output** - 0(less chance of heart attack) and 1(more chance of heart attack)

<html>
    <a id="3"></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:55px; border:10px solid brown; padding:15px;"><center><b>3. OVERVIEW 📺</b></center></h1>
</html>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df=pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
print("***** Shape of the dataset *****")
df.shape

In [None]:
print("***** First five rows *****")
df.head()

In [None]:
print("***** Column names in dataset *****")
print(list(df.columns))

In [None]:
print("***** Basic Infomation about dataset *****")
print()
df.info()

In [None]:
print("***** Description of data *****")
df.describe().T.style.bar(subset=['mean'],color='#205ff2').background_gradient(subset=['std','25%','50%','75%'],cmap="coolwarm")

In [None]:
cat_columns=['sex' , 'exng' , 'caa' , 'cp' , 'fbs' , 'restecg' , 'thall' , 'slp']
print("***** Unique values in categorical features *****")
print()

for i in cat_columns:
    print("Unique values in",i,'feature are : ',end=" ")
    print(df[i].unique())
    print()

In [None]:
cat_columns=['sex' , 'exng' , 'caa' , 'cp' , 'fbs' , 'restecg' , 'thall' , 'slp']
print("***** Value counts in categorical features *****")
print()

for i in cat_columns:
    print("Value counts of",i,'feature are : ')
    print(df[i].value_counts())
    print()

In [None]:
print("***** dtypes of columns in dataset ****")
print(df.dtypes)

In [None]:
df['output'].value_counts()

[Table of contents](#toc)
<html>
    <a id="4"></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:55px; border:10px solid brown; padding:15px;"><center><b>4. VISUALIZATIONS 📉</b></center></h1>
</html>

In [None]:
def with_hue(data,feature,ax):
    
    #Numnber of categories
    num_of_cat=len([x for x in data[feature].unique() if x==x])
    
    bars=ax.patches
    
    for ind in range(num_of_cat):
        ##     Get every hue bar
        ##     ex. 8 X categories, 4 hues =>
        ##    [0, 8, 16, 24] are hue bars for 1st X category
        hueBars=bars[ind:][::num_of_cat] 
        # Get the total height (for percentages)
        total=sum([x.get_height() for x in hueBars])
        #Printing percentages on bar
        for bar in hueBars:
            percentage='{:.1f}%'.format(100 * bar.get_height()/total)
            ax.text(bar.get_x()+bar.get_width()/2.0,
                   bar.get_height(),
                   percentage,
                    ha="center",va="bottom",fontweight='bold',fontsize=14)
    

    
def without_hue(data,feature,ax):
    
    total=float(len(data))
    bars_plot=ax.patches
    
    for bars in bars_plot:
        percentage = '{:.1f}%'.format(100 * bars.get_height()/total)
        x = bars.get_x() + bars.get_width()/2.0
        y = bars.get_height()
        ax.text(x, y,(percentage,bars.get_height()),ha='center',fontweight='bold',fontsize=14)



In [None]:
sns.set_theme(context="notebook",style="white",font_scale=2)
fig=plt.figure(figsize=(18,7))

#Setting plot and background color
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

#Dealing with spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

#countplot
c=sns.countplot(data=df,x='output',palette='rocket')

#percentage on bar plots
without_hue(df,'output',c)

[Table of Contents](#toc)

<html>
    <a id='4.1'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>4.1. UNIVARIATE ANALYSIS</b></center></h1>
</html>

In [None]:
df.head()

**Categorical Features**: Sex , exng , caa , cp , fbs , rest_ecg , thall , slp

**Continuous Features** : Age ,  trtbps , chol , thalachh , oldpeak

[Table of Contents](#toc)

<html>
    <a id='4.1.1'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>4.1.1. CATEGORICAL FEATURES</b></center></h1>
</html>

**Categorical Features**: Sex , exng , caa , cp , fbs , rest_ecg , thall , slp


In [None]:
# I have tried to make my code more dynamic here so that it can be easily reusable 
def plotting_cat_features(nrows,ncols,cat_columns):
    
    f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,34))
    f.patch.set_facecolor('#F2EDD7FF')

    #Setting background and foreground color
    for i in range(0,nrows):
        for j in range(0,ncols):
            ax[i][j].set_facecolor('#F2EDD7FF')

    #Plotting count plot 
    for i in range(0,nrows):
        for j in range(0,ncols):
            if(i==0): #For [0,0] sub plot
                if(j==0):
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Count plot of\ncategorical features",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=20,fontstyle='italic')
                elif(j==1): #For [0,1] subplot
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Count plot with respect to\ntarget",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=20,fontstyle='italic')

            else:
                #Without hueness
                if(j==0):
                    a1=sns.countplot(data=df,x=cat_columns[i-1],palette='rocket',ax=ax[i][j])
                    without_hue(df,cat_columns[i-1],a1)
                #With hueness
                elif(j==1):
                    a2=sns.countplot(data=df,x=cat_columns[i-1],hue='output',ax=ax[i][j],palette='rocket')
                    with_hue(df,cat_columns[i-1],a2)
                
                #Dealing with spines
                ax[i][j].spines['top'].set_visible(False)
                ax[i][j].spines['right'].set_visible(False)
                ax[i][j].spines['left'].set_visible(False)
                ax[i][j].grid(linestyle="--",axis='y',color='gray')
        
        
    

In [None]:
#First four columns
cat_columns=['sex' , 'exng' , 'caa' , 'cp']        
plotting_cat_features(5,2,cat_columns)   

In [None]:
cat_columns=['fbs' , 'restecg' , 'thall' , 'slp'] 
plotting_cat_features(5,2,cat_columns)

<html>
    <body style="background-color:LIGHTBLUE;">
        <h1 style="color:PINK;">Observations from univariate analysis of categorical features</h1>
    </body>
</html>

1. **Sex** : There are 96 Females in the data out of which 75% have a chance of heart attack and in case of Males(207) 45% have a chance to get heart attack , so according to data Females here have more chances to get  heart attack instead of Males

2. **exng** : People whose 'exercise induced angina' is 0 are more likely to get heart attacks

3. **caa** : Who have number of major vessels=0 are prone to get heart attacks

4. **Chest pain** : Person who have atypical chest pain have more chances to get heart attacks i.e. 82% while other chest pains also shows the danger of heart attacks

5. **Fasting blood sugar(fbs)**: This feature is showing that it doesn't matter whether blood sugar is greater than or less than 120 , more than 50% of people have chances to get heart attack

6. **resting electrocardiographic(restecg)** : 63.2% people who have ST-T wave abnormality are more prone and but in normal cases 46.3% is also a big number

7. **Thallium stress results(thall)** : people with thallium stress 3 have more chances i.e. 78.3%

[Table of Contents](#toc)

<html>
    <a id='4.1.2'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>4.1.2. CONTINUOUS FEATURES</b></center></h1>
</html>

**Continuous Features** : Age ,  trtbps , chol , thalachh , oldpeak

**HISTPLOT OF CONTINUOUS FEATURES**

In [None]:
def plotting_con_features(nrows,ncols,con_features):
    f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(20,34))
    f.patch.set_facecolor('#F2EDD7FF')

    #Setting background and foreground color
    for i in range(0,nrows):
        for j in range(0,ncols):
            ax[i][j].set_facecolor('#F2EDD7FF')

    #Plotting count plot 
    for i in range(0,nrows):
        for j in range(0,ncols):
            if(i==0): #For [0,0] sub plot
                if(j==0):
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Histplot of\ncontinuous features",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=20,fontstyle='italic')
                elif(j==1): #For [0,1] subplot
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Histplot with respect to\ntarget",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=20,fontstyle='italic')

            else:        
                #Without hueness
                if(j==0):
                    a1=sns.histplot(data=df,x=con_columns[i-1],palette='rocket',ax=ax[i][j],kde=True)
                #With hueness
                elif(j==1):
                    a2=sns.histplot(data=df,x=con_columns[i-1],hue='output',ax=ax[i][j],palette='rocket',multiple='stack',kde=True)

                #Dealing with spines
                ax[i][j].spines['top'].set_visible(False)
                ax[i][j].spines['right'].set_visible(False)
                ax[i][j].spines['left'].set_visible(False)
                ax[i][j].grid(linestyle="--",axis='y',color='gray')
        
        
    

In [None]:
con_columns=['age' , 'trtbps' , 'chol' , 'thalachh' , 'oldpeak']
plotting_con_features(6,2,con_columns)

**BOXEN PLOT OF CONTINUOUS FEATURES**

In [None]:
nrows=3
ncols=2
f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(18,20))

#Setting background and foreground color
f.patch.set_facecolor('#F2EDD7FF')
for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')

ax[0][0].spines['bottom'].set_visible(False)
ax[0][0].spines['left'].set_visible(False)
ax[0][0].spines['top'].set_visible(False)
ax[0][0].spines['right'].set_visible(False)

ax[0][0].tick_params(left=False,bottom=False)
ax[0][0].set_xticklabels([])
ax[0][0].set_yticklabels([])
ax[0][0].text(0.5,0.5,"BoxenPlot of continuous features",
            horizontalalignment="center",
            verticalalignment='center',
            fontweight='bold',fontsize=20)

for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        

        

sns.boxenplot(data=df,x='age',palette='rocket',ax=ax[0][1])

sns.boxenplot(data=df,x='trtbps',palette='gnuplot',ax=ax[1][0])

sns.boxenplot(data=df,x='chol',palette='rocket',ax=ax[1][1])

sns.boxenplot(data=df,x='thalachh',palette='gnuplot',ax=ax[2][0])

sns.boxenplot(data=df,x='oldpeak',palette='rocket',ax=ax[2][1])


<html>
        <h1 style="color:PINK;">Observations from univariate analysis of continuous features</h1>
</html>

1. Age is normally distributed with little variance and after applying log results are same , so we will go ahead with original age values and Age have some outliers also we can see from boxen plot as the we have very less data of 303 rows if we remove outliers there will be loss of data for us

2. In original form 'trtbps' is right skewed , 'chol' is right skewed and 'thalachh' is left skewed

3. After applying log 'trbps' and 'chol' become less skewed nearly have normal distribution and 'thalachh' still left skewed

4. We will now go ahead with original form of data with no changes

5. We have outliers in all continuous features but we will not remove it because we will have loss of data

[Table of Contents](#toc)

<html>
    <a id='4.2'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>4.2. BIVARIATE ANALYSIS</b></center></h1>
</html>


<html>
    <a id='4.2.1'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>4.2.1. CONTINUOUS VS CATEGORICAL</b></center></h1>
</html>

* **First , we will see that how our continuous features correlate with our categorical features**

 Sex , exng , caa , cp , fbs , restecg , thall , slp

We are not able to conclude anything from blood pressure and old peak as all the categorical features are nearly uniformly distributed , so we will not plot them w.r.t categoricsl features

**Age**

In [None]:
nrows=4
ncols=2
f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(18,24))
cat_columns=['sex','exng','caa','cp','fbs','restecg','thall','slp']
n=len(cat_columns)
f.patch.set_facecolor('#F2EDD7FF')
for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        sns.stripplot(data=df,x=cat_columns[(i*(nrows-2)+j)],y='age',palette='gnuplot',ax=ax[i][j])
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
        


* **caa(number of major vessels)** show slightly +ve correlation with age
* **Chest pain** is majorly having to the people who are above or equal to 40
* People having age above 40 have fasting Blood sugar greater than 120


We are not able to conclude anything from blood pressure and old peak as all the categorical features are nearly uniformly distributed

**chol** - serum cholestoral in mg/dl

In [None]:
nrows=4
ncols=2
f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(18,24))
cat_columns=['sex','exng','caa','cp','fbs','restecg','thall','slp']

f.patch.set_facecolor('#F2EDD7FF')
for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        sns.stripplot(data=df,x=cat_columns[(i*(nrows-2))+j],y='chol',palette='gnuplot',ax=ax[i][j])
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
        

* People having chest pain type 3 have more cholestrol as comparison to other ones
* People having restecg type 0 and type 2 have little more chorestrol as compared to restecg type 1


**thalach** - maximum heart rate achieved

In [None]:
nrows=4
ncols=2
f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(18,24))
cat_columns=['sex','exng','caa','cp','fbs','restecg','thall','slp']

f.patch.set_facecolor('#F2EDD7FF')
for i in range(0,nrows):
    for j in range(0,ncols):
        ax[i][j].set_facecolor('#F2EDD7FF')
        ax[i][j].spines['left'].set_visible(False)
        ax[i][j].spines['top'].set_visible(False)
        ax[i][j].spines['right'].set_visible(False)
        sns.stripplot(data=df,x=cat_columns[(i*(nrows-2))+j],y='thalachh',palette='gnuplot',ax=ax[i][j])
        ax[i][j].grid(linestyle="--",axis='y',color='gray')
        

* People having chest pain type 1 ,2 and 3 have more high heart rate
* People having blood sugar>120 have high heart rate excpet few
* People having restecg type 2 have more heart rate
* People having slp type 2 have high heart rate as comparison to other heart rates

[Table of Contents](#toc)

<html>
    <a id='4.2.2'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>4.2.2. CONTINUOUS VS CONTINUOUS</b></center></h1>
</html>

In [None]:
fig=plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.lineplot(data=df,x='age',y='chol',hue='output',palette='rocket')

In [None]:
fig=plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.lineplot(data=df,x='age',y='trtbps',hue='output',palette='rocket')

<html>
        <h1 style="color:PINK;">Observations from Bivariate analysis of features</h1>
</html>

* As the age increases cholestrol level increases

* People whose chances are positive to get heart attacks , thier blood pressure remians high as  age increases

* People having chest pain type 1,2 and 3 i.e chest pain other than asymptomatic chest pain have high heart rate

* People having chest pain type 1,2 and 3 and are above age of 40 have high chances to get heart attacks

* Most people above age 40 have high blood sugar


[Table of Contents](#toc)

<html>
    <a id='4.3'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>4.3. MULTIVARIATE ANALYSIS</b></center></h1>
</html>

In [None]:
plt.figure(figsize=(30,20))
plt.title("Heatmap of Features",fontsize=30,fontweight='bold',fontstyle='italic')
sns.heatmap(df.corr(),annot=True,linewidth=3)

According to above heatmap there is no significant +ve and -ve correlation between two features so we don't need to take any actions

**But will tell you some highest(+ve and -ve) correlated columns**


* **Age vs thalachh** = -0.4 correlated
* **cp vs output** = 0.43 correlated
* **thalachh vs output** = 0.42 correlated
* **exng vs output** = -0.44 correlated
* **slp vs oldpeak** = -0.58 correlated

In [None]:
#pairplot of continuous variables
x_vars=['age','trtbps','chol','thalachh','oldpeak']
y_vars=['age','trtbps','chol','thalachh','oldpeak']

sns.pairplot(data=df,x_vars=x_vars,y_vars=y_vars,hue='output',palette='gnuplot')

[Table of contents](#toc)
<html>
    <a id="5"></a>
    <h1 style="color:#990011FF; background-color:#FCF6F5FF; font-size:50px; border:10px solid brown; padding:15px;"><center><b>5. MODEL AND PREDICTION 🧭</b></center></h1>
</html>

[Table of Contents](#toc)

<html>
    <a id='5.1'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>5.1. PREPROCESSING</b></center></h1>
</html>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics  import accuracy_score,classification_report,roc_auc_score,plot_roc_curve,roc_curve,auc,f1_score
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
import optuna

As the data is not that imbalanced we have 138 -ve cases and 165 +ve cases but we will use SMOTE over sampling so that our model will have equal opportunity to learn about +ve and -ve cases

In [None]:
#ONE HOT ENCODING AND SCALING
df1=df.copy()

cat_features=['sex','exng','caa','cp','fbs','restecg','thall','slp']
con_features=['age','trtbps','chol','thalachh','oldpeak']

#df1=pd.get_dummies(df1,columns=cat_features,drop_first=True)

ss=StandardScaler()
df1[con_features]=ss.fit_transform(df1[con_features])

In [None]:
#First we will split data into training and testing set before oversampling to avoid any data leakage

Y=df1['output']
X=df1.drop('output',axis=1)
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
#OverSampling
smt=SMOTE()
x_train_sampling,y_train_sampling=smt.fit_resample(x_train,y_train)

In [None]:
x_train_sampling.head()

In [None]:
y_train_sampling.value_counts()

[Table of Contents](#toc)

<html>
    <a id='5.2'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>5.2. LOGISTIC REGRESSION</b></center></h1>
</html>

In [None]:
clf_logistic=LogisticRegression(random_state=0)

clf_logistic.fit(x_train_sampling,y_train_sampling)

y_pred_proba=clf_logistic.predict_proba(x_test)

y_pred=np.argmax(y_pred_proba,axis=1)

print("Acuuracy of logistic Regression : ",accuracy_score(y_test,y_pred))

In [None]:
ans=classification_report(y_test,y_pred)
print("***** Classification report of Logistic Regression is *****")
print()
print(ans)

In [None]:
print("F1 Score with oversampling : ", f1_score(y_test,y_pred))

In [None]:
fig=plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test,y_pred)

plt.title('Logistic Regression ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

[Table of Contents](#toc)

<html>
    <a id='5.3'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>5.3. RANDOMFOREST CLASSIFIER</b></center></h1>
</html>

In [None]:
def objective(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 200)
    max_depth = int(trial.suggest_int('max_depth', 1, 40))
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    return cross_val_score(clf, x_train_sampling, y_train_sampling, 
           n_jobs=-1, cv=5,scoring='f1').mean()

In [None]:
study = optuna.create_study(direction='maximize',study_name='Random Forest')
study.optimize(objective, n_trials=50)

In [None]:
trial = study.best_trial
print('## Accuracy -->',trial.value)
print("## best_parameters -->",trial.params)

In [None]:
clf_rf=RandomForestClassifier(**trial.params)
clf_rf.fit(x_train_sampling,y_train_sampling)

In [None]:
pred_rf=clf_rf.predict(x_test)
print("## Accuracy of random forest -->",accuracy_score(y_test,pred_rf))
print()
print("***** Classification report of Random Forest *****")
print()
print(classification_report(y_test,pred_rf))



In [None]:
print("F1 Score with oversampling : ", f1_score(y_test,pred_rf,average='micro'))

In [None]:
plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test,pred_rf)

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

[Table of Contents](#toc)

<html>
    <a id='5.3.1'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>5.3.1. FEATURE IMPORTANCE FOR RF </b></center></h1>
</html>

In [None]:
feature_importance = np.array(clf_rf.feature_importances_)
feature_names = np.array(x_train.columns)
data={'feature_names':feature_names,'feature_importance':feature_importance}
df_plt = pd.DataFrame(data)
df_plt.sort_values(by=['feature_importance'], ascending=False,inplace=True)
fig=plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.barplot(x=df_plt['feature_importance'], y=df_plt['feature_names'])
#plt.style.use("ggplot")
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plt.title("Important features for RandomForest Classifier")
plt.show()

[Table of Contents](#toc)

<html>
    <a id='5.4'></a>
    <h1 style="color:coral; background-color:teal; font-size:25px; padding:15px;"><center><b>5.4. LGBM CLASSIFIER</b></center></h1>
</html>

In [None]:
import lightgbm as lgb

In [None]:
#HYPERPARAMETER TUNING
def objective_lgbm(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 300)
    max_depth = int(trial.suggest_int('max_depth', 2, 50))
    learning_rate=trial.suggest_loguniform('learning_rate',0.001,1)
    colsample_bytree=trial.suggest_loguniform("colsample_bytree",0.1, 1)
    num_leaves=trial.suggest_int('num_leaves',10,300)
    reg_alpha= trial.suggest_loguniform('reg_alpha',0.1,1)
    reg_lambda= trial.suggest_loguniform('reg_lambda',0.1,1)
    min_split_gain=trial.suggest_loguniform('min_split_gain',0.1,1)
    subsample=trial.suggest_loguniform('subsample',0.1,1)    
    clf = lgb.LGBMClassifier(n_estimators=n_estimators, max_depth=max_depth,
                            learning_rate=learning_rate,colsample_bytree=colsample_bytree,
                            num_leaves=num_leaves,reg_alpha=reg_alpha,reg_lambda=reg_lambda,
                            min_split_gain=min_split_gain,subsample=subsample)
    return cross_val_score(clf, x_train_sampling, y_train_sampling, 
           n_jobs=-1, cv=5,scoring='f1').mean()

In [None]:
study_lgbm= optuna.create_study(direction='maximize',study_name="LGBM")
study_lgbm.optimize(objective_lgbm, n_trials=40)

In [None]:
#GETTING BEST PARAMETERS
trial_lgbm= study_lgbm.best_trial
print("## Accuracy --> ",trial_lgbm.value)
print("## Best parameters --> ",trial_lgbm.params)

In [None]:
#MODEL
model_lgbm=lgb.LGBMClassifier(**trial_lgbm.params)

In [None]:
#TRAINING
model_lgbm.fit(x_train_sampling,y_train_sampling)

In [None]:
#PREDICTING
pred_lgbm=model_lgbm.predict(x_test)
print('## Accuracy of LightGBM --> ',accuracy_score(pred_lgbm,y_test))

In [None]:
#CLASSIFICATION REPORT
print("***** Classification report of LGBM Classifier *****")
print()
print(classification_report(y_test,pred_lgbm))

In [None]:
fig=plt.figure(figsize=(18,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test,pred_lgbm)

plt.title('Light GBM ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

[Table of Contents](#toc)

<html>
    <a id='5.4.1'></a>
    <h1 style="color:#101820FF; background-color:#FEE715FF; font-size:20px; padding:15px;"><center><b>5.4.1. FEATURE IMPORTANCE FOR LGBM CLASSIFIER</b></center></h1>
</html>

In [None]:
feature_importance = np.array(model_lgbm.feature_importances_)
feature_names = np.array(x_train.columns)
data={'feature_names':feature_names,'feature_importance':feature_importance}
df_plt = pd.DataFrame(data)
df_plt.sort_values(by=['feature_importance'], ascending=False,inplace=True)
fig=plt.figure(figsize=(20,10))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.barplot(x=df_plt['feature_importance'], y=df_plt['feature_names'])
#plt.style.use("ggplot")
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plt.title("Important features for LGBM Classifier")
plt.show()

[Table of Contents](#toc)

# **ANY SUGGESTIONS ARE MOST WELCOME , PLEASE CONSIDER TO GIVE IT A UPVOTE 👍**

**IF YOU THINK THAT I HAVE TO SOMETHING MORE OR ANY STEP TO INCREASE MY AUC TELL ME IN THE COMMENTS I WILL EDIT THIS NOTEBOOK AGAIN ACCORDING TO THE SUGGESTIONS 😊🙌**

**GIVE IT A UPVOTE MAY IT CAN HELP ME TO GET A JOB/INTERNSHIP 👨‍🎓**

**ANY SUGGESTIONS ARE MOST WELCOMED , PLEASE TELL ME IF YOU WANT ME TO DO IN THIS NOTEBOOK 🙏**