<center> 
<img src = "https://www.elastic.co/guide/en/machine-learning/master/images/classification-vis.png" width = 800 height = 400/>
</center>
<br>

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is a system of ongoing health-related telephone surveys designed to collect data on health-related risk behaviors, chronic health conditions, and the use of preventive services from the non-institutionalized adult population (≥ 18 years) residing in the United States. The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division of Population Health at CDC's National Center for Chronic Disease Prevention and Health Promotion.

Originally, the dataset come from the CDC (1) and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories.  BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. 

The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". In this dataset, We noticed many different factors (questions) that directly or indirectly influence heart disease, so we decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects (2).

1. [codebook20_llcp-v2-508.pdf](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf)
2. [personal-key-indicators-of-heart-disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)


<hr>

# 1. Business Understanding
Heart disease is the leading cause of death in the United States. The term "heart disease" refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which can lead to a heart attack. Machine learning leads to a better understanding of how we can predict heart disease.

<hr>

<a id="title-two"></a>
# 2. Data Collecting
### 2.1 About the dataset

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is a system of ongoing health-related telephone surveys designed to collect data on health-related risk behaviors, chronic health conditions, and the use of preventive services from the non-institutionalized adult population (≥ 18 years) residing in the United States. The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division of Population Health at CDC's National Center for Chronic Disease Prevention and Health Promotion. 

Factors assessed by the BRFSS in 2020 included health status and healthy days, exercise, inadequate sleep, chronic health conditions, oral health, tobacco use, cancer screenings, and health-care access (core section). Optional Module topics for 2020 included prediabetes and diabetes, cognitive decline, electronic cigarettes, cancer survivorship (type, treatment, pain management) and sexual orientation/gender identity (SOGI).

Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.". The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". In this dataset, We noticed many different factors (questions) that directly or indirectly influence heart disease, so we decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects ([1](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)).

<br>

### 2.2 Dataset Description [source](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf)

|Category|Label|Question|Value| 
|-|-|-|-|
|<b>HeartDisease</b>|Ever had CHD or MI| <i>-Respondents that have ever reported having coronary <br> -heart disease (CHD) or myocardial infarction (MI)</i>|-Yes<br>-No|
|<b>BMI</b>|Computed body mass index|<i>Computed body mass index</i>|Float[1-9999]
|<b>Smoking</b>|Smoked at Least 100 Cigarettes|<i>Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]</i>|-Yes<br>-No|
|<b>AlcoholDrinking</b>|Heavy Alcohol Consumption Calculated Variable|<i>Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)</i>|-Yes<br>-No|
|<b>Stroke</b>|Ever Diagnosed with a Stroke|<i>(Ever told) (you had) a stroke.</i>|-Yes<br>-No|
|<b>PhysicalHealth</b>|Number of Days Physical Health Not Good|<i>Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?</i>|Number of days [1-30]|
|<b>MentalHealth</b>|Number of Days Mental Health Not Good|<i>Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?</i>|Number of days [1-30]|
|<b>DiffWalking</b>|Difficulty Walking or Climbing Stairs|<i>Do you have serious difficulty walking or climbing stairs?</i>|-Yes<br>-No|
|<b>Sex</b>|Are you male or female?|<i>Are you male or female?</i>|-Male<br>-Female|
|<b>AgeCategory</b>|Reported age in five-year age categories calculated variable|<i>Fourteen-level age category</i>|-Age [18-79]<br>-Age [80 or older]|
|<b>Race</b>|Imputed race/ethnicity value|<i>Imputed race/ethnicity value (This value is the reported race/ethnicity or an imputed race/ethnicity, if the respondent refused to give a race/ethnicity. The value of the imputed race/ethnicity will be the most common race/ethnicity response for that region of the state)</i>|-White<br>-Black<br>-Asian<br>-American Indian/Alaskan Native<br>-Hispanic<br>-Other|
|<b>Diabetic</b>|(Ever told) you had diabetes|<i>(Ever told) (you had) diabetes? (If ´Yes´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)</i>|-Yes<br>-No<br>-No, borderline diabetes<br>-Yes (during pregnancy)|
|<b>PhysicalActivity</b>|Exercise in Past 30 Days|<i>During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?</i>|-Yes<br>-No|
|<b>GenHealth</b>|General Health|<i>Would you say that in general your health is:</i>|-Excellent<br>-Very good<br>-Good<br>-Fair<br>-Poor|
|<b>SleepTime</b>|How Much Time Do You Sleep|<i>On average, how many hours of sleep do you get in a 24-hour period?</i>|Number of hours [1-24]|
|<b>Asthma</b>|Ever Told Had Asthma|<i>(Ever told) (you had) asthma?</i>|-Yes<br>-No|
|<b>KidneyDisease</b>|Ever told you have kidney disease?|<i>Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?</i>|-Yes<br>-No|
|<b>SkinCancer</b>|(Ever told) you had skin cancer?|<i>(Ever told) (you had) skin cancer?</i>|-Yes<br>-No|

In [None]:
# import pandas as pd for data manipulation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df = pd.read_csv('/kaggle/input/personal-key-indicators-of-heart-disease/heart_2020_cleaned.csv')
df_org = df.copy() # make a copy of the original data frame
display(df)

<hr>

<a id="title-three"></a>
# 3. Data Understanding
### 3.1 Listing the unique values of each column
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier.

While analyzing the data, many times the user wants to see the unique values in a particular column, which can be done using Pandas unique() function.

In [None]:
# retrieve all labels and store in a list
columns_df = list(df.columns.values)
# iterate over the list to print all unique values of each column in the dataframe
for column in columns_df:
    print(column, ':', str(df[column].unique()))

### 3.2 Continuous vs Categorical
Every dataset has two type of variables Continuous(Numerical) and Categorical. Regression based algorithms use continuous and categorical features to build the models. You can’t fit categorical variables into a regression equation in their raw form in most of the ML Libraries.

### need to move after descriptive statistics
#### 3.2.1 Return a subset of the DataFrame’s columns based on the column dtypes
`DataFrame.select_dtypes(include=None, exclude=None)`
- To select all numeric types, use `np.number` or `number`.
- To select strings you must use the `object` dtype, but note that this will return all object dtype columns.

In [None]:
# import numpy for array operations and select all numerical columns
import numpy as np

In [None]:
# list of numerical features
numeric_features = df.select_dtypes(include=[np.number])
numeric_features.columns

In [None]:
# list of categorical features
categorical_features = df.select_dtypes(include=[object])
categorical_features.columns

### 3.3 Generate descriptive statistics
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
- `DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)`

Before generating statistics on our dataset, if you look at the 'AgeCategory' column, you notice that unique values are in the range of [a-b] where a is minimum, and b is maximum.
- AgeCategory : [ '18-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80 or older']

We have to encode the column and do some calculations so that it would be usable for generating statistics. 
- One of the easiest methods to convert [a-b] to an integer is to calculate the mean and replace it with the original value. 
- For example, if we have [55-59], the mean equals 57. Hence, all values from 55 to 59 will replace with 57.

In [None]:
# encode 'AgeCategory' column
encode_AgeCategory = {'55-59':57, '80 or older':80, '65-69':67,
                      '75-79':77,'40-44':42,'70-74':72,'60-64':62,
                      '50-54':52,'45-49':47,'18-24':21,'35-39':37,
                      '30-34':32,'25-29':27}
df['AgeCategory'] = df['AgeCategory'].apply(lambda x: encode_AgeCategory[x])
df['AgeCategory'] = df['AgeCategory'].astype(int)
df['AgeCategory']

In [None]:
# Generate descriptive statistics
df.describe()[1:][list(numeric_features)].T.style.background_gradient(cmap='Blues')

<hr>

<a id="title-four"></a>
# 4. Exploratory Data Analysis
Exploratory Data Analysis(EDA) is an approach to analyse the data , to summarize its characteristics , often with visual methods.

Exploratory Data Analysis is majorly performed using the following methods:
- Univariate visualization — provides summary statistics for each field in the raw data set.
- Bivariate visualization — is performed to find the relationship between each variable in the dataset and the target variable of interest.


### 4.1 Univariate visualization
#### 4.1.1 Univariate visualization of categorical features
provides summary statistics for each categorical field in the raw data set.

In [None]:
# import matplotlib and seaborn for visualization
from matplotlib import pyplot as plt
import seaborn as sns

# Univariate visualization of categorical features
def categorical_feature_func():
  i = 1
  plt.figure(figsize = (25,15))
  for feature in categorical_features:
      plt.subplot(3,5,i)
      sns.set(palette='Paired')
      sns.set_style("ticks")
      ax = sns.countplot(x = feature, data = df)#, hue = 'Stroke')#, color='#221C35') 
      ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
      i +=1

categorical_feature_func()

#### 4.1.2 Univariate visualization of numerical features
provides summary statistics for each numerical field in the raw data set.

In [None]:
# Univariate visualization of numerical features
def numeric_features_func():
  i=1
  plt.figure(figsize = (35,5))
  for feature in numeric_features.columns:
      plt.subplot(1,5,i)
      sns.set(palette='dark')
      sns.set_style("ticks")
      sns.histplot(df[feature],kde=True)
      plt.xlabel(feature)
      plt.ylabel("Count")
      i+=1

numeric_features_func()

### 4.2 Bivariate visualization
#### 4.2.1 Bivariate visualization of categorical features
relationship between each categorical variable in the dataset and the target (HeartDisease) variable of interest.

In [None]:
def categorical_feature_func():
  i = 1
  plt.figure(figsize = (25,15))
  for feature in categorical_features:
      plt.subplot(3,5,i)
      sns.set(palette='Paired')
      sns.set_style("ticks")
      ax = sns.countplot(x = feature, data = df, hue = 'HeartDisease')
      ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
      i +=1

categorical_feature_func()

#### 4.2.2 Bivariate visualization of numerical features
relationship between each numerical variable in the dataset and the target (DiffWalking) variable of interest.

In [None]:
def numeric_features_func(f):
  i=1
  plt.figure(figsize=(35,5))
  sns.set(palette='Paired')
  sns.set_style("ticks")
  for feature in numeric_features:
      plt.subplot(1,5,i)
      sns.boxplot(y=df[feature], x = df[f])
      i+=1

numeric_features_func('DiffWalking')

<hr>

<br>

<a id="title-five"></a>
# 5. Preprocessing

### 5.1 Label Encoding
In machine learning, encoding is a process of converting categorical data into a form that can be easily understood by machine learning algorithms. As many machine learning models only accept numerical inputs, encoding is a crucial step in preparing your data for model training. There are multiple types of encoding, each with its own use cases:

- Label Encoding: This involves converting each value in a column to a number. For example, 'red' could be 1, 'green' could be 2, and so on. While this is a straightforward method, it can sometimes lead to problems as it may introduce an arbitrary ordering where none exists. For example, the model might incorrectly learn that 'green' > 'red' because 2 > 1, which is not necessarily a valid comparison.

- One-Hot Encoding: This involves creating a new binary column for each category in the data. For instance, for a column 'Color' with categories 'red', 'green', and 'blue', one-hot encoding would create three new columns, 'Color_red', 'Color_green', and 'Color_blue', which are either 0 or 1 depending on the color of the instance. This method can lead to a large increase in the dataset's dimensionality if a categorical variable has many categories, potentially slowing down training and reducing performance.

- Mixed Encoding: Sometimes, a combination of the above methods can be used based on the cardinality of the categorical variables. For example, label encoding could be applied to binary variables (with only two categories), and one-hot encoding could be applied to non-binary variables. This approach aims to balance the advantages and drawbacks of both encoding methods.


<br>
Example: 

- Original: `['GenHealth'] : ['Very good' 'Fair' 'Good' 'Poor' 'Excellent']`
- Encoded: 
  - `['GenHealth_Very good'] = [1, 0]`
  - `['GenHealth_Fair'] = [0, 1]`
  - `['GenHealth_Good'] = [0, 0]`
  - `['GenHealth_Poor'] = [0, 0]`
  - `['GenHealth_Excellent'] = [0, 0]`

In [None]:
# Encode all columns
columns_df = list(df.columns.values)
from sklearn.preprocessing import LabelEncoder

cat_cols = ["Smoking", "AlcoholDrinking", "Stroke", "DiffWalking",
                "Sex", "AgeCategory", "Race", "Diabetic", "PhysicalActivity",
                "GenHealth", "Asthma", "KidneyDisease", "SkinCancer"]
for cat_col in cat_cols:
    dummy_col = pd.get_dummies(df[cat_col], prefix=cat_col)
    df = pd.concat([df, dummy_col], axis=1)
    del df[cat_col]

for col in ['HeartDisease']:
    if df[col].dtype == 'O':
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])


df.head()

<b> UPDATE </b>
We added 1 more encoding method:
- Original:
  - OneHotEncoder
    - "Smoking", "AlcoholDrinking", "Stroke", "DiffWalking",
    - "Sex", "AgeCategory", "Race", "Diabetic", "PhysicalActivity",
    - "GenHealth", "Asthma", "KidneyDisease", "SkinCancer"
  - LabelEncoder
    - "HeartDisease"

<br>

- New method:
  - if unique values <= 2
    - LabelEncoder
  - else
    - OneHotEncoder


In [None]:
# copied original dataframe to new dataframe for new encoding methods
df_enc_mix = df_org.copy()

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Define a LabelEncoder
le = LabelEncoder()

# Get a list of categorical column names
categorical_cols = df_enc_mix.select_dtypes(include=['object', 'category']).columns.tolist()

# mix of label encoding and one-hot encoding
for col in categorical_cols:
    if len(df_enc_mix[col].unique()) <= 2:
        # label encode binary variables
        df_enc_mix[col] = le.fit_transform(df_enc_mix[col])
    else:
        # one-hot encode non-binary variables
        df_enc_mix = pd.get_dummies(df_enc_mix, columns=[col])

df_enc_mix.head()

#### 5.1.1 Listing the unique values of each column after encoding
Here we have all possible values on each feature, so please take a look at an example.

- Original: `Diabetic : ['Yes', 'No', 'No-borderline-diabetes', 'Yes-during pregnancy']`
- Encoded: 
  - `Diabetic_Yes : [0, 1]`
  - `Diabetic_No : [0, 1]`
  - `Diabetic_No-borderline-diabetes : [0, 1]`
  - `Diabetic_Yes-during pregnancy : [0, 1]`

In [None]:
# iterate over the list to print all unique values of each column in the dataframe
for column in list(df.columns.values):
    print(column, ':', str(df[column].unique()))

In [None]:
# iterate over the list to print all unique values of each column in the dataframe
for column in list(df_enc_mix.columns.values):
    print(column, ':', str(df_enc_mix[column].unique()))

### 5.2 Dataset Splitting
Here we will show how to split a dataset into Train and Test sets. 
- The train-test split is used to estimate the performance of machine learning algorithms that are applicable for prediction-based Algorithms/Applications. 
- By default, the Test set is split into 20% of actual data and the training set is split into 80% of the actual data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split the dataset into train and test set with 80% and 20% respectively
train_data, test_data = train_test_split(df, train_size=0.80) 
train_data.shape, test_data.shape

In [None]:
# split the dataset into train and test set for NEW encoding method
train_data_label, test_data_label = train_test_split(df_enc_mix, train_size=0.80)
train_data_label.shape, test_data_label.shape

We need to split a dataset into train and test sets to evaluate how well our machine learning model performs. The train set is used to fit the model, and the statistics of the train set are known. The second set is called the test data set, this set is solely used for predictions.

In [None]:
# split the train and test set into features and labels
X_train= train_data.drop('HeartDisease', axis=1)
y_train= train_data['HeartDisease']
print(X_train.shape, y_train.shape)

X_test= test_data.drop('HeartDisease', axis=1)
y_test= test_data['HeartDisease']
print(X_test.shape, y_test.shape)

In [None]:
# split the train and test set for the NEW encoding method
X_train_new= train_data_label.drop('HeartDisease', axis=1)
y_train_new= train_data_label['HeartDisease']
print(X_train_new.shape, y_train_new.shape)

X_test_new= test_data_label.drop('HeartDisease', axis=1)
y_test_new= test_data_label['HeartDisease']
print(X_test_new.shape, y_test_new.shape)

### 5.3 Feature Scaling
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing. 

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
# there is no data-leakage because we are using 
# information of train in test not test in train
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

# for NEW encoding method
X_train_new=sc.fit_transform(X_train_new)
X_test_new=sc.transform(X_test_new)

In [None]:
# X_train after scaling
X_train

In [None]:
# X_train_new after scaling
X_train_new

### 5.4 Handling Imbalanced Data
Balanced vs Imbalanced Dataset :

- Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of classes in the target column.
- Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution of classes in the target column.

Let's take a look at our target which is 'HeartDisease':

In [None]:
df['HeartDisease'].value_counts()

In [None]:
df_enc_mix['HeartDisease'].value_counts()

As we shown in the code above, there are 292k 'No' and 27k 'Yes'then it represents an Imbalanced dataset as there is highly unequal distribution of the two classes.
<br>
Problem with Imbalanced dataset:

- Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
- Minority class observations look like noise to the model and are ignored by the model.
- Imbalanced dataset gives misleading accuracy score.
<br>
Techniques to deal with Imbalanced dataset :

- Under Sampling
  - In this technique, we reduce the sample size of Majority class and try to match it with the sample size of Minority Class.
    - For example, take an imbalanced training dataset with 1000 records.
    - Before Under Sampling :
      - Target class 'Yes' = 900
      - Target class 'No' = 100
    - After Under Sampling :
      - Target class 'Yes' = 100
      - Target class 'No' = 100
    - Now, both classes have the same sample size.

- Over Sampling
  - In this technique, we increase the sample size of Minority class by replication and try to match it with the sample size of Majority Class.
    - For example, Let’s take the same imbalanced training dataset with 1000 records.
    - Before Under Sampling :
      - Target class 'Yes' = 900
      - Target class 'No' = 100
    - After Under Sampling :
      - Target class 'Yes' = 900
      - Target class 'No' = 900
    - Now, both classes have the same sample size.

We will use both methods in our project, so we used 'SMOTE' for over_sampling and 'NearMiss' for under_sampling.

In [None]:
# import SMOTE from imblearn.over_sampling
from imblearn.over_sampling import SMOTE

In [None]:
# import NearMiss from imblearn.under_sampling
from imblearn.under_sampling import NearMiss

In [None]:
# Counter is a collection where elements are stored as 
# dictionary keys and their counts are stored as dictionary values.
from collections import Counter

In [None]:
# balance the dataset using SMOTE (Synthetic Minority Oversampling Technique)
smote = SMOTE(sampling_strategy='minority')
X_train_smote , y_train_smote = smote.fit_resample(X_train,y_train)

print('Original: {}'.format(Counter(y_train))) 
print('   SMOTE: {}'.format(Counter(y_train_smote))) 

In [None]:
# SMOTE: NEW encoding method
X_train_smote_new , y_train_smote_new = smote.fit_resample(X_train_new,y_train_new)

print('Original: {}'.format(Counter(y_train_new)))
print('  SMOTE2: {}'.format(Counter(y_train_smote_new)))

In [None]:
# balance the dataset using NearMiss (undersampling)
nearmiss = NearMiss(version=3)
X_train_nearmiss, y_train_nearmiss = nearmiss.fit_resample(X_train, y_train)

print('Original: {}'.format(Counter(y_train))) 
print('NearMiss: {}'.format(Counter(y_train_nearmiss))) 

In [None]:
# NearMiss: NEW encoding method
X_train_nearmiss_new, y_train_nearmiss_new = nearmiss.fit_resample(X_train_new, y_train_new)

print(' Original: {}'.format(Counter(y_train_new)))
print('NearMiss2: {}'.format(Counter(y_train_nearmiss_new)))

### 5.5 K-Fold Cross Validation
K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. 
<br>
Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.

In [None]:
# kfold cross validation
from sklearn.model_selection import KFold

# make a 10 fold cross validation
cv = KFold(n_splits=10, random_state=None,shuffle=False) 

<hr>

<a id="title-six"></a>
# 6. Model Training

In this section, we compare the classification strength of AdaBoost, Random Forest, Decision Tree, KNN, Naïve Bayes, and Perceptron. After training our models, Naïve Bayes achieved the highest accuracy, whereas Perceptron reached the lowest accuracy.


In [None]:
# required libraries

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.metrics import auc, roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix


## 6.1 Decision Tree

#### 6.1.1 Decision Tree Classifier

In [None]:
# make the model and parameters
def dt_model():
    model_dt = DecisionTreeClassifier()
    params_dt = {"criterion":['gini','entropy'], "max_depth": [100], "random_state": [1024]}
    model_dt_cv = GridSearchCV( model_dt, 
                                param_grid = params_dt, 
                                cv = cv, 
                                n_jobs = -1, 
                                verbose = 1 )
    return model_dt_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_dt_cv_smote = dt_model()
model_dt_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_dt_cv_smote.best_params_)

In [None]:
# fit the model for NEW encoding method - SMOTE
model_dt_cv_smote_new = dt_model()
model_dt_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_dt_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_dt_cv_nearmiss = dt_model()
model_dt_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_dt_cv_nearmiss.best_params_)

In [None]:
# fit the model for New encoding method - NearMiss
model_dt_cv_nearmiss_new = dt_model()
model_dt_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_dt_cv_nearmiss_new.best_params_)

#### 6.1.2 Decision Tree Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_dt_smote = model_dt_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_dt_smote))

In [None]:
# print the best score (SMOTE) New encoding method
y_pred_dt_smote_new = model_dt_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_dt_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_dt_nearmiss = model_dt_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_dt_nearmiss))

In [None]:
# print the best score (NearMiss) New encoding method
y_pred_dt_nearmiss_new = model_dt_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_dt_nearmiss_new))

#### 6.1.3 Decision Tree Confusion Matrix

In [None]:
%matplotlib inline
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

cm_dt_smote = confusion_matrix(y_test,y_pred_dt_smote) # confusion matrix
cm_dt_nearmiss = confusion_matrix(y_test,y_pred_dt_nearmiss) # confusion matrix
cm_dt_smote_new = confusion_matrix(y_test_new,y_pred_dt_smote_new) # confusion matrix
cm_dt_nearmiss_new = confusion_matrix(y_test_new,y_pred_dt_nearmiss_new) # confusion matrix

def plot_confusion_matrix(ax, cm, title='Confusion matrix', cmap='viridis'):
    sn.heatmap(cm, annot=True, linewidths=0.8, fmt='d', cmap=cmap, ax=ax)
    ax.set_xlabel('Predicted',fontsize=16)
    ax.set_ylabel('Truth',fontsize=16)
    ax.set_title(title,fontsize=16)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_dt_smote, title='Decision Tree with SMOTE')
plot_confusion_matrix(axs[0,1], cm_dt_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_dt_smote_new, title='Decision Tree with NearMiss')
plot_confusion_matrix(axs[1,1], cm_dt_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()


#### 6.1.4 Decision Tree ROC Curve

In [None]:
from sklearn.metrics import roc_curve, auc

def plot_roc_auc(ax, model_cv, X_test, y_test, label):
    #ROC-AUC
    #predict Probabilities  
    y_score_model = model_cv.predict_proba(X_test) # results are probabilities for each sample for each class
    yes_probs = y_score_model[:,1] # retrieve the probabilities only for the class1 (yes, positve class)

    # calculate the features of ROC curve
    fpr_model, tpr_model, _ = roc_curve(y_test, yes_probs) # false positive, true posistive, threshold

    # AUC
    auc_model = auc(fpr_model, tpr_model)

    # plot "No-Skill" on ROC Curve
    ax.plot([0,1],[0,1], linestyle='--', label='No Skill')

    # Plot the ROC Curve
    label = f'{label} (auc={auc_model:.3f})'
    ax.plot(fpr_model, tpr_model, marker='_', label=label, color='red')

    # X-axis label
    ax.set_xlabel("False Positive Rate")

    # Y-axis label
    ax.set_ylabel("True Positive Rate")

    # show the legend
    ax.legend()


In [None]:
# plot_roc_auc(model_dt_cv, X_test, y_test, label='Decision Tree')

fig, axs = plt.subplots(2,2, figsize=(12,8))

plot_roc_auc(axs[0,0], model_dt_cv_smote, X_test, y_test, label='Decision Tree with SMOTE')
plot_roc_auc(axs[0,1], model_dt_cv_smote_new, X_test_new, y_test_new, label='SMOTE + New Encoding Method')
plot_roc_auc(axs[1,0], model_dt_cv_nearmiss, X_test, y_test, label='Decision Tree with NearMiss')
plot_roc_auc(axs[1,1], model_dt_cv_nearmiss_new, X_test_new, y_test_new, label='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

## 6.2 AdaBoost
#### 6.2.1 AdaBoost Classifier

In [None]:
# make the model and parameters
def ada_model():
    model_ada = AdaBoostClassifier()
    params_ada = {'n_estimators':[50], 'learning_rate':[1]}
    model_ada_cv = GridSearchCV(model_ada, 
                                param_grid = params_ada, 
                                cv = cv, 
                                verbose = 1)
    return model_ada_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_ada_cv_smote = ada_model()
model_ada_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_ada_cv_smote.best_params_)

In [None]:
# fit the model with the best hyperparameters using SMOTE New encoding method
model_ada_cv_smote_new = ada_model()
model_ada_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_ada_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_ada_cv_nearmiss = ada_model()
model_ada_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_ada_cv_nearmiss.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss New encoding method
model_ada_cv_nearmiss_new = ada_model()
model_ada_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_ada_cv_nearmiss_new.best_params_)

#### 6.2.2 AdaBoost Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_ada_smote = model_ada_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_ada_smote))

In [None]:
# print the best score (SMOTE New encoding method)
y_pred_ada_smote_new = model_ada_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_ada_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_ada_nearmiss = model_ada_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_ada_nearmiss))

In [None]:
# print the best score (NearMiss New encoding method)
y_pred_ada_nearmiss_new = model_ada_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_ada_nearmiss_new))

#### 6.2.3 Confusion Matrix

In [None]:
%matplotlib inline
cm_ada_smote = confusion_matrix(y_test, y_pred_ada_smote)
cm_ada_nearmiss = confusion_matrix(y_test, y_pred_ada_nearmiss)
cm_ada_smote_new = confusion_matrix(y_test_new, y_pred_ada_smote_new)
cm_ada_nearmiss_new = confusion_matrix(y_test_new, y_pred_ada_nearmiss_new)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_ada_smote, title='AdaBoost with SMOTE')
plot_confusion_matrix(axs[0,1], cm_ada_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_ada_smote_new, title='AdaBoost with NearMiss')
plot_confusion_matrix(axs[1,1], cm_ada_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

#### 6.2.4 ROC Curve

In [None]:
# plot_roc_auc(model_ada_cv, X_test, y_test, label='AdaBoost')

fig, axs = plt.subplots(2,2, figsize=(12,8))

plot_roc_auc(axs[0,0], model_ada_cv_smote, X_test, y_test, label='AdaBoost with SMOTE')
plot_roc_auc(axs[0,1], model_ada_cv_smote_new, X_test_new, y_test_new, label='SMOTE + New Encoding Method')
plot_roc_auc(axs[1,0], model_ada_cv_nearmiss, X_test, y_test, label='AdaBoost with NearMiss')
plot_roc_auc(axs[1,1], model_ada_cv_nearmiss_new, X_test_new, y_test_new, label='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

## 6.3 Random Forest
#### 6.3.1 Random Forest Classifier

In [None]:
# make the model and parameters
def rf_model():
    model_rf = RandomForestClassifier()
    params_rf = {"criterion":['gini','entropy']}
    model_rf_cv = GridSearchCV(model_rf, 
                            param_grid = params_rf, 
                            cv = cv, 
                            verbose = 1)
    return model_rf_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_rf_cv_smote = rf_model()
model_rf_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_rf_cv_smote.best_params_)

In [None]:
# fit the model with the best hyperparameters using SMOTE New encoding method
model_rf_cv_smote_new = rf_model()
model_rf_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_rf_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_rf_cv_nearmiss = rf_model()
model_rf_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_rf_cv_nearmiss.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss New encoding method
model_rf_cv_nearmiss_new = rf_model()
model_rf_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_rf_cv_nearmiss_new.best_params_)

#### 6.3.2 Random Forest Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_rf_smote = model_rf_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_rf_smote))

In [None]:
# print the best score (SMOTE New encoding method)
y_pred_rf_smote_new = model_rf_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_rf_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_rf_nearmiss = model_rf_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_rf_nearmiss))

In [None]:
# print the best score (NearMiss New encoding method)
y_pred_rf_nearmiss_new = model_rf_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_rf_nearmiss_new))

#### 6.3.3 Random Forest Confusion Matrix

In [None]:
%matplotlib inline
cm_rf_smote = confusion_matrix(y_test, y_pred_rf_smote)
cm_rf_nearmiss = confusion_matrix(y_test, y_pred_rf_nearmiss)
cm_rf_smote_new = confusion_matrix(y_test_new, y_pred_rf_smote_new)
cm_rf_nearmiss_new = confusion_matrix(y_test_new, y_pred_rf_nearmiss_new)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_rf_smote, title='Random Forest with SMOTE')
plot_confusion_matrix(axs[0,1], cm_rf_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_rf_smote_new, title='Random Forest with NearMiss')
plot_confusion_matrix(axs[1,1], cm_rf_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

#### 6.3.4 Random Forest ROC Curve

In [None]:
# plot_roc_auc(model_rf_cv, X_test, y_test, label='Random Forest')

fig, axs = plt.subplots(2,2, figsize=(12,8))

plot_roc_auc(axs[0,0], model_rf_cv_smote, X_test, y_test, label='Random Forest with SMOTE')
plot_roc_auc(axs[0,1], model_rf_cv_smote_new, X_test_new, y_test_new, label='SMOTE + New Encoding Method')
plot_roc_auc(axs[1,0], model_rf_cv_nearmiss, X_test, y_test, label='Random Forest with NearMiss')
plot_roc_auc(axs[1,1], model_rf_cv_nearmiss_new, X_test_new, y_test_new, label='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

## 6.4 Naïve Bayes
#### 6.4.1 Naïve Bayes Classifier

In [None]:
# make the model and parameters
def nb_model():
    model_nb = GaussianNB()
    params_nb = {'var_smoothing': np.logspace(1,10, num=100)}
    model_nb_cv = GridSearchCV(model_nb, 
                            param_grid = params_nb, 
                            cv = cv, 
                            verbose = 1)
    return model_nb_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_nb_cv_smote = nb_model()
model_nb_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_nb_cv_smote.best_params_)

In [None]:
# fit the model with the best hyperparameters using SMOTE New encoding method
model_nb_cv_smote_new = nb_model()
model_nb_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_nb_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_nb_cv_nearmiss = nb_model()
model_nb_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_nb_cv_nearmiss.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss New encoding method
model_nb_cv_nearmiss_new = nb_model()
model_nb_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_nb_cv_nearmiss_new.best_params_)

#### 6.4.2 Naïve Bayes Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_nb_smote = model_nb_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_nb_smote))

In [None]:
# print the best score (SMOTE New encoding method)
y_pred_nb_smote_new = model_nb_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_nb_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_nb_nearmiss = model_nb_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_nb_nearmiss))

In [None]:
# print the best score (NearMiss New encoding method)
y_pred_nb_nearmiss_new = model_nb_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_nb_nearmiss_new))

#### 6.4.3 Naïve Bayes Confusion Matrix

In [None]:
%matplotlib inline
cm_nb_smote = confusion_matrix(y_test, y_pred_nb_smote)
cm_nb_nearmiss = confusion_matrix(y_test, y_pred_nb_nearmiss)
cm_nb_smote_new = confusion_matrix(y_test_new, y_pred_nb_smote_new)
cm_nb_nearmiss_new = confusion_matrix(y_test_new, y_pred_nb_nearmiss_new)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_nb_smote, title='Naive Bayes with SMOTE')
plot_confusion_matrix(axs[0,1], cm_nb_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_nb_smote_new, title='Naive Bayes with NearMiss')
plot_confusion_matrix(axs[1,1], cm_nb_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

#### 6.4.4 Naïve Bayes ROC Curve

In [None]:
# plot_roc_auc(model_nb_cv, X_test, y_test, label='Naive Bayes')

fig, axs = plt.subplots(2,2, figsize=(12,8))

plot_roc_auc(axs[0,0], model_nb_cv_smote, X_test, y_test, label='Naive Bayes with SMOTE')
plot_roc_auc(axs[0,1], model_nb_cv_smote_new, X_test_new, y_test_new, label='SMOTE + New Encoding Method')
plot_roc_auc(axs[1,0], model_nb_cv_nearmiss, X_test, y_test, label='Naive Bayes with NearMiss')
plot_roc_auc(axs[1,1], model_nb_cv_nearmiss_new, X_test_new, y_test_new, label='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

## 6.5 K-Nearest Neighbors
#### 6.5.1 KNN Classifier

In [None]:
# make the model and parameters
def knn_model():
    model_knn = KNeighborsClassifier()
    params_knn = {'algorithm':['auto'], 'n_neighbors': range(1,4)}
    model_knn_cv = GridSearchCV(model_knn, 
                            param_grid = params_knn, 
                            cv = cv, 
                            verbose = 1)
    return model_knn_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_knn_cv_smote = knn_model()
model_knn_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_knn_cv_smote.best_params_)

In [None]:
# fit the model with the best hyperparameters using SMOTE New encoding method
model_knn_cv_smote_new = knn_model()
model_knn_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_knn_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_knn_cv_nearmiss = knn_model()
model_knn_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_knn_cv_nearmiss.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss New encoding method
model_knn_cv_nearmiss_new = knn_model()
model_knn_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_knn_cv_nearmiss_new.best_params_)

#### 6.5.2 KNN Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_knn_smote = model_knn_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_knn_smote))

In [None]:
# print the best score (SMOTE New encoding method)
y_pred_knn_smote_new = model_knn_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_knn_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_knn_nearmiss = model_knn_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_knn_nearmiss))

In [None]:
# print the best score (NearMiss New encoding method)
y_pred_knn_nearmiss_new = model_knn_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_knn_nearmiss_new))

#### 6.5.3 KNN Confusion Matrix

In [None]:
%matplotlib inline
cm_knn_smote = confusion_matrix(y_test, y_pred_knn_smote)
cm_knn_nearmiss = confusion_matrix(y_test, y_pred_knn_nearmiss)
cm_knn_smote_new = confusion_matrix(y_test_new, y_pred_knn_smote_new)
cm_knn_nearmiss_new = confusion_matrix(y_test_new, y_pred_knn_nearmiss_new)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_knn_smote, title='KNN with SMOTE')
plot_confusion_matrix(axs[0,1], cm_knn_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_knn_smote_new, title='KNN with NearMiss')
plot_confusion_matrix(axs[1,1], cm_knn_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

#### 6.5.4 KNN ROC Curve

In [None]:
# plot_roc_auc(model_knn_cv, X_test, y_test, label='KNN')

fig, axs = plt.subplots(2,2, figsize=(12,8))

plot_roc_auc(axs[0,0], model_knn_cv_smote, X_test, y_test, label='KNN with SMOTE')
plot_roc_auc(axs[0,1], model_knn_cv_smote_new, X_test_new, y_test_new, label='SMOTE + New Encoding Method')
plot_roc_auc(axs[1,0], model_knn_cv_nearmiss, X_test, y_test, label='KNN with NearMiss')
plot_roc_auc(axs[1,1], model_knn_cv_nearmiss_new, X_test_new, y_test_new, label='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()

## 6.6 Perceptron
#### 6.6.1 Perceptron Classifier

In [None]:
# make the model and parameters
def per_model():
    model_per = Perceptron()
    params_per = {'tol':[0.0001], 'random_state': [2]}
    model_per_cv = GridSearchCV(model_per, 
                            param_grid = params_per, 
                            cv = cv, 
                            refit = True,
                            verbose = 1)
    return model_per_cv

In [None]:
# fit the model with the best hyperparameters using SMOTE
model_per_cv_smote = per_model()
model_per_cv_smote.fit(X_train_smote ,y_train_smote)
print("Best Hyper Parameters for SMOTE: ", model_per_cv_smote.best_params_)

In [None]:
# fit the model with the best hyperparameters using SMOTE New encoding method
model_per_cv_smote_new = per_model()
model_per_cv_smote_new.fit(X_train_smote_new ,y_train_smote_new)
print("Best Hyper Parameters for SMOTE2: ", model_per_cv_smote_new.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss
model_per_cv_nearmiss = per_model()
model_per_cv_nearmiss.fit(X_train_nearmiss ,y_train_nearmiss)
print("Best Hyper Parameters for NearMiss: ", model_per_cv_nearmiss.best_params_)

In [None]:
# fit the model with the best hyperparameters using NearMiss New encoding method
model_per_cv_nearmiss_new = per_model()
model_per_cv_nearmiss_new.fit(X_train_nearmiss_new ,y_train_nearmiss_new)
print("Best Hyper Parameters for NearMiss2: ", model_per_cv_nearmiss_new.best_params_)

#### 6.6.2 Perceptron Classification Report

In [None]:
# print the best score (SMOTE)
y_pred_per_smote = model_per_cv_smote.predict(X_test)
print("Classification Report for SMOTE: \n", classification_report(y_test, y_pred_per_smote))

In [None]:
# print the best score (SMOTE New encoding method)
y_pred_per_smote_new = model_per_cv_smote_new.predict(X_test_new)
print("Classification Report for SMOTE2: \n", classification_report(y_test_new, y_pred_per_smote_new))

In [None]:
# print the best score (NearMiss)
y_pred_per_nearmiss = model_per_cv_nearmiss.predict(X_test)
print("Classification Report for NearMiss: \n", classification_report(y_test, y_pred_per_nearmiss))

In [None]:
# print the best score (NearMiss New encoding method)
y_pred_per_nearmiss_new = model_per_cv_nearmiss_new.predict(X_test_new)
print("Classification Report for NearMiss2: \n", classification_report(y_test_new, y_pred_per_nearmiss_new))

#### 6.6.3 Perceptron Confusion Matrix

In [None]:
%matplotlib inline
cm_per_smote = confusion_matrix(y_test, y_pred_per_smote)
cm_per_nearmiss = confusion_matrix(y_test, y_pred_per_nearmiss)
cm_per_smote_new = confusion_matrix(y_test_new, y_pred_per_smote_new)
cm_per_nearmiss_new = confusion_matrix(y_test_new, y_pred_per_nearmiss_new)

fig, axs = plt.subplots(2,2, figsize=(10,4))

plot_confusion_matrix(axs[0,0], cm_per_smote, title='Perceptron with SMOTE')
plot_confusion_matrix(axs[0,1], cm_per_nearmiss, title='SMOTE + New Encoding Method')
plot_confusion_matrix(axs[1,0], cm_per_smote_new, title='Perceptron with NearMiss')
plot_confusion_matrix(axs[1,1], cm_per_nearmiss_new, title='NearMiss + New Encoding Method')

plt.tight_layout()
plt.show()