<h1 style="text-align: center;">MACHINE LEARNING</h1>

Now that we have some slight info about what and how our data looks like, we can proceed further with our Machine Learning part.\
Upto this point we know,
* **Task:** Classification Problem
* **Independent/Features Variables:**
    - Quantitative Features:
        - *Age*
        - *RestingBP*
        - *Cholesterol*
        - *MaxHR*
        - *Oldpeak*
    - Ordinal Features:
        - *FastingBS*
        - *Sex*
        - *ChestPainType*
        - *RestingECG*
        - *ExerciseAngina*
        - *ST_Slope*
* **Dependent/Target Variable:** HeartDisease
* **Proposed Approach:**
    * Convert Ordinal Features to Quantitative Features(Categorical -> Numerical)
    * Scale Values (if needed)
    * Apply Classification Algorithms like:
        - *Gaussian Naive Bayes*
        - *Nearest Neighbours*
        - *SVC*
        - *Decision Trees*
        - *Ensembles*
        - *Random Forest*
        - *Gradient Boosting*
        - *MLPClassifier*

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import dump
import os

from sklearn.naive_bayes import GaussianNB

## Get data

In [2]:
data = pd.read_csv('data/kaggle/heart.csv')

In [3]:
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Now, for converting ordinal to categorical, we can either use an assessment map and map accordingly, or we can use pandas dummies. We'll be using both methods to see if we can get any difference in results.

In [4]:
#  Pandas dummies method
data_dummies = pd.get_dummies(data, drop_first=True)

In [5]:
# Dummified Data
data_dummies

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,1,1,0,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,1,1,0,0,0,1,0,0,1
3,48,138,214,0,108,1.5,1,0,0,0,0,1,0,1,1,0
4,54,150,195,0,122,0.0,0,1,0,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,110,264,0,132,1.2,1,1,0,0,1,1,0,0,1,0
914,68,144,193,1,141,3.4,1,1,0,0,0,1,0,0,1,0
915,57,130,131,0,115,1.2,1,1,0,0,0,1,0,1,1,0
916,57,130,236,0,174,0.0,1,0,1,0,0,0,0,0,1,0


In [6]:
# chestpain_map = {'ATA': 0, 'NAP': 1, 'ASY': 2, 'TA': 3}
# resting_map = {'Normal': 0, 'LVH': 1, 'ST': 2}
# exercise_map = {'N': 0, 'Y': 1}
# st_map = {'Flat': 0, 'Up': 1, 'Down': 2}
# sex_map = {'M': 0, 'F': 1}

In [7]:
# def numeric_features(data):
#     data = data.copy()
#     data['ChestPainType'] = data['ChestPainType'].map(chestpain_map)
#     data['Sex'] = data['Sex'].map(sex_map)
#     data['RestingECG'] = data['RestingECG'].map(resting_map)
#     data['ExerciseAngina'] = data['ExerciseAngina'].map(exercise_map)
#     data['ST_Slope'] = data['ST_Slope'].map(st_map)
#     return data

In [8]:
# data_mapped = numeric_features(data)

## Train/Test split

In [9]:
# Separate Features and Target Variable
X = data_dummies.drop(["HeartDisease"], axis=1)
y = data_dummies["HeartDisease"]

In [10]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify = y, random_state = 123)

## Scaling

Instead of scaling at each model's pipeline, we will scale our data once and then use it in all models

In [11]:
# Initiate Scaler
scaler = StandardScaler()

In [12]:
# Scale X_train and X_test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
# Save scaled data in order to maintain test reproducibility 
!mkdir -p data/scaled
np.save('data/scaled/X_train.npy', X_train_scaled)
np.save('data/scaled/X_test.npy', X_test_scaled)

## Base Model Training

In [14]:
# Directory to save models
if not os.path.exists("models/"):
    print("Initiated Models directory!")
    os.makedirs("models")
else:
    print("Models directory already exists!")

Initiated Models directory!


In [15]:
def train_model(model, modelName, X_train, y_train, X_test, y_test):
    model = model.fit(X_train, y_train)
    print("Model Name: ", modelName)
    print("Train Accuracy: ", model.score(X_train, y_train))
    print("Test Accuracy: ", model.score(X_test, y_test))
    dump(model, 'models/' + modelName + '.joblib')

## 1. GaussianNB

In [16]:
# Initiate model
model = GaussianNB()

In [17]:
# Train model
train_model(model, "GaussianNB", X_train_scaled, y_train, X_test_scaled, y_test)

Model Name:  GaussianNB
Train Accuracy:  0.8628205128205129
Test Accuracy:  0.8623188405797102
