# ISM6251.003S23- Assignment 1


## Business Problem
The aim is to create a predictive model using the patient characteristics provided, which can determine if a patient has heart disease or not. If the model proves effective, it could be utilized by the medical sector to anticipate heart disease in new patients who possess the necessary data. This will enable them to prioritize patients for diagnostic testing, surgeries, and other treatments.

## Description of the analysis
This project aims to determine the evaluation metrics employed for assessing the performance of the model(s) and explain the rationale behind the selection of these metrics. The logistic regression, SVM, and decision tree models will be utilized for modeling purposes. To test a variety of parameter values for each model, both random and grid searches will be performed for each of the model.

### 1. Library Import

In [38]:
import numpy as np
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

### 2. Load the data

In [18]:
X_train = pd.read_csv("heart_train_X.csv")
X_test = pd.read_csv("heart_test_X.csv")
y_train = pd.read_csv("heart_train_y.csv")
y_test = pd.read_csv("heart_test_y.csv")


### 2. Model the data

Dataframe will be constructed to contain all the outcomes obtained from our models

In [19]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 3. Choosing the right performance metric

In this dataset, a false-negative result would indicate that a person with heart disease is wrongly classified as healthy. Therefore, it is crucial to choose a model that minimizes the number of false negatives and thus, maximizes the recall value. Also, the cost of a false-negative result in this model is very high and would result in death of the patient.

Hence, recall will be used to measure the performance of the models. 

### 4. Logistic Regression 

#### 4.1 Define parameter distribution

In [20]:
param_grid_lr = {'penalty': ['None','l1', 'l2']}

#### 4.2 Performing Random Search

In [21]:
#Defining scoring metric
score_measure = "recall"

# define the logistic regression model
model = LogisticRegression()

# Create a random search object
random_search_lr = RandomizedSearchCV(estimator=model, param_distributions=param_grid_lr, n_iter=100, cv=5, random_state=42)

# Fit the random search to the data
_ = random_search_lr.fit(X_train, y_train)

# Print the best parameters found
print(f"The best {score_measure} score is {random_search_lr.best_score_}")
print(f"... with parameters: {random_search_lr.best_params_}")

warnings.filterwarnings('ignore')

The best recall score is 0.8243968739381582
... with parameters: {'penalty': 'l2'}


#### 4.3 Grid Search

In [22]:
#Defining scoring metric
score_measure = "recall"
kfolds = 5

#Using best parameters found from random search
param_grid_lrg = {'penalty': ['l2']}

# define the logistic regression model
model = LogisticRegression()

# define the grid search
grid_search_lr = GridSearchCV(estimator = model, param_grid=param_grid_lrg, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# fit the grid search to the data
_ = grid_search_lr.fit(X_train, y_train)


#Best parameters
print(f"Best parameters: {grid_search_lr.best_params_}")



Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best parameters: {'penalty': 'l2'}


#### 4.4 Predict with test data

In [23]:
c_matrix = confusion_matrix(y_test, grid_search_lr.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"Logistic Regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.772532,0.803571,0.743802,0.772532


### 5. SVM

#### 5.1 Define parameter distribution

In [24]:
param_grid_SVM = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

#### 5.2  Performing Random Search

In [25]:
# Create an SVM classifier
svm = SVC()

# Create a random search object
random_search_SVM = RandomizedSearchCV(
    svm, 
    param_distributions=param_grid_SVM, 
    n_iter=50, # number of parameter combinations to try
    cv=5, # number of cross-validation folds
    scoring='recall', # scoring metric to use
    verbose=1, 
    n_jobs=-1 # number of CPU cores to use for parallel computation
)


# Fit the random search to the data
_= random_search_SVM.fit(X_train, y_train)

# Print the best parameters found
print("Best parameters found:", random_search_SVM.best_params_)


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters found: {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}


#### 5.3 Grid Search

In [26]:
#Using the best parameters found from random search
param_grid_SVMg = [
  {'C': [1], 
   'kernel': ['rbf'],
   'gamma': [0.0001]}
  
]

# Create an SVM classifier
svm = SVC()

# Create a grid search object
grid_search_SVM = GridSearchCV(
    svm, 
    param_grid=param_grid_SVMg, 
    cv=5, # number of cross-validation folds
    scoring='recall', # scoring metric to use
    verbose=1, 
    n_jobs=-1 # number of CPU cores to use for parallel computation
)

# Fit the grid search to the data
grid_search_SVM.fit(X_train, y_train)



Fitting 5 folds for each of 1 candidates, totalling 5 fits


GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid=[{'C': [1], 'gamma': [0.0001], 'kernel': ['rbf']}],
             scoring='recall', verbose=1)

In [27]:
c_matrix = confusion_matrix(y_test, grid_search_SVM.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.772532,0.803571,0.743802,0.772532
0,SVM,0.519313,0.519313,1.0,0.683616


### 6. Decision Tree

#### 6.1 Define Parameter Distribution

In [28]:
param_grid_dtree = {
    'min_samples_split': np.arange(1,200),  
    'min_samples_leaf': np.arange(1,200),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

#### 6.2 Perform Random Search

In [29]:
# Define scoring measure
score_measure = "recall"
kfolds = 5

# Create a decision tree object
dtree = DecisionTreeClassifier()

# Use RandomizedSearchCV to search over the parameter grid
random_search_dtree = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid_dtree, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Fit the random search object to the data
_ = random_search_dtree.fit(X_train, y_train)

# Print the best parameters and score

print(f"The best {score_measure} score is {random_search_dtree.best_score_}")
print(f"... with parameters: {random_search_dtree.best_params_}")


warnings.filterwarnings('ignore')

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.8442857142857143
... with parameters: {'min_samples_split': 193, 'min_samples_leaf': 25, 'max_leaf_nodes': 36, 'max_depth': 14, 'criterion': 'gini'}


#### 6.3 Grid Search

In [30]:
#Define scoring measure
score_measure = "recall"
kfolds = 5

#Using best parameters found from random search
param_grid_dtreeg = {
    'min_samples_split': np.arange(176,180),  
    'min_samples_leaf': np.arange(22,26),
    'max_leaf_nodes': np.arange(49,53), 
    'max_depth': np.arange(20,24), 
    'criterion': ['gini'],
}

# Create a decision tree object
dtree = DecisionTreeClassifier()

# Create a grid search object
grid_search_dtree = GridSearchCV(estimator = dtree, param_grid=param_grid_dtreeg, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Fit the grid search to the data
_ = grid_search_dtree.fit(X_train, y_train)


# Print the best parameters and score
print(f"The best {score_measure} score is {grid_search_dtree.best_score_}")
print(f"... with parameters: {grid_search_dtree.best_params_}")


Fitting 5 folds for each of 256 candidates, totalling 1280 fits
The best recall score is 0.8442857142857143
... with parameters: {'criterion': 'gini', 'max_depth': 20, 'max_leaf_nodes': 49, 'min_samples_leaf': 23, 'min_samples_split': 176}


In [43]:
c_matrix = confusion_matrix(y_test, grid_search_dtree.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.772532,0.803571,0.743802,0.772532
0,SVM,0.519313,0.519313,1.0,0.683616
0,Decision tree,0.738197,0.720588,0.809917,0.762646


### 7 Neural Net

#### 7.1 Define parameter distribution

In [31]:
param_grid_nn = {
    'hidden_layer_sizes': [ (50,), (70,),(50,30), (40,20), (60,40, 20), (70,50,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}

#### 7.2 Perform Random search

In [33]:
#Defining Scoring metric
score_measure = "accuracy"
kfolds = 5

#Create a MLP classifier
ann = MLPClassifier()

#Use RandomizedSearchCV to search over the parameter grid
random_search_nn = RandomizedSearchCV(estimator = ann, param_distributions=param_grid_nn, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)
# Fit the random search object to the data
_ = random_search_nn.fit(X_train, y_train)

# Print the best parameters and score
bestRecallTree = random_search_nn.best_estimator_
print(random_search_nn.best_params_)



Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'solver': 'sgd', 'max_iter': 5000, 'learning_rate_init': 0.5, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (50,), 'alpha': 1, 'activation': 'tanh'}


In [40]:
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.80      0.79       112
           1       0.81      0.79      0.80       121

    accuracy                           0.80       233
   macro avg       0.80      0.80      0.80       233
weighted avg       0.80      0.80      0.80       233



#### 7.3 Grid Search

In [44]:
#Define scoring measure
score_measure = "accuracy"
kfolds = 5

#Using best parameters found from random search
param_grid_nng = {
    'hidden_layer_sizes': [ (30,), (50,), (70,), (90,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd'],
    'alpha': [.5, .7, 1],
    'learning_rate': ['adaptive', 'invscaling'],
    'learning_rate_init': [0.2, 0.5, 0.7],
    'max_iter': [5000]
}

#Create a MLP classifier
ann = MLPClassifier()

# Create a grid search object
grid_search_nn = GridSearchCV(estimator = ann, param_grid=param_grid_nng, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Fit the grid search to the data
_ = grid_search_nn.fit(X_train, y_train)

# Print the best parameters and score
bestRecallTree = grid_search_nn.best_estimator_

print(grid_search_nn.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
{'activation': 'relu', 'alpha': 1, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'learning_rate_init': 0.7, 'max_iter': 5000, 'solver': 'sgd'}


In [45]:
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.81      0.78       112
           1       0.81      0.75      0.78       121

    accuracy                           0.78       233
   macro avg       0.78      0.78      0.78       233
weighted avg       0.78      0.78      0.78       233



In [46]:
c_matrix = confusion_matrix(y_test, grid_search_nn.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Net", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.772532,0.803571,0.743802,0.772532
0,SVM,0.519313,0.519313,1.0,0.683616
0,Decision tree,0.738197,0.720588,0.809917,0.762646
0,Neural Network,0.781116,0.8125,0.752066,0.781116


### Discussion of Result

The aim of this project was to develop a binary classifier that could predict whether an individual has heart disease or not. To achieve this, four different modeling techniques, namely logistic regression, SVM, decision tree and neural net were used. The primary goal was to select the model that minimized false negatives (i.e., a sick person being wrongly classified as healthy), as this had severe consequences. Additionally, high accuracy values were desirable, particularly true positives, given the evenly split data.

After comparing the four models, it was determined that the SVM model has the best recall score among all the models, indicating less false negative predictions. 

But the SVM model had a lower precision and accuracy score than the other models indicating less accurate identification of true positives. 

The neural net model also has a good recall score of 0.752, accuracy of 0.781, precision of 0.812 and F1 score of 0.781.However, the decision tree model has a good recall score of 0.809917, indicating fewer false negatives. It also has a good accuracy score of 0.738197 in this case high true positive gives indication of a good model.

Given the context of the data, model with high recall and accuracy should be selected, as the cost of false negatives is higher than that of false positives and it is also important to determine the true positives accurately(people with heart disease classified as sick). Decision tree would be the best fit model for this data, as it has a good recall score, as well as a good accuracy score.