# Predicting heart disease using machine learning
This project deals with various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes.

## 1. Problem Statement
> Given clinical parameters about a patient, can we predict whether or not they have a heart disease.

## 2. Data
The data we are using is open source data from The Cleavland data from the UCI Machine Leaning Repository, also available on Kaggle. https://archive.ics.uci.edu/dataset/45/heart+disease

## 3. Evaluation
> We aim to achieve 95% accuracy of predicting whether or not a patient has heart disease before full-proof deployment of our model.

In [None]:
## Importing the tools

# Regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for plotting inside our notebook
%matplotlib inline

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# For Model Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix , classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# For saving our model
import pickle

 ## Load Data

In [None]:
df = pd.read_csv("6.1 heart-disease.csv")
df.shape

## Exploratory Data Analysis (EDA)

In [None]:
df.head()

In [None]:
# Checking data distribution balance
df["target"].value_counts().plot(kind="bar", color=["lightpink", "lightblue"]);

In [None]:
# Checking info about every attributes of data
df.info()

In [None]:
 # Checking for missing values 
df.isna().sum()

In [None]:
# Getting more details about each attributes of data
df.describe()

### Finding patterns in our data

In [None]:
# Finding relation between Sex and Target attributes
df.sex.value_counts() # Checking distribution between males and females

In [None]:
pd.crosstab(df.target, df.sex) # Comparing sex to target

In [None]:
# Plotting above crosstab distribution
pd.crosstab(df.target, df.sex).plot(kind="bar",
                                    figsize = (10,5),
                                    color=["lightpink", "lightblue"])
plt.title("Heart disease frequency for Sex")
plt.xlabel("0 = Healthy, 1 = Heart Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);


In [None]:
# Age vs. Max Heart Rate for Heart Disease
plt.figure(figsize=(10,6))
# Scatter with positive example
plt.scatter(df.age[df.target==1],
            df.thalach[df.target==1],
            c="salmon")
# Scatter with negative example
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            c="lightblue")
plt.title("Age vs. Heart Rate in healthy and heart disease patients")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Heart Disease", "Healthy"]);

In [None]:
#checking age distribution
df.age.plot.hist();

#### Most of our data range from 55 to 65 years old people 

## Heart Disease frequency per chest pain type
1. 0: Typical angina (Related to Heart Disease)
2. 1: Atypical angina
3. 2: Non-anginal pain
4. 3: Asymptomatic

In [None]:
pd.crosstab(df.cp, df.target)

In [None]:
pd.crosstab(df.cp, df.target).plot(kind = "bar",
                                   figsize = (10,6),
                                   color=["lightpink", "lightblue"])
plt.title("Heart Disease Frequency per Chest Pain Type")
plt.xlabel("Chest Pain type")
plt.ylabel("Amount")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0);

### Making a correlation matrix

In [None]:
df.corr()

In [None]:
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
                 annot =True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu");

## 5. Modelling
We will experiment with 3 classification models:
1. Logistic Regression
2. K-Near Neighbours Classifier
3. Random Forest Classifier

In [None]:
# Split data into features and target variables
x = df.drop("target", axis = 1)
y = df["target"]

In [None]:
# Split data into training and test sets
np.random.seed(42)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [None]:
# Putting all models in a dictionaries
models = {"Logistic Regression" : LogisticRegression(),
          "KNN" : KNeighborsClassifier(),
          "Random Forest" : RandomForestClassifier()}
# Creating a function to fit and show score for each models
def fit_and_score(models, x_train, x_test, y_train, y_test):
    np.random.seed(42)
    model_scores = {}
    for name, model in models.items():
        model.fit(x_train, y_train)
        model_scores[name] = model.score(x_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                            x_train = x_train,
                            x_test = x_test,
                            y_train = y_train,
                            y_test = y_test)
model_scores

## Model comparison

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();

### Metrics that we are going to consider:
* feature importance
* confusion matrix
* cross-validation
* precision
* Recall
* F1 score
* classification report
* ROC Curve
* Area under the curve (AUC)

**Hyperparameter Tuning**

In [None]:
# Tunning KNN
train_scores = []
test_scores = []

# Creating different n_neighbors to test on
neighbors = range(1,21)

# Setup KNN instance
knn = KNeighborsClassifier()

# Loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors = i)
    knn.fit(x_train, y_train)
    train_scores.append(knn.score(x_train, y_train))
    test_scores.append(knn.score(x_test, y_test))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1,21,1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()
best_n_neighbors_index = test_scores.index(max(test_scores))
best_n_neighbors = neighbors[best_n_neighbors_index]
print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%, achieved with n_neighbors = {best_n_neighbors} ")

## Hyperparameter tuning with RandomisedSearchCV
Now we will tune following models:
* LogisticRegression()
* RandomForestClassifier()

In [None]:
# Creating a hyperparameter grid for LogisticRegression
log_reg_grid = {"C": np.logspace(-4,4,20),
               "solver":["liblinear"]}

# Creating a hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3,5,10],
           "min_samples_split": np.arange(2,20,2),
           "min_samples_leaf": np.arange(1,20,2)}

In [None]:
# Tuning LogisticRegression model

np.random.seed(42)
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)
rs_log_reg.fit(x_train, y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(x_test, y_test)

In [None]:
# Hyperparameter tuning for RanRandomForestClassifier
np.random.seed(42)

rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)
rs_rf.fit(x_train, y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(x_test, y_test)

In [None]:
model_scores

## Hyperparameter Tuning with GridSearchCV

In [None]:
# Setting Hyperparameters for LogisticRegression model
log_reg_grid = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=15,
                          verbose=True)
gs_log_reg.fit(x_train, y_train);

In [None]:
gs_log_reg.best_params_

In [None]:
gs_log_reg.score(x_test, y_test)

## Evaluating our tuned machine learning classifier

* ROC Curve and AUC Curve
* Confusion Matrix
* Classification report
* Precision
* Recall
* F1-score

In [None]:
y_preds = gs_log_reg.predict(x_test)
y_preds

In [None]:
x_test.head()

In [None]:
# Plot ROC Curve and calculate AUC metric
plot_roc_curve(gs_log_reg, x_test, y_test)

In [None]:
# Confusion matrix
print(confusion_matrix(y_test, y_preds))

In [None]:
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_preds):
    fig,ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True,
                     cbar=False)
    plt.xlabel("True label")
    plt.ylabel("Predicted label")

plot_conf_mat(y_test, y_preds)

In [None]:
# Classification report 
print(classification_report(y_test, y_preds))

### Calculating evaluation metrics using cross validation
Using `cross_val_score()`

In [None]:
gs_log_reg.best_params_

In [None]:
 # Creating a new classifier with best parameters
clf = LogisticRegression(C=0.20433597178569418,
                         solver="liblinear")

In [None]:
## Cross-validated accuracy
cv_acc = cross_val_score(clf,
                         x,
                         y,
                        cv=5,
                         scoring="accuracy")
cv_acc

In [None]:
cv_acc = np.mean(cv_acc)
cv_acc

In [None]:
# Cross-validated precision
cv_precision = cross_val_score(clf,
                               x,
                               y,
                               cv=5,
                               scoring="precision")
cv_precision = np.mean(cv_precision)
cv_precision

In [None]:
# Cross-validated recall
cv_recall = cross_val_score(clf,
                               x,
                               y,
                               cv=5,
                               scoring="recall")
cv_recall = np.mean(cv_recall)
cv_recall

In [None]:
# Cross-validated f1
cv_f1 = cross_val_score(clf,
                               x,
                               y,
                               cv=5,
                               scoring="f1")
cv_f1 = np.mean(cv_f1)
cv_f1

In [None]:
# Visualizing cross_validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                           "Precision": cv_precision,
                           "Recall": cv_recall,
                           "F1": cv_f1},
                         index=[0])
cv_metrics.T.plot.bar(title ="Cross-validated classification metrics",
                     legend=False);

### Feature Importance

In [None]:
clf.fit(x_train, y_train);

In [None]:
clf.coef_

In [None]:
# Matching coef's of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# Visualize feature importance 
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title ="Feature Importance", legend = False);

In [None]:
# Saving our best model
pickle.dump(gs_log_reg, open('gs-log-reg.model', 'wb'))