# Predicting Heart Disease Using Machine Learning

## 1. Problem Statement

> Given clinical data of a patient, predict whether the patient has heart disease.

## 2. Data

>Kaggle Dataset: <https://www.kaggle.com/c/heart-disease-uci>  
> UCI Dataset: <https://archive.ics.uci.edu/ml/datasets/heart+disease>

## 3. Evaluation

> If we can reach 95% accuracy, we are good to go.

## 4. Features

> The data set contains the following features:
>
> There are ***`13`*** attributes and ***`1`*** target attribute.:
>
> 1. `age`: age in years
> 2. `sex`: sex (1 = male; 0 = female)
> 3. `cp`: chest pain type
>
>     * Value 0: typical angina
>     * Value 1: atypical angina
>     * Value 2: non-anginal pain
>     * Value 3: asymptomatic
> 4. `trestbps`: resting blood pressure (in mm Hg on > admission to the hospital)
> 5. `chol`: serum cholestoral in mg/dl
> 6. `fbs`: (fasting blood sugar > 120 mg/dl) (1 = true; 0 > = false)
> 7. `restecg`: resting electrocardiographic results
>     * Value 0: normal
>     * Value 1: having ST-T wave abnormality (T wave > inversions and/or ST elevation or depression of > 0.> 05 mV)
>     * Value 2: showing probable or definite left > ventricular hypertrophy by Estes' criteria
> 8. `thalach`: maximum heart rate achieved
> 9. `exang`: exercise induced angina (1 = yes; 0 = no)
> 10. `oldpeak` = ST depression induced by exercise > relative to rest
> 11. `slope`: the slope of the peak exercise ST segment
>     * Value 0: upsloping
>     * Value 1: flat
>     * Value 2: downsloping
> 12. `ca`: number of major vessels (0-3) colored by > flourosopy
> 13. `thal`:
>     * 0 = normal
>     * 1 = fixed defect
>     * 2 = reversable defect
> and the label
> 14. `condition`:
>     * 0 = no disease
>     * 1 = disease


### Preparing Tools

We will use the following libraries:
1. Pandas
2. Numpy
3. Sklearn (SciKit-Learn) 
4. Seaborn

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We want our plots to appear in the notebook
%matplotlib inline

# Importing the Machine Learing models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve


### Load Data

In [None]:
df=pd.read_csv('../input/heart-disease/heart.csv')
df.shape

### Data Exploration

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.isna().sum()

In [None]:
df.describe()


In [None]:
df.info()

In [None]:
df.target.value_counts()

In [None]:
df.target.value_counts().plot.bar(color=['red','blue'],figsize=(10,5));

In [None]:
df.sex.value_counts()

In [None]:
pd.crosstab(df.sex,df.target)

In [None]:
pd.crosstab(df.target,df.sex).plot.bar(color=['red','blue'],figsize=(10,5))
plt.title("Heart Disease frequency for Sex")
plt.xlabel("0 = No disease 1 = Heart disease")
plt.ylabel("Amount")
plt.legend(['Female', 'Male']);

### Age vs Maximum heart rate for Heart Disease


In [None]:
fig,ax = plt.subplots(figsize=(10,6))
ax.scatter(df.age[df.target==1],df.thalach[df.target==1],color='red')
ax.scatter(df.age[df.target==0],df.thalach[df.target==0],color="blue")
ax.set_title("Age vs Maximum heart rate")
ax.set_xlabel("Age")
ax.set_ylabel("Maximum heart rate")
ax.axhline(df.thalach.mean(),linestyle='--',color='black')
ax.legend(['Heart disease', 'No disease','Average heart rate']);


In [None]:
df.age.plot.hist(figsize=(10,5),color='blue');

In [None]:
pd.crosstab(df.cp,df.target)

In [None]:
pd.crosstab(df.cp,df.target).plot.bar(color=['blue','red'],figsize=(10,5));
plt.title("Heart Disease frequency for each type of chest pain")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(['No Disease ', 'Heart disease']);

### Finding some correlations

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
ax=sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.2f', cmap='YlGnBu',ax=ax)
ax.set_title("Correlation between variables",fontsize=20);

## 5. Modelling 

In [None]:
# Split the data into X and y
X = df.drop(['target'], axis=1)
y = df.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

In [None]:
X_train

In [None]:
y_train.value_counts()

### To get the right model, we follow the sklearn's model selection map.
You can find the map here: <https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html> 

In [None]:
models={
    'Logistic Regression':LogisticRegression(max_iter=1000),
    'KNN':KNeighborsClassifier(),
    'Random Forest':RandomForestClassifier()
}

# creating a function to evaluate the models
def fit_and_score(models,X_train,X_test,y_train,y_test):
    """
    fits and evaluates given machine learing models
    models : a dict of different Sklearn machine larning models
    X_train : training data(No labels)
    X_test : test data(No labels)
    y_train : training labels
    y_test : test labels
    """
    np.random.seed(42)
    model_scores = {}
    for name,model in models.items():
        model.fit(X_train,y_train)
        model_scores[name]=model.score(X_test,y_test)
    return model_scores
model_scores=fit_and_score(models,X_train,X_test,y_train,y_test)
model_scores


In [None]:
model_scores_df=pd.DataFrame(model_scores,index=['Accuracy'])
model_scores_df.T.plot.bar(figsize=(10,5));

##  6. Hyperparameter Tuning

In [None]:
train_scores=[]
test_scores=[]
neighbours=range(1,20)
knn=KNeighborsClassifier()
for n in neighbours:
    knn.set_params(n_neighbors=n)
    knn.fit(X_train,y_train)
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(neighbours,train_scores,label='Training Score',color='blue')
ax.plot(neighbours,test_scores,label='Test Score',color='red')
ax.set_title("KNN Score vs Number of Neighbours")
ax.set_xlabel("Number of Neighbours")
ax.set_ylabel("Score")
ax.set_xticks(neighbours)
ax.legend(["Training Score","Test Score"]);
print(f"Maximum KNN Testing Score: {max(test_scores)*100:.2f}%")

We saw highest `KNN` score is less than `Logistic Regression` and `Random Forest` score. So, we will leave KNN and will try to tune Logistic Regression and Random Forest.

In [None]:
# Create a hpyerparameter grid for logistic regression
log_reg_grid = {'penalty':['l1','l2'],
                'C':np.logspace(-4,4,20),
                'solver':['liblinear']}

# Create a hpyerparameter grid for Random Forest Classifier
rf_grid = {'n_estimators': np.arange(10,1000,50),
           'max_depth':["None",3,5,10],
           'min_samples_split': np.arange(2,20,20),
           'min_samples_leaf': np.arange(1,20,2)}

In [None]:
# Hypertune LogisticRegression
np.random.seed(42)
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                n_iter=20,
                                cv=5,
                                verbose=2)


In [None]:
rs_log_reg.fit(X_train,y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,y_test)

In [None]:
# Hypertune RandomForestClassifier
np.random.seed(42)
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                            param_distributions=rf_grid,
                            n_iter=20,
                            cv=5,
                            verbose=True)

In [None]:
rs_rf.fit(X_train,y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(X_test,y_test)

In [None]:
# GridSearchCV for LogisticRegression
np.random.seed(42)
log_reg_grid = {'penalty':['l1','l2'],
                'C':np.logspace(-4,4,20),
                'solver':['liblinear']}
gs_log_reg = GridSearchCV(LogisticRegression(),
                            param_grid=log_reg_grid,                                                    
                            cv=5,
                            verbose=True)
gs_log_reg.fit(X_train,y_train)

In [None]:
gs_log_reg.best_params_

In [None]:
gs_log_reg.score(X_test,y_test)


In [None]:
# #### GridSearchCV for RandomForestClassifier
# np.random.seed(42)
# rf_grid = {'n_estimators': np.arange(10,1000,20),
#             'min_samples_split': np.arange(2,20,15),
#             'min_samples_leaf': np.arange(1,20,2)}
# gs_rf = GridSearchCV(RandomForestClassifier(),
#                     param_grid=rf_grid, 
#                     cv=3,
#                     verbose=True)
# gs_rf.fit(X_train,y_train)


## 7. Evaluating the model
* roc curve
* confusion matrix
* classification report
* precision
* recall
* f1 score


In [None]:
# plot Roc curve and calculate AUC
plot_roc_curve(gs_log_reg,X_test,y_test);

In [None]:
confusion_matrix(y_test,gs_log_reg.predict(X_test))

In [None]:
sns.set(font_scale=1.5)
fig,ax=plt.subplots(figsize=(5,5))
ax=sns.heatmap(confusion_matrix(y_test,gs_log_reg.predict(X_test)),
                                annot=True,
                                cbar=False)
ax.set_title("Confusion Matrix",fontsize=20)
ax.set_ylabel("Predicted Label")
ax.set_xlabel("True Label");

In [None]:
# classification report
print(classification_report(y_test,gs_log_reg.predict(X_test)))

In [None]:
eval_metrics=[
    'accuracy','precision','recall','f1'    
]
eval_metrics_results={}
for metric in eval_metrics:
    eval_metrics_results[metric]=cross_val_score(gs_log_reg,X_train,y_train,cv=5,scoring=metric).mean()
eval_metrics_df=pd.DataFrame(eval_metrics_results,index=['Mean'])
eval_metrics_df


In [None]:
eval_metrics_df.T.plot.bar(figsize=(10,5),color='blue');
plt.title('Evaluation Metrics')