# Heart Disease Exercise Notebook

> This notebook contains the the solution for the Heart Disease Exercise.

In this notebook, we are trying to make a model that can detect the presence of heart disease in a patient.

## Understand the dataset
Attribute Information:
1. **age** - age in years 

2. **sex** - (1 = male; 0 = female) 

3. **cp** - chest pain type
    * 0: Typical angina: chest pain related decrease blood supply to the heart
    * 1: Atypical angina: chest pain not related to heart
    * 2: Non-anginal pain: typically esophageal spasms (non heart related)
    * 3: Asymptomatic: chest pain not showing signs of disease
    
    
4. **trestbps** - resting blood pressure (in mm Hg on admission to the hospital)

5. **chol** - serum cholestoral in mg/dl 

6. **fbs** - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 

7. **restecg** - resting electrocardiographic results
    * 0: Nothing to note
    * 1: ST-T Wave abnormality
    * 2: Possible or definite left ventricular hypertrophy


8. **thalach** - maximum heart rate achieved 

9. **exang** - exercise induced angina (1 = yes; 0 = no)

10. **oldpeak** - ST depression induced by exercise relative to rest 

11. **slope** - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
    
    
12. **ca** - number of major vessels (0-3) colored by flourosopy 

13. **thal** - thalium stress result

14. **target** - have disease or not (1=yes, 0=no) (= the predicted attribute)

## Import the libraries
Importing the Data Science and Machine Learning libraries

* Pandas as pd
* Numpy as np
* Matplotlib as plt
* Scikit-learn

In [1]:
# Import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

# We will leave the scikit library as we will only import functions of the library when we need to

## Importing the data

Since the dataset is a csv file, we will need to use `pd.read_csv()` method to read the files.

And since the file is accessible from github, we can directly pass the raw url of the github file.

In [2]:
# Read the csv file from github
df = pd.read_csv("https://raw.githubusercontent.com/Sayed-Husain/Introduction-to-Machine-Learning-Workshop/main/Data/Heart%20disease.csv")

# To confirm that the file was read lets view the first 5 rows
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
# Check the size of the dataset
len(df)

302

In [4]:
# Check how many positive and negative samples we have
df["target"].value_counts()

1    164
0    138
Name: target, dtype: int64

## Prepare our data

**Steps to make:**
1. Create X, y variables
2. Create training and testing datasets

### Create the X and y variables

In [5]:
# Create the X varible - the df excluding `target` column
X = df.drop("target", axis=1)

# Create the y variable - the `target` column
y = df["target"]

In [6]:
# Confirm that the X variable is what we are expecting 
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2


In [7]:
# Confirm that the y variable is what we are expecting
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

### Create the training and testing datasets

In [8]:
# Import the train_test_split() function from sckit-learn
from sklearn.model_selection import train_test_split

# Use the train_test_function() function to split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [9]:
len(X_train), len(y_train), len(X_test), len(y_test)

(241, 241, 61, 61)

## Make the model

As we have the training and testing datasets ready, it is time to make the model.

As this is a classification problem, we will use the `RandomForsetClassifier()` model.

In [10]:
# Import the model from scikit-learn library 
from sklearn.ensemble import RandomForestClassifier

# Make the model
model = RandomForestClassifier()

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
model.score(X_test, y_test)

0.8688524590163934

## Is 86% a good score?

Generally speaking, having a model that can predict with 86% accuracy means that the model definetly is seeing patterns.

However, as the subject of this model is sensetive "Heart Disesae", perhaps 86% is not good enough.


Also, arguably, accuracy is not the best way to measure our model's competency, as in this model, our main aim is not to get the highest accuarcy score, but to make sure that the no one who is actually in danger of heart disease will be predicted as a Free of disease, even if that meant that people who are not in danger might be classified as in danger, as eventually, the doctors are going to examine the patient.


Therefore, **`recall`** is a better evaluation metric for this problem.
