# Machine learning model to predict the status of the disease (target) using different variables

### Random Forest Classifier Workflow for Classifying Heart Disease

#### 1. Get the data ready

We'll import `heart-disease.csv`.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [114]:
import pandas as pd
import numpy as np

heart_disease = pd.read_csv("heart-disease.csv") 
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics. 

The `target` column indicates whether the patient has heart disease (`target=1`) or not (`target=0`), this is our "label" columnm, **the variable we're going to try and predict**.

The rest of the columns (often called features) are what we'll be using to predict the `target` value.

> **Note:** It's a common custom to save features to a varialbe `X` and labels to a variable `y`. In practice, we'd like to use the `X` (features) to build a predictive algorithm to predict the `y` (labels).

In [115]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

print(X)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  
0        0   0     1  
1        0   0     2  
2        2   0    

In [116]:
print(y.head())
print(y.value_counts())

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64
target
1    165
0    138
Name: count, dtype: int64


#### 2. Choose the model and hyperparameters

In [117]:
# Since we're working on a classification problem, we'll start with a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100) # Number of trees used to create the random forest

# View the current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### 3. Fit the model to the training data and use it to make a prediction

In [118]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # It splits the data in train and test (using 80% to train the model)

In [119]:
clf.fit(X_train, y_train) # Now the model is fitted to the data # The model is learning patterns from the training data

In [120]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
107,45,0,0,138,236,0,0,152,1,0.2,1,0,2
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
35,46,0,2,142,177,0,0,160,1,1.4,0,0,2
130,54,0,2,160,201,0,1,163,0,0.0,2,1,2
108,50,0,1,120,244,0,1,162,0,1.1,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,52,1,3,152,298,1,1,178,0,1.2,1,0,3
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2
63,41,1,1,135,203,0,1,132,0,0.0,1,0,1
248,54,1,1,192,283,0,0,195,0,0.0,2,1,3


In [121]:
y_train

107    1
71     1
35     1
130    1
108    1
      ..
83     1
164    1
63     1
248    0
197    0
Name: target, Length: 242, dtype: int64

In [122]:
y_preds = clf.predict(X_test)
y_preds

array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

In [123]:
y_test # Labels extracted from the dataset (known labels)

79     1
234    0
8      1
160    1
105    1
      ..
163    1
51     1
297    0
296    0
17     1
Name: target, Length: 61, dtype: int64

#### 4. Evaluate the model on the training data and test data

In [124]:
test_tr = clf.score(X_train, y_train) 
print(f"The model's accuracy on the training dataset is: {test_tr*100:.2f}%") 

The model's accuracy on the training dataset is: 100.00%


**The model has found patterns (`score = 1`) in the training data because it got trained on the features as well as the label (it has a chance to see both data *and* labels).**

**Seeing the labels it has a chance to correct itself**

In [125]:
# Evaluate the model on the test set
test_acc = clf.score(X_test, y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%") # The scoring function (or evaluation metric) is used to assess the model's performance by automatically comparing the true labels (ground truth) with the predicted labels obtained from the test set

The model's accuracy on the testing dataset is: 81.97%


**Our model's accuracy is a bit less on the test dataset than the training dataset.**

**This is quite often the case, because remember, a model has never seen the testing examples before.**

In [126]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Create a classification report
print(classification_report(y_test, y_preds)) # True labels vs predictions

              precision    recall  f1-score   support

           0       0.85      0.76      0.80        29
           1       0.80      0.88      0.84        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [127]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[22,  7],
       [ 4, 28]], dtype=int64)

In [128]:
# Compute the accuracy score (same as the score() method for classifiers) 
accuracy_score(y_test, y_preds)

0.819672131147541

#### 5. Improve the model

In [129]:
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 77.05%

Trying model with 20 estimators...
Model accuracy on test set: 80.33%

Trying model with 30 estimators...
Model accuracy on test set: 77.05%

Trying model with 40 estimators...
Model accuracy on test set: 81.97%

Trying model with 50 estimators...
Model accuracy on test set: 80.33%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 70 estimators...
Model accuracy on test set: 85.25%

Trying model with 80 estimators...
Model accuracy on test set: 83.61%

Trying model with 90 estimators...
Model accuracy on test set: 81.97%



#### 6. Save a model and load it

In [130]:
import pickle

pickle.dump(clf, open("random_forst_model_1.pkl", "wb"))

In [131]:
loaded_model = pickle.load(open("random_forst_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.819672131147541