## A Simple Scikit-Learn Classification Workflow
This notebook shows a breif workflow you might use with scikit-learn to build a machine learning model to classify whether or not a patient has heart disease.

It follows the diagram below:

![](sklearn-workflow.png "C:\Users\royso\Downloads\sklearn-workflow.png")

Note: This workflow assumes your data is ready to be used with machine learning models (is numerical, has no missing values).

In [1]:
# import the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Lets read the data!

heart_disease = pd.read_csv("F:\Data Science\ZeroToMastery.io\heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Lets create feature columns and target column

X = heart_disease.drop("target", axis = 1)

y = heart_disease["target"]

In [13]:
# Now we will split the data in training and test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [14]:
# Now lets fit a model to the data

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

In [15]:
model.fit(X_train, y_train)

RandomForestClassifier()

In [16]:
# Predicting the target column in test data

y_preds = model.predict(X_test)
y_preds

array([1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0], dtype=int64)

In [17]:
# Lets Evaluate the model

model.score(X_train, y_train) #training set

1.0

In [18]:
model.score(X_test, y_test) #test_set

0.8360655737704918

In [21]:
# Experiment to improve model score(Hyper-parameter tuning)
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)

np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print("")

Trying model with 10 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 20 estimators...
Model accruacy on test set: 0.8688524590163934

Trying model with 30 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 40 estimators...
Model accruacy on test set: 0.819672131147541

Trying model with 50 estimators...
Model accruacy on test set: 0.7868852459016393

Trying model with 60 estimators...
Model accruacy on test set: 0.819672131147541

Trying model with 70 estimators...
Model accruacy on test set: 0.819672131147541

Trying model with 80 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 90 estimators...
Model accruacy on test set: 0.8032786885245902



In [24]:
# Note: It's best practice to test different hyperparameters with a validation set or cross-validation.

from sklearn.model_selection import cross_val_score

# Try different numbers of estimators with cross-validation and no cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}")
    print("")

Trying model with 10 estimators...
Model accruacy on test set: 0.8360655737704918
Cross-validation score: 78.53551912568305

Trying model with 20 estimators...
Model accruacy on test set: 0.8032786885245902
Cross-validation score: 79.84699453551912

Trying model with 30 estimators...
Model accruacy on test set: 0.819672131147541
Cross-validation score: 80.50819672131148

Trying model with 40 estimators...
Model accruacy on test set: 0.819672131147541
Cross-validation score: 82.15300546448088

Trying model with 50 estimators...
Model accruacy on test set: 0.819672131147541
Cross-validation score: 81.1639344262295

Trying model with 60 estimators...
Model accruacy on test set: 0.819672131147541
Cross-validation score: 83.47540983606557

Trying model with 70 estimators...
Model accruacy on test set: 0.8524590163934426
Cross-validation score: 81.83060109289617

Trying model with 80 estimators...
Model accruacy on test set: 0.819672131147541
Cross-validation score: 82.81420765027322

Trying

In [25]:
# Save the ML model for later use!
import pickle

# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [26]:
# Load a saved model 
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))

In [34]:
loaded_model.predict(X_test)
loaded_model

RandomForestClassifier(n_estimators=90)

In [35]:
loaded_model.score(X_test, y_test)

0.8524590163934426