In [13]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

In [14]:
heart_df = pd.read_csv("data/heart-disease.csv")
heart_df.head() # classification dataset - supervised learning

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 2. Hyperparameter tuning with RandomizedSearchCV

Scikit-Learn's [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) allows us to randomly search across different hyperparameters to see which work best. It also stores details about the ones which work best

we create a grid (dictionary) of hyperparameters we'd like to search over.

In [15]:
# Hyperparameter grid RandomizedSearchCV will search over
# Hyperparameters --> keys
# values we want to try
# grid is a dictionoary

grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

Where did these values come from?. They're made up.

Yes. Not completely pulled out of the air but after reading the Scikit-Learn documentation on Random Forest's you'll see some of these values have certain values which usually perform well and certain hyperparameters take strings rather than integers.

Now we've got the grid setup, Scikit-Learn's RandomizedSearchCV will look at it, pick a random value from each, instantiate a model with those values and test each model.

How many models will it test?
As many as there are for each combination of hyperparameters to be tested.

max_depth has 4, max_features has 2, min_samples_leaf has 3, min_samples_split has 3, n_estimators has 5. That's 4x2x3x3x5 = 360 models!

Or...

We can set the n_iter parameter to limit the number of models RandomizedSearchCV tests.

The best thing? The results we get will be cross-validated (hence the CV in RandomizedSearchCV) so we can use train_test_split().

And since we're going over so many different models, we'll set n_jobs to 1 of RandomForestClassifier so Scikit-Learn takes advantage of all the cores (processors) on our computers.

**Note**: Depending on n_iter (how many models you test), the different values in the hyperparameter grid, and the power of your computer, running the cell below may take a while.

In [16]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)  # Results are reproducable

# Shuffle the data
heart_df_shuffle = heart_df.sample(frac=1)

# Split into X and y
X = heart_df_shuffle.drop("target",axis=1)
y = heart_df_shuffle["target"]


# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
# n_jobs --> how much of ur computer processor are u going to dedicate towards the machine learning model
# 1 --> means all
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV --> cross validation - automatically creates the validation sets for us
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # try 20 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# it will take clf and grid, then search over grid for different types of the hyperparameters combinations at random (which works best)

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

# Fitting 5 folds for each of 10 candidates, totalling 50 fits means,
# 10 iterations of different combinations of parameters in grid
# Splitting each combination 5 times, cv = 5
# fit function is run 50x times using different hyperparamters on different sets of data 

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=   0.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=   0.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=   0.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=   0.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=   0.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100; tot

In [17]:
# Which combination of hyperparameters got the best results found by RandomizedSearchCV
rs_clf.best_params_

{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 20}

when we call predict() on rs_clf (our RandomizedSearchCV version of our classifier), it'll use the best hyperparameters it found.

In [1]:
def evaluate_preds(y_true,y_preds):
    """
    Performs evaluation comparison on y_true labels vs. y_pred labels
    on  a classification model.
    """
    accuracy = accuracy_score(y_true,y_preds)
    precision = precision_score(y_true,y_preds)
    recall = recall_score(y_true,y_preds) 
    f1 = f1_score(y_true,y_preds)
    metric_dict = {
        "accuracy":round(accuracy,2),
        "precision":round(precision,2),
        "recall":round(recall,2),
        "f1":round(f1,2)
    } # A dictionary that stores the results of the evaluation metrics
    
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    
    return metric_dict

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)

# Evaluate the predictions
rs_metrics = evaluate_preds(y_test, rs_y_preds)

Acc: 83.61%
Precision: 0.87
Recall: 0.82
F1 score: 0.84


In [2]:
# ------------------------------------------------------------------

A few next ideas you could try:

* Collecting more data - Based on the results our models are getting now, it seems like they're finding some patterns. Collecting more data may improve a models ability to find patterns. However, your ability to do this will largely depend on the project you're working on.
* Try a more advanced model - Although our tuned Random Forest model is doing pretty well, a more advanced ensemble method such as [XGBoost](https://xgboost.ai/) or [CatBoost](https://catboost.ai/) might perform better.

Since machine learning is part engineering, part science, **these kind of experiments are common place in any machine learning project.**

In [3]:
# ---------------------------------------------------------------

### Correlation Analysis

* It means which attributes have correlations 

One column is correlated with another

e.g. The price of floor size, and price of land size. As prices go up, both land price and floor size increases --> highly correlated

SO we remove them from our model as they do not affect our model.

### Forward/Backward Attribute Selection

* Forward - start with just one column when you train a model and keep adding one attribute at a time until you get the accuracy to plateau. i.e. if you keep increasing columns and lets say after 15th column, all the atributes that you add don't improve the model, then we don't need it

* Backward - train the model on all the attributes and then slowly start taking away attributes or columns to train your model. Does it affect or improve the model? 

These ways are for testing our model, reduce our data if we want to, and play with our model instead of just including everything