# Improving a Machine Learning model

* First predictions = **Baseline predictions.**
* First model = **Baseline model.**

From a **data** perspective:
* Could we collect more data? 
* Could we improve our data?

From a **model** perspective:
* Is there a better model we could use? (see scikit-learn model flow chart)
* Could we improve the current model?

Hyperparameters vs Parameters
* **Parameters** = The model finds these patterns in data.
* **Hyperparameters** = Settings on a model, that you adjust to (potentially) improve its ability to find patterns.

Three ways to adjust Hyperparameters:
* By hand
* Randomly with RandomSearchCV
* Exhaustively with GridSearchCV

In [46]:
# Standard imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [47]:
# Load data

from sklearn.datasets import load_boston

boston = load_boston()

boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])


heart_disease = pd.read_csv('~/sample_project/Data/heart-disease.csv')

In [48]:
# Specific imports 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

**Improving the model**

In [49]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [50]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 5.1 Tuning hyperparameters by hand

Make 3 sets: 
* Training.
* Validation.
* Test

In [51]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We will try to adjust the following:
* `max-depth`.
* `max_features`.
* `min_samples_leaf`.
* `min_samples_split`.
* `n_estimators`.

In [52]:
def evaluate_preds(y_true, y_preds):
    """
    Performs evaluation comparison on y_true vs. y_pred labels, on a classification model.
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision, 2),
                   "recall": round(recall, 2),
                   "f1": round(f1, 2)}
    
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    
    return metric_dict

#### Manually splittiing the data.

###### Shuffling the data.

In [53]:
np.random.seed(42)

heart_disease_shuffled = heart_disease.sample(frac=1)

###### Split into X and y.

In [54]:
X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

###### Split the data into train, validation and test sets.

In [55]:
train_split = round(0.7 * len(heart_disease_shuffled))
validation_split = round(train_split + 0.15 * len(heart_disease_shuffled))

X_train, y_train = X[:train_split], y[:train_split]

X_valid, y_valid = X[train_split:validation_split], y[train_split:validation_split]

X_test, y_test = X[validation_split:], y[validation_split:]

In [56]:
len(X_train), len(X_valid), len(X_test)

(212, 45, 46)

In [57]:
clf = RandomForestClassifier()

**baseline parameters**

In [58]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [66]:
clf.fit(X_train, y_train)

# Make baseline predictions, on the valid set. As this is what we will be tuning the hyperparameters on.

y_preds = clf.predict(X_valid)

# Evaluate the classifier on valid set

baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

Accuracy: 84.44%
Precision: 0.85
recall: 0.88
F1 score: 0.86


{'accuracy': 0.84, 'precision': 0.85, 'recall': 0.88, 'f1': 0.86}