## Improving a model

In [1]:
# First predictions = baseline predictions.
# First model = baseline model.

From a data prespective:
* Could I collect more date? (generally, more date = better)
* Could I improve the data? (more feature within a sample)

From a model perspective:
* Is there a better model I can use?
* Could I improve the currect model?

Parameter = model find these patterns in data
Hyperparameters = settings on a model you can adjut to try to improve its 
ability to find patterns.

## Three ways to adjust hyperparameters:
1. By hand;
2. Randomly with RandomSearchCV;
3. Exhaustively with GridSearchCV.

In [2]:
#Standard import for all projects
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [5]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 1. Improving hyperparams by hand

In [6]:
# Making the 3 sets : training, validation and test
# Try and adjust: max_depth, max_features, min_sample_leaf,min_sample_split, n_estimators


In [127]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create evaluation function if you want to do something more than once
def evaluate_preds(y_true, y_preds):
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision, 2),
                  "recall": round(recall, 2),
                  "f1": round(f1, 2)}
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    return metric_dict

In [128]:
heart_disease = pd.read_csv("data/heart-disease.csv")


In [129]:
# Manually split the data into training, validation and test
from sklearn.ensemble import RandomForestClassifier


np.random.seed(42)

# Shuffle the data
heart_disease_shuffled = heart_disease.sample(frac=1)

# Split into X & y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

# Split data into train, validation and test sets
train_split = round(0.7 * len(heart_disease_shuffled)) # 70% of data
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled)) # 15% od data
X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test, y_test = X[valid_split:], y[valid_split:]

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make baseline predicitions
y_preds = clf.predict(X_valid)

# Evaluate the classifier on validation set
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

Acc: 82.22%
Precision: 0.81
Recall: 0.88
F1 score: 0.85


{'accuracy': 0.82, 'precision': 0.81, 'recall': 0.88, 'f1': 0.85}

In [133]:
np.random.seed(42)

# Create a second classifier with diff hyperparms
clf_2 = RandomForestClassifier(n_estimators=1000)
clf_2.fit(X_train, y_train)

# Make predicitons with diff hyperparams
y_preds_2 = clf_2.predict(X_valid)

# Evaluate the 2 nd classifiier
clf_2_metrics = evaluate_preds(y_valid, y_preds_2)

Acc: 82.22%
Precision: 0.81
Recall: 0.88
F1 score: 0.85


In [120]:
clf_3 = RandomForestClassifier(n_estimators=1000,
                               max_depth=10)

## 2. Randomly with RandomSearchCV;

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid= {"n_estimators": [10, 100, 200, 500, 1000, 1500],
       "max_depth": [None, 5, 10, 20, 30],
       "max_features": ["auto", "sqrt"],
       "min_samples_split": [2, 4, 6],
       "min_samples_leaf": [1, 2, 4]}

np.random.seed()