# Machine Learning Workflow

A very high level outline of a typical machine learning workflow.

The goal is to keep this *very* concise and only include detail where it makes sense. I want to be able to use this as my standard reference for when I'm working on a new ML project.

In [86]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
pd.options.display.max_columns = 0

2 sample datasets will give us a broad overview of the two main subtypes of classic ML problems
* `heart_disease` is a classification dataset (predicting whether someone has heart disease or not).
* `boston_df` is a regression dataset (predicting the median house prices of cities in Boston).

In [72]:
heart_disease = pd.read_csv('./data/heart-disease.csv')
boston_df = pd.read_csv('./data/boston.csv')

# Prepare Data

In [73]:
# X = features, y = target (industry standard variables)
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

In [74]:
from sklearn.model_selection import train_test_split

# Consider the limitations of random splitting for many datasets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Pick a suitable model and train
Pick a model to use. [sklearn ML cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [75]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

clf_model = RandomForestClassifier()
regr_model = RandomForestRegressor()

In [76]:
clf_model.fit(X_train, y_train)

y_preds = clf_model.predict(X_test)
y_probs = clf_model.predict_proba(X_test)

y_preds, y_probs

(array([0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0,
        0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int64),
 array([[0.61, 0.39],
        [0.71, 0.29],
        [0.09, 0.91],
        [0.82, 0.18],
        [0.97, 0.03],
        [0.52, 0.48],
        [0.59, 0.41],
        [0.24, 0.76],
        [0.23, 0.77],
        [0.99, 0.01],
        [0.42, 0.58],
        [0.92, 0.08],
        [0.2 , 0.8 ],
        [0.14, 0.86],
        [0.71, 0.29],
        [1.  , 0.  ],
        [0.61, 0.39],
        [0.17, 0.83],
        [0.82, 0.18],
        [0.3 , 0.7 ],
        [0.3 , 0.7 ],
        [0.89, 0.11],
        [0.97, 0.03],
        [0.1 , 0.9 ],
        [0.01, 0.99],
        [0.07, 0.93],
        [0.14, 0.86],
        [0.82, 0.18],
        [0.94, 0.06],
        [0.24, 0.76],
        [0.83, 0.17],
        [0.89, 0.11],
        [0.06, 0.94],

# Evaluate the model

Every scikit-learn model has a default metric accessible through the `score()` function. However, there area a range of different metrics you can use depending on the model you're using.

See https://scikit-learn.org/stable/modules/model_evaluation.html

In [77]:
clf_model.score(X_test, y_test)

0.8289473684210527

In [78]:
from sklearn.model_selection import cross_val_score

# scoring=None uses default score() metric
print(cross_val_score(
    estimator=clf_model,
    X=X,
    y=y,
    cv=5, # 5-fold cross-validation
    scoring=None
))
print(cross_val_score(
    estimator=clf_model,
    X=X,
    y=y,
    cv=5, # 5-fold cross-validation
    scoring='precision'
))

[0.85245902 0.8852459  0.80327869 0.81666667 0.76666667]
[0.78947368 0.90625    0.82758621 0.81818182 0.76315789]


In [79]:
# Other classification metrics

from sklearn.metrics import accuracy_score
print('accuracy')
print(accuracy_score(y_test, y_preds))

# Receiver Operating Characteristic (ROC curve)/Area under curve (AUC)
from sklearn.metrics import roc_curve, roc_auc_score
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_probs[:, 1])
print('roc auc')
print(roc_auc_score(y_test, y_preds))

from sklearn.metrics import confusion_matrix
print('confusion matrix')
print(confusion_matrix(y_test, y_preds))

from sklearn.metrics import classification_report
print('classification report')
print(classification_report(y_test, y_preds))

accuracy
0.8289473684210527
roc auc
0.8291666666666666
confusion matrix
[[30  6]
 [ 7 33]]
classification report
              precision    recall  f1-score   support

           0       0.81      0.83      0.82        36
           1       0.85      0.82      0.84        40

    accuracy                           0.83        76
   macro avg       0.83      0.83      0.83        76
weighted avg       0.83      0.83      0.83        76



In [80]:
# Other regression metrics

X = boston_df.drop('target', axis=1)
y = boston_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

regr_model.fit(X_train, y_train)
y_preds = regr_model.predict(X_test)

from sklearn.metrics import r2_score
print('r squared score')
print(r2_score(y_test, y_preds))

from sklearn.metrics import mean_absolute_error
print('MAE')
print(mean_absolute_error(y_test, y_preds))

from sklearn.metrics import mean_squared_error
print('MSE')
print(mean_squared_error(y_test, y_preds))

r squared score
0.9217375486209248
MAE
1.694323529411767
MSE
5.857417696078444


# Improve through experimentation

Two of the main methods to improve a model's baseline metrics (the first evaluation metrics you get) include improving the data and improving the model.

### Improving the data
* Can we collect more data? In ML, more data is generally better as it gives the model more opportunities to learn patterns.
* Can we improve our data? This could be filling in missing values, finding a better encoding (turning things into numbers), feature engineering, etc.

### Improving the model
* Is there a better model we could use?
* Could we improve the current model with **hyperparameter tuning**?

In [81]:
# You can get a list of adjustable hyperparameters
clf_model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [82]:
# Example of adjusting hyperparameters by hand

# Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate two models with different settings
clf_1 = RandomForestClassifier(n_estimators=100)
clf_2 = RandomForestClassifier(n_estimators=200)

# Fit both models on training data
clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)

# Evaluate both models on test data and see which is best
print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))

0.8157894736842105
0.8289473684210527


In [83]:
# Example of adjusting hyperparameters computationally (recommended)

from sklearn.model_selection import RandomizedSearchCV

# Define a grid of hyperparameters
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["log2", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec 2019, using n_jobs=1 works)
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # try 10 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

# Find the best hyperparameters
print(rs_clf.best_params_)

# Scoring automatically uses the best hyperparameters
rs_clf.score(X_test, y_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=4, n_estimators=200; total time=   0.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=4, n_estimators=200; total time=   0.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=4, n_estimators=200; total time=   0.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=4, n_estimators=200; total time=   0.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=4, n_estimators=200; total time=   0.7s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=1200; total time=   4.3s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=1200; total time=   4.1s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=6, n_estimators=1

0.7540983606557377

# Save and reload a trained model

In [84]:
import pickle
from pathlib import Path

file_name = 'rs_random_forest_model_1.pkl'
pickle.dump(rs_clf, open(file_name, 'wb'))
loaded_pickle_model = pickle.load(open(file_name, 'rb'))
print(loaded_pickle_model.score(X_test, y_test))
Path(file_name).unlink()

0.7540983606557377


# All together with simple model

We can put a number of different sklearn functions together with `Pipeline`.

For a machine learning model to work, you need to have no missing data and no non-numeric values. We can handle these in various ways, such as dropping, filling, encoding, etc.

In [87]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv('./data/car-sales-extended-missing-data.csv')
data.dropna(subset=['Price'], inplace=True)

# Define different features and column transformer pipelines
# NOTE: Make sure to process numerical categorical columns separately
# E.g. if we had imputed 'Doors' here, we would fill it with 'missing'
# Which is not only meaningless but would cause an error with the encoder
# It also does not need to be encoded since it is already a number
categorical_features = ['Make', 'Colour']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

door_feature = ['Doors']
door_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=4))
])

numerical_features = ['Odometer (KM)']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('door', door_transformer, door_feature),
        ('num', numerical_transformer, numerical_features)
    ],
    # NOTE: This allows you to know which col transform failed
    verbose=True
)

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        # Setting verbose here allows you to estimate time remaining
        # Random forest can take a while for large datasets
        verbose=3,
        # For large datasets, use all processors (-1)
        # This bogs down comp significantly (leave 1 processor if working?)
        n_jobs=-1
    ))
])


X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22188417408787875