# ml-utils guide

First let's import py_utils

In [116]:
import py_utils

Running the function below gives us the ability to hide code cells in jupyter notebook. <br>
This can be helpful when presenting.

In [117]:
py_utils.hide_code_cells()

## Model Building

To explore how to use the model utils let's quickly make a test model. <br>
Firstly, import some more packages.

In [118]:
import os
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification, load_iris

Let's make a fake numeric dataset for the purpose of our examples.

We'll do this by using scikit-learn's helpful `make_classification` function.

<mark>You can 'unpack' arguments into a function via a dictionary using the `**` notation (see below).</mark>

In [119]:
make_classification_dict = {'n_samples': 100000, 'n_features': 50}

sample_data =  make_classification(**make_classification_dict)

Let's look at our X:

In [120]:
X = pd.DataFrame(sample_data[0])
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.365997,-0.433317,1.560637,-0.138079,0.245504,0.132429,-0.12972,0.440637,0.876539,-0.008592,...,0.177584,-0.168598,-0.095449,-0.061516,-0.979664,0.558874,-2.092906,0.171905,1.03565,0.430578
1,-0.132642,1.8865,-0.378027,1.32195,1.468235,0.85114,0.273501,-1.474592,0.194993,1.00709,...,0.720932,-1.619365,0.302415,0.094322,1.447754,0.24062,0.430696,0.958248,-0.3113,-0.544773
2,-1.060272,-1.19859,0.148687,1.467177,0.023818,-0.467987,0.074374,0.248414,-2.211274,1.50754,...,0.568848,-0.693858,0.549049,-0.124684,-0.530048,1.41776,-0.609569,-0.026135,-1.037186,-0.623432
3,-0.771197,-0.865145,0.607501,-0.910828,-0.556104,-0.987186,0.076289,0.270997,-0.49054,1.691022,...,-0.589941,-1.192576,1.390384,-1.84215,0.079264,0.868303,0.309218,-1.786201,1.454674,-0.281275
4,-0.416671,-0.356968,-0.740395,2.274168,1.440758,-0.367331,-0.73658,-0.118942,-0.845161,-0.01438,...,-0.109576,-0.877958,0.788034,1.718997,-0.671951,-1.693619,1.327838,-1.463735,-0.217258,0.747213


Let's look at our y: <br>
<mark>You can continue code onto the next line with `\` (see below)</mark>

In [121]:
y = pd.Series(sample_data[1])\
.to_frame()
y.head()

Unnamed: 0,0
0,1
1,1
2,0
3,1
4,1


In [122]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

In [123]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #we instantiate the scaler object
X_train_scaled = scaler.fit_transform(X_train) #we fit and transform X into the scaler object using .fit_transform()

Now let's say we want to apply a logistic regression model.

In [124]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train_scaled, y_train.values.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [125]:
logreg.score(X_test, y_test)

0.85548

### Model Saving

In [148]:
mkdir Models

Let's save the model, the function assumes that we have a 'Models' subfolder so we have one created here. <br>
We need to define the following things for this function:
* `model_name` = the root name of the model we want to save.
* `model_var` = the model variable we want to save.
* `subfolder` = optional subfolder we want to use, we don't need it here so let's leave it as an empty string.
* `wd` = working directory, the main working directory of our project.

In [126]:
py_utils.dump_diff_model(model_name='my_model', model_var=logreg, subfolder='', wd = os.getcwd())

Let's see this file, which is saved below.

In [127]:
[x for x in os.listdir(os.path.join(os.getcwd(), 'Models')) if x.endswith('.joblib')]

['my_model_2019-07-22 01:13:59.555809.joblib']

Now let's try and save it again and check the output.

In [128]:
py_utils.dump_diff_model(model_name='my_model', model_var=logreg, subfolder='', wd = os.getcwd())

In [129]:
[x for x in os.listdir(os.path.join(os.getcwd(), 'Models')) if x.endswith('.joblib')]

['my_model_2019-07-22 01:13:59.555809.joblib']

We can see that nothing has saved! <br>
This is because we are trying to save the same model twice.

Now let's try and save a different model with the same name 'my_model'. <br>
Let's say we want to change Logistic Regression solver.

In [130]:
logreg = LogisticRegression(solver='saga')
logreg.fit(X_train_scaled, y_train.values.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [131]:
logreg.score(X_test, y_test)

0.85548

We've now changed the solver from 'lbfgs' to 'saga', now let's try and save this different model.

In [132]:
py_utils.dump_diff_model(model_name='my_model', model_var=logreg, subfolder='', wd = os.getcwd())

In [133]:
[x for x in os.listdir(os.path.join(os.getcwd(), 'Models')) if x.endswith('.joblib')]

['my_model_2019-07-22 01:14:01.120734.joblib',
 'my_model_2019-07-22 01:13:59.555809.joblib']

We can see that this new model has been saved. <br>
This is because it is different from the most recent version of that model as defined by its timestamp.

### Model Loading

Now we know how to save models, let's see how we load them. <br>
First let's import out necessary package.

In [115]:
import joblib

Let's use another one of these util functions: most_recent_model() <br>
Very simply, it returns the path of the most recent model in a directory (based on the timestamp).

Let's see what we define for this function:
* `model_name` = the root name of the model we want to save.
* `wd` = working directory, the main working directory of our project.
* `subfolder` = optional subfolder we want to use, we don't need it here so let's leave it as an empty string.

In [135]:
model_to_load = py_utils.most_recent_model(model_name = 'my_model', wd = os.getcwd(), subfolder='')
model_to_load

'/Users/Daniel/Desktop/GitHub_Repos/ml-utils/Models/my_model_2019-07-22 01:14:01.120734.joblib'

Here we see the path above. <br>
Now let's import this model.

In [136]:
logreg = joblib.load(model_to_load)

In [137]:
logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

We see above that we successfully imported this model.