# Example of versioning ML experiments using DVC

This notebook aims to be a guideline for versioning your ML projects using DVC, from a Jupyter notebook.

This notebook allows you to experiment as much as you like, and when you are in a state that you would like to preserve for future reference as a git commit, use the DVC cells to version all your relevant files. 

The cells marked with a green markdown box are responsible for creating a snapshot of your raw data, processed data, and trained models.

This snapshot is implemented as md5 hashes of the respective files saved as text in the `.dvc` files. The hashes in the .dvc files will be part of the git commit.

## Imports and global declarations

In [14]:
from sklearn import datasets
import sklearn
from sklearn import preprocessing
from sklearn.externals import joblib
from sklearn import metrics
from sklearn import model_selection
import numpy as np
import pickle
import pandas as pd
from sklearn.linear_model import LinearRegression
import json
import os

<div class="alert alert-block alert-success">
<h2>Download and version raw data</h2>
</div>

In [3]:
raw_data = datasets.fetch_california_housing(data_home="data/raw")
# Save the raw input data for reproducibility
!dvc commit -f data/raw.dvc

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to data/raw


Saving 'data/raw' to cache '.dvc/cache'.
Linking directory 'data/raw'.
Saving information to 'data/raw.dvc'.
[0m

## Data preprocessing

In [4]:
def to_dataframe(X, y):
    return pd.concat([
            pd.DataFrame(data=X, columns=raw_data.feature_names),
            pd.DataFrame(data=y, columns=['Value'])
        ],
        axis=1)

In [5]:
raw_df = to_dataframe(raw_data.data, raw_data.target)
raw_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [6]:
raw_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Test train split

In [7]:
train_X, test_X, train_y, test_y = model_selection.train_test_split(raw_df[raw_df.columns[:-1]], raw_df['Value'])

In [8]:
train_X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
17584,2.6573,29.0,3.451128,1.077694,969.0,2.428571,37.31,-121.93
11752,3.1583,16.0,5.622378,1.034965,792.0,2.769231,38.76,-121.21
909,4.9167,34.0,5.963675,1.057692,1276.0,2.726496,37.55,-122.01
6418,2.5417,43.0,4.186386,1.025932,1544.0,2.502431,34.15,-118.0
1631,7.6107,33.0,7.554167,1.045833,1348.0,2.808333,37.88,-122.17


### Normalize feature columns by training data only

In [9]:
scaler = preprocessing.StandardScaler()
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
train_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
17584,-0.643187,0.029381,-0.774588,-0.034971,-0.402238,-0.057802,0.789598,-1.181518
11752,-0.377902,-1.004661,0.079716,-0.120778,-0.556614,-0.0276,1.467986,-0.822341
909,0.553191,0.42709,0.214004,-0.075138,-0.134478,-0.031389,0.901883,-1.221427
6418,-0.704399,1.142965,-0.485292,-0.138919,0.099267,-0.051254,-0.688818,0.778992
1631,1.979695,0.347548,0.839802,-0.098953,-0.071681,-0.024134,1.056275,-1.301244


In [10]:
test_X_scaled = pd.DataFrame(scaler.transform(test_X), index=test_X.index, columns=test_X.columns)
test_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
186,-0.495771,1.858841,-0.429702,-0.273856,0.199568,0.083884,1.014168,-1.331175
4934,-1.069603,0.90434,-0.571272,0.074738,0.76823,0.114198,-0.758996,0.644301
911,0.597299,0.268006,0.42951,-0.135071,-0.263561,-0.019934,0.897205,-1.221427
2359,0.557586,-0.527411,0.515851,-0.187719,0.318184,-0.013504,0.532279,-0.034146
16169,-0.504031,0.268006,-0.666624,-0.004922,-0.17547,-0.082624,1.014168,-1.49081


In [11]:
train_df = pd.concat([train_X_scaled, train_y], axis=1)
test_df = pd.concat([test_X_scaled, test_y], axis=1)

<div class="alert alert-block alert-success">
<h2>Optional: Version the processed data with DVC for efficiency and/or reproducibility</h2>
</div>

In [15]:
os.makedirs('data/processed/')
train_df.to_csv('data/processed/california_households_train.csv', index_label='Index')
test_df.to_csv('data/processed/california_households_test.csv', index_label='Index')
joblib.dump(scaler, 'data/processed/california_households_scaler.pkl')
!dvc commit -f process_data.dvc

Saving 'data/processed' to cache '.dvc/cache'.
Linking directory 'data/processed'.
Saving information to 'process_data.dvc'.
[0m

### Use this cell to reload processed data, after switching branches

In [16]:
train_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
test_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
scaler = joblib.load('data/processed/california_households_scaler.pkl')

## Training

In [17]:
model = LinearRegression()
X = train_df[train_df.columns[:-1]]
y = train_df['Value']
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

<div class="alert alert-block alert-success">
<h2>Save the trained model for reproducibility</h2>
</div>

In [20]:
os.makedirs('models')
joblib.dump(model, 'models/california_households.pkl')
!dvc commit -f models.dvc

Saving 'models' to cache '.dvc/cache'.
Linking directory 'models'.
Saving information to 'models.dvc'.
[0m

### Use this cell to reload the model, after switching branches

In [21]:
model = joblib.load('models/california_households.pkl')

## Evaluate the model

In [22]:
predictions = model.predict(test_df[test_df.columns[:-1]])
truth = test_df['Value']
metrics_dict = {}
metrics_dict['R2'] = metrics.r2_score(truth, predictions)
metrics_dict['MAE'] = metrics.mean_absolute_error(truth, predictions)
metrics_dict['MSE'] = metrics.mean_squared_error(truth, predictions)
metrics_dict['median_absolute_error'] = metrics.median_absolute_error(truth, predictions)
metrics_dict['loss'] = metrics_dict['MSE']
pd.DataFrame(metrics_dict, index=[0])

Unnamed: 0,R2,MAE,MSE,median_absolute_error,loss
0,0.609808,0.528453,0.515188,0.412521,0.515188


<div class="alert alert-block alert-success">
<h2>Save the computed metrics for easy display in DVC and DAGsHub</h2>
</div>

In [23]:
with open('metrics/metrics.json', 'w') as f:
    json.dump(metrics_dict, f, indent=2)
!dvc commit -f eval.dvc

Output 'metrics/metrics.json' doesn't use cache. Skipping saving.
Saving information to 'eval.dvc'.
[0m

<div class="alert alert-block alert-success">
<h2>Versioning section - use the following cells to create a full commit of your current state</h2>
</div>

### Make sure all data and models are committed to DVC
The output of the following cell should be: `Pipeline is up to date. Nothing to reproduce.`

If you get something else, then maybe you forgot to `dvc commit` earlier in the notebook.
We recommend to make sure that the current contents in the data and models directories are to your liking,
and if so, use the commit cell below to automatically commit all current files to DVC.

In [24]:
!dvc status

Pipeline is up to date. Nothing to reproduce.
[0m

In [427]:
# Use this if dvc status is not up-to-date and you're sure the current state is OK.
!dvc commit -f

Saving 'models' to cache '.dvc/cache'.
Linking directory 'models'.
Saving information to 'models.dvc'.
Saving 'data/processed' to cache '.dvc/cache'.
Linking directory 'data/processed'.
Saving information to 'process_data.dvc'.
Saving information to 'eval.dvc'.
Saving 'data/raw' to cache '.dvc/cache'.
Linking directory 'data/raw'.
Saving information to 'data/raw.dvc'.
[0m