# Example of versioning ML experiments using DVC

This notebook aims to be a guideline for versioning your ML projects using DVC, from a Jupyter notebook.

This notebook allows you to experiment as much as you like, and when you are in a state that you would like to preserve for future reference as a git commit, use the DVC cells to version all your relevant files. 

The cells marked with a green markdown box are responsible for creating a snapshot of your raw data, processed data, and trained models.

This snapshot is implemented as md5 hashes of the respective files saved as text in the `.dvc` files. The hashes in the .dvc files will be part of the git commit.

In [3]:
!pip install --upgrade ruamel.yaml

Collecting ruamel.yaml
  Downloading ruamel.yaml-0.16.10-py2.py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 7.5 MB/s eta 0:00:01
Installing collected packages: ruamel.yaml
  Attempting uninstall: ruamel.yaml
    Found existing installation: ruamel.yaml 0.16.6
    Uninstalling ruamel.yaml-0.16.6:
      Successfully uninstalled ruamel.yaml-0.16.6
Successfully installed ruamel.yaml-0.16.10


In [8]:
!pip install -r requirements.txt



## Imports and global declarations

In [9]:
from sklearn import datasets
import sklearn
from sklearn import preprocessing
from sklearn.externals import joblib
from sklearn import metrics
from sklearn import model_selection
import numpy as np
import pickle
import pandas as pd
from sklearn.linear_model import LinearRegression
import json
import os

<div class="alert alert-block alert-success">
<h2>Download and version raw data</h2>
</div>

In [10]:
raw_data = datasets.fetch_california_housing(data_home="data/raw")
# Save the raw input data for reproducibility
!dvc commit -f data/raw.dvc

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to data/raw


[0m                                                                            

## Data preprocessing

In [11]:
def to_dataframe(X, y):
    return pd.concat([
            pd.DataFrame(data=X, columns=raw_data.feature_names),
            pd.DataFrame(data=y, columns=['Value'])
        ],
        axis=1)

In [12]:
raw_df = to_dataframe(raw_data.data, raw_data.target)
raw_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [13]:
raw_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Test train split

In [14]:
train_X, test_X, train_y, test_y = model_selection.train_test_split(raw_df[raw_df.columns[:-1]], raw_df['Value'])

In [15]:
train_X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
4099,3.5714,8.0,3.105599,1.105599,2138.0,1.515237,34.14,-118.37
5990,5.8891,37.0,6.522642,1.026415,1344.0,2.535849,34.1,-117.73
253,2.6765,52.0,5.0,1.026846,473.0,3.174497,37.77,-122.21
29,1.6875,52.0,4.703226,1.032258,395.0,2.548387,37.84,-122.28
19988,3.4231,15.0,5.442509,0.958188,961.0,3.348432,36.2,-119.32


### Normalize feature columns by training data only

In [16]:
scaler = preprocessing.StandardScaler()
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
train_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
4099,-0.153975,-1.642304,-1.03954,0.027355,0.678076,-1.16153,-0.696645,0.597222
5990,1.067067,0.667823,0.501218,-0.153383,-0.063364,-0.336056,-0.715306,0.916197
253,-0.625438,1.862716,-0.185347,-0.1524,-0.876707,0.180484,0.99689,-1.316627
29,-1.146476,1.862716,-0.319164,-0.140046,-0.949543,-0.325915,1.029547,-1.351515
19988,-0.232104,-1.084687,0.014182,-0.309111,-0.421011,0.321164,0.264424,0.123744


In [17]:
test_X_scaled = pd.DataFrame(scaler.transform(test_X), index=test_X.index, columns=test_X.columns)
test_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
15118,-0.229207,0.189866,0.111352,-0.075341,-0.616176,0.245928,-1.303144,1.304947
10400,0.670308,-1.084687,1.82869,0.648606,-1.1942,0.003404,-0.985898,1.005909
6472,0.224502,0.906802,-0.083953,-0.147253,-0.276271,0.000836,-0.719972,0.751725
14036,-0.833642,0.030546,-0.885872,-0.229721,-0.382725,-0.962769,-1.345133,1.210252
5935,2.328095,-0.367751,1.375218,-0.025427,0.606173,0.067939,-0.691979,0.846421


In [18]:
train_df = pd.concat([train_X_scaled, train_y], axis=1)
test_df = pd.concat([test_X_scaled, test_y], axis=1)

<div class="alert alert-block alert-success">
<h2>Optional: Version the processed data with DVC for efficiency and/or reproducibility</h2>
</div>

In [19]:
os.makedirs('data/processed/')
train_df.to_csv('data/processed/california_households_train.csv', index_label='Index')
test_df.to_csv('data/processed/california_households_test.csv', index_label='Index')
joblib.dump(scaler, 'data/processed/california_households_scaler.pkl')
!dvc commit -f process_data.dvc

[0m                                                                            

### Use this cell to reload processed data, after switching branches

In [20]:
train_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
test_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
scaler = joblib.load('data/processed/california_households_scaler.pkl')

## Training

In [21]:
model = LinearRegression()
X = train_df[train_df.columns[:-1]]
y = train_df['Value']
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<div class="alert alert-block alert-success">
<h2>Save the trained model for reproducibility</h2>
</div>

In [22]:
os.makedirs('models')
joblib.dump(model, 'models/california_households.pkl')
!dvc commit -f models.dvc

[0m                                                                            

### Use this cell to reload the model, after switching branches

In [23]:
model = joblib.load('models/california_households.pkl')

## Evaluate the model

In [24]:
predictions = model.predict(test_df[test_df.columns[:-1]])
truth = test_df['Value']
metrics_dict = {}
metrics_dict['R2'] = metrics.r2_score(truth, predictions)
metrics_dict['MAE'] = metrics.mean_absolute_error(truth, predictions)
metrics_dict['MSE'] = metrics.mean_squared_error(truth, predictions)
metrics_dict['median_absolute_error'] = metrics.median_absolute_error(truth, predictions)
metrics_dict['loss'] = metrics_dict['MSE']
pd.DataFrame(metrics_dict, index=[0])

Unnamed: 0,R2,MAE,MSE,median_absolute_error,loss
0,0.625872,0.511748,0.499559,0.392732,0.499559


<div class="alert alert-block alert-success">
<h2>Save the computed metrics for easy display in DVC and DAGsHub</h2>
</div>

In [25]:
with open('metrics/metrics.json', 'w') as f:
    json.dump(metrics_dict, f, indent=2)
!dvc commit -f eval.dvc

Output 'metrics/metrics.json' doesn't use cache. Skipping saving.       
[0m

<div class="alert alert-block alert-success">
<h2>Versioning section - use the following cells to create a full commit of your current state</h2>
</div>

### Make sure all data and models are committed to DVC
The output of the following cell should be: `Pipeline is up to date. Nothing to reproduce.`

If you get something else, then maybe you forgot to `dvc commit` earlier in the notebook.
We recommend to make sure that the current contents in the data and models directories are to your liking,
and if so, use the commit cell below to automatically commit all current files to DVC.

In [30]:
!dvc status

.ipynb_checkpoints/models-checkpoint.dvc:                               
	changed deps:
		deleted:            .ipynb_checkpoints/data/processed
	changed outs:
		deleted:            .ipynb_checkpoints/models
[0m

In [29]:
# Use this if dvc status is not up-to-date and you're sure the current state is OK.
!dvc commit -f

[31mERROR[39m: failed to commit - dependency '.ipynb_checkpoints/data/processed' does not exist

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m