Code used in a workshop on deploying machine learning models and MLOPs for the November 2021 Cohort Training for the AIMLAC Doctoral Training Centre, UK.
The repro is an example of how to produce a reproducible pipeline for training a machine learning model for text categorisation. It focuses on the binary classification task of separating ham
from spam
text messages, using a kaggle dataset of 5157 text messages.
The various stages of the presentation have been labelled with git tags, so you can easily jump to the stage of interest. These stages are described below with the commands used to create them (if applicable).
Populate the unpinned-requirements.txt
with the main depdendencies required by the project.
Additional test and development depdendencies can be set in requirements_test.txt
and requirements_dev.txt
respectively.
Run:
make update-requirements-txt
...to populate requirements.txt
, and:
make virtualenv
...to create a virtual environment.
To activate the virtual environment run:
source ./build/virtualenv/bin/activate
You can also add the following to your ~/.bashrc (or equivalent) to allow you to just run activate
:
function activate {
source build/virtualenv/bin/activate
}
You'll then need to run source ~/.bashrc
to make this function available in your current shell.
The next step is to get the data. This is available from kaggle on an CC0 license, but you will need to have a kaggle account and be signed in to access the download link.
The downloaded dataset should be:
name | size | rows |
---|---|---|
SPAM text message 20170820 - Data.csv | 485.7kb | 5574 |
Download this dataset and put it in a new folder called ./data/raw/
mkdir -p ./data/raw
Then add dvc[s3]
to the unpinned depdendencies and update the virtualenv:
echo "dvc[s3]" >> unpinned_requirements.txt
make update-requirements-txt virtualenv
Next, add the raw data to dvc with:
dvc add data/raw/SPAM\ text\ message\ 20170820\ -\ Data.csv
This will add two new files ./data/raw/.gitignore
, and will prompt us to add both, which we should do:
git add ./data/raw/.gitignore ./data/raw/SPAM text message 20170820 - Data.csv.dvc
# Commit following your own flow, but as an example:
git commit -m 'new: Added raw data to data/raw'
Optionally (though, ideally) you can also set a dvc remote, for example an s3 bucket:
dvc remote add s3 s3://mantisnlp-blogs/deploying-ML
dvc remote default s3
dvc push
Add pandas a loguru to the unpinned_requirements.txt
and update dependencies:
echo "pandas\nloguru" >> unpinned_requirements.txt
make update-requirements-txt virtual-requirements
Create a new folder for source files and for processed data:
mkdir -p src data/processed
Add script to drop duplicates and NA values from the raw data and save to data/processed/spam_data.csv
.
Add this file to a .gitignore
file so that git doesn't follow it.
echo data/processed/spam_data.csv >> data/processed/.gitignore
Next step is to create a dvc pipeline. This allows us to keep better track of our work by tracking it with dvc.
We do this by defining a dvc.yaml
file. In this file we define our worfklow setting the dependencies and the outputs that we want dvc to track. In this way, dvc is a bit like a Makefile, in that it creates a directed acyclic graph, which we can view with dvc dag
:
+----------------------------------------------------+
| data/raw/SPAM text message 20170820 - Data.csv.dvc |
+----------------------------------------------------+
*
*
*
+------------+
| clean_data |
+------------+
In this case, we had a very simple dag containing just a dependency (the data) and our cleaning step. We reproduce out pipeline with:
dvc repro
We can check what has been produced with:
>>> ls -gh data/processed/
total 428K
-rw-rw-r-- 1 matthew 428K nov 22 19:33 spam_data.csv
dvc prompts us to add the dvc.lock file to git, which we should do if we are happy with the outcome of the pipeline run.
NOTE: In a real application we would probably also want to have a dev set, this is acting purely as a simple example.
- We add a script to create the split (
src/process_data.py
) - We add a new stage to the dvc pipeline defined in
dvc.yaml
(train_test_split). - We define parameters for the train/test split script in
params.yaml
.
The dag from dvc dag
now looks like:
+----------------------------------------------------+
| data/raw/SPAM text message 20170820 - Data.csv.dvc |
+----------------------------------------------------+
*
*
*
+------------+
| clean_data |
+------------+
*
*
*
+------------------+
| train_test_split |
+------------------+
Here we add a script to train the model and output predictions to ./results
The dag from dvc dag
now looks like:
+----------------------------------------------------+
| data/raw/SPAM text message 20170820 - Data.csv.dvc |
+----------------------------------------------------+
*
*
*
+------------+
| clean_data |
+------------+
*
*
*
+------------------+
| train_test_split |
+------------------+
*
*
*
+-------+
| train |
+-------+
We add a script to run the evaluation of the model by comparing the true test labels with the predictions.
+----------------------------------------------------+
| data/raw/SPAM text message 20170820 - Data.csv.dvc |
+----------------------------------------------------+
*
*
*
+------------+
| clean_data |
+------------+
*
*
*
+------------------+
| train_test_split |
+------------------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
We create a file which contains the results of the evaluation (./results/metrics.json
) and since we define this as a metric output from the evaluation task, we can check the results by running:
# To show all metrics (as a markdown table)
dvc metrics show --md
To show just metrics that have changed:
dvc metrics diff --md
Path | Metric | HEAD | workspace | Change |
---|---|---|---|---|
results/metrics.json | accuracy | - | 0.97576 | - |
results/metrics.json | ham.f1-score | - | 0.98643 | - |
results/metrics.json | ham.precision | - | 0.97426 | - |
results/metrics.json | ham.recall | - | 0.9989 | - |
results/metrics.json | ham.support | - | 1819 | - |
results/metrics.json | macro avg.f1-score | - | 0.93665 | - |
results/metrics.json | macro avg.precision | - | 0.98208 | - |
results/metrics.json | macro avg.recall | - | 0.90109 | - |
results/metrics.json | macro avg.support | - | 2063 | - |
results/metrics.json | spam.f1-score | - | 0.88688 | - |
results/metrics.json | spam.precision | - | 0.9899 | - |
results/metrics.json | spam.recall | - | 0.80328 | - |
results/metrics.json | spam.support | - | 244 | - |
results/metrics.json | weighted avg.f1-score | - | 0.97465 | - |
results/metrics.json | weighted avg.precision | - | 0.97611 | - |
results/metrics.json | weighted avg.recall | - | 0.97576 | - |
results/metrics.json | weighted avg.support | - | 2063 | - |
DVC allows us to create for loops within the pipeline. Imagine for instance that we want to collect metrics from predictions made on the training set to help us diagnose overfitting. We can do this by parameterising our src/evaluate.py
.
We use the library typer to create a simple command line interface for the evaluate script. We then call this script from dvc.yaml
as usual, but convert the stage into a foeach stage. These stages are very poweful because they allow us to iterate through multiple datasets, models, etc. in just a few lines of code.
The dvc dag
now looks like:
+----------------------------------------------------+
| data/raw/SPAM text message 20170820 - Data.csv.dvc |
+----------------------------------------------------+
*
*
*
+------------+
| clean_data |
+------------+
*
*
*
+------------------+
| train_test_split |
+------------------+*
**** * ****
**** * ****
**** * ****
*** +-------+ ***
* | train | ***
*** +-------+** *
* ** ** ***
*** ** ** *
* ** ** **
+----------------+ +---------------+
| evaluate@train | | evaluate@test |
+----------------+ +---------------+
And we can check the results:
>>> dvc metrics diff 6-evaluate-model main --md | grep -E "(Path|---|weighted)"
Path | Metric | 6-evaluate-model | main | Change |
---|---|---|---|---|
results/metrics.json | weighted avg.f1-score | 0.97465 | - | - |
results/metrics.json | weighted avg.precision | 0.97611 | - | - |
results/metrics.json | weighted avg.recall | 0.97576 | - | - |
results/metrics.json | weighted avg.support | 2063 | - | - |
results/test_metrics.json | weighted avg.f1-score | - | 0.97465 | - |
results/test_metrics.json | weighted avg.precision | - | 0.97611 | - |
results/test_metrics.json | weighted avg.recall | - | 0.97576 | - |
results/test_metrics.json | weighted avg.support | - | 2063 | - |
results/train_metrics.json | weighted avg.f1-score | - | 0.9974 | - |
results/train_metrics.json | weighted avg.precision | - | 0.99742 | - |
results/train_metrics.json | weighted avg.recall | - | 0.99741 | - |
results/train_metrics.json | weighted avg.support | - | 3094 | - |
Change a parameter in params.yaml
, for example test_prop=0.2
. We can then check the implication of this change by running:
>>> dvc status
train_test_split:
changed deps:
params.yaml:
modified: train_test_split
This indicates, as expected, that the train_test_split
stage is affected by the parameter change, so if we run dvc repro
we can expect this and all stages that depend on it to be reproduced.
>>> dvc repro
'data/raw/SPAM text message 20170820 - Data.csv.dvc' didn't change, skipping
Stage 'clean_data' didn't change, skipping
Running stage 'train_test_split':
Updating lock file 'dvc.lock'
Running stage 'train':
> ./build/virtualenv/bin/python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> ./build/virtualenv/bin/python src/evaluate.py
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
Use `dvc push` to send your updates to remote storage.
We can now check the metrics change with dvc metrics --diff
as before
>>> dvc metrics diff 6-evaluate-model main --md | grep -E "(Path|---|weighted)"
Path | Metric | main | workspace | Change |
---|---|---|---|---|
results/train_metrics.json | weighted avg.f1-score | 0.9974 | 0.99732 | -8e-05 |
results/train_metrics.json | weighted avg.precision | 0.99742 | 0.99734 | -8e-05 |
results/train_metrics.json | weighted avg.recall | 0.99741 | 0.99733 | -8e-05 |
results/train_metrics.json | weighted avg.support | 3094 | 4126 | 1032 |
results/test_metrics.json | weighted avg.f1-score | 0.97465 | 0.9821 | 0.00745 |
results/test_metrics.json | weighted avg.precision | 0.97611 | 0.98238 | 0.00626 |
results/test_metrics.json | weighted avg.recall | 0.97576 | 0.98254 | 0.00678 |
results/test_metrics.json | weighted avg.support | 2063 | 1031 | -1032 |
Up until now, we have assumed that any change to a parameter that results in the pipeline being re-run, would require that we create a new commit in order to record it in our history. That is somewhat inconvenient, and fortunately dvc offers a solution to make running experiments simpler.
If we want to queue up a lot of experiments and run them sequentially, or in parralel, we can use the dvc exp
command.
For instance, the following command will queue up five experiments that we will the execute in turn, each with a different train/test split.
dvc exp run --queue -S params.yaml:train_test_split.test_prop=0.1
dvc exp run --queue -S params.yaml:train_test_split.test_prop=0.2
dvc exp run --queue -S params.yaml:train_test_split.test_prop=0.3
dvc exp run --queue -S params.yaml:train_test_split.test_prop=0.4
dvc exp run --queue -S params.yaml:train_test_split.test_prop=0.5
We can check the list of experiments as a markdown table (we can also save to csv, or json) with:
dvc exp show --md --no-pager \
--include-metrics "weighted avg.f1-score" \
--include-metrics "weighted avg.support"
Experiment | Created | State | results/train_metrics.json:weighted avg.f1-score | results/test_metrics.json:weighted avg.f1-score | train_test_split.test_prop |
---|---|---|---|---|---|
workspace | - | - | 0.99732 | 0.9821 | 0.2 |
main | 04:15 PM | - | 0.99732 | 0.9821 | 0.2 |
├── 726fd28 | 04:16 PM | Queued | - | - | 0.5 |
├── 0ec1387 | 04:16 PM | Queued | - | - | 0.4 |
├── bb1c240 | 04:16 PM | Queued | - | - | 0.3 |
├── c6fd1fd | 04:16 PM | Queued | - | - | 0.2 |
└── 41a6d76 | 04:16 PM | Queued | - | - | 0.1 |
Once we are happy with our queue, we can execute all the experiments with:
# --jobs will run 10 experiments in parralel
# Ensure you have activated your virtualenv before running
dvc exp run --run-all --jobs 10
If we check the results again with dvc exp show
, the results will now be populated in the table:
Experiment | Created | results/train_metrics.json:weighted avg.f1-score | results/test_metrics.json:weighted avg.f1-score | train_test_split.test_prop |
---|---|---|---|---|
workspace | - | 0.99732 | 0.9821 | 0.2 |
main | 04:15 PM | 0.99732 | 0.9821 | 0.2 |
├── 14055fb [exp-46088] | 04:17 PM | 0.9975 | 0.97786 | 0.3 |
├── 52386c7 [exp-d053f] | 04:17 PM | 0.99697 | 0.98616 | 0.1 |
├── 5003c73 [exp-ad17f] | 04:17 PM | 0.99766 | 0.97524 | 0.5 |
└── fe55962 [exp-597b7] | 04:17 PM | 0.9974 | 0.97465 | 0.4 |
Here we see that the experiment with a train/test split of 0.1 is the best performing. We can select that one and apply it to the current workspace with:
dvc exp apply exp-d053f
Now we can continue to work as usual and commit the parameter changes to git.
We add a simple API using fastapi in ./src/api.py
which takes raw text as an input and returns a class: either spam
or ham
.
To launch the API (in development mode) run: make serve
or:
build/virtualenv/bin/uvicorn src.api:app --reload
In another terminal you can test the API by making POST requests to it with curl with make test-api
, or:
curl --header "Content-Type: application/json" --request POST \
--data '{"text":"You'\''ve WON a PRIZE, text back to find out what"}' \
localhost:8000/predict
You will get a response of {"result":"spam"}
.
Finally, we can containerise the API for easier deployment by creating a Dockerfile
. We use docker-compose.yaml
to define how we want that container to be built and launched.
To build and launch the container:
docker-compose up --build
Note that we need to set both the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables on our local environment where we launch the container. These env vars should allow us to use an AWS user with read access to the s3 bucket we are using as the dvc remote.
We can test the container in the same way we tested the API locally, with: make test-api
.
To stop the container, run:
docker-compose down