Skip to content
A example of makefile-like workflow for downloading, unzipping, transforming data and training a model based on doit built-tool illustrated with iceberg competition on Kaggle
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

What is this about?

It is a structured and extensible implementation of a typical data transformation workflow made with doit build-tool illustrated with Statoil/C-CORE Iceberg Classifier Challenge on Kaggle.

Assume that your have the following workflow:

  1. Download archives with datasets from Kaggle website.

  2. Unzip them.

  3. Transform data by converting from json to numpy format and adding a third layer to image computed as mean between given two layers (HH and HV bands).

  4. Train your model (CNN) and save its coefficients to file.

Imagine that you do not want to push these huge files to your version control system and/or Docker registry. You computer may have much slower internet connection than AWS instances where are training your models so downloading and transforming data each time is much faster for you.

Imagine that you also want to cache intermediate results of these steps so that re-running pipeline does not require repeating unnecessary actions.

But when something is actually changed (your code or input data) cached data should be invalidated and recomputed.

The straitforward solution is just manually implement these steps and caching logic. The obvious disadvantages of this approach are many conditional statements and complexity of detection that inputs of some step are changed and its requires recomputation.

Such problem is historically adressed by software like Make build automation tool. This is already much better but it is based on shell scripts while doit task management and automation tool also supports Python code in tasks and is easily extended with Python code.

How doit implementation looks like?

Assuming that all dependencies are installed and your specified your Kaggle credentials just run the following shell command:


It looks for and for configuration there, finds 'default_tasks': ['train'] and launches a task called train. Its defenition looks like:

def task_train():
    return {
        'actions': [baseline_model.train],
        'file_dep': ['', 'data/train.npy']

baseline_model.train is a Python function that reads data from data/train.npy and trains a CNN on them.

Note that it depends not only on data but on source file ( which is correct because if both data and training code is not changed retraining is not required, but if at least one of the inputs changed the model should be also retrained.

data/train.npy in turn is configured as a target of a task called convert_train_to_numpy:

def task_convert_train_to_numpy():
    return {
        'actions': [baseline_model.convert_train_to_numpy],
        'file_dep': ['', 'data/train.json'],
        'targets': ['data/train.npy']

By combining target and dependency files doit tool is able to determine which task depends on which and which task target is up-to-date. This is a default workflow that could be easily extended by implementing a custom Python function to check if task is outdated or not.


I used Anaconda distribution that already includes packages like numpy, pandas and scikit-learn.

This sample is also using Keras, TensorFlow, doit, progressbar and MechanicalSoup that are all easily installed by pip/conda except if you want to install Tensorflow with GPU support and especially on Windows.

Kaggle credentials

Kaggle requires your login and password to download a dataset. Specify your credentials with environment variables KAGGLE_LOGIN and KAGGLE_PASSWORD or .credentials.ini in the following format (file has higher priority):


Be carefult and never push this file to a public repository like GitHub (it is already added to .gitignore for convinience).


Thanks to Gerrit Gruben (@uberwach) for his idea of using makefiles for data transformation on Kaggle competiontions. See his Kaggle competition project structure template.

Thanks to DeveshMaheshwari for his kernel with basic CNN model.

Thanks to Kaggle-CLI contributors.


Feel free to contact me, you can also post a comment with Disqus at the corresponding blog post.

You can’t perform that action at this time.