Skip to content

Writing a Tutorial

Shahin Saadati edited this page May 20, 2022 · 11 revisions

A tutorial in this repository is referred to a Jupyter notebook which is written for a dataset to either do a deep analysis of it, or apply some ML technique to it. Some possibilities for tutorials are as follows:

  • Walk through of a dataset, go over its main features, use data visualization to tell a story about the dataset
  • Correlation/causation analysis
  • Time series analysis
  • Supervised Learning, such as Classification, Regression, Forecasting, etc
  • Unsupervised Learning, such as clustering

Choose a Dataset

Make sure the dataset you choose is tabular and onboarded by our team. There should be a directory available for that dataset here.

Write your Tutorial

Your tutorial can be anything you want, as long as it shows something interesting about the dataset.

Colab Development

You may start by downloading a copy of the template and upload it to Colab.

Development Tips

  • While Colab offers many cool macros and shortcuts, we ask you not to use them, since these tutorials should also be runnable in Workbench, and locally.
  • To help the reader understand your tutorial easier, make sure to add enough description in markdown cells before your code cells.
  • We encourage you to submit your notebook for code review via a Pull Request on GitHub. This helps us to keep track of your progress, and you have access to the history of your work later on.
  • When you are done with your code, try and download your notebook to your local machine and run it locally to make sure it still runs without any issues.

Providing Metadata

Each tutorial requires a number of metadata, which should be stored in a artifact.yaml file. Here is an example of how the file should look like:

artifact:
  title: The title of your tutorial
  description: A brief description of what the tutorial is about.
  tags:
    - libraries:sklearn,matplotlib
    - ml:classification
    - vertical:government
    - tier:free

The vertical variable is one of healthcare, environment, finance, information, education, retail, government, and manufacturing. tier' is one of free' or paid, and it is paid only when the tutorial requires some GCP services such as Vertex AI.

Formatting the Code

We use flake8 to format the python code properly. The following commands should be helpful for this purpose:

# Running black on Python files:
poetry run black .

# Running flake8 on Python files:
poetry run flake8 .

# Running flake8 on Jupyter Notebook files:
poetry run nbqa flake8 . 

Testing

Each notebook tutorial requires a test file. We use testbook to write our test units. Let's assume we have a simple notebook called my_notebook.ipynb with the following four cells:

# Cell 1
import pandas as pd
from google.cloud import bigquery
# Cell 2
QUERY = 'SELECT * FROM table LIMIT 1000'
# Cell 3
bqclient = bigquery.Client(project='my_project')
dataframe = bqclient.query(QUERY).result().to_dataframe()
# Cell 4
var = 3 + 4

In our test, we want to mock the bigquery client and avoid making a real request. The trick is to inject a cell before calling bigquery.Client and mock it. Here is how to do it using testbook:

from testbook import testbook

@testbook('./my_notebook.ipynb')
def test_get_details(tb):
    tb.inject(
        """
        import mock
        mock_client = mock.MagicMock()
        mock_df = pd.DataFrame()
        mock_df['week'] = range(10)
        mock_df['count'] = 5
        p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
        mock_client.query().result().to_dataframe.return_value = mock_df
        p1.start()
        """,
        before=2,
        run=False
    )
    tb.execute()
    dataframe = tb.get('dataframe')
    assert dataframe.shape == (10, 2)

    var = tb.get('var')
    assert var == 7

A full example can be found here.

In order to run the test file and see if your tests pass, you can use Poetry. Run the following command from the root directory of the repository:

poetry run python -m pytest -v datasets/DATASET_NAME/docs/tutorials/TUTORIAL_DIRECTORY/NOTEBOOK_test.py

[Optional] Detailed Description

If you want, you may provide an overview.md file with more details about your tutorial and include it alongside your other files for your tutorial.

Submit the Code

Your code needs to be submitted for review via a Pull Request. Here is a guideline to show how to do it. We encourage you to submit your tutorial frequently for review on GitHub for incremental feedback from the reviewers.

Example

Let's assume you wrote a tutorial for austin_bikeshare dataset which trains a model to predict the duration of a trip and save it in a file called bike_trip_predict.ipynb.

For each dataset, all tutorials should be placed in a new subdirectory under .../docs/tutorials. For this example, create a subdirectory called bike_trip_prediction and place all your new files inside it. So, the final tree structure should look something like this:

├── datasets
└── austin_bikeshare
    └── docs  
        └── tutorials
            └── bike_trip_prediction
                ├── bike_trip_predict.ipynb
                ├── bike_trip_predict_test.py
                └── artifact.yaml