Skip to content

Creating a Course

Ryan Holbrook edited this page Aug 10, 2021 · 15 revisions

Let's say we're going to create a new course on Kaggle Learn called Data Science.

Notebooks

  1. On the command line, navigate to the learntools/notebooks/ directory.
  2. Create a new branch on master with a name like ds-course. Be sure to check that there isn't already a branch with that name.
  3. Decide on a "track name" like data_science. This will be the name of the directory where your course files will exist. Check that there isn't already a directory with that name.
  4. There should be a Bash script called new_track.sh. Run /.new_track.sh data_science.
  5. Stage the new files: git add data_science.
  6. Commit the changes: git commit -m "Create track ds-course."
  7. Create a pull request on GitHub named [Data Science] New course.

Checking Code

  1. Navigate to the learntools root directory learntools/ (the directory containing setup.py). Do this from inside a Jupyter notebook, either with !cd or os.chdir.
  2. Uninstall the current version of learntools. Inside of a Jupyter notebook, run !pip uninstall learntools.
  3. Install an editable version of learntools. Inside of a Jupyter notebook, run !pip install --editable . (note the period). Installing the local copy of learntools from inside Jupyter helps ensure the Python kernel can find the installation. Due to environment weirdness, installing it from the command line can be broken.
  4. Navigate to learntools/learntools.
  5. Create a directory for your course: mkdir data_science.
  6. Create an initialization file: touch data_science/__init__.py.
  7. Commit the changes.

Datasets

Create a folder to contain local copies of the course data: mkdir learntools/notebooks/input. This folder will just be for your own use while developing and won't be committed to the repository (it's in notebooks/.gitignore).

Create a folder for a course dataset: mkdir input/ds-course-data. Put all of the data you plan to use in here. If you develop your notebooks in the raw folder (notebooks/data_science/raw/), then you can access your datasets just like you would on Kaggle, like '../input/ds-course-data/data.csv'. (NB: This trick relies on using a relative path to the input folder. Unlike on Kaggle, absolute paths like /kaggle/input won't work.)

Now navigate to to the dataset folder and zip up the datasets:

cd input/ds-course-data
zip -r ds-course-data.zip *

Create a dataset on Kaggle with a name that matches the folder you created, like: DS Course Data, and upload the zip file. Whenever you add files to your dataset, just repeat the process.

Jenkins

Add track name 'data_science' to TRACKS and TESTABLE_NOTEBOOK_TRACKS in learntools/notebooks/test.sh.

Create a new file setup_data.sh in learntools/notebooks/data_science/:

#!/bin/bash
# Download the datasets used in the ML notebooks to correct relative_paths (../input/...)

mkdir -p input

DATASETS="ryanholbrook/ds-course-data ryanholbrook/some-other-data"

for slug in $DATASETS
do
    name=`echo $slug | cut -d '/' -f 2`
    dest="input/$name"
    mkdir -p $dest
    kaggle d download -p $dest --unzip $slug
done

COMPDATASETS="competition-name"

for comp in $COMPDATASETS
do 
    dest="input/$comp"
    mkdir -p $dest
    kaggle competitions download $comp -p $dest
    cd $dest
    unzip ${comp}.zip
    chmod 700 *.csv
    cp *.csv ..
    cd ../..
done

You'll need to keep this list of datasets in DATASETS up-to-date with those you use in your course (that is, those defined in track_meta.py). For competition datasets, user @dansbecker needs to accept the competition rules for Jenkins to be able to access the dataset. (Check this in case of a 403 - Forbidden error.)

After you've saved the file, at the command prompt run:

chmod a+x setup_data.sh

This will make the file executable by Jenkins.

Clone this wiki locally