In [None]:
# hide
%load_ext nb_black
# nb_black if using jupyter

<IPython.core.display.Javascript object>

# Helsinki Machine Learning Project Template

> Template for open source ML and analytics projects.

> NOTE: Once you begin your work, rewrite this notebook (index.ipynb) so that it describes your project, and regenerate README by calling `nbdev_build_docs`

## About

This is a git repository\* template for Python-based open source ML and analytics projects.

Installing this template creates a new, 
independent repository with clean commit history, 
but with a copy of all the files and folders in this repository.
Updates to the template can not be pulled to projects built on an import.

This template is developed and maintained by the data and analytics team of the city of Helsinki.
The template is intented as the basis of ML projects of the city,
but free for anyone to use under the Apache-2.0 licence.

This template helps you to develop, test, share and update ML applications, in a collaborative way.
It defines steps and tools of a ML project and helps you to tie them together for easy to use, explainable and reproducible workflow.
You can just replace the examples with your own code, or start from scratch with clean notebooks.
Instructions for using the template are given below, so just keep reading!

ML requires team effort. When working with data, we have three kinds of people.
We have people who know code, people who know data, and people who know a problem to solve.
Those who know two of these are rare, not to mention owning all three.
Working with the template enables joint work of application field specialists 
(e.g. healthcare, city planning, finance), researchers, data analysts, engineers & scientists, programmers and other stakeholders.
In essence, the template is a tool for teamwork - *a single person does not have to, and most likely does not even know how to complete all of the steps defined!*.
With code, documentation and results as one, each team member can understand what is going on, and contribute on their behalf.

The template is completely open source and environment agnostic.
It contains examples and instructions to help you through the whole project.
However, we try to keep it light and easy to maintain, and thus the documentation is not exhaustive.
We have added references to original sources so that they are easy to find.

If you see a tool or concept you are not familiar with, don't be scared.
Just follow along and you'll get started with ease, for sure.

If you have a question, the internet most probably already has the answer. Just search engine it!
If you can't find the answer, you can post your question to the [discussion forum](https://github.com/City-of-Helsinki/ml_project_template/discussions),
so the maintainers can help. [Stack Overflow](https://stackoverflow.com/) is the recommended forum for questions that are not specific to this template.

All you need to do for starting to work on your data project is to install the template following the instructions below.


## Contents

The repository contains the following files and folders:

    ## EDITABLES:
    data/               # Folder for storing data files. Contents are ignored by git.
    results/            # Runned notebooks, figures, trained models etc. Ignored by git.
    notebook_templates/ # Notebook templates with usage instructions, but without code examples
    .gitignore          # Filetypes, files and folders to be ignored by git.
    00_data.ipynb       # Data loading, cleaning and preprocessing with examples
    01_model.ipynb      # ML Model scripting, class creation and testing with examples
    02_loss.ipynb       # ML Evaluation with examples
    04_workflow.ipynb   # ML workflow definition with examples
    requirements.txt    # Required python packages and versions
    CONTRIBUTING.md     # Instructions for contributing
    settings.ini        # Project specific settings. Build instructions for lib and docs.
    docker-compose.yml  # docker settings (fast.ai default), if you wish to containerize your project

    ## AUTOMATICALLY GENERATED: (Do not edit unless otherwise specified!)
    docs/               # Project documentation (html)
    [your_module]/      # Python module built from the notebooks. The name of the module is the name of the repository (after installation).
    Makefile
    README.md 

    ## STATIC NON-EDITABLES: (Do not edit unless otherwise specified or you really know what you're doing!)
    LISENCE
    MANIFEST.in   
    setup.py                                     

## Example Project

We wanted to make this template easy to approach.
That's why we included a demo, that it is built around - the notebooks `index`, `data`, `model`, `loss` and `workflow`.
They explain the required steps of a ML project.

The demo is an example ML project on automating heart disease diagnosis with logistic regression on [UCI heart disease open dataset](https://archive.ics.uci.edu/ml/datasets/heart+disease).
The dataset contains missing values, and is thus great for demonstrating some light data wrangling.
The demo is only meant for showcasing how the template joins together different tools and steps.

**If you'd like to skip the demo**, and get right into action, you can replace the notebooks `index`, `data`, `model`, `loss` and `workflow` with clean copies under the folder `notebook_templates/`.

The `index` notebook (this notebook or the empty copy) will become the `README` of your project and frontpage of your documentation, so edit it accordingly.
You should at least have a general description of the project,
instructions on how to install and use it,
and instructions for contributing.

## Installing the Template

> Note: if you are doing a project on personal or sensitive data for the City of Helsinki, contact the data and analytics team of the city before proceeding!

### 1. On your GitHub homepage:

0. (Create [GitHub account](https://github.com/) if you do not have one already. 
1. Sign into your GitHub homepage
2. Go to page https://github.com/City-of-Helsinki/ml_project_template and click the green button that says 'Use this template'.
(**NOTE:** if you are already familiar with using the template, you can use the pre-cleaned version instead: https://github.com/City-of-Helsinki/ml_project_template_pre_cleaned.
If you do so, follow these instructions where applicable.)
3. Give your project a name. Do not use the dash symbol '-', but rather the underscore '_', because the name of the repo will become the name of your Python module.
4. If you are creating a project for your organization, change owner of the repo from the drop down bar (it's you by default).
You need to be included as a team member to the GitHub of the organization.
5. Define your project publicity (you can change this later, in most cases you'll want to begin with a private repo).
6. Click 'Create repository from template'

This will create a new repository for you copying everything from this template, but with clean commit history.

### 2. On your computing environment:

**Put all the highlited ** `commands` ** to shell one ate a time and press enter**
**(replace the parts with square brackets with your own information** `'[replace this with your info]'`**(remove the brackets))**


0. Create an SSH key and add it to your github profile. SSH is a protocol for secure communication over the internet. 
    A ssh key is unique to a computing unit, and you must recreate this step every time you are using a new unit,
    be it a personal computer, server or a cloud computing instance. You can read more on SSH from [Wikipedia](https://fi.wikipedia.org/wiki/SSH) or 
    from [GitHub docs](https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent).
    * Create SSH key with `ssh-keygen -t ed25519 -C "[your_email]"`
    * You can leave the name empty (just press enter), but **always create keys with a secure password that you remember**.
    This password can not be reset. You have to create new key if you forget it.
    * Now among other lines, there should be a text displayed saying `Your public key has been saved in [public_key_path]/id_ed25519.pub`.
    * Copy the public key adress and call `cat [public_key_path]/id_ed25519.pub]`
    * Now the key has been copied to clipboard and displayed on your shell. It begins with 'ssh' and ends with your email.
    Depending on your system, you may also have to manually copy it from the shell output.
    * Go to your GitHub homepage > progile picture in top right corner > settings > SSH and GPG keys > new ssh key
    * Paste the public key to the key part, and give the key a name that describes the computing environment it belongs to.
    * If you permanently stop using this computing environment, remove the public key from your github profile.
    
1. In your shell, move to the folder you want to work in: `cd [path to your programming projects folder]`.
(If you get lost, `cd ..` moves you one folder towards root, and `cd` gets you to root.)
2. Clone the repository you just imported: `git clone git@github.com:[repository_owner]/[your_repository]`.
If the repository is private, you'll be asked the password of the ssh key you just generated. 
This will copy all the files and folders that you imported to your new repository in the github website, to your computing environment.
3. Go inside the repository folder: `cd [your_repository]`
4. Create virtual environment.  Virtual environments allow you to install project specific python versions and track dependencies.
Read more on virtual environments from [this blog post](https://realpython.com/python-virtual-environments-a-primer/).


    # Using conda (Azure ML only supports conda virtual environments):

    conda create --name [environment_name] python=3.8
    conda activate [environment_name] # every time you start working
    conda install pip

    conda deactivate # when you stop working

    # Using virtualenv (preferred way if not working in Azure):

    pip install virtualenv
    python3.8 -m virtualenv [environment_name]
    source [environment_name]/bin/activate # every time you start working
    
    deactivate # when you stop working

5. Install dependencies (versions of python packages that work well together):


    pip install -r requirements.txt # install required versions of python packages with pip
    nbdev_install_git_hooks # install nbdev git additions

6. Create an ipython kernel for running the notebooks. Good practice is to name the kernel with your virtual environment name.


    python -m ipykernel install --user --name [ipython_kernel_name] --display-name "Python 3.8 ([ipython_kernel_name])"

7. With your team, decide which notebook editor are you using. There are two common editors: Jupyter and JupyterLab, but both run the same notebooks.
Depending on the selection, you'll have to edit the top cell of each notebook where black formatter extension is activated for the notebook cells.
You can change this later, but it is convenient to only develop with one type of an editor.
Black is a code formatting tool used to unify code style regardless of who is writing it.
You may notice, that the structure of your code changes a bit from what you have written after you run the cells of a notebook.
This is the formatter restructuring your code.
There are other formats and tools, and even more opinions on them, but black is used in the city of Helsinki projects.
So, after deciding which editor you are working with (Azure ML default notebook view is based on JupyterLab), edit the top cell of all notebooks:
    
    
    # if using Jupyter:
    %load_ext nb_black
    
    # if using JupyterLab:
    %load_ext lab_black

    # do not add comment to same line with a magic command:
    %load_ext nb_black #this comment breaks the magic command

8. Check that you can run the notebooks `00_data.ipynb`, `01_model.ipynb`, `02_loss.ipynb` and `03_workflow.ipynb`.
You may have to change the kernel your notebook interpreter is using to the one you just created.
This can be done drop down bar in top of the notebook editor. You can play around with the notebooks to better understand the structure and the examples.

9. If while getting to know the template examples you ran or edited any of the notebooks, run the following command to clean any unrelevant changes.

    git reset --hard

Please note, that this will reset any changes made to the template. We do this to clean any 'play-around' work you might have done getting to know the template and the examples.

10. Replace the notebooks `index`, `data`, `model`, `loss` and `workflow` with copies without the code examples (there is also additional empty notebook template `_XX_empty_notebook_template.ipynb` if you want to deviate from basic template structure):


    git rm index.ipynb 00_data.ipynb 01_model.ipynb 02_loss.ipynb 03_workflow.ipynb
    git mv notebook_templates/_index.ipynb ./index.ipynb
    git mv notebook_templates/_00_data.ipynb ./00_data.ipynb
    git mv notebook_templates/_01_model.ipynb ./01_model.ipynb
    git mv notebook_templates/_02_loss.ipynb ./02_loss.ipynb
    git mv notebook_templates/_03_workflow.ipynb ./03_workflow.ipynb

11. You may delete the folders `ml_project_template`, `notebook_templates` and `visuals`.


    git rm -r ml_project_template notebook_templates visuals docs/visuals

12. Edit `settings.ini`, `docs/_config.yml` and `docs/_data/topnav.yml` according to your project details.
The files contain instructions for minimum required edits.
You can continue editing them in the future, so no need to worry about getting it right the first time.
These are used for building the python modules and docs based on your notebooks.
If you get errors when building a module or docs, take a look again at these files.

13. The Helsinki logo is a registered trademark, and may only be used by the city of Helsinki.
If you are using this template for other than city of Helsinki projects, remove the files `favicon.ico` and `company_logo.png` from `docs/assets/images/`.
You may replace these with your own logo. Fast.ai logo will show in documentation if custom logos are not defined.

14. Recreate the module and doc pages to clean them: `nbdev_build_lib && nbdev_build_docs`

15. Configure your git user name and email adress (one of those added to your git account) if you haven't done it already:


    git config --global user.name "FIRST_NAME LAST_NAME"
    git config --global user.email "MY_NAME@example.com"

16. Make initial commit (snapshot of the code as it is when you begin the work):


    git add .
    git commit -m "Initial commit"

17. Push (save changes to remote repository): `git push -u origin master`. You will be asked to log in with your SSH key and password, again.


## How to use

1. Install this template as basis of your new project (see above).

2. Remember to always activate your virtual environment before you start working: `conda activate [environment name]` with anaconda or `source [environment name]/bin/activate` with virtualenv

3. Explore, explain, visualize, test - do your thing!

5. Save your notebooks and call `nbdev_build_lib` to build python modules of your notebooks - needed if you want to share code between notebooks or create a modules.
This will export all notebook cells with `# export` tag to corresponding python files.
Remember to do this if you want to rerun your workflow after making changes to exportables.

6. Save your notebooks and call `nbdev_build_docs` to create doc pages based on your notebooks (see below).
This will also update README.md file based on `index.ipynb`.
If you want to host your project pages on GitHub, you will have to make your project public.
You can also build the pages locally with jekyll.

7. You can install new packages with `pip install [package name]`.
Check out what packages are installed with the template from `requirements.txt`, or check if a specific package is installed with `pip show [package_name]`.
If you install new packages, remember to update the requirements for dependency management: `pip freeze > requirements.txt`.

8. Before you publish your project, edit LISENCE and the copywright information in the `index.ipynb` according to your project details.
Please mind that some dependencies of your project might have more restrictive licenses than the Apace-2.0 this template is distributed under. 

9. Remember to track your changes with git! 


## Keeping Project libraries up to date

Python has a rich and wide ecosystem of libraries to help with machine learning tasks among other things. Pandas, Matplotlib, Scipy, PyTorch to name a few. If base libraries in this template aren't sufficient you can add more with `pip install library`. However, `pip` command installs libraries into your local Python environment. To achieve consistent reproducibility we need to gather information about requirements into project repository. New libraries are added to `project_requirements.in` file. When you change this file remember to run:

```bash
pip-compile --generate-hashes --allow-unsafe -o requirements.txt base_requirements.in full_requirements.in project_requirements.in
pip-compile --generate-hashes --allow-unsafe -o project_requirements.txt base_requirements.in project_requirements.in
```

These two commands update full requirements for development environments and lighter, more focused requirements for server usage.

After requirements are updated you should run:

```bash
pip install -r requirements.txt
```

This way libraries you and other users will have the same Python environment.

Warning: if you don't update package names and versions next time you or anybody else tries to use this project in another environment its code might not work. Worse, it might *seem to* work, but does so incorrectly.


## Contributing

See [CONTRIBUTING.md](https://github.com/City-of-Helsinki/ml_project_template/blob/master/CONTRIBUTING.md) on how to contribute to the development of this template.


## Copyright

Copyright 2021 City-of-Helsinki. Licensed under the Apache License, Version 2.0 (the "License");
you may not use this project's files except in compliance with the License.
A copy of the License is provided in the LICENSE file in this repository.

The Helsinki logo is a registered trademark, and may only be used by the city of Helsinki.
> NOTE: If you are using this template for other than city of Helsinki projects, remove the files `favicon.ico` and `company_logo.png` from `docs/assets/images/`.

    # to remove remove helsinki logo and favicon:
    git rm docs/assets/images/favicon.ico docs/assets/images/company_logo.png
    git commit -m "removed Helsinki logo and favicon"

This template was built using [nbdev](https://nbdev.fast.ai/) on top of the fast.ai [nbdev_template](https://github.com/fastai/nbdev_template).

## Now you are all set up and ready to begin you ML project!