In [None]:
# hide
%load_ext nb_black
# nb_black if using jupyter

<IPython.core.display.Javascript object>

# Helsinki Machine Learning Project Template

> Template for open source ML and analytics projects.

> NOTE: Once you begin your work, rewrite this notebook (index.ipynb) so that it describes your project, and regenerate README by calling `nbdev_build_docs`

## About

This is a git repository template for Python-based open source ML and analytics projects.

(mention here notebooks, nbdev and papermill - the core tools)

Installing this template creates a new, 
independent repository with clean commit history, 
but with a copy of all the files and folders in this repository.
Note, that updates to the template can not be automatically pulled to new projects that are built using on top of it. 

This template helps you to develop, test, share and update ML applications, in a collaborative way.
It defines steps and tools of a ML project and helps you to tie them together for easy to use, explainable and reproducible workflow.


The template is completely open source and environment agnostic.
It contains examples and instructions to help you through the whole project.
However, we try to keep it light and easy to maintain, and thus the documentation is not exhaustive.
We have added references to original sources so that they are easy to find.

The template is developed and maintained by the data and analytics team of the city of Helsinki.
The template is published under the Apache-2.0 licence and open source utilization is encouraged!


## Contents

The core structure of the repository is the following:

    ## EDITABLE:
    data/               # Folder for storing data files. Ignored by git by default.
    |- raw_data/        # To store raw data files
    |- preprocessed_data/   # To store cleaned data
    results/            # Save results here. Ignored by git by default.
    |- notebooks/       # Save automatically executed notebooks here
    00_data.ipynb       # Extract, transfer, load data here & define related functions.
    01_model.ipynb      # Create and code test your ML model
    02_loss.ipynb       # Train and evaluate ML model, deploy or save for later use
    03_workflow.ipynb   # Define ML workflow and parameterization
    04_api.ipynb        # Define runtime API for using trained ML model
    project_requirements.in    # Add here the Python packages you want to install
    settings.ini        # Project specific settings. Build instructions for lib and docs.
    Dockerfile          # instructions for building docker image
    docker-compose.yml  # docker settings (fast.ai default)

    ## AUTOMATICALLY GENERATED: (Do not edit unless otherwise specified!)
    docs/               # Project documentation (html)
    [your_module]/      # Python module built from the notebooks. The name of the module is the name of the repository (after installation).
    README.md           # The frontpage of your 
    requirements.txt    # automatically generated by pip-tools
    project_requirements.txt # automatically generated by pip-tools

    ## STATIC NON-EDITABLE: (Edit only if you know what you're doing!)
    base_requirements.in    # core tools that every project built based on the template always requires
    full_requirements.in    # core tools that every project requies at development stage
    LISENCE                 # lisence information
    MANIFEST.in             # metadata for building python distributable
    setup.py                # settings for the python module of your project. Generated by nbdev.
                                         

## How to Install

> Note: if you are doing a project on personal data for the City of Helsinki, contact the data and analytics team of the city before proceeding any further!

### 1. On your GitHub homepage:

0. (Create [GitHub account](https://github.com/) if you do not have one already. 
1. Sign into your GitHub homepage
2. Go to page https://github.com/City-of-Helsinki/ml_project_template and click the green button that says 'Use this template'.
3. Give your project a name. Do not use the dash symbol '-', but rather the underscore '_', because the name of the repo will become the name of your Python module.
4. If you are creating a project for your organization, change owner of the repo from the drop down bar (it's you by default).
You need to be included as a team member to the GitHub of the organization.
5. Define your project publicity (you can change this later, but most likely you want to begin with a private repo).
6. Click 'Create repository from template'

This will create a new repository for you copying everything from this template, but with clean commit history.

### 2. Setting up your development environment

#### 2.1 Alternative 1: Codespaces

If your organization has Codespaces enabled (requires Enterprise GitHub & Azure subscription), you are now ready to begin development. Just launch the repository in a codespace, and a dev container is automatically set up.

#### 2.2 Alternative 2: Local installation with Docker

#### 2.3 Alternative 3: Local installation manually

0. Create an SSH key and add it to your github profile ([instructions](https://docs.github.com/en/authentication/connecting-to-github-with-ssh))
1. Configure your git user name and email adress if you haven't done it already: `git config --global user.name "FIRST_NAME LAST_NAME" && git config --global user.email "your@email.com"`
2. Clone your new repository: `git clone git@github.com:[repository_owner]/[your_repository]`.
3. Go inside the repository folder: `cd [your_repository]`
4. Create and activate virtual environment of your choice. Remember to define the Python version to 3.9! (Instructions: [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/))
5. Install requirements: `pip install -r requirements.txt && nbdev_install_git_hooks`
6. Create an ipython kernel for running the notebooks: `python -m ipykernel install --user --name python39myenv`

### 3. Initializing your project

After setting up your local dev environment, you must edit the

1. Edit `LICENCE`, `Makefile`, `settings.ini`, `docs/_config.yml` and `docs/_data/topnav.yml` according to your project details. You can continue editing them in the future.
2. Remove the folder `ml_project_template`. A new folder with the name of your repository will be created automatically when calling `nbdev_build_lib`. This contains the python module of your project.
3. Recreate the module and doc pages to clean them: `nbdev_build_lib && nbdev_build_docs`
4. Make initial commit: `git add . && git commit -m "initialized repository from City-of-Helsinki/ml_project_template"`
5. Push changes `git push -u origin master`

You are now ready to begin your ML project development! Remember to track your changes with git!


## How to use

1. Install this template as basis of your new project (see above).

2. Remember to activate your virtual environment every time you begin work (containers will take care of this for you if you utilize them): `conda activate [environment name]` with anaconda or `source [environment name]/bin/activate` with virtualenv

3. Do your data science!

4. Save your notebooks and call `nbdev_build_lib` to build python modules of your notebooks - needed if you want to share code between notebooks or create a modules.
This will export all notebook cells with `# export` tag to corresponding .py files under the module (the folder inside your repository named after your repository).
Do this every time you make changes to any exportable parts of the code.

5. Save your notebooks and call `nbdev_build_docs` to create doc pages based on your notebooks (see below).
This will convert the notebooks into HTML files under `docs/` and update README based on the `index.ipynb`.
If you want to host your project pages on GitHub, you will have to make your project public and enable github pages in repo > Settings > Pages : set Source to `docs/`.
Alternatively you can build the pages locally with jekyll.


## Keeping Project libraries up to date

Python has a rich and wide ecosystem of libraries to help with machine learning tasks among other things.
Pandas, Matplotlib, Scipy, PyTorch to name a few.
If base libraries in this template aren't sufficient you can add more with `pip install library`.
However, `pip` command installs libraries into your local Python environment. 
To achieve consistent reproducibility we need to gather information about requirements into project repository. 
New libraries are added to `project_requirements.in` file. When you change this file remember to run:

```bash
pip-compile --generate-hashes --allow-unsafe -o requirements.txt base_requirements.in full_requirements.in project_requirements.in
pip-compile --generate-hashes --allow-unsafe -o project_requirements.txt base_requirements.in project_requirements.in
```

These update full requirements for development environments and lighter, more focused requirements for server usage.

After requirements are updated you should run:

```bash
pip install -r requirements.txt
```

This way libraries you and other users will have the same Python environment.

> NOTE: run `./update_requirements.sh` - it contains the three above pip commands for updating and installing the requirements for convenience!

Warning: if you don't update package names and versions next time you or anybody else tries to use this project in another environment its code might not work. Worse, it might *seem to* work, but does so incorrectly.


### Ethical Aspects

Please involve ethical consideration of your ML application!

For example:
* Can you recognize ethical issues with your ML project?
* Is there a risk for bias, discrimination, violation of privacy or conflict with the local or global laws?
* Could your results or algorithms be misused for malicious acts?
* Can data or model updates include bias in your model?
* How have you tackled these issues in your implementation?

## How to Cite this Work (optional)

If you are doing a research project, you can add bibtex and other citation templates here.
You can also get a doi for your code by adding it to a code archive,
so your code can be cited directly!

To cite this work, use:

    @misc{sten2022helsinki,
    title = {Helsinki Machine Learning Project Template},
    author = {Nuutti A Sten and Jussi Arpalahti},
    year = {2022},
    howpublished = {City of Helsinki. Available at: \url{https://github.com/City-of-Helsinki/ml_project_template}}
    }

## Contributing

See [CONTRIBUTING.md](https://github.com/City-of-Helsinki/ml_project_template/blob/master/CONTRIBUTING.md) on how to contribute to the development of this template.


## Copyright

Copyright 2022 City-of-Helsinki. Licensed under the Apache License, Version 2.0 (the "License");
you may not use this project's files except in compliance with the License.
A copy of the License is provided in the LICENSE file in this repository.

The Helsinki logo is a registered trademark, and may only be used by the city of Helsinki.
> NOTE: If you are using this template for other than city of Helsinki projects, remove the files `favicon.ico` and `company_logo.png` from `docs/assets/images/`.

    # to remove remove helsinki logo and favicon:
    git rm docs/assets/images/favicon.ico docs/assets/images/company_logo.png
    git commit -m "removed Helsinki logo and favicon"

This template was built using [nbdev](https://nbdev.fast.ai/) on top of the fast.ai [nbdev_template](https://github.com/fastai/nbdev_template).