In [2]:
from dsc.notebook import embed_website

# Agenda
- In this chapter we discuss how to setup your Python project after you have created a Git repo and a virtual environment
- We talk about how you can rapidly develop prototypes and conduct analyses in Jupyter notebooks and when you should modularize your code into modules or a package.
- In particular, we will use [PyScaffold](https://github.com/pyscaffold) in combination with the [datascience template](https://github.com/drivendata/cookiecutter-data-science/) of [Cookiecutter](https://www.cookiecutter.io/) so that
    - Obtain a sound folder structure for your project so that you know where to put
        - Data
        - Notebooks
        - Modules
        - Reports
        - ...  
    - You can easily pip-install your project package in editable mode

# Motivation

- Jupyter notebooks are heavily used in the data science community (industry, science, kaggle)
- Notebooks are wonderful for prototyping, exploration, analyses (including visualizations) and sharing results.
    - Combing markdown with source code of many programming languages
    - Seperating of code blocks into cells
    - Plotting
- There are even companies that 
    - Use Jupyter notebooks in production systems, e.g., [Netflix](https://netflixtechblog.com/notebook-innovation-591ee3221233)
    - Provide [platforms](https://www.databricks.com/solutions/data-science) that productionize notebooks 
<!-- (but if you don't need Spark, Databricks might be too expensive)-->
- However, the use of Jupyter notebooks for code that is actually deployed and runs in production is rather the exception.  

<div align="center">
And there are good reasons for that!
</div>

<div>
<img src="./figures/all_code_in_notebook_meme.png" alt="Notebook meme" width=1000/>
<div/>
<div>
</div>

The (in)famous talk ["I don't like notebooks"](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/) given by Joel Grus at JupyterCon 2018 highlights possible drawbacks of Jupyter notebooks for coding.

In [14]:
# embed_website("https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/")  # activating this might lag the presentation

- In the talk the following **drawbacks of Jupyter notebooks** are mentioned:
    1. **Reproducibility and hidden states** are a concern (slides 23, 24, 31, 42, 73-74)
        - Re-running or running cells in different order give you much flexibility for exploring
        - But it can be difficult to reproduce or understand results
    1. Inspiring bad coding habits and **discouraging good coding habits** (slides 45-59, 76-90)
        - No functions, abstractions or OOP, no writing of modules/packages
        - Mixing "library" and "excecution" code
        - Code is often not DRY and resuable
        - No testing
        - No type annotations
    1. **Powerful features of IDEs are missing** (slides 60-70, this can be somewhat addressed in the meantime by opening a notebook directly in VSCode or PyCharm)
        - Autosuggestion
        - Automatic highlighting of possible errors by linters
        - Code formatting using, e.g., black
        - Type annotation checkers
<!-- - The author also mentions in his notes that the Jupyter ecosystem consists of tools to work around people's bad habits so they don't have to fix them -->

- Drawbacks not mentioned in the talk:
    - **Code versioning is difficult** (in the meantime, this can be addressed with Jupytext)
    - **Collaboration is difficult** (everybody working on the same notebook? merge conflicts?)
- See also [here](https://www.reddit.com/r/datascience/comments/nf47se/does_netflix_use_jupyter_notebooks_in_production/) and [here](https://www.youtube.com/watch?v=9Q6sLbz37gk)
for further discussions about using notebooks in production.

- Jupyter notebooks and its extensions are tools. 
    - Many **projects try to extend Jupyter notebooks and make them ready for collaboration and production** (or e.g., writing books with them).
    - Like any tool you can use a notebook for a purpose it was not designed for: ["If Your Only Tool Is a Hammer Then Every Problem Looks Like a Nail"](https://en.wikipedia.org/wiki/Law_of_the_instrument).
- Much "data science" education appears to be limited to showing beginners how to use notebooks but does not develop their programming skills or teach best practices.
- Instead of extending and productionizing notebooks **I think it makes more sense to empower data scientists to build production-ready code the way it is typically done in other areas**
- Especially if you project becomes more mature or complex it is beneficial to move away from notebooks and modularize your
  code and factor it into a package/module
- When the time comes, you'll be much closer to production-grade code and can rather ensure scalability, maintainability and resiliency that long-lived production code needs to support

# Using PyScaffold for setting up your Python project

## What is PyScaffold?
From the [website](https://pyscaffold.org/en/stable/):

- PyScaffold is a **project generator** for bootstrapping high-quality Python packages, ready to be shared on PyPI and installable via pip. 
- PyScaffold incentives its users to use the **best tools and practices** available in the Python ecosystem.

    - A generated project will contain **sane default configurations** for 
        - Setuptools (the de facto standard for building Python packages)
        - Sphinx (the one & only Python documentation tool)
        - Pytest and tox (most commonly used Python testing framework & task runner)
    -  PyScaffold can also bring pre-commit into the mix to run a set of prolific linters and automatic formatters in each commit in order to adhere to common coding standards like pep8 and black.

"Using PyScaffold is like having a Python Packaging Guru, who has spent a lot of time researching how to create the best project setups, as a friend that is helping you with your project." &#128515;

- Moreover, there is a [PyScaffold extension tailored for Data Science projects](https://github.com/pyscaffold/pyscaffoldext-dsproject) 	&#128522;
- This extension creates a folder structure that is inspired by [cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science). In contrast to cookiecutter-data-science alone it
    1. Advocates a proper Python package structure that can be shipped and distributed.
    2. Uses a conda environment instead of something virtualenv-based and is thus more suitable for data science projects.
    3. More default configurations for Sphinx, pytest, pre-commit, etc. to foster clean coding and best practice.

<!-- Cookiecutter is a tool that allows the definition of templates 
for a broad range of software projects. On the other hand, PyScaffold focus is on developing distributable Python packages (exclusively) in a simple way
-->

- In this course, we will only use a subset of the features that the standard PyScaffold (DataScience) provides:
    - A sound folder structure
    - Using it to easily install the project package in editable mode
    - Settinp up documentation for a package 
    - Setting up pre-commit hooks (will be discussed when we talk about collaboration)

## Installing PyScaffold and creating a project with PyScaffold

- Cd into a location that is not a git repo (e.g., "../dsc_2022").
- Run ```conda create -n pyscaffold_test -c conda-forge pyscaffoldext-dsproject```
- Activate the pyscaffold_test environment
- Run ```putup --dsproject pyscaffold_test``` to create a directory called pyscaffold_test which is set up by PyScaffold
- Note that PyScaffold automatically turns pyscaffold_test into a git repo (please rename the initial branch)

The resulting folder structure should look like this

```
├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├── setup.py                <- [DEPRECATED] Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── PYTHON_PKG          <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
```