![IE](../img/ie.png)

# Sessions 2 & 3: Project layout and unit tests

### Juan Luis Cano Rodríguez <jcano@faculty.ie.edu> - Master in Business Analytics and Big Data (2019-04-03)

## Triangular workflows in git

When collaborating with a project hosted online on GitHub or GitLab, the most common setup is having a central repository, one remote fork per user, and local clones/checkouts:

![Triangular workflow](https://github.blog/wp-content/uploads/2015/07/5dcdcae4-354a-11e5-9f82-915914fad4f7.png?resize=2000%2C951)

(Source: https://github.blog/2015-07-29-git-2-5-including-multiple-worktrees-and-triangular-workflows/)

Following this workflow requires discipline and sticking to a subset of actions and git commands to avoid common mistakes. This website contains all you need to know to setup your triangular workflow and we don't need to reproduce it here:

https://www.asmeurer.com/git-workflow/

*Notice* the different naming conventions between this website and the first image:

1. **Convention 1**: upstream/origin/local
2. **Convention 2**: origin/&#x3C;username&#x3E;/local

We will be consistent with the Aaron Meurer guide and therefore use Convention 2 all the time.

### ⚠ After creating a pull request ⚠

After your pull request has been merged to `master`, your local `master` and `<username>/master` will be outdated with respect to `origin/master`. On the other hand, **you should avoid working on this branch anymore in the future**: remember branches should be ephemeral and short-lived.

To put yourself in a clean state again, you have to:

1. Click "remove branch" in the pull request (don't click "remove fork"!)
2. `git checkout master` to go back to `master`
3. `git fetch origin` (**never, ever use `git pull` unless you know exactly what you're doing**)
4. `git merge --ff-only origin master` (this will update your local `master` with `origin/master`, and fail if you accidentally made any commit in `master`)
5. `git fetch -p <username>` :star2: this will acknowledge the removal of the branch :star2:
6. `git push <username> master` (this will update your fork with respect to `origin`)
7. `git checkout -b new-branch` to start working in the new feature!

This process **has to be repeated after every pull request**.

🌈 

## Project layout

Most data science projects will start with a bunch of notebooks. However, at some point we will want to reuse code between them, and eventually put our models in production without the need to use the notebooks themselves ([unless you are Netflix](https://medium.com/netflix-techblog/notebook-innovation-591ee3221233)). Choosing a good project layout is extremely important to organize the code, avoid common pitfalls and be predictable (i.e. imitate the rest of the ecosystem to minimize surprise). On the other hand, there is lots (**lots**) of outdated, bad or wrong advice on the Internet about this topic, so here we will present The Truth™.

### References

* Packaging a Python library https://blog.ionelmc.ro/2014/05/25/python-packaging/
* Less known packaging features and tricks https://blog.ionelmc.ro/presentations/packaging/
* setuptools documentation https://setuptools.readthedocs.io/en/stable/setuptools.html

### The `src` layout

```
├─ src
│  └─ packagename
│     ├─ __init__.py
│     └─ ...
├─ tests
│  └─ ...
├─ README.txt
├─ setup.py
└─ setup.cfg
```

* The `src/packagename` contains the source code of the library.
  - The `packagename` is what users type after `import` in a Python script, and therefore should not contain special characters.
  - It should contain a `__init__.py` that can be empty (more on that below).
  - The `src` segment prevents you from *shooting yourself in the foot*, because it's common to do `import packagename` when you are developing, and this will import the code from the directory, not from your `sys.path`. Always include it.

* The `tests` directory contains the tests. It **must not** contain any `__init__.py` because it's not meant to be imported as a package. In very specific cases it's included inside `src/packagename`.

* Every project contains a `README.txt` that at least explains what the project is.

* `setup.py` can be an extremely simple file containing only this:

```
from setuptools import setup

setup()
```

(This requires `setuptools > 30.3.0`, released 8 Dec 2016)

* `setup.cfg` contains the metadata of the project. The absolutely required fields are `name`, `version`, and `packages`, therefore you will need something like this:

```
[metadata]
name = my_package
version = 0.1.0

# Magic! Don't touch below this line
[options]
package_dir=
    =src
packages=find:

[options.packages.find]
where=src
```

The `name` is what users will have to type after `pip install` and therefore can contain hyphens. **Do not confuse this** with what users have to type on `import` (see above).

With this layout, **you can `pip install` your code** in your Python environment:

```
$ pip install --editable .
$ python
>>> import packagename
>>>
```

This is an alternative to modifying the `PYTHONPATH` environment variable (see first session).

### Exercise

1. Create a directory called `test-package`
2. `git init` inside it
3. Create a basic `src` layout in it, with `name = test-package` and the source in `src/test_package`
4. Create a `src/test_package/__init__.py` with a `print("Hello, world!")`
5. Install it in editable mode using `pip` and test that `>>> import test_package` prints `Hello, world!`
6. Include a `README.txt` and an appropriate `.gitignore` from http://gitignore.io/
7. Commit the changes
8. Create a new GitHub project and push the repository there

🎉