![IE](../img/ie.png)

# Sessions 5 & 6: Project layout and unit tests

### Juan Luis Cano Rodríguez <jcano@faculty.ie.edu> - Master in Business Analytics and Big Data (2019-12-11)

## Project layout

Most data science projects will start with a bunch of notebooks. However, at some point we will want to reuse code between them, and eventually put our models in production without the need to use the notebooks themselves ([unless you are Netflix](https://medium.com/netflix-techblog/notebook-innovation-591ee3221233)). Choosing a good project layout is extremely important to organize the code, avoid common pitfalls and be predictable (i.e. imitate the rest of the ecosystem to minimize surprise). On the other hand, there is lots (**lots**) of outdated, bad or wrong advice on the Internet about this topic, so here we will present The Truth™.

### References

* Packaging a Python library https://blog.ionelmc.ro/2014/05/25/python-packaging/
* Less known packaging features and tricks https://blog.ionelmc.ro/presentations/packaging/
* setuptools documentation https://setuptools.readthedocs.io/en/stable/setuptools.html

### The `src` layout

```
package-name
├─ src
│  └─ package_name
│     ├─ __init__.py
│     └─ ...
├─ tests
│  └─ ...
├─ README.md
└─ pyproject.toml
```

* The `src/package_name` contains the source code of the library.
  - The `package_name` is what users type after `import` in a Python script, and therefore cannot contain special characters (only letters, numbers, and underscores).
  - It should contain a `__init__.py` that can be empty (more on that below).
  - The `src` segment prevents you from *shooting yourself in the foot*, because it's common to do `import packagename` when you are developing, and this will import the code from the directory, not from your `sys.path`. Always include it.

* The `tests` directory contains the tests. It **must not** contain any `__init__.py` because it's not meant to be imported as a package. In very specific cases it's included inside `src/package_name`.

* Every project contains a `README.md` that at least explains what the project is.

* `pyproject.toml` contains the metadata of the project. The absolutely required fields are `module`, `author`, and some extra information that tells Python how to install the package.

### Creating a package

1. Run `flit init` to create the metadata:

```
$ flit init
Module name: ie_nlp_utils
Author [Juan Luis Cano Rodríguez]: 
Author email [jcano@faculty.ie.edu]: 
Home page: 
Choose a license (see http://choosealicense.com/ for more info)
1. MIT - simple and permissive
2. Apache - explicitly grants patent rights
3. GPL - ensures that code based on this is shared with the same terms
4. Skip - choose a license later
Enter 1-4 [1]: 

Written pyproject.toml; edit that file to add optional extra info.
$ cat pyproject.toml 
[build-system]
requires = ["flit_core >=2,<3"]
build-backend = "flit_core.buildapi"

[tool.flit.metadata]
module = "ie_nlp_utils"
author = "Juan Luis Cano Rodríguez"
author-email = "jcano@faculty.ie.edu"
classifiers = ["License :: OSI Approved :: MIT License"]


```

2. Place some code under the source directory. In `__init__.py` there must be a docstring giving a description of the project and a `__version__` variable indicating the version:

```
$ mkdir src
$ mkdir src/ie_nlp_utils/
$ nano src/ie_nlp_utils/__init__.py  # ...
$ cat src/ie_nlp_utils/__init__.py
$ cat src/ie_nlp_utils/__init__.py 
"""IE NLP utils (test package)."""

__version__ = "0.1.0"

```

3. Install the code using `pip install`!

```
$ pip install .
$ python
>>> import ie_nlp_utils
>>> ie_nlp_utils.__version__
'0.1.0'

```

This is an alternative to modifying the `PYTHONPATH` environment variable (see previous session).

4. Add a `README.md` and a `.gitignore` file, for example copying the one from https://www.gitignore.io/api/python.
5. Commit the changes 🎉

### _Intermezzo:_ Version numbers

* Version numbers for Python packages are explained in [PEP 440](https://www.python.org/dev/peps/pep-0440/)
* For libraries, the most widely used convention is [semantic versioning](https://semver.org/): `X.Y.Z`
  - `Z` **must** be incremented if only backwards compatible bug fixes are introduced (a bug fix is defined as an internal change that fixes incorrect behavior)
  - `Y` **must** be incremented every time there is new, backwards-compatible functionality
  - `X` **must** be incremented every time there are backwards-incompatible changes
* Between releases, the version should have the `.dev0` suffix
* Recommendation: start with `0.1.dev0` (development version), then make a `0.1.0` release, then progress to `0.1.1` for quick fixes and `0.2.0` for new functionality, and when you want to make a promise of *relative* stability jump to `1.0.0`
* For applications, other conventions are more appropriate, like [calendar versioning](http://calver.org/): `[YY]YY.MM.??

## Project requirements

Sometimes our project will depend on third-party libraries (pandas, scikit-learn). To make pip install those dependencies automatically, we can add them to our `pyproject.toml` under the `[tool.flit.metadata]` section, using the `requires` option:

```
[build-system]
requires = ["flit_core >=2,<3"]
build-backend = "flit_core.buildapi"

[tool.flit.metadata]
module = "ie_nlp_utils"
author = "Juan Luis Cano Rodríguez"
author-email = "jcano@faculty.ie.edu"
classifiers = ["License :: OSI Approved :: MIT License"]
requires = [
    "pandas",
    "matplotlib>=2",
]
```

On the other hand, we might want to specify _optional_ dependencies that should only be installed upon request, or for some specific purposes. A typical example will be development dependencies: we will need things like pytest and black, but we don't want the user to install them as part as our library. To do that, we can specify *groups* of optional dependencies under the `tool.flit.metadata.requires-extra` section:

```
[tool.flit.metadata.requires-extra]
dev = [
    "black",
    "pytest",
]
```

That way, they will only get installed when `[dev]` is added after the name of our library:

```
$ pip install .[dev]
$ # pip install /path/to/library/[dev]  # Absolute instead of relative paths
$ # pip install library[dev]  # Libraries already available in pypi.org
```

## Testing

Testing is **essential**. Many developers get along without testing their software, but as common wisdom says:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">If you use software that lacks automated tests, you are the tests.</p>&mdash; Jenny Bryan (@JennyBryan) <a href="https://twitter.com/JennyBryan/status/1043307291909316609?ref_src=twsrc%5Etfw">September 22, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Computers excel at doing repetitive tasks: they basically never make mistakes (the mistake might be in what we told the computer to do). Humans, on the other hand, fail more often, especially under pressure, or on Friday afternoons and Monday mornings. Therefore, instead of letting the humans be the tests, we will use the computer to **frequently verify that our software works as specified**.

### References

* pytest documentation https://docs.pytest.org

### Further reading

* Extreme Programming https://www.wikiwand.com/en/Extreme_programming
* Obey the Testing Goat! http://www.obeythetestinggoat.com/pages/book.html#toc
* (Shameless self-plug) Testing and validation approaches for scientific software https://nbviewer.jupyter.org/format/slides/github/poliastro/oscw2018-talk/blob/master/Talk.ipynb

### Test-Driven Development

> Make it work. Make it right. Make it fast.

Test-Driven Development shifts the focus of software development to writing tests. The "practice of test-first development, planning and writing tests before each micro-increment" is not new: it was in use at NASA in the early 1960s ([source](https://www.wikiwand.com/en/Extreme_programming)). In the 1990s, Extreme Programming took this concept to the extreme by the use of **small, automated** tests.

The "test-driven development mantra" is <span style="color: red">**Red**</span> - <span style="color: green">**Green**</span> - **Refactor**:

![The mantra](../img/red-green-refactor.png)

1. Write a test. <span style="color: red">**Watch it fail**</span>.
2. Write just enough code to <span style="color: green">**pass the test**</span>.
3. Improve the code without breaking the test.
4. Repeat.

### Testing in Python

Summary: **use pytest**. Everybody does. It rocks.

[pytest](https://docs.pytest.org/) is a testing framework for Python that makes writing tests extremely easy. It is much more powerful than the standard library equivalent, `unittest`. To use it, you need to install it first:

```
$ pip install pytest
```

The simplest test is **a function with an `assert`**. The `assert` statement just fails if the contents are not `True`, and else it does nothing. *It should only be used for testing*.

In [1]:
assert True  # Does nothing

In [2]:
assert False  # Fails!

AssertionError: 

In [3]:
assert 2 + 2 == 5, "Math is wrong"  # Fails with a message

AssertionError: Math is wrong

### Example

> Write a function that **tokenizes a sentence** (i.e. splits it into a list of words)

First, we write a (failing) test:

```python
# tests/test_tokenize.py
from ie_nlp_utils import tokenize  # This will fail right away!

def test_tokenize_returns_expected_list():
    sentence = "This is a sentence"
    expected_tokens = ["This", "is", "a", "sentence"]

    tokens = tokenize(sentence)

    assert tokens == expected_tokens
```

and we run it from the command line:

```
$ pytest
...
```

Then we fix the test in the simplest way:

```python
# src/ie_nlp_utils/__init__.py
def tokenize(sentence):
    return sentence.split()
```

And we watch it pass!

```
$ pytest
...
```

### Exercise

1. Add a new test that checks that `tokenize(sentence, lower=True)` returns a list of *lowercase* tokens.
2. Fix the test *in a way the first one doesn't break*.
3. *Extra*: Use `@pytest.mark.parametrize` to pass two different sentences to the new test https://docs.pytest.org/en/latest/example/parametrize.html