# Documentation

**Contact:** Alex Kanitz (alexander.kanitz@unibas.ch)

This lesson gives you a summary of the various methods that you can use to document and annotate your software, for users, other developers and yourself.

## Table of Contents

* [Why should we document our software?](#Why-should-we-document-our-software?)
* ["Docs"](#"Docs")
* [Logging](#Logging)
  * [A simple example](#A-simple-example)
  * [Logging across multiple modules](#Logging-across-multiple-modules)
  * [Further reading: Logging](#Further-reading:-Logging)
* [Typing](#Typing)
  * [Type hints](#Type-hints)
  * [Further reading: Type hints](#Further-reading:-Type-hints)
* [Docstrings](#Docstrings)
  * [Docstring styles](#Docstring-styles)
  * [Further reading: Docstrings](#Further-reading:-Docstrings)
* [Block and inline comments](#Block-and-inline-comments)
* [Homework](#Homework)

## Why should we document our software?

In the field of bioinformatics, software is often badly documented. This is presumably due to time constraints and/or the belief that software is first and foremost written for oneself or the lab.

However, even _if_ software is really only used by yourself, documentation is still valuable, because your _future self_ is also another person, so to say. You will surely have a hard time figuring out your code when you look at it in a few weeks, months, years time!

Moreover, good scientific practices require proper documentation of code. Not doing so would be the equivalent of just publishing raw data, maybe with an unannotated set of protocols. So documenting and annotating your code is critically important for context, reusability, interpretability and, more generally, scientific provenance! Let's have a look at a number of ways how you can improve your codebase in those respects.

## "Docs"

We have already added a basic `README.md` file to our test repository in an earlier lession, but so far haven't
paid it much attention. If you want anyone to actually use your software, that will need some work!

While there isn't a standard or even a convention on what exactly should be inside a `README.md` file accompanying your repository, popular Open Source software typically have most or all of the following items covered in their docs:

* **Synopsis:** Short description of what the software does and why someone would want to use it.
* **Usage:** A section with short examples demonstrating _how_ the software can be used.
* **Installation:** Instructions for installing the software.
* **Detailed usage:** Exhaustive information on all relevant functions and behaviors of a software. Only necessary for larger projects.
* **License:** Information on how the software can be used by others.
* **Contributing:** Information on how others can contribute to the software. Only necessary if contributions are desired.
* **Versioning:** Information on what releases mean, if there is a regular release cycle etc.
* **Contributors:** A place to acknowledge all those who contributed to the software.
* **Contact:** Let people know how to get in touch with you.

Of course you can add other sections/info, e.g., for scientific/research software, you may want to add a section on how to cite the software when it is used.

> Note that if a project is sufficiently complex, markdown may not be the most suitable way to write
> documentation. In those cases, often [reStructuredText](https://docutils.sourceforge.io/rst.html) is used instead to generate beautifully rendered,
> browsable multipage documentation that is typically hosted outside of the repository, e.g., on services like
> [Read the Docs](https://readthedocs.org/).

## Logging

A common, simple way of _debugging_ code during development as well as for
keeping users updated about what's happening in the program right now is by
adding `print()` statements throught the codebase.

While this works, it is not very flexible, e.g., you probably _don't_ want the
user to see _debugging_ messages when they just run the code in _production_.
Also, it is more difficult to set up logging via `print()` statements to
different output streams, like screen output, writing to files etc.

Python comes with the `logging` module to make logging more convenient,
flexible and effective, as it has multiple configuration options for
formatting, configuring handlers/outputs and more. Moreover, it comes, by
default, with several builtin _log levels_ that you can attach to each log
message. These are the following (together with some loose conventions of how
they are being used; ultimately, that is, of course, up to you), ordered in
increasing severity:

* `DEBUG`: messages not to be printed during normal operation
* `INFO`: general messages for the user
* `WARNING`: messages that the user should pay extra attention to, but that are
   not considered errors
* `ERROR`: messages that report on recoverable faulty/erroneous program
   exectuion
* `CRITICAL`: messages reporting on unrecoverable errors (the program exits
   with an error status)

### A simple example

It is very easy to use the builtin `logging` module. Simply import it,
configure it, create a logger and then use it in your error messages. Here is
an example:

In [None]:
import logging

# configure the "root logger", from which all logger instances inherit
# this is OPTIONAL, you can also use the default options
logging.basicConfig(
    format='[%(asctime)s: %(levelname)s] %(message)s',
    level=logging.INFO,
)

# set up a logger instance for the current module
LOG = logging.getLogger(__name__)

# use the logger in your code
LOG.info("Script starting...")

So what's happening here?

1. First, we are **`import`ing** the [`logging`](https://docs.python.org/3/library/logging.html)
   module as usual.
   > As `logging` is a standard/builtin module, we do not need to install it.
2. We then **configure** the _root logger_ with the
   [`.basicConfig()`](https://docs.python.org/3/library/logging.html#logging.basicConfig)
   method by passing it two (optional) parameters:

   - `format`: defines the format of the log messages (see
      [here](https://docs.python.org/3/library/logging.html#logrecord-attributes) for a list of attributes you can use)
   - `level`: the minimum severity of log levels to log; here, we are only
      logging `logging.INFO` and more severe messages, i.e., we are _not_
      logging any debug messages  
  
   > Note that you can also use a configuration file together with the
   > [`.fileConfig()`](https://docs.python.org/3/library/logging.config.html#logging.config.fileConfig)
   > method for advanced configuration use cases.
3. We are then creating a **logger instance**.
   > In principle it is also possible
   to skip this step and use the _root logger_ by using, e.g.,
   `logging.debug("My debug message...")`, but that is considered bad pratice.
   The `.getLogger()` method takes as its required, positional argument the
   name of the logger, i.e., a string. It is a convention to pass it `__name__`
   in most cases, as this makes sure that in a program with multiple modules,
   each module has its own logger instance. We now have our logger instance
   stored in the variable `LOG`
4. Finally, we use the logger to log a message via one of its convenience
   methods.
   > Next to `.info()` there are also corresponding methods for all other
   > log levels, e.g., `.debug()`, `.error()`
   etc. messages to log messages with the corresponding log level.

### Logging across multiple modules

So do we have to set up logging like that in every module?

There are [several
patterns](https://stackoverflow.com/questions/15727420/using-logging-in-multiple-modules)
to set up logging across multiple modules. Here's a simple one that should
serve many scenarios, and it allows you to configure logging only once
(for each entry point into your program):

You configure the _root logger_ in your entry point (here module `main.py`):

```python
# main.py

import logging

from my_module import my_func


def main():
    LOG.info("Program started")
    my_func()
    LOG.info("Program finished")


if __name__ == '__main__':
    logging.basicConfig(
        format='[%(asctime)s: %(levelname)s] %(message)s (module "%(module)s")',
        level=logging.INFO,
    )
    LOG = logging.getLogger(__name__)
    main()
```

Given that we have the _root logger_ configured already (and all other loggers
inherit from that logger), in any other (non-entry point) module, we can now
just do:

```python
# my_module.py

import logging

LOG = logging.getLogger(__name__)


def my_func():
    LOG.info("This is from a function from another module")

```

You can copy the code from above to two files, `main.py` and `my_module.py`,
respectively, and run the code from the command line with:

```bash
python main.py
```

You will see the following output:

```console
[2021-11-02 16:08:16,945: INFO] Program started (module "main")
[2021-11-02 16:08:16,945: INFO] This is from a function from another module (module "my_module")
[2021-11-02 16:08:16,946: INFO] Program finished (module "main")
```

So, how did this work?

Given that we started the program from the command line, `__name__` is set to
`__main__` and so the logger is configured and the `main()` function called.
Here, there is first a log message to tell us that the program started, then
code from a function in another module (that we imported) is executed (leading
to another log message being written from there) and then finally we receive
a message that the program concluded. Note that each call now includes the name
of the module the message was logged in. This is because we included the
`%(module)s` argument in the format string.

But what happens if we import the module `my_module` from another program
because we offer some library code that is useful in various programs? There
won't be any call to `.basicConfig()` and thus the root logger and the module-
level logger won't be configured in the same way! That's correct - and it is
actually desirable. It is always the calling program/client that should
configure logging. Therefore, put your logging configuration only in your
entry point modules, and then just import and create a logger in all of your
other modules with the pattern:

```python
import logging

LOG = logging.getLogger(__name__)
```

### Further reading: Logging

- [Detailed How-To](https://docs.python.org/3/howto/logging.html)
- [Official docs](https://docs.python.org/3/library/logging.html)
- [Format string attributes](https://docs.python.org/3/library/logging.html#logrecord-attributes)

## Typing

By original design, Python is a _dynamically typed_ language, meaning that a
variable's type (e.g., `int` or `str`) can change over time. This also usually
means (and including in Python) that a variable's initial type does not have
to be declared before using the variable. Thus, as you know, you can just do
something like that without problems:

In [None]:
x = 18
print(f"Value of x: {x}")
print(f"Type of x: {type(x)}")
x = "Now I'm a string"
print(f"Value of x: {x}")
print(f"Type of x: {type(x)}")

This is in contrast to _statically typed_ languages, such as Java or C, where
variable types have to be declared first. In C, e.g., this could look like
this:

```c
int x;
x = 18;
int y = 20; /* you can also assign to the variable at the same time */
z = 20; /* raises an error because the type of z hadn't been declared! */
```

In addition to being _dynamically typed_, Python is also a
[_duck-typed_](https://en.wikipedia.org/wiki/Duck_typing) language, meaning
that whether the current value of a given variable matches a required type is
evaluated at run time with the "duck test":

> _"If it walks like a duck and it quacks like a duck, then it must be a duck"_

An example (taken from [Wikipedia](https://en.wikipedia.org/wiki/Duck_typing)):

In [None]:
class Duck:
    def swim(self):
        print("Duck swimming")

    def fly(self):
        print("Duck flying")

class Whale:
    def swim(self):
        print("Whale swimming")

for animal in [Duck(), Whale()]:
    animal.swim()
    animal.fly()

So if we say "everything that can swim is a duck", then a whale is a duck.
Until we get into a situation where we require that a duck also needs to
be able to fly, hence the error above.

Both _dyamic_ and _static_ typing systems for programming languages have
advantages and disadvantages. The advantages of _dynamic_ typing, especially
_duck-typing_ is that is easier and more flexible. The
disadvantages are that it is too easy and too flexible!
Duck-typing takes a load of your mind to such an extent that you are prone not
to think about your code _enough_ and hence are more likely to introduce
errors, especially in edge cases. And once a bug is introduced into your
codebase, _dynamic typing_ will also make it a lot harder to spot it. Or write
unit tests for your code. Because it is difficult to foresee all of the
different scenarios in which that code may be used and, e.g., what type of
inputs your function may receive in these scenarios. In other words, _dynamic
typing_ tends to be more dangerous, especially for more complex codebases!

One other major disadvantage of _dynamic typing_ nowadays is that you cannot
use the full power of modern, _smart_ editors, which are able to check your
code for (potential) issues that may arise from _duck typing_. If the types are
not known until the code is run, the editor cannot help you in spotting these!

Finally, statically typed language are generally more performant, because they
allow better optimization when producing bytecode.

### Type hints

Up until Python 3.5, it was actually impossible to even declare types for
variables. Due to the disadvantages of _dynamic typing systems_ mentioned in
the previous section, an _optional_ "type hinting" system was introduced since.
In Python 3.9 and above, this is now quite mature, and we strongly recommend
you to make use of it for production code. But you may perhaps skip it for your
unit tests - best of both worlds!

So how does it work? It's quite simple!


> Note that the typing system in Python and the `typing` module have been
> undergoing a lot of changes since they were first introduced in Python 3.5.
> We are referring here to how things are done in Python 3.9, where the system
> is more mature and has stabilized to some degree. Note that if you need to
> support older Python versions, you may need to do things slightly differently
> (generally the older ways are still supported in the newer Python minor
> releases).

To declare the type of a version, you can do:

```python
x: int
x = 18
```

Or you can declare the type and assign at the same type (more common):

```python
x: int = 18
```

However, note that typing _is not enforced_! Unliked in C (see above), your
runtime won't complain at all if you do something like that:

```python
x: int
x = "But I'm a string!"
```

Python remains a _dynamically typed_ language and the type hints are, well,
just _hints_!

So why should I bother with adding them, then?

The answer is that you can use linters like `mypy` to check your code for
typing issues. You will likely be able to configure your editor to use it and
tell you in realtime if you run into potential issues. If you make use of
type hinting, you should also include `mypy` in your CI. Just include a call
`mypy name_of_your_package` and it will report any issues it finds. After
coding for a bit, even after passing all your other linter tests, you might be
amazed what issues `mypy` finds!

Let's look at how to use type hints in functions and methods:

```python
def my_func(
    a: list,
    b: bool = False,
) -> str:
    # my code
```

Here we have defined a function that takes one required parameter `a` that is
supposed to be of type `list`, as well as an optional (default value provided!)
parameter `b` that is of type `bool`. The _return type_ is declared as `str`.
We recommend you to use type hints at the very least for your functions so that
your interfaces are well defined. It also helps you with writing docstrings, as
you don't need to bother with adding variable types in them. Tools that are
able to process properly formatted docstrings (e.g., Google-style docstrings)
will detect the types from the hints in the function/methods signature. This is
of course better, because a docstring is just text, it doesn't enforce
anything, even if you use `mypy`. You can declare a parameter to require a
certain type, but then the actual implementation uses another type. Docstrings
tend to degrade more easily, and the real source of truth is always the code
itself!

Let's look at some more type hints:

```python
a: list[str]  # a list of strings
b: tuple[str, int]  # a tuple with two items, the first a string, the second an integer
c: dict  # a dictionary
d: dict[str, int]  # a dictionary with the keys beings strings and the values being integers
```

If you want your variables to _optionally_ accept `None` (in addition to the
declared type), you can use `Optional` from the `typing` module:

```python
from typing import Optional

a: Optional[list[str]]  # here we accept a list of strings, or `None`
```

Two other useful features of the `typing` module are `Union` and `Any`. They
allow you to specify more than one type:

```python
from typing import (Any, Union)

a: Union[str, int]  # accepts a string OR an integer
b: Any  # accepts any type
```

### Further reading: Type hints

Of course there's a lot more to the new Python typing system, but you will
probably be able to get quite far with the rather simple examples above. Once
you run into situations where they won't be enough (or if you want to support
Python version below 3.9), you can check the following resources:

* [Official documentation](https://docs.python.org/3/library/typing.html)
* [PEP 484](https://www.python.org/dev/peps/pep-0484/)
* [Cheat sheet](https://mypy.readthedocs.io/en/latest/cheat_sheet_py3.html)

## Docstrings

One particular aspect of documentation covers the feature of Python that allows modules, functions, classes and
methods to be described by a simple triple-quoted string, the documentation string, or more commonly referred to as just "docstring". Whenever docstrings are used they have to represent the first statement (i.e., non-comment and non-blank line) of the code unit they describe. Having docstrings placed right next to the code units they document, helps tremendously in ensuring that code and documentation do not diverge over time.

Let's look at a minimal example of a docstring annotating a function:

In [None]:
def my_function():
    """I am a docstring."""
    pass

Docstrings are important because they can be used by users and developers to quickly get an idea of what an imported piece of code can do. For example, after defining the function above, we can call the `help()` function on it to retrieve the information from the docstring:

In [None]:
help(my_function)

While this may not seem overly useful in this instance, this is simply because our docstring isn't very helpful. 

So how do we write better docstrings?

### Docstring styles

There are some popular styles in which docstrings are typically written, all of which follow the conventions and have tooling available to auto-generate beautiful documentation from them. Let's see an example each for these styles for a hypothetical function `func` that takes as inputs an integer `arg1` and a string `arg2` and which either returns a `bool` or raises a `TypeError`.

* [Sphinx style](https://pythonhosted.org/an_example_pypi_project/sphinx.html#function-definitions)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      :param arg1: Description of arg1.
      :type arg1: int
      :param arg2: Description of arg2.
      :type arg2: str
      ...
      :raises TypeError: Description of situation when exception is raised.
      ...
      :returns: Description of return value.
      :rtype: bool
      """
  ```

* [NumPy style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      Parameters
      ----------
      arg1 : int
          Description of arg1.
      arg2 : str
          Description of arg2.

      Returns
      -------
      bool
          Description of return value.

      Raises
      ------
      TypeError
          Description of situation when exception is raised.
      """
      pass
  ```

* [Google style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      Args:
          arg1 (int): Description of arg1.
          arg2 (str): Description of arg2.

      Returns:
          bool: Description of return value.
      
      Raises:
          TypeError: Description of situation when exception is raised.
      """
      pass
  ```

> We will be using **Google-style docstrings** for our collaborative coding projects, as they are least verbose
> and therefore easiest to read and write in most situations. For your own projects, choose the style that suits
> you or your coding environment best - just make sure to be consistent.

> Note that in all of the examples above, types of input parameters and the return value are explicitly indicated.
> **When using type hints in the function definition, this information is redundant and should therefore be avoided.**

So a better way of defining function `func` with Google-style docstrings would be:

In [None]:
def func(arg1: int, arg2: str) -> bool:
    """Summary line.

    Extended description of function.

    Args:
        arg1: Description of arg1.
        arg2: Description of arg2.

    Returns:
        Description of return value.

    Raises:
        TypeError: Description of situation when exception is raised.
    """
    pass

# let's call help on the function
help(func)

That already looks much better - though perhaps a little abstract. Let's write a docstring for a function that we
might actually want to implement some day:

In [None]:
def translate_dna(
    seq: str,
    start: int,
) -> str:
    """Translate a DNA sequence.
    
    Args:
        seq: DNA sequence to be translated.
        start: Start position for translation.

    Returns:
        Amino acid sequence corresponding to DNA sequence.

    Raises:
        ValueError: Sequence contains non-DNA characters.
        ValueError: Start position is out of bounds.
    """
    pass

# let's call help on the function
help(translate_dna)

So just by looking at this docstring, we know exactly how we can use this function. We even know what we could
possibly do wrong. In fact, after putting in the work to come up with the function _signature_ and _docstring_,
we would have a very good idea how to implement that function!

Writing docstrings indeed makes you think about your code and possible side effects or edge cases of your functions, methods and classes. We therefore _strongly_ recommend you not only to write extensive docstrings, but even do so before actually implementing them. Then, after you are done, double check that your docstring still matches with your implementation.

We recommend you to do this at least for any that might ever be useful in another context, for other people or your future self! It may be tedious to write them, but in the end it will increase the uptake of and trust in your code, and it will improve your reputation as a developer.

### Further reading: Docstrings

Refer to the following information to learn more about how to write good docstrings:
- [PEP 8](https://www.python.org/dev/peps/pep-0008/) (general Python style guide)
- [PEP 257](https://www.python.org/dev/peps/pep-0257/) (specific guidelines for docstrings)

Docstrings can also be parsed by tools like [Sphinx](https://www.sphinx-doc.org/en/master/) to generate beautifully rendered, browsable, interlinked documentation for your codebase, which can then be hosted on services like [Read the Docs](https://readthedocs.org/).

## Block and inline comments

Sometimes, you want to tell other developers (or your future self) why you did a particularly tricky thing in a certain way. And sometimes, that thing is an implementation detail that would not make too much sense to document in the docstring.

In those (and _only in those_) cases, you can use block or inline comments:

```python
# this is a block comment
some_complicated_code_that_benefits_from_documentation(...)

...

some_other_complicated_code_further_down(...)  # this is an inline comment
```

However, **we strongly recommend you to use block and inline comments as sparingly as possible** to document your code, because:

* The code is the _only_ source of truth! It's easy for code and documentation to diverge, because code gets updated and accompanying comments are often forgotten. Comments that don't match the code are not only not helpful, they may be harmful!
* If your code is so "ugly" that you feel like you need comments to structure it, you probably should refactor it (e.g., create shorter/clearer functions/methods).
* Most of the time, docstrings, type annotations and log messages should be sufficient for documenting what your code is doing, and for these, there is (better) tooling available to check whether they are still consistent with the actual code.


Apart from the very rare cases described at the beginning of this section, some other good reasons to include comments are:

* Document that a workaround is required because of some known issue in a dependency (link to the issue)
* Give credit when taking code from public sources, e.g. from [Stack Overflow](https://stackoverflow.com/)
* Indicate edge cases that are currently not planned to be covered (if they are planned to be covered, it is prefereable to open an issue in the repository rather than adding a `# TODO: xxx` comment)

> Please also do not use block comments to "comment out" code! When using version control, one can always go back
> to a previous state and recover code that was deleted. Git servers like GitHub, GitLab etc. make it very easy
> to trace all changes, so it's unnecessary to keep outdated code in the codebase.

# Homework

For all homework: Please merge your code via the Git flow you learned about in the last session (feature branch, commit, merge request, merge). Each point below should be a separate commit (write [semantic commit messages](https://www.conventionalcommits.org/en/v1.0.0/) and choose the most appropriate keywords, e.g., `refactor`, `build`, etc. - nothing of what is added in this homework will be feature).

1. **Extend docs** (30 min)  
   Extend the `README.md` of your project with some or all of the suggested sections.
2. **Use logging** (30 min)  
   Add log messages to your code using Python's `logging` module, making use of the various logging levels, e.g.,
   `info`, `debug`, `warning` etc. Replace existing `print()` commands in your code (if any) with log messages.
3. **Add type hints** (15 min)  
   Add type hints to all of your function/method definitions (inputs and return value), as well as to any
   class/instance attributes and local variables in functions and methods.
4. **Add docstrings** (45 min)  
   Add docstrings for your package, as well as all of your modules, classes, functions and methods.

> When working with multiple people on a tool, split up the work accordingly. Each party could write some of the documentation (resolve any merging conflicts that may arise!) and annotate their own code with log messages, type hints and docstrings.

The directory structure of your repository should not have changed.

Enjoy documenting and annotating your code :)