# Encapsulation & Packaging

This is a primer on how to
- encapsulate your code for better reusability
- import other people's code so that you don't need to reinvent the wheel
- package your code and publish it in a way that other people can use it.

## Table of Contents

* [Why should we encapsulate our code?](#Why-should-we-encapsulate-our-code?)
* [Scope](#Scope)
* [Types of encapsulation](#Types-of-encapsulation)
  * [Classes](#Classes)
  * [Modules](#Modules)
  * [Packages](#Packages)
* [Creating and publishing packages](#Creating-and-publishing-packages)
  * [Annotate your package with metadata](#Annotate-your-package-with-metadata)
  * [Create a command-line executable](#Create-a-command-line-executable)
  * [Build the package for distribution](#Build-the-package-for-distribution)
  * [Upload the package to PyPI](#Upload-the-package-to-PyPI)
* [Further reading](#Further-reading)
* [Homework](#Homework)

## Why should we encapsulate our code?

A very important concept implemented in most, if not all, programming languages is the principle of *encapsulation*.

The basic idea is that code should be written in a modular manner, each functionality being implemented in a piece of code that communicates with the rest of the code via an interface. This has many advantages, such as 

* each piece of code can be written and tested *independently* of other code
* variables values will not be inadvertently changed somewhere else in the program
* the code can be easily shared and reused, as users only need to know the interface of the code

In other words, it becomes much easier to ensure that a particular functionality is implemented correctly, it can be optimized without affecting other parts of a program, and once the development of the code is complete, it does not have to be revisited when another part of the program is modified.

Furthermore, encapsulated code can be conveniently reused without having to worry how it does what it does, just like a *black box*:
```
        ____________
       | BLACK  BOX |
       |            |
input ==> f(input) ==> output
       |            |
       |____________|
```

The output produced by the code in the black box is just a function of its input.

## Scope

The fundamental idea that different layers of encapsulation build on is that of **scope**, i.e. where a particular object is *visible* and can be accessed in the program.

Imagine that we have written a function `power_up()` that looks like this:

In [1]:
def power_up(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    return x**y

And we use this function in our code:

In [2]:
print("With 2 args:", power_up(3, 3))
print("With default:", power_up(3))
print("With y=", y, ": ", power_up(3))

With 2 args: 27
With default: 3


NameError: name 'y' is not defined

What has happened? 

Even though the function executes fine, at the last call to print we find out that `y` is not defined. Why is that? 

That's because `y` only exists within the **scope of the `power_up` function**, not in the scope of the program which calls the function. This is important! Imagine what would happen if every function, module etc. that we use in our code, written by ourselves or others, would define variables that are visible everywhere in the code. Then variables that we define will start changing without us being aware because the same names are defined and used by others in the modules that we imported. Not a good idea.

Python keeps track of variables, functions etc. in a `symbol table`, where each of these objects are associated with their scope. Objects from an *outer* scope are visible in a *inner* scope but not the other way around. Let's look at an example:

In [3]:
def power_up(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    print("In the inner scope x =", str(x))
    print("in the inner scope z = ", str(z))
    return x**y

x = 3
z = 10

print("Before function call x =", str(x))
print(power_up(4, 2))
print("After function call x =", str(x))

Before function call x = 3
In the inner scope x = 4
in the inner scope z =  10
16
After function call x = 3


As you can see, the `global` variable `x` defined outside of the function is not influenced by the argument `x` used by the function. Furthermore, the `global` variable `z` is visible inside the function. This is because Python follows specific rules of *scope resolution*, roughly speaking checking tables of symbols from the innermost to the outermost scope until it finds the name that we refer to (or not, as was the case with the `y` variable in the example above).

## Types of encapsulation

Python's functions and methods are one example of encapsulation. Let's look at some others.

### Classes

Classes are nice abstractions that allow us to group data and functionality into objects, creating new *data types* or extending already available ones. Classes also define new *name scopes*, beyond the *global* scope of the entire program and the *local* scope of individual functions.

#### Defining classes

Classes are defined similarly to functions:

```
class ClassName:
    <statement 1>
    <statement 2>
    ...
    <statement n>
```

Class definitions should come before the first use of the class object. The statements inside the class definition implement class *variables* and *methods*, the latter being a special type of function, associated with the data type, which can be accessed with the <kbd>.</kbd> notation. Let's look in more detail at an example, a class intended to describe lymphocytes, which are distinguished by their surface molecules (*markers*).

In [14]:
# a class describing different types of lymphocytes
class Cell:

    markers = []
    
    def __init__(self, kind):
        self.kind = kind
        
    def add_marker(self, marker):
        self.markers.append(marker)

What did we just do?

We started by defining a `class variable` called *markers*, intended to hold the markers associated with each type of cell. We then defined two functions, one that has a special name, `__init__` and will be called every time we create an `instance` of the class and another that adds markers for the cell. 

An interesting feature of Python is that while data types are "recipes" for creating *instances*, they are themselves objects in the language. For this reason you see in the definition of methods that the first argument is `self`, the keyword denoting the instance of the class to which the method is applied.

The `add_marker` function allows us to construct the list of markers defining the cell type.


Now, let's define some lymphocyte objects:

In [15]:
c = Cell("B cell")
c.add_marker("CD19")
d = Cell("T cell")
d.add_marker("CD3")
print(c.kind + " markers: ")
print(c.markers)
print(d.kind + " markers: ")
print(d.markers)

B cell markers: 
['CD19', 'CD3']
T cell markers: 
['CD19', 'CD3']


Hmmm... what just happened? Not quite what we intended it seems. Instead of a cell type-specific marker list, we ended up with each cell ends up having the same markers. This is because *markers* was defined as a mutable `class variable`, which all instances share and end up modifying. In contrast *kind* is an `instance variable`, its definition being linked to a specific instance of the class, which is why the full name of the varible is `self.kind`. 

To achieve what we want we would have to define a *markers* variable for each instance. 

Using the scope rules and the fact that classes are also Python objects, we can still define and access class variable as illustrated below. Note though that, in contrast to some other programming languages, Python does not prevent us from accessing all members of a class, variables and methods, via the <kbd>.</kbd> notation, i.e. they are *public*, not *private* to the class. 

In [18]:
# a class describing different cell types
class Cell:

    markers = []
    
    def __init__(self, kind):
        self.kind = kind
        self.markers = []
        
    def add_marker(self, marker):
        self.markers.append(marker)
        
c = Cell("B cell")
c.add_marker("CD19")
d = Cell("T cell")
d.add_marker("CD3")
Cell.markers.append("TIA1")
print(c.kind + " markers: " + " ".join(c.markers))
print(d.kind + " markers: " + " ".join(d.markers))
print("Cell class markers: " + " ". join(Cell.markers))

B cell markers: CD19
T cell markers: CD3
Cell class markers: TIA1


A better use of class variables is to keep track of properties that are indeed shared by all instances of the class. In the example below we define a dictionary class variable. This uses a data type we have not discussed before, the enumeration or `Enum` type. This basically consists of consecutive numerical values typically starting from 1, but assigns more informative names to these values that we use in subsequent code. 

Here we first create an `Enum` variable called *errorType*, that associates numbers 1, 2, and 3, with three names, parsed out of a string argument. We get 3 error objects within errorType, they have the names `unknown`, `'wrongType` and `wrongValue`. Within each of these objects we have a name and a value. The names are again `unknown`, `'wrongType` and `wrongValue`, and the values are 1, 2, 3.

In the second line, we create a dictionary class variable that associates the values 1, 2, 3 denoting the error types to more verbose messages about what the errors mean.

In [24]:
from enum import Enum

class ParameterError:
    errorType = Enum('errorType', 'unknown wrongType wrongValue')
    messageDict = {errorType.unknown.value : "Unexpected error",
                   errorType.wrongType.value : "Incorrect parameter type", 
                   errorType.wrongValue.value : "Incorrect parameter value"}
    
    def __init__(self, typeCode=errorType.unknown.value):
        self.t = typeCode
        self.m = self.messageDict[self.t]

    def shout(self):
        return (self.t, self.m)

e = ParameterError(1)
f = ParameterError(2)
print(e.shout())
print(f.shout())

print("And here is what an errorType variable contains:")
print(type(ParameterError.errorType.unknown))
print(f'{ParameterError.errorType.unknown.name}:{ParameterError.errorType.unknown.value}')

(1, 'Unexpected error')
(2, 'Incorrect parameter type')
And here is what an errorType variable contains:
<enum 'errorType'>
unknown:1


#### Inheritance

Classes are important not only because of the encapsulation they provide, but also because they support `inheritance`, a concept that allows us to build objects in a modular, incremental manner. Let's look again at an example. 

In [25]:
class Transcript:
    
    def __init__(self, tid, name, kind):
        # give the transcript an identifier, name and function
        self.tid = tid
        self.name = name
        self.kind = kind
        
    def set_coords(self, chrom, strand, start, end):
        # save the coordinates of the transcript in the genome
        self.chrom = chrom
        self.strand = strand
        self.start = start
        self.end = end
        
    def set_sequence(self, seq):
        # save the transcript sequence
        self.seq = seq
        

class CodingTranscript(Transcript):
    
    def set_cds(self, start, end):
        self.cds_start = start
        self.cds_end = end
        

We here defined a general `Transcript` class, with information that all transcripts should have, i.e. id, name, functional annotation, as well as genome coordinates and sequence. For transcripts encoding proteins we would also like to know where the coding region starts and ends in the genome, so we next defined a `CodingTranscript` class that **inherits** from the `Transcript` class, i.e. has all the attributes of this class, but in addition, it has a method that can set the additional variables. Let's use these classes now to create a coding and a non-coding transcript.

In [26]:
my_coding_transcript = CodingTranscript("1", "CT1", "coding")
my_coding_transcript.set_coords("chr1", "+", 231456, 232929)
my_coding_transcript.set_cds(142, 895)

my_nc_transcript = Transcript("2", "NCT1", "noncoding")
my_nc_transcript.set_coords("chr3", "-", 852314, 853100)
my_nc_transcript.set_cds(42, 604)

AttributeError: 'Transcript' object has no attribute 'set_cds'

As we can see, all works well when we set `cds_start` and `cds_end` in a transcript of the `CodingTranscript` class, which has this method and attributes, but not when we try to use the method for an instance of the `Transcript` base class, which does not have the `set_cds` method defined. 

Furthermore, we can also see that the `__init__` function defined for a `Transcript` is invoked when we create an instance of the `CodingTranscript` class, because this latter class does not have another `__init__` method.

On the other hand, we can overwrite the `__init__` function in the subclass, which will work like this:

In [27]:
class OtherCodingTranscript(Transcript):
    
    def __init__(self, tid, name, kind, start, end):
        # give the transcript an identifier, name and function
        self.tid = tid
        self.name = name
        self.kind = kind
        self.cds_start = start
        self.cds_end = end

second_coding_transcript = OtherCodingTranscript("2", "CT2", "coding", 45, 500)
third_coding_transcript = OtherCodingTranscript("3", "CT3", "coding")

TypeError: __init__() missing 2 required positional arguments: 'start' and 'end'

So when the function is defined in the derived class, it is used, and in this case, it has a different number of arguments than the `__init__` function of the base class, which is why the third_coding_transcript does not get created. There are also ways to do the initialization stepwise, cascading `__init__` calls to parent functions. The `__init__` function of the parent class can be accessed as `super().__init__` in the body of the `__init__` function of the derived class.

### Modules

Let's look now at another very useful form of encapsulation, which is the **module**. Documentation about modules can be found at https://docs.python.org/3/tutorial/modules.html

A **module** is a logical unit of code, a file containing Python definitions and statements. In its most basic form, it is just the set of commands inside one of the code cells here, saved as a text file with a `.py` extension (a Python *script* or program).

As soon as scripts get too big, it makes sense to distribute functional units across multiple files/modules to make them more managable. We would then split up the code such that each module deals with a specific functionality or set of functionalities that we need in our program. Some modules may be extremely general, as they basically define objects and methods for those objects that are useful in very different fields, from physics to biology to humanities.

To use the content of a module within another module or program one needs to `import` the module. This will include the objects and methods of the module in the symbol table of our program. To avoid any confusion, the names of the objects and methods of the module will be accessible within the current program prefixed with the name of the module.

Python already comes with a considerable number of built-in modules. We are going to look at a number of modules in the coming weeks, but let's take a look at an example to get an idea of how a module is composed. We will use a module that comes with the standard python distribution, namely `time`.

#### Example: The `time` module

It's not uncommon that we want to find out how long it takes to run various parts of our programs and for this, Python has a built-in `time` module that has a lot of relevant functions. We can access this package as follows:

In [28]:
import time

print(time)

<module 'time' (built-in)>


The `print` command does not tell us very explicitly what is inside the module. To find out, we can use the **`dir()`** function, using the module name as argument:

In [29]:
dir(time)

['_STRUCT_TM_ITEMS',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'altzone',
 'asctime',
 'ctime',
 'daylight',
 'get_clock_info',
 'gmtime',
 'localtime',
 'mktime',
 'monotonic',
 'monotonic_ns',
 'perf_counter',
 'perf_counter_ns',
 'process_time',
 'process_time_ns',
 'sleep',
 'strftime',
 'strptime',
 'struct_time',
 'time',
 'time_ns',
 'timezone',
 'tzname',
 'tzset']

What are these things? Modules typically contain one or more of the following objects:
* special functions (starting with double underscores)
* variables (may also be referred to as attributes)
* functions (may also be referred to as methods)

To enforce good programming practices, Python has a number of conventions about underscores in names:

* Single Leading Underscore **`_var`**: Indicates that the name is not part of the interface; it is for internal use in the module (similar to `private` names in other programming languages). 
* Double Leading Underscore **`__var`**: Indicates that in the context of Python `classes` the name will be rewritten to prevent conflicts with names in subclasses.
* Single Trailing Underscore **`var_`**: Used to avoid naming conflicts with Python keywords.
* Double Trailing Underscore **`__var__`**: Indicates special methods defined by the Python language (as we see above in the `time` module).
* Underscore **`_`**: Special name for temporary variables.

Let's have a look at some of the objects, starting with the specially designated ones.

In [30]:
# Documentation text, also used when calling help() on the module
print(time.__doc__)

This module provides various functions to manipulate time values.

There are two standard representations of time.  One is the number
of seconds since the Epoch, in UTC (a.k.a. GMT).  It may be an integer
or a floating point number (to represent fractions of seconds).
The Epoch is system-defined; on Unix, it is generally January 1st, 1970.
The actual value can be retrieved by calling gmtime(0).

The other representation is a tuple of 9 integers giving local time.
The tuple items are:
  year (including century, e.g. 1998)
  month (1-12)
  day (1-31)
  hours (0-23)
  minutes (0-59)
  seconds (0-59)
  weekday (0-6, Monday is 0)
  Julian day (day in the year, 1-366)
  DST (Daylight Savings Time) flag (-1, 0 or 1)
If the DST flag is 0, the time is given in the regular time zone;
if it is 1, the time is given in the DST time zone;
if it is -1, mktime() should guess based on the date and time.



In [31]:
print('Loader: ', time.__loader__)
print('Package: ', time.__package__)
print('Spec: ', time.__spec__)
print('Name: ', time.__name__)

Loader:  <class '_frozen_importlib.BuiltinImporter'>
Package:  
Spec:  ModuleSpec(name='time', loader=<class '_frozen_importlib.BuiltinImporter'>, origin='built-in')
Name:  time


So these special objects give us more information about the module, including its name, where it resides and how it is loaded.

In [32]:
print(time.gmtime())  # coordinated universal time
print(time.localtime())  # local time

time.struct_time(tm_year=2024, tm_mon=11, tm_mday=2, tm_hour=9, tm_min=59, tm_sec=42, tm_wday=5, tm_yday=307, tm_isdst=0)
time.struct_time(tm_year=2024, tm_mon=11, tm_mday=2, tm_hour=10, tm_min=59, tm_sec=42, tm_wday=5, tm_yday=307, tm_isdst=0)


The `sleep` function is especially useful when dealing with web services, when we do not want to flood the service with request. Then, we typically space the requests at appropriate time intervals, and the `sleep` function helps us do this. For example:

In [33]:
def sleep_n_seconds(n = 10):
    """This function waits for n (default = 10) seconds."""
    print('Starting to wait...')
    time.sleep(n)
    print('Done waiting.')
    return

sleep_n_seconds(10)

Starting to wait...
Done waiting.


Let's convince ourselves that it really waits for the specified amount of seconds. We'll rewrite the function to tell us the time before it starts waiting and when it's finished:

In [34]:
def sleep_n_seconds(n = 10):
    """This function waits for n (default = 10) seconds and returns the elapsed time."""
    print('Starting to wait...')
    start = time.time()
    time.sleep(n)
    end = time.time()
    print('Done waiting.')
    return start, end

(start, end) = sleep_n_seconds(3)

print('Started at: ', time.ctime(start))
print('Ended at: ', time.ctime(end))
print('Time elapsed:', end-start)

Starting to wait...
Done waiting.
Started at:  Sat Nov  2 11:00:49 2024
Ended at:  Sat Nov  2 11:00:52 2024
Time elapsed: 3.0011141300201416


#### The special variable `__name__`

We'll have a closer look at the **`__name__`** variable, which has a special use in python, not encountered in other programming languages: it allows us to write modules  that can be either run as stand-alone programs or be imported within other programs. How does it work? The whole idea rests on what the `__name__` variables is set to in these two situations: 

1. When the module is run as stand-alone with `python my_module.py`, the variable `__name__` is given the value `'__main__'`.
2. However, when the module is imported within another program, the variable `__name__` is given the name of the module file, which in this case would be `'my_module'`.

We can exploit this to define code that is only executed if the program was started in the stand-alone mode:

```python
def main:
    print 'The value of __name__ is ' + __name__

#### THE BELOW WILL BE FALSE IF THE MODULE IS IMPORTED AND main() WILL NOT BE EXECUTED ####    

if __name__ == '__main__':
    main()
```

This pattern is generally found in Python-based command-line tools and scripts. Not having the `if __name__ == '__main__'` guard in these circumstances may have unintended side effects upon importing.

Let's use one of the previous examples to illustrate this, specifically an example of the `power_up` function. Assume that we saved one of the previous cells in this notebook as a script (`my_mod.py`) in a directory of __helpers__ scripts. The content of the script is:

In [35]:
!cat helpers/my_mod.py

def power_up_1(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    print("In the inner scope x =", str(x))
    print("in the inner scope z =", str(z1))
    return x**y

x = 3
z1 = 10

print("Before function call x =", str(x))
print(power_up_1(4, 2))
print("After function call x =", str(x))


Basically, we defined the `power_up_1` function, which we called like that to make sure that we distinguish from the `power_up` function we defined earlier, and after this function definition we initialized some variables, made a call to the `power_up_1` function and had some print statements. Now take a look at what happens if we import this piece of code, i.e. this module:

In [36]:
import helpers.my_mod

Before function call x = 3
In the inner scope x = 4
in the inner scope z = 10
16
After function call x = 3


The code in the module was executed upon import, including setting of variables and print out statements, which we included in that module for testing purposes. The variables and functions from this module are accessible (recall the `.` notation):

In [37]:
print("Accessing module function: " + str(helpers.my_mod.power_up_1(5, 3)))
print("Accessing module variable: " + str(helpers.my_mod.z1))

In the inner scope x = 5
in the inner scope z = 10
Accessing module function: 125
Accessing module variable: 10


In [41]:
!cat helpers/my_mod2.py

def power_up_2(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    print("In the inner scope x =", str(x))
    return x**y

def main():
    x2 = 3

    print("Before function call x =", str(x2))
    print(power_up_2(4, 2))
    print("After function call x =", str(x2))

if __name__ == '__main__':
    main()


Now let's include this module in our code:

In [42]:
import helpers.my_mod2

We see nothing. We can check if the `power_up_2` function is defined

In [43]:
helpers.my_mod2.power_up_2(3,3)

In the inner scope x = 3


27

and we see that it is. However, the rest of the testing code is not executed upon import because it is wrapped into a `main()` function, which is available in the module, but has to be invoked explicitly once the module is imported. 

If we run the code as a stand-alone program from the commandline, we get something different:

In [44]:
!python helpers/my_mod2.py

Before function call x = 3
In the inner scope x = 4
16
After function call x = 3


Now the `main()` function is executed, because the `__name__` variable in the module has been set to `'__main__'`. This allows for functions of the module to be reused, while the application context where these functions were defined can remain in the background, not cluttering the programs developed by other users.

#### Example: The `argparse` module

Another very general purpose module that we should look at is `argparse`. It provides a wide range of functionalities associated with the parsing commandline arguments, enforcing the correct use of programs. The full documentation on the module can be found at https://docs.python.org/3/library/argparse.html. Here we will go through a few important features of this module. The basic idea is to construct an object containing all the relevant information about the commandline arguments that we expect and use this object to parse the commandline, extracting the individual arguments. While we may have done this ourselves in other programming languages, parsing the commandline to extract arguments and ensure that they have the correct type, range etc. is really tedious and error-prone. Let's look at the most basic way of using the module:

In [45]:
import argparse
parser = argparse.ArgumentParser()

In [46]:
dir(parser)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_action_groups',
 '_actions',
 '_add_action',
 '_add_container_actions',
 '_check_conflict',
 '_check_value',
 '_defaults',
 '_get_args',
 '_get_formatter',
 '_get_handler',
 '_get_kwargs',
 '_get_nargs_pattern',
 '_get_option_tuples',
 '_get_optional_actions',
 '_get_optional_kwargs',
 '_get_positional_actions',
 '_get_positional_kwargs',
 '_get_value',
 '_get_values',
 '_handle_conflict_error',
 '_handle_conflict_resolve',
 '_has_negative_number_optionals',
 '_match_argument',
 '_match_arguments_partial',
 '_mutually_exclusive_groups',
 '_negative_number_matcher',
 '_option_string_actions',
 '_optionals',
 '_parse_know

We created an object that has in principle a lot of functionality, but does not have any variables/arguments. What we need to do is to add these variables. To see how this works, let's unpack the first example provided in the documentation, which looks like this:

In [48]:
!cat helpers/argparse_basic.py

import argparse

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
                    help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')
parser.add_argument('chars', metavar='char', type=str, nargs='+',
                    help='character type parameters')
parser.add_argument('--p', action='store_true', help="print args?")

if __name__ == '__main__':
    args = parser.parse_args()
    if(args.p):
        print(args.integers)
        print(args.chars)
    print(args.accumulate(args.integers))
    
    




We start by creating the `parser` object by a call to the `ArgumentParser` method of `argparse`. This can take quite some arguments, one of which is the _description_ of what the program is supposed to do, information that will show up if we run the program with the option `-h`, that is, when we call "help" on the program. The complete list of arguments to `ArgumentParser` is this, with their associated defaults:
```python
class argparse.ArgumentParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=argparse.HelpFormatter, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True, exit_on_error=True)
```
Again, you can find out what each of these parameters is for, some are more intuitive than others, e.g. `prog` will hold the name of the program (this would be taken from `sys.argv[0]` when we run the program from the commandline), and `usage` will hold a string describing how the program is to be invoked.

The next line
```python
parser.add_argument('integers', metavar='N', type=int, nargs='+',
                    help='an integer for the accumulator')
```
specifies one commandline argument. The first argument to this method has to be either an argument name or a flag that specifies an optional argument. In this case the first argument is a name that tells us how the positional argument(s) will be called. We further see that this argument should be of type `int`, that there should be at least one (and if more, they will be gathered into a list, `nargs='+'`) and that in the `help` message this parameter will be called `N`.

The next line shows us how to add an optional argument:
```python
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')
```
This is because the first argument to the function is the flag `--sum`. The argument that follows after this flag is saved into the variable specified by the `dest` argument, which in this case is called `accumulate`. The `action` argument specifies some action related to the argument being parsed. In this case, it specifies that what needs to happen is to store the constant that is also an argument to the method, namely the build-in function `sum` into the `accumulate` variable. However, if this option is not specified, the dafault value assigned to the `accumulate` variable is the built-in function `max`. 

We'll add one more positional parameter, similar to the integers, but holding characters. Finally, we add a flag (`p`) that can be used by the program (as you see above, it is used to print the number and character lists that were created when parsing the command line.

Let's now run our basic program:

In [56]:
!python helpers/argparse_basic.py 1 2 3 4 5 6 7 8 9 a --p

[1, 2, 3, 4, 5, 6, 7, 8, 9]
['a']
9


In [57]:
!python helpers/argparse_basic.py 1 2 3 4 5 6 7 8 9 --sum

36


In [58]:
!python helpers/argparse_basic.py --sum 1 2 3 4 5 6 7 8 9

36


In [59]:
!python helpers/argparse_basic.py 1 2 3 4 5 6 7 8 9 a

9


In [60]:
!python helpers/argparse_basic.py 1 --sum

usage: argparse_basic.py [-h] [--sum] [--p] N [N ...] char [char ...]
argparse_basic.py: error: the following arguments are required: char


As you may notice, we have not passed any commandline arguments to the method `parse_args()`. That's because its default parameter is the list of commandline arguments, which is held in the sys.argv, and is parsed out of the commandline string. We could construct such a list ourselves if we wanted, e.g.

In [61]:
import helpers.argparse_basic

In [62]:
helpers.argparse_basic.parser.parse_args("1 2 3 4 5 6 7 8 9 --sum".split())

Namespace(accumulate=<built-in function sum>, chars=['9'], integers=[1, 2, 3, 4, 5, 6, 7, 8], p=False)

#### More on `import`s

Apart from importing entire modules, one can also directly import any objects from inside them, such as classes, functions and constants/variables. This is generally the preferred method of importing code, as it is more performant and more explicit. To import a class `SomeClass` from the module `my_module`, you would, e.g., write:

```python
from my_module import SomeClass
```

To import some constant `MY_CONSTANT` at the same time, you would write the following instead:

```python
from my_module import (MY_CONSTANT, SomeClass)
```

It is also possible to import an object under a different name:

```python
from my_module import SomeClass as sc
```

Following the above statement you can (and indeed MUST) now refer to the imported class as `sc` rather than the original `SomeClass`.

### Packages

While discussing modules above, I may have sometimes referred to a *package*. Is there a difference?

A module is a single Python file. On the other hand, a [package](https://docs.python.org/3/tutorial/modules.html#packages) is a _collection_ of related modules (but it may well contain only a single module), organized in a hierarchical manner on the file system and containing an `__init__.py` module in each directory (in special cases there are exceptions to this last point, but we will not go into that here).

The simplest possible package, with just a single `__init__.py` module looks like this:

```console
my_package/
└── __init__.py
```

A more complex, nested package might look like this:

```console
my_package/
├── __init__.py
├── my_module.py
├── my_other_module.py
├── my_subpackage/
│   ├── __init__.py
│   ├── my_submodule.py
│   └── my_other_submodule.py
├── my_other_subpackage/
│   ├── __init__.py
│   ├── not_a_python_file  # ignored by Python
│   ├── my_third_submodule.py
│   └── my_third_subpackage/
│       ├── __init__.py
│       └── my_last_submodule.py
└── not_a_package/  # not a (sub)package, because it does not contain an "__init__.py" file
    ├── not_a_python_file
    └── my_other_submodule.py
```

In principle, the `__init__.py` module can contain any code. However, it is a convention that it should generally just contain initialization code to be executed when the package is imported (technically, you cannot really import a package and when you do something like `import my_package` what is _really_ imported are the contents of the `__init__.py` file in the package `my_package`). If there is nothing to initialize (which is often the case), they should best be left empty.

#### Package `import`s

To import an entire (sub)module from a package, you would write something like this:

```python
import my_package  # imports module `__init__.py` in `my_package` root
import my_package.my_subpackage  # imports module `__init__.py` in subpackage `my_subpackage`
import my_package.my_submodule  # imports module `my_submodule` in `my_package` root
```

As you can see, you need to use the `.` notation, starting with the most outside package. For example, to import a class `SomeClass` and a constant `MY_CONSTANT` from the submodule `my_submodule` in subpackage `my_subpackage` in package `my_package`, you would, e.g., write:

```python
from my_package.my_subpackage.my_submodule import (MY_CONSTANT, SomeClass)
```

_**Where does Python look for packages?**_

When the import statement is encountered, Python checks whether the module from which an item is to be imported is defined in the current package/directory. If not, it looks for the module in the directories specified in the list variable `sys.path`, which is initialized from the environment variable `PYTHONPATH` or, if it not set, from a built-in default. This variable can be modified, for e.g.:

```python
import sys
sys.path.append('/path/to/my/module')
```

An `ImportError` exception is raised if the package/module is not found. Packages support a special attribute, __path__, initialized to be a list containing the name of the directory holding the package’s `__init__.py` before the code in that file is executed. This is sometimes used to extend the set of modules found in a package.

_**Conventions on `import`ing**_

Production code can frequently have dozens of `import` statements per module. To make these more maintainable there are a couple of conventions that you should try to stick to:

* Place all import statements at the top of a module, before any other code. Try to avoid importing modules/objects just before you need them. This practice will reveal problems with missing or broken packages early, and it will be easier to maintain and read the list of imports.
* Distribute all imports into three separate blocks, separated with one blank line: imports from (1) built-in, (2) third-party (i.e., from packages you have installed manually) and (3) local (i.e., from _this_ project) modules. Sort each block alphabetically by the package/module names (irrespective of whether you use `import` or `from ... import`).

For example, this could be the first couple of lines of your module:

```python
import os
from time import sleep

from third_party_package.some_module import SomeClass

from my_own_package.some_module import SOME_CONSTANT
import my_own_package.whole_module
```

#### Installing packages

Packages are also the container that is used most commonly to publish Python projects. Over the years, innumerable packages have been written for Python that one can *install* and use. The biggest resource for Python packages is the [Python Package Index](https://pypi.org/), more commonly referred to as PyPI. 

Python packages can be installed via the package manager Pip that should be automatically installed together with Python (you can verify that it is available with `pip --version`). When installing a package that is listed in PyPI, the syntax is very simple:

```bash
pip install package_name
```

where `package_name` is the name of the package you would like to install.

When installing a package, Pip fetches the code from a repository and stores it in an efficient way on your local file system so that you can use it in the future. The location where packages are stored is pre-configured and typically does not need to be modified. If you want to find out where a given package was installed, you can run `pip show package_name`.

It is useful to know that Pip can also be used to install packages from sources other than PyPI, e.g., from Git repositories:

```bash
# Install code from default branch
pip install git+https://github.com/user/repo.git

# Install code from specific branch/tag/commit
pip install git+https://github.com/user/repo.git@branch/tag/commit
```

Finally, Pip can also be used to install a Python package from the current directory. For this, we need to be in the root directory of the project, which should contain instructions about package installation (a `pyproject.toml` file is the currently preferred option, a `setup.py` file was used before and still works currently). If available, you can execute the following command to install your local package:

```bash
pip install .
```

Sometimes it is useful to install a Python project that you are currently working on. In that case, it is best to install the package in an "editable" manner by providing the `-e` flag when installing:

```bash
pip install -e .
```

This would not copy any files to your interpreter directory (e.g. the `site-packages` directory).

## Creating and publishing packages

There is a lot of support for creating distributable versions of packages. We will only look at th basics, and you can use https://setuptools.pypa.io/en/latest/userguide/quickstart.html as initial reference.

We have already learned that what makes a directory containing Python modules a package is the presence of an `__init__.py` module. However, if we want to publish/distribute our package in a form that is usable by others, we need to do a few more things:

### Annotate your package with metadata

For your project you probably created a directory `code/` inside of your Git repository root directory and put all your actual code inside that subdirectory. We generally do this so that all the code is nicely located together in one directory and is not mixed up with configuration files etc. (e.g., `.gitignore`). As such, the `code/` (the name is not important) subdirectory is effectively your package, and all files and directories inside it are (sub)modules and subpackages of that package.

To annotate the package and provide the instructions required for PyPI to store and Pip to install the package, we need to create a file `pyproject.toml` inside the repository root directory (i.e., _outside_ of the package).

A very minimal `pyproject.py` file could look something like this:

```python
[build-system]                                                                                                  
requires = ["setuptools"]                                                                                       
build-backend = "setuptools.build_meta"                                                                         
                                                                                                                
[project]                                                                                                       
name = "mypackage"                                                                                          
version = "1.0.0"                                                                                               
description = "Brief package description"                                              
license = { text = "MIT" }                                                                                      
authors = [                                                                                                     
        {name = "MyName", email = "MyEmail"},                                        
]                                                                                                               
dependencies = [] # add here packages that are required for your package to run, including version or range of versions                                                                                                           

[tool.setuptools.packages]
find = {} # this will autodetect Python packages from the directory tree, e.g., in `code/`
```

You can modify these values to fit your needs and most of these should be fairly obvious. We would like to point out two good practices though:

1. Never publish any code without a **license**! Commonly used code licenses in the Open Source Software community are the MIT, Apache 2.0 and GPL-based licenses. We recommend you use the [MIT license](https://opensource.org/licenses/MIT) for a start. See [this resource](https://choosealicense.com/) for further info and additional licenses.
2. It is very useful (particularly for users of your software) to explicitly **version your code**. We strongly recommend you adhere to the [Semantic Versioning](https://semver.org/) specification for this purpose, a 3-part versioning system (a fourth part is optional) composed of major (for breaking changes to the software's interface), minor (for backwards-compatible changes to the software's interface) and patch (for changes that do not change the software's interface).

Check out more options from
- https://setuptools.pypa.io/en/latest/userguide/index.html
- https://setuptools.pypa.io/en/latest/userguide/quickstart.html
- https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html


### Create a command-line executable

As scripts that can be executed from the commandline are commonplace in the life sciences, we would like to point out that it is possible to have create commandline executables from your Python code that are invoked after package installation and execute a specific function. All you need to do is at the following to your `pyproject.toml` file:

```python
[project.scripts]                                                                                               
my-executable = "mypackage.my_module:main"                                                              
```

Here we tell `setuptools`: _"Create an executable `my-executable` that executes the function `main()` from module `my_module` of package `mypackage`."_

> By convention, we often call the function that serves as an entry point for command-line scripts `main()`. See [above](#On-executing-Python-scripts-from-the-command-line) for an example and how to guard this sort of code against accidental execution when importing from the module that contains that function. In your homework you will write such a `main()` function that parses the user's commandline arguments and provides them as input to the code you have previously written.

### Build the package for distribution

To prepare your package for publication, simply execute Python's `build` module, like so:

```python
python -m build
```

If you have a look at your directory after executing the command, you will see that `build` created a directory `dist/` for you, which contains compressed files that we need to publish the package.

### Upload the package to PyPI

You should now be ready to publish your package to the Python Package Index (PyPI)!

Unless you are sure that your software is ready to be used by others (or you have at least a strong desire to get there soon), we recommend you _not_ to publish to PyPI just yet. Nothing will stop you from doing so, but there is not much point in flooding [PyPI](https://pypi.org/) with test software or half-baked code. Instead, we can publish the code to the [Test PyPI](https://test.pypi.org/). By the way, publishing to Test PyPI is also good practice for serious/production releases - just to make sure that your package ends up on the index just like you want it.

To upload the package to Test PyPI, we make use of the `twine` module and tell it to upload the contents of the `dist/` folder to `testpypi`:

```python
python -m twine upload --repository testpypi dist/*
```

Executing this command will ask you for your **username and password**, so make sure you register with Test PyPI first.

> To upload to PyPI instead, simply omit the `--repository` parameter and its argument. Note that you need to register with PyPI and Test PyPI indvidually, their user databases are not shared!

Your code should now be available on Test PyPI for a while. You and others can install the package with Pip using

```python
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple mypackage
```

where `mypackage` is the name of the package. The arguments to `--index-url` and `--extra-index-url` tell Pip to find `mypackage` on Test PyPI but find any dependent packages (if your code requires them) from the regular PyPI.

# Further reading

If you would like more info about today's topics, have a look at the following resources on:

* [Classes](https://docs.python.org/3/tutorial/classes.html)
* [Modules](https://docs.python.org/3/tutorial/modules.html)
* [Packaging](https://packaging.python.org/)
* [Licenses](https://choosealicense.com/)
* [Semantic Versioning](https://semver.org/)

# Homework

For all homework: Please merge your code via the Git flow you learned about in the last session (feature branch, commit, merge request, merge). Each point below should be a separate commit (write [semantic commit messages](https://www.conventionalcommits.org/en/v1.0.0/) and choose the most appropriate keywords, e.g., `refactor`, `build`, etc. - nothing of what is added in this homework will be feature).

1. **Refactor your code to use a class** (60 min)  
   Rewrite the code you have so far to use classes. E.g. `GillespieApp` could be one such class, `SimulationParams` another. Save the file(s) it in file/module in your code directory and delete old code.
2. **Set up a basic `pyproject.toml` file** (10 min)
    Use the information above to create a basic `pyproject.toml` file in your project's root directory. Make sure to pick an Open Source software license from [this resource](https://choosealicense.com/), and add a corresponding entry to your `pyproject.toml` file.
3. **Add a command-line interface** (45 min)  
   Create a command-line interface (CLI) and make sure that respective code is only executed when the module is called from the command line, not when imported. Add the corresponding entry to the `pyproject.toml` file.
4. **Create a package** (20 min)  
   Add a `__init__.py` file to your code directory to make it a package. Give your package and
   executable suitable names. For the package name, first verify that a package with your chosen
   name does not already exist on PyPI.
5. **Publish your package** (10 min)  
   Build the package then publish it on Test PyPI. Note that your package will be removed from Test
   PyPI after a while. This is fine as we do not want to store our test packages permanently. For
   any real package, you would of course publish it on the regular PyPI as well, after verifying
   that the upload to Test PyPI works as expected.

Upon completion, the directory structure should look something like this:

```console
├── your_package
│   ├── __init__.py
│   ├── your_code_file.py
│   ├── your_other_code_file.py
│   └── ...  # any additional modules
├── .git
│   ├── ...
│   ├── ...
│   └── ...
├── .gitignore
├── images
│   ├── screenshot_git_tutorial_1_student_1.png
│   ├── screenshot_git_tutorial_2_student_1.png
│   ├── screenshot_markdown_tutorial_student_1.png
│   ├── screenshot_git_tutorial_1_student_2.png
│   ├── screenshot_git_tutorial_2_student_2.png
│   ├── screenshot_markdown_tutorial_student_2.png
│   └── ...  # screenshots from additional contributors
├── LICENSE
├── README.md
└── setup.py
```

Enjoy creating and publishing your first Python package! :)