# Running Luigi task

This notebook will detail the different tasks that may be run in this project, using Luigi.

We won't run any Python code cell here, but list the shell command that can be written in order to execute each previous steps.

As a remainder before to detail the Luigi-related command, let focus on the source code organization:
- jitenshop/
  - tasks/
    - __init__.py
    - db.py
    - stations.py
    - availability.py
    - clusters.py
  - webapp/
  - __init__.py
  - controller.py
  - iodb.py
  - config.ini.sample
  - (config.ini)

The modules in `jitenshop/tasks` will be of first interest.

## Database management

As a prerequisite to some of our pipeline items, we are supposed to create a scheme to host our data.

The according task is located in `jitenshop/tasks/db.py`.

The command is the following:

```
python -m luigi --local-scheduler --module jitenshop.tasks.db CreateSchema
```

## Station data handling

As denoted in [the first notebook](./1_data_recovering.ipynb), we are trying to retrieve some shared-bike stations, by :
- downloading the data from an Open Data portal as a zip file;
- unzipping this archive to get shapefiles;
- populating the database with raw station information;
- creating a clean station table that will be usable thereafter.

Each of these step can be done through a dedicated task, amongst : `DownloadShapefile`, `UnzipShapeFile`, `ShapefileIntoDB` and `NormalizeStationTable`.

The pipeline follows a linear structure between all these steps, hence one may guarantee that everything is done by the following command:
```
python -m luigi --local-scheduler --module jitenshop.tasks.stations NormalizeStationTable
```
By doing so, we build the table `lyon.station`, the output of this part of the pipeline.

## Availability data handling

Then in [the first notebook](./1_data_recovering.ipynb), we recover the availability data as well. We saw that two manners of doing such a thing are possible:
- retrieve the bike availability data in real-time by requesting the Open Data portal each 5 minutes (for instance);
- or retrieve the last week of data for demonstration or test purpose.

In both cases, we have to :
- download the data;
- save it in a developer-friendly format like `.csv`;
- store it into the database.

The corresponding tasks are :
- `RealTimeAvailability`, `RealTimeAvailabilityToCSV`, and `RealTimeAvailabilityToDB` in the first scenario;
- `Availability`, `AvailabilityToCSV` and `AvailabilityToDB` in the second scenario.

During this workshop, we can focus on the last task:
```
python -m luigi --local-scheduler --module jitenshop.tasks.availability AvailabilityToDB --start 2019-08-22 --stop 2019-08-24
```

At the end of this step, we've got shared-bike availability at each Lyon station between indicated time window.

## Clustering data creation

The last important step of this data pipeline in the workshop scope is the clustering of Lyon shared-bike stations. In this purpose, we had to:
- compute the clusters;
- store the clustered stations into the database;
- store the typical week day profiles (cluster centroids) into the database.

The three corresponding tasks in `jitenshop.tasks.clusters` are `ComputeClusters`, `StoreClustersToDatabase`, `StoreCentroidsToDatabase`. The last two ones are independant from each other (although they both have the first one as a dependency). Consequently, we have to run both commands to complete the clustering process:
```
python -m luigi --local-scheduler --module jitenshop.tasks.clusters StoreClustersToDatabase --n-clusters 4 --start 2019-08-22 --stop 2019-08-24
python -m luigi --local-scheduler --module jitenshop.tasks.clusters StoreCentroidsToDatabase --n-clusters 4 --start 2019-08-22 --stop 2019-08-24
```

After succeeding these tasks, we've stored clusters and centroids into the database.

## TD;LR

A final task that takes all the process in charge is located in `jitenshop/tasks/main`. It declares as dependencies `AvailabilityToDB`, `StoreClustersToDatabase` and `StoreCentroidsToDatabase`. We run it with the following command:
```
python -m luigi --local-scheduler --module jitenshop.tasks.main Finalize
```

This task ensures the database is ready to use, for instance in a web application!

## Server mode

The Luigi tasks may be launched with a server. Until now, we used the `--local-scheduler` argument, that executed pipeline locally.

We can do the same by using the luigi daemon (*warning:* requires `python-daemon`):
```
luigid --background --logdir ./log
```

And then, we can run the task of our choice:
```
python -m luigi --module jitenshop.tasks.main Finalize
```

With this mode, one can have more information about our pipeline, especially in production environments. As an example, one may get a fancy visualization of our pipeline:

![fancy_luigi_diagram](../img/luigi-pipeline.png)

For more details, see [official documentation](https://luigi.readthedocs.io/en/stable/central_scheduler.html?highlight=daemon#the-luigid-server).