<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Module-config.py" data-toc-modified-id="Module-config.py-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Module <code>config.py</code></a></span></li><li><span><a href="#Modules-data.py-and-plot.py" data-toc-modified-id="Modules-data.py-and-plot.py-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Modules <code>data.py</code> and <code>plot.py</code></a></span></li></ul></div>

# Show the disaggregator overview

In [None]:
from IPython.display import Image
pic = Image(filename=('./docs/_static/overview.png'))
pic

# Introduction

## Module `config.py`

The `config.py`-module is resonsible for all configuration needs of the entire program. This encompasses tasks such as, but not limited to:
- making I/O paths available,
- processing the contents of the `config.yaml`-file,
- establishing a connection to the demandregio database through a RESTful-API and
- providing data assignments based on dictionaries.

In [None]:
from disaggregator import config

Load config from the ``config.yaml``-file and get a value from a key, e.g.:

In [None]:
cfg = config.get_config()
cfg['database_host']

A typical example for assignments are those between the NUTS-3-code and its real-world name of the 401 regions in Germany. These can be accessed through:

In [None]:
dict_nuts3_name = config.dict_region_code(keys='natcode_nuts3', values='name')
dict_nuts3_name

## Modules `data.py` and `plot.py`

The `data.py`-module is resposible for providing 
- all relevant datasets (dimensionless, spatial, temporal and spatiotemporal) in a clear and structured manner,
- access the demandregio-database in a comfortable way and
- some handy utility functions.

The `plot.py`-module provides plotting functions such as
- spatial data: geographical choropleth maps
- temporal data: multidimensional line/bar/scatter charts
- spatiotemporal data: animations.

In [None]:
from disaggregator import data, plot

The demandregio database contains both **spatial** and **temporal** datasets.

To have a look, what different **spatial datasets** are available, we do:

In [None]:
df_spatial = data.database_description('spatial', force_update=True)
df_spatial.head()

... and for **temporal datasets** we do:

In [None]:
df_temporal = data.database_description('temporal', force_update=True)
df_temporal.head()

In [None]:
df_temporal.columns

Load **population** per region. This dataset is one-dimensional and returned as a ``pandas.Series`` see here:

In [None]:
df_pop = data.population(year=2000)
df_pop.head()

Load **household sizes** per region. This dataset is two-dimensional and returned as a ``pandas.DataFrame`` see here:

In [None]:
df_HH = data.households_per_size()
df_HH.head()

So single-households are in column ``1``, households /w two persons are in column ``2`` and so on...    
***Please note***: Column ``6`` contains all household sizes with more than five persons.

Now, it could happen that you think that the values in one region are somewhat remarkable or special (very high, very low, variating in size...) in contrast to the surroung regions, e.g.:

In [None]:
df_HH.loc['DE27D':'DE402']

As you can see region with nuts3-ID `DE300` is a lot higher than the two around. For this case, it is useful to quickly get to know the name of that regions for a better unterstanding. This can be done easily with the function `append_region_name(df)`:

In [None]:
data.append_region_name(df_HH.loc['DE27D':'DE402'])

Or, a bit more elegant and pythonic, just like this:

In [None]:
df_HH.loc['DE27D':'DE402'].pipe(data.append_region_name)

So after this step it becomes clear, why these region's values are that high: It is simply Berlin, the biggest city.

One further important dataset is that containing the **living spaces in [m²]** by _building type_ for each region.
building types:
- `1FH`: one family house
- `2FH`: two family house
- `MFH_03_06`: multi family house for 3-6 families
- `MFH_07_12`: multi family house for 7-12 families
- `MFH_13_99`: multi family house for >12 families

In [None]:
df_ls = data.living_space()
df_ls.head()

Now let's plot these datasets as a choropleth map:

In [None]:
fig, ax = plot.choropleth_map(df_pop/1e6, relative=False, unit='Mio. cap', axtitle='Population absolute')

As you can see, this is an <u>absolute</u> illustration, as it just shows the number of persons living in each region.  
  
Though this might be the most intuitive way, it contains **two severe problems**:  
1. Since the population in the biggest city (Berlin) is almost *twice as big* as in the second-largest city of Hamburg and more than 3 times bigger as in the fourth-largest city (Cologne), but most of the rural areas have **way less** residents, the colorbar scaling does not provide good information.  


2. The illustration does not take into account the size of the different regions. So, even if $-$ theoretically $-$ all people would be distributed equally over Germany, bigger regions would always show more residents than smaller ones.  
  
The **solution** is a <u>relative</u> illustration, showing the population per region as a proportion of the underlying area size in square-kilometers:

In [None]:
fig, ax = plot.choropleth_map(df_pop, relative=True, unit='cap', axtitle='Population relative per km²')

Now, this graphic shows a lot better, which areas are more densely and which are less densely populated.  

However, the two largest cities (Berlin and Munich) with ~4000+ residents/km² still kind of stick out. If we want to learn, where other densely populated areas are, it might be helpful to limit the colorization interval from zero to 3000:

In [None]:
fig, ax = plot.choropleth_map(df_pop, relative=True, unit='cap', axtitle='Population relative', interval=(0,3000))

Now let's plot the households on a map:

In [None]:
fig, ax = plot.choropleth_map(df_HH, relative=True, unit='households', axtitle='Households /w', colorbar_each_subplot=True, add_percentages=False)

As you can see, the framework recognizes automatically that this dataset contains several data columns and creates a subplot for each column.
Still, it might be that you are not interested in the distribution of each household size, but the sum of all, e.g. to compare if the distribution of households corresponds to the distribution of the population. Let's do this:

In [None]:
fig, ax = plot.choropleth_map(df_HH.sum(axis=1), relative=True, unit='households', axtitle='Sum of Households')

So what about the living space distribution. Is it comparable to the households?

In [None]:
fig, ax = plot.choropleth_map(df_ls, relative=True, unit='m²', axtitle='Living spaces in', colorbar_each_subplot=True)

Now let's have a look at the **income distribution**:

In [None]:
df_inc = data.income(by='population')
df_inc.head()

In [None]:
fig, ax = plot.choropleth_map(df_inc/1e3, relative=False, unit='1000 €/cap.', axtitle='Income per capita')

Now save this figure e.g. as PDF file:

In [None]:
from disaggregator.config import data_out
fig.savefig(data_out('income_distribution.pdf'), bbox_inches='tight')

By the way, it is always possible to change the underlying colormap:

In [None]:
fig, ax = plot.choropleth_map(df_inc/1e3, relative=False, unit='1000 €/cap.', axtitle='Income per capita', cmap='gist_rainbow')