In [1]:
import logging
import views
logging.basicConfig(
    level=logging.DEBUG,
    format=views.config.LOGFMT,
)

# New repository workshop 2020.06.11

### Before we start

* Hello! I'm Frederick.
* Audio / Video OK?
* Please ask questions as we go.
* Does everyone have a working copy of code and data? 
* Can you all see this notebook and follow along?

## Overview

* New repository, why?
* Layout, whats in here?

### Another repository?

    "Those who cannot remember the past are condemned to repeat it."

#### The original
First there was 
* the original ViEWS github repository. ( https://github.com/UppsalaConflictDataProgram/views/ )
* With an old public mirror at https://github.com/UppsalaConflictDataProgram/openviews

Pros/cons?

* (+) Still working to churn out monthly forecasts.
* (-) There was no package structure to this repository, no `import views`.
* (-) Model specification a janky mess of scripts that generated paramfiles using black magic.
* (-) Very difficult to maintain and use.

#### The gitlab

After the first paper publication I (frehoy) decided to create a new repository to try and structure the code and make it easier to work with while retaining the original repository for monthly forecasta production.
It exists at https://gitlab.com/frehoy/views and is currently private, ask for an invite if you want one.
That repository is now where all ViEWS functionality that went into the JPR 2020 paper resides.
Just before JPR 2020 publication it will be pushed as a separate branch to https://github.com/UppsalaConflictDataProgram/openviews2

* (+) Has a package that you can `import views` from
* (+) Has a lot of cool functionality
* (-) Relies heavily on naming conventions and passing dictionaries around.
* (-) Has a lot of half-fininshed, in-a-hurry functionality
* (-) Doesn't include some core database functionality
* (-) Isn't on github under the UCDP organisation.

#### This one (2)

This new repository, with
* a private version here:  https://github.com/UppsalaConflictDataProgram/views2/
* and a public copy here: https://github.com/UppsalaConflictDataProgram/openviews2

aims to fix the negatives of the gitlab repository and bring balance to the force.

* 📦 Has a package that you can `import views` from.
* 🧠 Has a clear way to go from model/ensemble development to production.
* 👮‍♂️ Has a stricter layout.
* 📕 Has a manual!
* ✅ Has tests.
* 👩‍🎨 Has a code formatter.
* ⚠️ Has type checking.
* 😡 Doesn't yet have all the functionality from the gitlab repository.
* 🏗 Is a work in progress.

```






```

# Repository layout

## / (root)

* README.md, with basic installation and tooling instructions.
* install_views2.sh, the installer for MacOS and Linux.
* run_tools.sh for linting, testing, formatting, type checking and building documentation.
* config.yaml 
* storage directory

## projects

* Holds one directory per project.
* Everything that doesn't need to be imported from somewhere else belongs in a project.
* Each project should have a README.md at its root to aid when navigating through github.
* This workshop is a project. 

## docs

Documentation built using [sphinx](https://www.sphinx-doc.org/en/master/usage/quickstart.html) which is still a bit of a mystery to me but extremely powerful.

* Human written files go in `docs/human_source/`
* Rebuilt and compiled to HTML by run_tools.sh . actually nicer than the .pdf and can be viewed by opening `/docs/_build/html/index.html` (SHOW!)
* Rebuilt and compiled to .pdf (via latex) by docs/build_docs.sh (SHOW!)
* Built documentation isn't stored in git, only the source files.

## runners

Everything in production usage is done via scripts, not notebooks.
In python we distinguish between **scripts**, which are run, and **modules**, which are imported.

* All the .py files under the `views/` directory are **modules** to be imported.
* **Scripts** that are executed go in `runners/`

All runners take some arguments. To see which arguments are taken do

`python runners/predict.py --help` (SHOW!)

## misc

Misc contains the environment.yaml file. Use it to add requirements. 

```











```

## views

All importable functionality lives in the views package.

In [2]:
import views

To see avaiable documentation (from docstrings) for any object when running in jupyter simply add a question mark after it:

In [3]:
views.Period?

[0;31mInit signature:[0m
[0mviews[0m[0;34m.[0m[0mPeriod[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_start[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_end[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpredict_start[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpredict_end[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Defines a time period of training and predicting. 
[0;31mFile:[0m           ~/github/ViEWS2/views/apps/model/api.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


Or to list all the available subpackges in views type `views.` and press TAB.

In [4]:
# Add ., press TAB.
views.utils

<module 'views.utils' from '/Users/frehoy/github/ViEWS2/views/utils/__init__.py'>

In jupyter or a configured IDE or text editor an autocomplete list should appear.

Now to the structure of the `views` package

## `__init__.py`


When you do `import views` you are actually importing the file `views/__init__.py`.
For details on how packages work see the [official docs](https://docs.python.org/3/tutorial/modules.html#packages).

An `__init__.py` should exist in every directory that has python modules to tell python "here is a package you can import".
It can be empty but a good idea is to import the package contents that you want to expose in the `__init__.py`.
For `views` a few key features are imported in this top level `__init__.py` so that you can do

In [5]:
from views import Model

which is a bit nicer than the full path: 

In [6]:
from views.apps.model.api import Model

the goal is to expose most things that people will use directly at this top level.
Care must be taken to avoid recursive import errors though.

#### views.DATASETS

`views.DATASETS` is a dictionary  holding `Dataset` objects.
It is your one-stop-shop for data, replacing the current `flat` schema in the database.

A Dataset is composed of:

In [7]:
dataset = views.DATASETS["cm_africa_imp_0"]
dataset?

[0;31mType:[0m        Dataset
[0;31mString form:[0m Dataset(name='cm_africa_imp_0', ids=['month_id', 'country_id'], table_skeleton=Table(fqtable='ske <...> 87815b5a60>, <views.apps.data.api.Transform object at 0x7f87815c5b50>], balance=False, cols=None)
[0;31mFile:[0m        ~/github/ViEWS2/views/apps/data/api.py
[0;31mDocstring:[0m  
Represents a dataset

Args:
    name: A descriptive name
    ids: Identifier columns, should be 2
    table_skeleton: Table instance of the base table to join into
    tables: List of Tables to join in data from
    loa: Name of level of analysis used to get correct geometry
    transforms: List of Transforms to compute
    balance: Whether to make a balanced index of the dataset
    cols: List of columns to subset tables by


To get the data from a dataset do

In [8]:
# The magical .df attribute
df = dataset.df.loc[400:420]
df.head()

[2020-06-10 22:23:39,669] - views.utils.io:65 - DEBUG - Reading parquet at /Users/frehoy/github/ViEWS2/storage/data/datasets/cm_africa_imp_0.parquet with cols None
[2020-06-10 22:23:43,183] - views.utils.io:72 - DEBUG - Finished reading parquet from /Users/frehoy/github/ViEWS2/storage/data/datasets/cm_africa_imp_0.parquet.


Unnamed: 0_level_0,Unnamed: 1_level_0,acled_count_ns,acled_count_os,acled_count_pr,acled_count_sb,cdum_1,cdum_10,cdum_100,cdum_101,cdum_102,cdum_103,...,greq_25_splag_1_1_ged_best_ns,time_since_greq_5_splag_1_1_ged_best_ns,tlag_8_greq_1_ged_best_sb,tlag_11_greq_25_ged_best_os,tlag_12_vdem_v2xdd_i_or,greq_500_splag_1_1_ged_best_ns,time_since_greq_500_splag_1_1_ged_best_ns,tlag_6_acled_count_sb,time_since_greq_25_splag_1_1_ged_best_ns,tlag_3_greq_25_ged_best_os
month_id,country_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
400,40,0.0,0.0,0.0,0.0,0,0,0,0,0,0,...,0,1397.0,0.0,0.0,0.0,0,1397.0,0.0,1397.0,0.0
400,41,2.0,1.0,18.0,0.0,0,0,0,0,0,0,...,0,1.0,0.0,0.0,0.131,0,202.0,3.0,4.0,0.0
400,42,1.0,1.0,0.0,0.0,0,0,0,0,0,0,...,0,18.0,0.0,0.0,0.088,0,1397.0,0.0,26.0,0.0
400,43,1.0,0.0,0.0,0.0,0,0,0,0,0,0,...,0,18.0,0.0,0.0,0.736,0,1397.0,0.0,22.0,0.0
400,47,0.0,0.0,7.0,0.0,0,0,0,0,0,0,...,0,1.0,0.0,0.0,0.131,0,229.0,0.0,4.0,0.0


## views/specs

### data
Datasets, Tables and Transforms are defined by `specs/data/spec.yaml`
Let's take a look!

In [9]:
from views.specs import data
print("Datasets:")
for name in data.DATASETS.keys():
    print("\t", name)
print("Tables")
for name in data.TABLES.keys():
    print("\t", name)

Datasets:
	 cm_global_imp_0
	 cm_global_imp_1
	 cm_global_imp_2
	 cm_global_imp_3
	 cm_global_imp_4
	 cm_africa_imp_0
	 cm_africa_imp_1
	 cm_africa_imp_2
	 cm_africa_imp_3
	 cm_africa_imp_4
	 pgm_global_imp_0
	 pgm_global_imp_1
	 pgm_global_imp_2
	 pgm_global_imp_3
	 pgm_global_imp_4
	 pgm_africa_imp_0
	 pgm_africa_imp_1
	 pgm_africa_imp_2
	 pgm_africa_imp_3
	 pgm_africa_imp_4
Tables
	 skeleton.cm_africa
	 skeleton.cm_global
	 skeleton.cy_africa
	 skeleton.cy_global
	 skeleton.pgm_africa
	 skeleton.pgm_global
	 skeleton.pgy_africa
	 skeleton.pgy_global
	 acled.cm
	 acled.pgm
	 cdum.c
	 fvp_v2.cy_imp_sklearn_0
	 fvp_v2.cy_imp_sklearn_1
	 fvp_v2.cy_imp_sklearn_2
	 fvp_v2.cy_imp_sklearn_3
	 fvp_v2.cy_imp_sklearn_4
	 ged.cm
	 ged.pgm_geoimp_0
	 ged.pgm_geoimp_1
	 ged.pgm_geoimp_2
	 ged.pgm_geoimp_3
	 ged.pgm_geoimp_4
	 icgcw_v2.cm
	 pgdata.pgy_imp_sklearn_0
	 pgdata.pgy_imp_sklearn_1
	 pgdata.pgy_imp_sklearn_2
	 pgdata.pgy_imp_sklearn_3
	 pgdata.pgy_imp_sklearn_4
	 reign_v2.cm_extrapolated

### periods

Defines A, B, C, train test splits by run_id.

In [10]:
from views.specs.periods import get_periods_by_name
periods = get_periods_by_name(run_id="d_2020_05_01_prelim")
periods

[2020-06-10 22:23:43,221] - views.utils.io:107 - DEBUG - Loading YAML from /Users/frehoy/github/ViEWS2/views/specs/periods/periods.yaml


{'A': Period(name='A', train_start=121, train_end=396, predict_start=397, predict_end=432),
 'B': Period(name='B', train_start=121, train_end=432, predict_start=433, predict_end=468),
 'C': Period(name='C', train_start=121, train_end=480, predict_start=484, predict_end=521)}

### models - development

Lets hop to `projects/model_development/example.ipynb`

### models - production
Ok, now we've seen how models are specified in development. 

What about production?

Lets hop to `views/apps/pipeline/models_cm.py`

And `views/specs/models/cm.yaml`

## Data updates

Beceause data is stored in this normalised way updating from the database is a two-step process:

In [11]:
# Don't actually run as it takes a few hours
if False:
    for table in views.TABLES.values():
        table.refresh()
    for dataset in views.DATASETS.values():
        dataset.refresh()
        

In [12]:
# For members of the public, to update tables do
if False:
    # Fetch .zipped tables and update cache
    views.apps.data.public.import_tables_and_geoms(
        tables = views.TABLES, 
        geometries = views.GEOMETRIES, 
        path_zip = views.apps.data.public.fetch_latest_zip_from_website(
            path_dir_destination=views.DIR_SCRATCH
        ),
    )
    # Update datasets
    for dataset in views.DATASETS.values():
        dataset.refresh()
    

The above should normally be run by the dedicated script 

    `python runners/refresh_data.py --all` 

for ViEWS team or 

    `python runners/import_data --fetch --datasets`

for the public.

## views/utils

`from views.utils import io, db, data as datautils, stats as statsutils`

The utils folder holds a few modules for dealing with io, data and database connectivity.

One particularly important function is `assign_into_df`:

    `from views.utils.data import assign_into_df`
    `assign_into_df?`

As it lets you insert data into the same column multiple times without overwriting previously insert values with missingness.


## views/database

Everything related to maintaining the database goes here.
"Clients" shouldn't need to use this at all.
