Skip to content

Commit

Permalink
Merge pull request #142 from JannisHoch/update_docs
Browse files Browse the repository at this point in the history
Update docs
  • Loading branch information
JannisHoch committed Jun 16, 2021
2 parents 347d20e + b663a44 commit cefb89a
Show file tree
Hide file tree
Showing 48 changed files with 1,579 additions and 1,354 deletions.
53 changes: 21 additions & 32 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
===============
Overview
===============

CoPro
----------------
===============

Welcome to CoPro, a machine-learning tool for conflict risk projections based on climate, environmental, and societal drivers.

Expand All @@ -29,12 +26,12 @@ Welcome to CoPro, a machine-learning tool for conflict risk projections based on
:target: https://joss.theoj.org/papers/1f03334e56413ff71f65092ecc609aa4

.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/JannisHoch/copro/to_binder?filepath=%2Fpresentations%2FvEGU21.ipynb
:target: https://mybinder.org/v2/gh/JannisHoch/copro/update_docs?filepath=%2Fexample%2Fnb_binder.ipynb

Model purpose
Model purpose
--------------

As primary model output, CoPro provides maps of conflict risk (defined as the fraction conflict predictions of all predictions).
As primary model output, CoPro provides maps of conflict risk.

To that end, it employs observed conflicts as target data together with (user-provided) socio-economic and environmental sample data to train different classifiers (RFClassifier, kNearestClassifier, and Support Vector Classifier).
While the samples have the units of the data, the target value is converted to Boolean, where a 0 indicates no conflict occurrence and 1 indicates occurrence.
Expand Down Expand Up @@ -89,18 +86,19 @@ To run the model from command line, a command line script is provided. The usage
Usage: copro_runner [OPTIONS] CFG
Main command line script to execute the model. All settings are read from
cfg-file. One cfg-file is required argument to train, test, and evaluate
the model. Additional cfg-files can be provided as optional arguments,
whereby each file corresponds to one projection to be made.
Main command line script to execute the model.
All settings are read from cfg-file.
One cfg-file is required argument to train, test, and evaluate the model.
Multiple classifiers are trained based on different train-test data combinations.
Additional cfg-files for multiple projections can be provided as optional arguments, whereby each file corresponds to one projection to be made.
Per projection, each classifiers is used to create separate projection outcomes per time step (year).
All outcomes are combined after each time step to obtain the common projection outcome.
Args: CFG (str): (relative) path to cfg-file
Options:
-proj, --projection-settings PATH path to cfg-file with settings for a projection run
-v, --verbose command line switch to turn on verbose mode
--help Show this message and exit.
-plt, --make_plots add additional output plots
-v, --verbose command line switch to turn on verbose mode
This help information can be also accessed with

Expand All @@ -120,23 +118,18 @@ Example data
----------------

Example data for demonstration purposes can be downloaded from `Zenodo <https://zenodo.org/record/4297295>`_.
To facilitate this process, the bash-script ``download_example_data.sh`` can be called in the example folder.
To facilitate this process, the bash-script ``download_example_data.sh`` can be called in the example folder under `/_scripts`.

With this (or other) data, the provided configuration-files (cfg-files) can be used to perform a reference run or a projection run.
All output is stored in the output directory specified in the cfg-files.
In the output directory, two folders are created: one name `_REF` for output from the reference run, and `_PROJ` for output for projections.

Jupyter notebooks
^^^^^^^^^^^^^^^^^^

There are multiple jupyter notebooks available to guide you through the model application process step-by-step.
They can all be run and converted to html-files by executing the provided shell-script.

.. code-block:: console
$ cd path/to/copro/example
$ sh run_notebooks.sh
It is of course also possible to execute the notebook cell-by-cell and explore the full range of possibilities.
It is possible to execute the notebooks cell-by-cell and explore the full range of possibilities.
Note that in this case the notebooks need to be run in the right order as some temporary files will be saved to file in one notebook and loaded in another!
This is due to the re-initalization of the model at the beginning of each notebook and resulting deletion of all files in existing output folders.

Expand All @@ -147,24 +140,20 @@ Command-line

While the notebooks are great for exploring, the command line script is the envisaged way to use CoPro.

To only test the model for the reference situation, the cfg-file is the required argument.

To make a projection, both cfg-files need to be specified with the latter requiring the -proj flag.
If more projections are ought to be made, multiple cfg-files can be provided with the -proj flag.
To only test the model for the reference situation and one projection, the cfg-file for the reference run is the required argument.
This cfg-file needs to point to the cfg-file of the projection in turn.

.. code-block:: console
$ cd path/to/copro/example
$ copro_runner example_settings.cfg
$ copro_runner example_settings.cfg -proj example_settings_proj.cfg
Alternatively, the same commands can be executed using a bash-file.

.. code-block:: console
$ cd path/to/copro/example
$ sh run_script_reference.sh
$ sh run_script_projections.sh
$ cd path/to/copro/example/_scripts
$ sh run_command_line_script.sh
Validation
^^^^^^^^^^^^^^^^^^
Expand All @@ -175,7 +164,7 @@ The selected classifier is trained and validated against this data.
Main validation metrics are the ROC-AUC score as well as accuracy, precision, and recall.
All metrics are reported and written to file per model evaluation.

With the example data downloadable from `Zenodo <https://zenodo.org/record/4297295>`_, a ROC-AUC score of 0.82 can be obtained.
With the example data downloadable from `Zenodo <https://zenodo.org/record/4297295>`_, a ROC-AUC score of above 0.8 can be obtained.
Note that with additional and more explanatory sample data, the score will most likely increase.

.. figure:: docs/_static/roc_curve.png
Expand Down
18 changes: 9 additions & 9 deletions copro/conflict.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import pandas as pd
import numpy as np
import os, sys
import math
import click

def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir):
"""Creates a list for each timestep with boolean information whether a conflict took place in a polygon or not.
Expand All @@ -27,7 +27,8 @@ def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir):
# select the entries which occured in this year
temp_sel_year = conflict_gdf.loc[conflict_gdf.year == sim_year]

assert (len(temp_sel_year) != 0), AssertionError('ERROR: no conflicts were found in sampled conflict data set for year {}'.format(sim_year))
if len(temp_sel_year) == 0:
click.echo('WARNING: no conflicts were found in sampled conflict data set for year {}'.format(sim_year))

# merge the dataframes with polygons and conflict information, creating a sub-set of polygons/regions
data_merged = gpd.sjoin(temp_sel_year, extent_gdf)
Expand All @@ -54,10 +55,6 @@ def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir):
if config.getboolean('general', 'verbose'): print('DEBUG: storing boolean conflict map of year {} to file {}'.format(sim_year, os.path.join(out_dir, 'conflicts_in_{}.csv'.format(sim_year))))
# data_stored = pd.merge(bool_per_poly, global_df, on='ID', how='right').fillna(0)
data_stored = pd.merge(bool_per_poly, global_df, on='ID', how='right').dropna()
# print(global_df.head())
# print(bool_per_poly.head())
# data_stored = global_df.merge(bool_per_poly, left_index=True, right_index=True, how='left')
# print(data_stored)
data_stored.index = data_stored.index.rename('watprovID')
data_stored = data_stored.drop('geometry', axis=1)
data_stored = data_stored.astype(int)
Expand Down Expand Up @@ -167,9 +164,6 @@ def read_projected_conflict(extent_gdf, bool_conflict, check_neighbors=False, ne

if check_neighbors:

# if neighboring_matrix == None:
# raise ValueError('ERROR: if check_neighbors=True, a matrix with neihgbouring polygons needs to be provided too!')

# determine log-scaled number of conflict events in neighboring polygons
val = calc_conflicts_nb(i_poly, neighboring_matrix, bool_conflict)
# append resulting value
Expand Down Expand Up @@ -287,6 +281,8 @@ def split_conflict_geom_data(X):
arrays: seperate arrays with ID, geometry, and actual data
"""

# first column corresponds to ID, second to geometry
# all remaining columns are actual data
X_ID = X[:, 0]
X_geom = X[:, 1]
X_data = X[: , 2:]
Expand All @@ -308,10 +304,14 @@ def get_pred_conflict_geometry(X_test_ID, X_test_geom, y_test, y_pred, y_prob_0,
dataframe: dataframe with each input list as column plus computed 'correct_pred'.
"""

# stack separate columns horizontally
arr = np.column_stack((X_test_ID, X_test_geom, y_test, y_pred, y_prob_0, y_prob_1))

# convert array to dataframe
df = pd.DataFrame(arr, columns=['ID', 'geometry', 'y_test', 'y_pred', 'y_prob_0', 'y_prob_1'])

# compute whether a prediction is correct
# if so, assign 1; otherwise, assign 0
df['correct_pred'] = np.where(df['y_test'] == df['y_pred'], 1, 0)

return df
Expand Down
39 changes: 21 additions & 18 deletions copro/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@


def initiate_XY_data(config):
"""Initiates an empty dictionary to contain the XY-data for each polygon.
"""Initiates an empty dictionary to contain the XY-data for each polygon, ie. both sample data and target data.
This is needed for the reference run.
By default, the first column is for the polygon ID, the second for polygon geometry.
The antepenultimate column is for boolean information about conflict at t-1 while the penultimate column is for boolean information about conflict at t-1 in neighboring polygons.
Expand All @@ -22,6 +22,8 @@ def initiate_XY_data(config):
dict: emtpy dictionary to be filled, containing keys for each variable (X), binary conflict data (Y) plus meta-data.
"""

# Initialize dictionary
# some entries are set by default, besides the ones corresponding to input data variables
XY = {}
XY['poly_ID'] = pd.Series()
XY['poly_geometry'] = pd.Series()
Expand All @@ -39,7 +41,7 @@ def initiate_XY_data(config):
return XY

def initiate_X_data(config):
"""Initiates an empty dictionary to contain the X-data for each polygon.
"""Initiates an empty dictionary to contain the X-data for each polygon, ie. only sample data.
This is needed for each time step of each projection run.
By default, the first column is for the polygon ID and the second for polygon geometry.
The penultimate column is for boolean information about conflict at t-1 while the last column is for boolean information about conflict at t-1 in neighboring polygons.
Expand All @@ -50,8 +52,10 @@ def initiate_X_data(config):
Returns:
dict: emtpy dictionary to be filled, containing keys for each variable (X) plus meta-data.
"""

"""

# Initialize dictionary
# some entries are set by default, besides the ones corresponding to input data variables
X = {}
X['poly_ID'] = pd.Series()
X['poly_geometry'] = pd.Series()
Expand All @@ -68,7 +72,7 @@ def initiate_X_data(config):
return X

def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
"""Fills the XY-dictionary with data for each variable and conflict for each polygon for each simulation year.
"""Fills the (XY-)dictionary with data for each variable and conflict for each polygon for each simulation year.
The number of rows should therefore equal to number simulation years times number of polygons.
At end of last simulation year, the dictionary is converted to a numpy-array.
Expand All @@ -78,6 +82,7 @@ def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
root_dir (str): path to location of cfg-file.
conflict_data (geo-dataframe): geo-dataframe containing the selected conflicts.
polygon_gdf (geo-dataframe): geo-dataframe containing the selected polygons.
out_dir (path): path to output folder.
Raises:
Warning: raised if the datetime-format of the netCDF-file does not match conventions and/or supported formats.
Expand Down Expand Up @@ -162,8 +167,6 @@ def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
if config.getboolean('general', 'verbose'): click.echo('DEBUG: all data read')

df_out = pd.DataFrame.from_dict(XY)

df_corr = evaluation.calc_correlation_matrix(df_out.drop(columns='poly_ID'), out_dir)

return df_out.to_numpy()

Expand Down Expand Up @@ -270,31 +273,31 @@ def fill_X_conflict(X, config, conflict_data, polygon_gdf):
return X

def split_XY_data(XY, config):
"""Separates the XY-array into array containing information about variable values (X-array) and conflict data (Y-array).
"""Separates the XY-array into array containing information about variable values (X-array or sample data) and conflict data (Y-array or target data).
Thereby, the X-array also contains the information about unique identifier and polygon geometry.
Args:
XY (array): array containing variable values and conflict data.
config (ConfigParser-object): object containing the parsed configuration-settings of the model.
Returns:
arrays: two separate arrays, the X-array and Y-array
arrays: two separate arrays, the X-array and Y-array.
"""

# convert array to dataframe for easier handling
XY = pd.DataFrame(XY)
if config.getboolean('general', 'verbose'): click.echo('DEBUG: number of data points including missing values: {}'.format(len(XY)))

# some debugging, seems that for some reason popluation data is not added to values
# test_df = XY[XY.isna().any(axis=1)]
# test_df.drop(test_df.columns[[1]], axis = 1, inplace=True)
# test_df.to_csv(os.path.join(os.path.abspath(config.get('general', 'output_dir')), '_REF', 'test_df.csv'))

# if config.getboolean('general', 'verbose'): click.echo('DEBUG: exluding polygons containing NaNs: {}'.format(X[X.isna().any(axis=1)]))
# X = X.dropna()
# fill all missing values with 0
XY = XY.fillna(0)

# convert dataframe back to array
XY = XY.to_numpy()
X = XY[:, :-1] # since conflict is the last column, we know that all previous columns must be variable values

# get X data
# since conflict is the last column, we know that all previous columns must be variable values
X = XY[:, :-1]
# get Y data and convert to integer values
Y = XY[:, -1]
Y = Y.astype(int)

Expand All @@ -306,7 +309,7 @@ def split_XY_data(XY, config):

def neighboring_polys(config, extent_gdf, identifier='watprovID'):
"""For each polygon, determines its neighboring polygons.
As result, a n times n look-up dataframe is obtained containing, where n is number of polygons in extent_gdf.
As result, a (n x n) look-up dataframe is obtained containing, where n is number of polygons in extent_gdf.
Args:
config (ConfigParser-object): object containing the parsed configuration-settings of the model.
Expand Down

0 comments on commit cefb89a

Please sign in to comment.