Merge pull request #142 from JannisHoch/update_docs

Update docs
JannisHoch · Jun 16, 2021 · cefb89a · cefb89a
2 parents 347d20e + b663a44
commit cefb89a
Show file tree

Hide file tree

Showing 48 changed files with 1,579 additions and 1,354 deletions.
diff --git a/README.rst b/README.rst
@@ -1,9 +1,6 @@
 ===============
-Overview
-===============
-
 CoPro
-----------------
+===============
 
 Welcome to CoPro, a machine-learning tool for conflict risk projections based on climate, environmental, and societal drivers.
 
@@ -29,12 +26,12 @@ Welcome to CoPro, a machine-learning tool for conflict risk projections based on
     :target: https://joss.theoj.org/papers/1f03334e56413ff71f65092ecc609aa4
 
 .. image:: https://mybinder.org/badge_logo.svg
-    :target: https://mybinder.org/v2/gh/JannisHoch/copro/to_binder?filepath=%2Fpresentations%2FvEGU21.ipynb
+    :target: https://mybinder.org/v2/gh/JannisHoch/copro/update_docs?filepath=%2Fexample%2Fnb_binder.ipynb
 
-    Model purpose
+Model purpose
 --------------
 
-As primary model output, CoPro provides maps of conflict risk (defined as the fraction conflict predictions of all predictions).
+As primary model output, CoPro provides maps of conflict risk.
 
 To that end, it employs observed conflicts as target data together with (user-provided) socio-economic and environmental sample data to train different classifiers (RFClassifier, kNearestClassifier, and Support Vector Classifier).
 While the samples have the units of the data, the target value is converted to Boolean, where a 0 indicates no conflict occurrence and 1 indicates occurrence.
@@ -89,18 +86,19 @@ To run the model from command line, a command line script is provided. The usage
 
     Usage: copro_runner [OPTIONS] CFG
 
-    Main command line script to execute the model.  All settings are read from
-    cfg-file. One cfg-file is required argument to train, test, and evaluate
-    the model. Additional cfg-files can be provided as optional arguments,
-    whereby each file corresponds to one projection to be made.
+    Main command line script to execute the model. 
+    All settings are read from cfg-file.
+    One cfg-file is required argument to train, test, and evaluate the model.
+    Multiple classifiers are trained based on different train-test data combinations.
+    Additional cfg-files for multiple projections can be provided as optional arguments, whereby each file corresponds to one projection to be made.
+    Per projection, each classifiers is used to create separate projection outcomes per time step (year).
+    All outcomes are combined after each time step to obtain the common projection outcome.
 
     Args:     CFG (str): (relative) path to cfg-file
 
     Options:
-    -proj, --projection-settings PATH   path to cfg-file with settings for a projection run
-
-    -v, --verbose                       command line switch to turn on verbose mode
-    --help                              Show this message and exit.
+    -plt, --make_plots        add additional output plots
+    -v, --verbose             command line switch to turn on verbose mode
 
 This help information can be also accessed with
 
@@ -120,23 +118,18 @@ Example data
 ----------------
 
 Example data for demonstration purposes can be downloaded from `Zenodo <https://zenodo.org/record/4297295>`_.
-To facilitate this process, the bash-script ``download_example_data.sh`` can be called in the example folder.
+To facilitate this process, the bash-script ``download_example_data.sh`` can be called in the example folder under `/_scripts`.
 
 With this (or other) data, the provided configuration-files (cfg-files) can be used to perform a reference run or a projection run. 
 All output is stored in the output directory specified in the cfg-files. 
+In the output directory, two folders are created: one name `_REF` for output from the reference run, and `_PROJ` for output for projections.
 
 Jupyter notebooks
 ^^^^^^^^^^^^^^^^^^
 
 There are multiple jupyter notebooks available to guide you through the model application process step-by-step.
-They can all be run and converted to html-files by executing the provided shell-script.
 
-.. code-block:: console
-
-    $ cd path/to/copro/example
-    $ sh run_notebooks.sh
-
-It is of course also possible to execute the notebook cell-by-cell and explore the full range of possibilities.
+It is possible to execute the notebooks cell-by-cell and explore the full range of possibilities.
 Note that in this case the notebooks need to be run in the right order as some temporary files will be saved to file in one notebook and loaded in another!
 This is due to the re-initalization of the model at the beginning of each notebook and resulting deletion of all files in existing output folders.
 
@@ -147,24 +140,20 @@ Command-line
 
 While the notebooks are great for exploring, the command line script is the envisaged way to use CoPro.
 
-To only test the model for the reference situation, the cfg-file is the required argument.
-
-To make a projection, both cfg-files need to be specified with the latter requiring the -proj flag.
-If more projections are ought to be made, multiple cfg-files can be provided with the -proj flag.
+To only test the model for the reference situation and one projection, the cfg-file for the reference run is the required argument.
+This cfg-file needs to point to the cfg-file of the projection in turn.
 
 .. code-block:: console
 
     $ cd path/to/copro/example
     $ copro_runner example_settings.cfg
-    $ copro_runner example_settings.cfg -proj example_settings_proj.cfg
 
 Alternatively, the same commands can be executed using a bash-file.
 
 .. code-block:: console
 
-    $ cd path/to/copro/example
-    $ sh run_script_reference.sh
-    $ sh run_script_projections.sh
+    $ cd path/to/copro/example/_scripts
+    $ sh run_command_line_script.sh
 
 Validation
 ^^^^^^^^^^^^^^^^^^
@@ -175,7 +164,7 @@ The selected classifier is trained and validated against this data.
 Main validation metrics are the ROC-AUC score as well as accuracy, precision, and recall. 
 All metrics are reported and written to file per model evaluation.
 
-With the example data downloadable from `Zenodo <https://zenodo.org/record/4297295>`_, a ROC-AUC score of 0.82 can be obtained. 
+With the example data downloadable from `Zenodo <https://zenodo.org/record/4297295>`_, a ROC-AUC score of above 0.8 can be obtained. 
 Note that with additional and more explanatory sample data, the score will most likely increase.
 
 .. figure:: docs/_static/roc_curve.png

diff --git a/copro/conflict.py b/copro/conflict.py
@@ -3,7 +3,7 @@
 import pandas as pd
 import numpy as np
 import os, sys
-import math
+import click
 
 def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir): 
     """Creates a list for each timestep with boolean information whether a conflict took place in a polygon or not.
@@ -27,7 +27,8 @@ def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir):
     # select the entries which occured in this year
     temp_sel_year = conflict_gdf.loc[conflict_gdf.year == sim_year]  
 
-    assert (len(temp_sel_year) != 0), AssertionError('ERROR: no conflicts were found in sampled conflict data set for year {}'.format(sim_year))
+    if len(temp_sel_year) == 0:
+        click.echo('WARNING: no conflicts were found in sampled conflict data set for year {}'.format(sim_year))
 
     # merge the dataframes with polygons and conflict information, creating a sub-set of polygons/regions
     data_merged = gpd.sjoin(temp_sel_year, extent_gdf)
@@ -54,10 +55,6 @@ def conflict_in_year_bool(config, conflict_gdf, extent_gdf, sim_year, out_dir):
         if config.getboolean('general', 'verbose'): print('DEBUG: storing boolean conflict map of year {} to file {}'.format(sim_year, os.path.join(out_dir, 'conflicts_in_{}.csv'.format(sim_year))))
         # data_stored = pd.merge(bool_per_poly, global_df, on='ID', how='right').fillna(0)
         data_stored = pd.merge(bool_per_poly, global_df, on='ID', how='right').dropna()
-        # print(global_df.head())
-        # print(bool_per_poly.head())
-        # data_stored = global_df.merge(bool_per_poly, left_index=True, right_index=True, how='left')
-        # print(data_stored)
         data_stored.index = data_stored.index.rename('watprovID')
         data_stored = data_stored.drop('geometry', axis=1)
         data_stored = data_stored.astype(int)
@@ -167,9 +164,6 @@ def read_projected_conflict(extent_gdf, bool_conflict, check_neighbors=False, ne
 
             if check_neighbors:
 
-                # if neighboring_matrix == None:
-                #     raise ValueError('ERROR: if check_neighbors=True, a matrix with neihgbouring polygons needs to be provided too!')
-
                 # determine log-scaled number of conflict events in neighboring polygons
                 val = calc_conflicts_nb(i_poly, neighboring_matrix, bool_conflict)
                 # append resulting value
@@ -287,6 +281,8 @@ def split_conflict_geom_data(X):
         arrays: seperate arrays with ID, geometry, and actual data 
     """    
 
+    # first column corresponds to ID, second to geometry
+    # all remaining columns are actual data
     X_ID = X[:, 0]
     X_geom = X[:, 1]
     X_data = X[: , 2:]
@@ -308,10 +304,14 @@ def get_pred_conflict_geometry(X_test_ID, X_test_geom, y_test, y_pred, y_prob_0,
         dataframe: dataframe with each input list as column plus computed 'correct_pred'.
     """   
 
+    # stack separate columns horizontally
     arr = np.column_stack((X_test_ID, X_test_geom, y_test, y_pred, y_prob_0, y_prob_1))
 
+    # convert array to dataframe
     df = pd.DataFrame(arr, columns=['ID', 'geometry', 'y_test', 'y_pred', 'y_prob_0', 'y_prob_1'])
 
+    # compute whether a prediction is correct
+    # if so, assign 1; otherwise, assign 0
     df['correct_pred'] = np.where(df['y_test'] == df['y_pred'], 1, 0)
 
     return df

diff --git a/copro/data.py b/copro/data.py
@@ -7,7 +7,7 @@
 
 
 def initiate_XY_data(config):
-    """Initiates an empty dictionary to contain the XY-data for each polygon. 
+    """Initiates an empty dictionary to contain the XY-data for each polygon, ie. both sample data and target data. 
     This is needed for the reference run.
     By default, the first column is for the polygon ID, the second for polygon geometry.
     The antepenultimate column is for boolean information about conflict at t-1 while the penultimate column is for boolean information about conflict at t-1 in neighboring polygons.
@@ -22,6 +22,8 @@ def initiate_XY_data(config):
         dict: emtpy dictionary to be filled, containing keys for each variable (X), binary conflict data (Y) plus meta-data.
     """
 
+    # Initialize dictionary
+    # some entries are set by default, besides the ones corresponding to input data variables
     XY = {}
     XY['poly_ID'] = pd.Series()
     XY['poly_geometry'] = pd.Series()
@@ -39,7 +41,7 @@ def initiate_XY_data(config):
     return XY
 
 def initiate_X_data(config):
-    """Initiates an empty dictionary to contain the X-data for each polygon. 
+    """Initiates an empty dictionary to contain the X-data for each polygon, ie. only sample data. 
     This is needed for each time step of each projection run.
     By default, the first column is for the polygon ID and the second for polygon geometry.
     The penultimate column is for boolean information about conflict at t-1 while the last column is for boolean information about conflict at t-1 in neighboring polygons.
@@ -50,8 +52,10 @@ def initiate_X_data(config):
 
     Returns:
         dict: emtpy dictionary to be filled, containing keys for each variable (X) plus meta-data.
-    """    
-
+    """   
+
+    # Initialize dictionary
+    # some entries are set by default, besides the ones corresponding to input data variables
     X = {}
     X['poly_ID'] = pd.Series()
     X['poly_geometry'] = pd.Series()
@@ -68,7 +72,7 @@ def initiate_X_data(config):
     return X
 
 def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
-    """Fills the XY-dictionary with data for each variable and conflict for each polygon for each simulation year. 
+    """Fills the (XY-)dictionary with data for each variable and conflict for each polygon for each simulation year. 
     The number of rows should therefore equal to number simulation years times number of polygons.
     At end of last simulation year, the dictionary is converted to a numpy-array.
 
@@ -78,6 +82,7 @@ def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
         root_dir (str): path to location of cfg-file.
         conflict_data (geo-dataframe): geo-dataframe containing the selected conflicts.
         polygon_gdf (geo-dataframe): geo-dataframe containing the selected polygons.
+        out_dir (path): path to output folder.
 
     Raises:
         Warning: raised if the datetime-format of the netCDF-file does not match conventions and/or supported formats.
@@ -162,8 +167,6 @@ def fill_XY(XY, config, root_dir, conflict_data, polygon_gdf, out_dir):
             if config.getboolean('general', 'verbose'): click.echo('DEBUG: all data read')
 
     df_out = pd.DataFrame.from_dict(XY)
-
-    df_corr = evaluation.calc_correlation_matrix(df_out.drop(columns='poly_ID'), out_dir)
 
     return df_out.to_numpy()
 
@@ -270,31 +273,31 @@ def fill_X_conflict(X, config, conflict_data, polygon_gdf):
     return X
 
 def split_XY_data(XY, config):
-    """Separates the XY-array into array containing information about variable values (X-array) and conflict data (Y-array).
+    """Separates the XY-array into array containing information about variable values (X-array or sample data) and conflict data (Y-array or target data).
     Thereby, the X-array also contains the information about unique identifier and polygon geometry.
 
     Args:
         XY (array): array containing variable values and conflict data.
         config (ConfigParser-object): object containing the parsed configuration-settings of the model.
 
     Returns:
-        arrays: two separate arrays, the X-array and Y-array
+        arrays: two separate arrays, the X-array and Y-array.
     """    
 
+    # convert array to dataframe for easier handling
     XY = pd.DataFrame(XY)
     if config.getboolean('general', 'verbose'): click.echo('DEBUG: number of data points including missing values: {}'.format(len(XY)))
 
-    # some debugging, seems that for some reason popluation data is not added to values
-    # test_df = XY[XY.isna().any(axis=1)]
-    # test_df.drop(test_df.columns[[1]], axis = 1, inplace=True)
-    # test_df.to_csv(os.path.join(os.path.abspath(config.get('general', 'output_dir')), '_REF', 'test_df.csv'))
-
-    # if config.getboolean('general', 'verbose'): click.echo('DEBUG: exluding polygons containing NaNs: {}'.format(X[X.isna().any(axis=1)]))
-    # X = X.dropna()
+    # fill all missing values with 0
     XY = XY.fillna(0)
 
+    # convert dataframe back to array
     XY = XY.to_numpy()
-    X = XY[:, :-1] # since conflict is the last column, we know that all previous columns must be variable values
+
+    # get X data
+    # since conflict is the last column, we know that all previous columns must be variable values
+    X = XY[:, :-1] 
+    # get Y data and convert to integer values
     Y = XY[:, -1]
     Y = Y.astype(int)
 
@@ -306,7 +309,7 @@ def split_XY_data(XY, config):
 
 def neighboring_polys(config, extent_gdf, identifier='watprovID'):
     """For each polygon, determines its neighboring polygons.
-    As result, a n times n look-up dataframe is obtained containing, where n is number of polygons in extent_gdf.
+    As result, a (n x n) look-up dataframe is obtained containing, where n is number of polygons in extent_gdf.
 
     Args:
         config (ConfigParser-object): object containing the parsed configuration-settings of the model.