updated output documentation and some bugfixes

JannisHoch · Nov 30, 2020 · 07534a5 · 07534a5
1 parent d03505e
commit 07534a5
Show file tree

Hide file tree

Showing 5 changed files with 41 additions and 88 deletions.
diff --git a/copro/evaluation.py b/copro/evaluation.py
@@ -143,7 +143,7 @@ def polygon_model_accuracy(df, global_df, out_dir, make_proj=False):
     if not make_proj: df_temp['fraction_correct_predictions'] = df_temp.nr_correct_predictions / df_temp.nr_predictions
 
     #- compute average correct prediction rate by dividing sum of correct predictions with number of all predicionts
-    df_temp['fraction_correct_conflict_predictions'] = df_temp.nr_predicted_conflicts / df_temp.nr_predictions
+    df_temp['chance_of_conflict'] = df_temp.nr_predicted_conflicts / df_temp.nr_predictions
 
     #- merge with global dataframe containing IDs and geometry, and keep only those polygons occuring in test sample
     df_hit = pd.merge(df_temp, global_df, on='ID', how='left')

diff --git a/docs/Output.rst b/docs/Output.rst
@@ -1,101 +1,56 @@
 Output
 =========================
 
-The model can produce a range of output files. Output is stored in the output folder as specified in the configurations-file (cfg-file).
+The model can produce a range of output files. All output is stored in the output folder as specified in the configurations-file (cfg-file).
 
-.. note:: 
-
-    In addition to these output files, the model settings file (cfg-file) is automatically copied to the output folder.
+In addition to the output files listed below, the model settings file (cfg-file) is automatically copied to the output folder.
 
 .. important:: 
 
     Not all model types provide the output mentioned below. If the 'leave-one-out' or 'single variable' model are selected, only the metrics are stored to a csv-file.
 
-.. important::
-
-    Most of the output can only be produced when running a reference model, i.e. when comparing the predictions against observations. 
-    If running a prediction model, only the chance of conflict per polygon is stored to file.
-
-Selected polygons
-------------------
-A shp-file named ``selected_polygons.shp`` contains all polygons after performing the selection procedure.
-
-Selected conflicts
--------------------
-The shp-file ``selected_conflicts.shp`` contains all conflict data points after performing the selection procedure.
-
-Sampled variable and conflict data
------------------------------------
-During model execution, data is sampled per polygon and time step. 
-This data contains the geometry and ID of each polygon as well as unscaled variable values (X) and a boolean identifier whether conflict took place or not (Y).
-If the model is re-run without making changes to the data and how it is sampled, the resulting XY-array is stored to ``XY.npy``. This file can be loaded again with ``np.load()``.
-
-If making projections, the Y-part is not available. The remaining X-data is still written to a file ``X.npy``.
-
-.. note:: 
-
-    Note that ``np.load()`` returns an array. This can be further processed with e.g. pandas.
-
-ML classifier
---------------
-At the end of a reference run, the chosen classifier is fitted with all available XY-data.
-To be able to re-use the classifier (e.g. to make predictions), it is pickled to ``clf.pkl``.
-
-All predictions
-------------------
-Per model run, a fraction of the total XY-data is used to make a prediction. 
-To be able to analyse model output, all predictions (stored as pandas dataframes) made per run are appended to a main output-dataframe.
-This dataframe is, actually, the basis of all futher analyes.
-When storing to file, this can become a rather large file. 
-Therefore, the dataframe is converted to npy-file (``raw_output_data.npy``). This file can be loaded again with ``np.load()``.
 
-.. note:: 
-
-    Note that ``np.load()`` returns an array. This can be further processed with e.g. pandas.
-
-Evaluation metrics
------------------------
-Per model run, a range of metrics are computed to evalute the predictions made. 
-They are all appended to a dictionary and saved to the file ``evaluation_metrics.csv``.
-
-ROC-AUC
---------
-To be able to determine the mean of the ROC-AUC score plus its standard deviation, the required data is stored to csv-files.
-``ROC_data_tprs.csv`` contains the false positive rates per evaluation, and ``ROC_data_aucs.csv`` the area-under-curve values per run. 
-
-Model prediction per polygon
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| File name                     | Description                                                                                 | Note                                                                                        |
++===============================+=============================================================================================+=============================================================================================+
+| ``selected_polygons.shp``     | Shapefile containing all remaining polygons after selection procedure                       |                                                                                             |
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``selected_conflicts.shp``    | Shapefile containing all remaining conflict points after selection procedure                |                                                                                             | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``XY.npy``                    | NumPy-array containing geometry, ID, and scaled data of sample (X) and target data (Y)      | can be provided in cfg-file to safe time in next run; file can be loaded with numpy.load()  | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``X.npy``                     | NumPy-array containing geometry, ID, and scaled data of sample (X)                          | only written in projection run; file can be loaded with numpy.load()                        | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``clf.pkl``                   | Pickled classifier fitted with the entirety of XY-data                                      | needed to perform projection run; file can be loaded with pickle.load()                     | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``raw_output_data.npy``       | NumPy-array containing each single prediction made in the reference run                     | will contain multiple predictions per polygon; file can be loaded with numpy.load()         | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``evaluation_metrics.csv``    | Various evaluation metrics determined per repetition of the split-sample test repetition    | file can e.g. be loaded with pandas.read_csv()                                              | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``ROC_data_tprs.csv``         | False-positive rates per repetition of the split-sample test repetition                     | file can e.g. be loaded with pandas.read_csv(); data can be used to later plot ROC-curve    | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``ROC_data_aucs.csv``         | Area-under-curve values per repetition of the split-sample test repetition                  | file can e.g. be loaded with pandas.read_csv(); data can be used to later plot ROC-curve    | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+| ``output_per_polygon.shp``    | Shapefile containing resulting conflict risk estimates per polygon                          | for further explanation, see below                                                          | 
++-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
+
+Conflict risk per polygon
 ---------------------------
-At the end of all model repetitions, the resulting output dataframe contains multiple predictions for each polygon.
+At the end of all model repetitions, the resulting output data frame contains multiple predictions for each polygon.
 By aggregating results per polygon, it is possible to assess model output spatially. 
 
-Three main output metrics are calculated per polygon:
-
-1. The chance of a correct (*CCP*), defined as the ratio of the number of correct predictions made to the overall number of predictions made;
-2. The total number of conflicts in the test  (*NOC*);
-3. The chance of conflict (*COC*), defined as the ration of the number of conflict predictions to the overall number of predictions made.
+Three main output metrics are calculated per polygon and saved to ``output_per_polygon.shp``:
 
-all data
-^^^^^^^^^
+1. The number of predictions made per polygon;
+2. The number of observed conflicts per polygon;
+3. The number of predicted conflicts per polygon;
+4. The fraction of correct predictions (*FOP*), defined as the ratio of the number of correct predictions over the total number of predictions made;
+5. The chance of conflict (*COC*), defined as the ration of the number of conflict predictions over the total number of predictions made.
 
-All output metrics (CCP, NOC, COC) are determined based on the entire data set at the end of the run, i.e. without splitting it in chunks.
-
-The data is stored to ``output_per_polygon.shp``.
-
-k-fold analysis
-^^^^^^^^^^^^^^^^
-The model is repeated several times to eliminate the influence of how the data is split into training and test samples.
-As such, the accuracy per run and polygon will differ.
-
-To account for that, the resulting data set containing all predictions at the end of the run is split in k chunks. 
-Subsequently, the mean, median, and standard deviation of CCP is determined from the k chunks.
-
-The resulting shp-file is named ``output_kFoldAnalysis_per_polygon.shp``.
+.. important::
 
-.. note::
+    For projection runs, only the COC can be determined as no conflict observations are used/available.
 
-    In addition to these shp-files, various plots can be stored by using the provided plots-functions. The plots are stored in the output directory too.
-    Note that the plot settings cannot yet be fully controlled via those functions, i.e. it is more anticipated for debugging.
-    To create custom-made plots, rather use the shp-files and csv-file.
 
 
 
diff --git a/docs/examples/index.rst b/docs/examples/index.rst
@@ -4,8 +4,9 @@ Workflow
 =========
 
 This page provides a short example workflow in Jupyter Notebooks. 
-It is designed such that the main features of model become clear.
-Even though the model can be perfectly executed using notebooks, there is also the possibility to use a command line script (see :ref:`script`).
+It is designed such that the main features of CoPro become clear.
+
+Even though the model can be perfectly executed using notebooks, the main (and more convenient) way of model execution is the command line script (see :ref:`script`).
 
 .. toctree::
     :maxdepth: 1

diff --git a/docs/model_settings.rst b/docs/model_settings.rst
@@ -32,7 +32,7 @@ Here, the different sections are explained briefly.
 **[pre_calc]**
 
 - *XY*: if the XY-data was already pre-computed in a previous run and stored as npy-file, it can be specified here and will be loaded from file. If nothing is specified, the model will save the XY-data by default to the output directory as ``XY.npy``;
-- *clf*: path to the pickled fitted classifier from the reference run. Needed for projection runs only;
+- *clf*: path to the pickled fitted classifier from the reference run. Needed for projection runs only!
 
 **[extent]**
 

diff --git a/scripts/copro_runner.py b/scripts/copro_runner.py
@@ -99,9 +99,6 @@ def main(cfg, projection_settings=[], verbose=False):
     # create accuracy values per polygon and save to output folder
     df_hit, gdf_hit = copro.evaluation.polygon_model_accuracy(out_y_df, global_df, out_dir)
 
-    # apply k-fold 
-    gdf_CCP = copro.evaluation.calc_kFold_polygon_analysis(out_y_df, global_df, out_dir, k=10)
-
     #- plot distribution of all evaluation metrics
     fig, ax = plt.subplots(1, 1)
     copro.plots.metrics_distribution(out_dict, figsize=(20, 10))