Skip to content

Commit

Permalink
updated output documentation and some bugfixes
Browse files Browse the repository at this point in the history
  • Loading branch information
JannisHoch committed Nov 30, 2020
1 parent d03505e commit 07534a5
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 88 deletions.
2 changes: 1 addition & 1 deletion copro/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def polygon_model_accuracy(df, global_df, out_dir, make_proj=False):
if not make_proj: df_temp['fraction_correct_predictions'] = df_temp.nr_correct_predictions / df_temp.nr_predictions

#- compute average correct prediction rate by dividing sum of correct predictions with number of all predicionts
df_temp['fraction_correct_conflict_predictions'] = df_temp.nr_predicted_conflicts / df_temp.nr_predictions
df_temp['chance_of_conflict'] = df_temp.nr_predicted_conflicts / df_temp.nr_predictions

#- merge with global dataframe containing IDs and geometry, and keep only those polygons occuring in test sample
df_hit = pd.merge(df_temp, global_df, on='ID', how='left')
Expand Down
117 changes: 36 additions & 81 deletions docs/Output.rst
Original file line number Diff line number Diff line change
@@ -1,101 +1,56 @@
Output
=========================

The model can produce a range of output files. Output is stored in the output folder as specified in the configurations-file (cfg-file).
The model can produce a range of output files. All output is stored in the output folder as specified in the configurations-file (cfg-file).

.. note::

In addition to these output files, the model settings file (cfg-file) is automatically copied to the output folder.
In addition to the output files listed below, the model settings file (cfg-file) is automatically copied to the output folder.

.. important::

Not all model types provide the output mentioned below. If the 'leave-one-out' or 'single variable' model are selected, only the metrics are stored to a csv-file.

.. important::

Most of the output can only be produced when running a reference model, i.e. when comparing the predictions against observations.
If running a prediction model, only the chance of conflict per polygon is stored to file.

Selected polygons
------------------
A shp-file named ``selected_polygons.shp`` contains all polygons after performing the selection procedure.

Selected conflicts
-------------------
The shp-file ``selected_conflicts.shp`` contains all conflict data points after performing the selection procedure.

Sampled variable and conflict data
-----------------------------------
During model execution, data is sampled per polygon and time step.
This data contains the geometry and ID of each polygon as well as unscaled variable values (X) and a boolean identifier whether conflict took place or not (Y).
If the model is re-run without making changes to the data and how it is sampled, the resulting XY-array is stored to ``XY.npy``. This file can be loaded again with ``np.load()``.

If making projections, the Y-part is not available. The remaining X-data is still written to a file ``X.npy``.

.. note::

Note that ``np.load()`` returns an array. This can be further processed with e.g. pandas.

ML classifier
--------------
At the end of a reference run, the chosen classifier is fitted with all available XY-data.
To be able to re-use the classifier (e.g. to make predictions), it is pickled to ``clf.pkl``.

All predictions
------------------
Per model run, a fraction of the total XY-data is used to make a prediction.
To be able to analyse model output, all predictions (stored as pandas dataframes) made per run are appended to a main output-dataframe.
This dataframe is, actually, the basis of all futher analyes.
When storing to file, this can become a rather large file.
Therefore, the dataframe is converted to npy-file (``raw_output_data.npy``). This file can be loaded again with ``np.load()``.

.. note::

Note that ``np.load()`` returns an array. This can be further processed with e.g. pandas.

Evaluation metrics
-----------------------
Per model run, a range of metrics are computed to evalute the predictions made.
They are all appended to a dictionary and saved to the file ``evaluation_metrics.csv``.

ROC-AUC
--------
To be able to determine the mean of the ROC-AUC score plus its standard deviation, the required data is stored to csv-files.
``ROC_data_tprs.csv`` contains the false positive rates per evaluation, and ``ROC_data_aucs.csv`` the area-under-curve values per run.

Model prediction per polygon
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| File name | Description | Note |
+===============================+=============================================================================================+=============================================================================================+
| ``selected_polygons.shp`` | Shapefile containing all remaining polygons after selection procedure | |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``selected_conflicts.shp`` | Shapefile containing all remaining conflict points after selection procedure | |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``XY.npy`` | NumPy-array containing geometry, ID, and scaled data of sample (X) and target data (Y) | can be provided in cfg-file to safe time in next run; file can be loaded with numpy.load() |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``X.npy`` | NumPy-array containing geometry, ID, and scaled data of sample (X) | only written in projection run; file can be loaded with numpy.load() |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``clf.pkl`` | Pickled classifier fitted with the entirety of XY-data | needed to perform projection run; file can be loaded with pickle.load() |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``raw_output_data.npy`` | NumPy-array containing each single prediction made in the reference run | will contain multiple predictions per polygon; file can be loaded with numpy.load() |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``evaluation_metrics.csv`` | Various evaluation metrics determined per repetition of the split-sample test repetition | file can e.g. be loaded with pandas.read_csv() |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``ROC_data_tprs.csv`` | False-positive rates per repetition of the split-sample test repetition | file can e.g. be loaded with pandas.read_csv(); data can be used to later plot ROC-curve |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``ROC_data_aucs.csv`` | Area-under-curve values per repetition of the split-sample test repetition | file can e.g. be loaded with pandas.read_csv(); data can be used to later plot ROC-curve |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| ``output_per_polygon.shp`` | Shapefile containing resulting conflict risk estimates per polygon | for further explanation, see below |
+-------------------------------+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+

Conflict risk per polygon
---------------------------
At the end of all model repetitions, the resulting output dataframe contains multiple predictions for each polygon.
At the end of all model repetitions, the resulting output data frame contains multiple predictions for each polygon.
By aggregating results per polygon, it is possible to assess model output spatially.

Three main output metrics are calculated per polygon:

1. The chance of a correct (*CCP*), defined as the ratio of the number of correct predictions made to the overall number of predictions made;
2. The total number of conflicts in the test (*NOC*);
3. The chance of conflict (*COC*), defined as the ration of the number of conflict predictions to the overall number of predictions made.
Three main output metrics are calculated per polygon and saved to ``output_per_polygon.shp``:

all data
^^^^^^^^^
1. The number of predictions made per polygon;
2. The number of observed conflicts per polygon;
3. The number of predicted conflicts per polygon;
4. The fraction of correct predictions (*FOP*), defined as the ratio of the number of correct predictions over the total number of predictions made;
5. The chance of conflict (*COC*), defined as the ration of the number of conflict predictions over the total number of predictions made.

All output metrics (CCP, NOC, COC) are determined based on the entire data set at the end of the run, i.e. without splitting it in chunks.

The data is stored to ``output_per_polygon.shp``.

k-fold analysis
^^^^^^^^^^^^^^^^
The model is repeated several times to eliminate the influence of how the data is split into training and test samples.
As such, the accuracy per run and polygon will differ.

To account for that, the resulting data set containing all predictions at the end of the run is split in k chunks.
Subsequently, the mean, median, and standard deviation of CCP is determined from the k chunks.

The resulting shp-file is named ``output_kFoldAnalysis_per_polygon.shp``.
.. important::

.. note::
For projection runs, only the COC can be determined as no conflict observations are used/available.

In addition to these shp-files, various plots can be stored by using the provided plots-functions. The plots are stored in the output directory too.
Note that the plot settings cannot yet be fully controlled via those functions, i.e. it is more anticipated for debugging.
To create custom-made plots, rather use the shp-files and csv-file.



5 changes: 3 additions & 2 deletions docs/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@ Workflow
=========

This page provides a short example workflow in Jupyter Notebooks.
It is designed such that the main features of model become clear.
Even though the model can be perfectly executed using notebooks, there is also the possibility to use a command line script (see :ref:`script`).
It is designed such that the main features of CoPro become clear.

Even though the model can be perfectly executed using notebooks, the main (and more convenient) way of model execution is the command line script (see :ref:`script`).

.. toctree::
:maxdepth: 1
Expand Down
2 changes: 1 addition & 1 deletion docs/model_settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Here, the different sections are explained briefly.
**[pre_calc]**

- *XY*: if the XY-data was already pre-computed in a previous run and stored as npy-file, it can be specified here and will be loaded from file. If nothing is specified, the model will save the XY-data by default to the output directory as ``XY.npy``;
- *clf*: path to the pickled fitted classifier from the reference run. Needed for projection runs only;
- *clf*: path to the pickled fitted classifier from the reference run. Needed for projection runs only!

**[extent]**

Expand Down
3 changes: 0 additions & 3 deletions scripts/copro_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,6 @@ def main(cfg, projection_settings=[], verbose=False):
# create accuracy values per polygon and save to output folder
df_hit, gdf_hit = copro.evaluation.polygon_model_accuracy(out_y_df, global_df, out_dir)

# apply k-fold
gdf_CCP = copro.evaluation.calc_kFold_polygon_analysis(out_y_df, global_df, out_dir, k=10)

#- plot distribution of all evaluation metrics
fig, ax = plt.subplots(1, 1)
copro.plots.metrics_distribution(out_dict, figsize=(20, 10))
Expand Down

0 comments on commit 07534a5

Please sign in to comment.