added more contextual info and explanations to notebooks

JannisHoch · Nov 30, 2020 · c458f92 · c458f92
1 parent 7bc0a3f
commit c458f92
Show file tree

Hide file tree

Showing 9 changed files with 462 additions and 317 deletions.
diff --git a/README.rst b/README.rst
@@ -157,6 +157,8 @@ Note that with additional and more explanatory sample data, the score will most
 
 .. figure:: docs/_static/roc_curve.png
 
+Additional ways to validate the model are showcased in the `Workflow <https://copro.readthedocs.io/en/latest/examples/index.html>`_.
+
 Documentation
 ---------------
 

diff --git a/docs/_static/roc_curve.png b/docs/_static/roc_curve.png
diff --git a/docs/examples/index.rst b/docs/examples/index.rst
@@ -4,7 +4,9 @@ Workflow
 =========
 
 This page provides a short example workflow in Jupyter Notebooks. 
-It is designed such that the main features of CoPro become clear.
+It is designed such that the main steps, features, assumptions, and outcomes of CoPro become clear.
+
+As model input data, the data set downloadable from `Zenodo <https://zenodo.org/record/4297295>`_ was used. 
 
 Even though the model can be perfectly executed using notebooks, the main (and more convenient) way of model execution is the command line script (see :ref:`script`).
 

diff --git a/example/example_settings.cfg b/example/example_settings.cfg
@@ -12,7 +12,7 @@ y_start=2000
 # end year
 y_end=2015
 # number of repetitions
-n_runs=3
+n_runs=10
 
 [pre_calc]
 # if nothing is specified, the XY array will be stored in output_dir

diff --git a/example/example_settings_proj.cfg b/example/example_settings_proj.cfg
@@ -33,7 +33,7 @@ min_nr_casualties=1
 type_of_violence=1,2,3
 
 [climate]
-shp=KoeppenGeiger/2000/1976-2000.shp
+shp=KoeppenGeiger/2000/Koeppen_Geiger_1976-2000.shp
 # define either one or more classes (use abbreviations!) or specify None for not filtering
 zones=BWh,BSh
 code2class=KoeppenGeiger/classification_codes.txt

diff --git a/example/nb01_model_init_and_selection.ipynb b/example/nb01_model_init_and_selection.ipynb
@@ -4,7 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Model initialization and selection procedure"
+    "# Model initialization and selection procedure\n",
+    "\n",
+    "In this notebook, we will show how CoPro is initialized and how the polygons and conflicts are selected."
    ]
   },
   {
@@ -78,9 +80,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the cfg-file, all the settings for the analysis are defined. By 'parsing' (i.e. reading) it, all settings and file paths are known to the model. This is a simple way to make the code independent of the input data and settings.\n",
-    "\n",
-    "**Note** that the cfg-file can be stored anywhere, not per se in the same directory where the model data is stored (as in this example case). Make sure that the paths in the cfg-file are updated if you use relative paths and change the folder location of th cfg-file."
+    "In the cfg-file, all the settings for the analysis are defined. Note that the cfg-file can be stored anywhere, not per se in the same directory where the model data is stored (as in this example case). Make sure that the paths in the cfg-file are updated if you use relative paths and change the folder location of th cfg-file."
    ]
   },
   {
@@ -96,7 +96,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Based on this cfg-file, the set-up of the run can be initialized. One part of the cfg-file is the specification and creation of an output folder."
+    "Based on this cfg-file, the set-up of the run can be initialized. Here, the cfg-file is parsed (i.e. read) and all settings and paths become known to the model. Also, the output folder is created (if it does not exist yet) and the cfg-file is copied to the output folder for improved reusability."
    ]
   },
   {
@@ -130,9 +130,20 @@
    "source": [
     "## Filter conflicts and polygons\n",
     "\n",
-    "As conflict database, we use the [PRIO/UCDP database](https://ucdp.uu.se/downloads/). Not all conflicts of the database may need to be used in the model. This can be, for example, because they belong to a non-relevant type of conflict we are not interested in, or because it is simply not in our area-of-interest.\n",
+    "### Background\n",
+    "\n",
+    "As conflict database, we use the [UCDP Georeferenced Event Dataset](https://ucdp.uu.se/downloads/index.html#ged_global) v201. Not all conflicts of the database may always need to be used for a simulation. This can be, for example, because they belong to a non-relevant type of conflict we are not interested in, or because it is simply not in our area-of-interest. Therefore, it is possible to filter the conflicts on various properties:\n",
+    "\n",
+    "1. min_nr_casualties: minimum number of casualties of a reported conflict; \n",
+    "1. type_of_violence: 1=state-based armed conflict; 2=non-state conflict; 3=one-sided violence.\n",
     "\n",
-    "In the selection procedure, we first load the conflict database and convert it to a georeferenced dataframe (geo-dataframe). We then only keep those entries that fit to our selection criteria as specified in the cfg-file. Subsequently, we clip the remaining conflict datapoints to the extent of a provided shp-file, representing the area-of-interest. Here, we focus on the African continent and do the analysis at the scale of water provinces. The remaining conflict points are then used for the machine-learning model."
+    "To unravel the interplay between climate and conflict, it may be beneficial to run the model only for conflicts in particular climate zones. It is hence also possible to select only those conflcits that fall within a climate zone following the [Koeppen-Geiger classification](http://koeppen-geiger.vu-wien.ac.at/).\n",
+    "\n",
+    "### Selection procedure\n",
+    "\n",
+    "In the selection procedure, we first load the conflict database and convert it to a georeferenced dataframe (geo-dataframe). To define the study area, a shape-file containing polygons (in this case water provinces) is loaded and converted to geo-dataframe as well.\n",
+    "\n",
+    "We then apply the selection criteria (see above) as specified in the cfg-file, and keep the remaining data points and polygons. "
    ]
   },
   {
@@ -156,26 +167,16 @@
     "conflict_gdf, extent_gdf, selected_polygons_gdf, global_df = selection.select(config, out_dir)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "conflict_gdf.to_file('conflicts.shp')\n",
-    "selected_polygons_gdf.to_file('polygons.shp')"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Depending on the settings, we may focus on some climate zones only. As such, not all water provinces are used in the model. For a visual inspection if this selection worked as intended, we plot below the conflicts and as background map only those water provinces that are actually used in the model."
+    "With the chosen settings, the following picture of polygons and conflict data points is obtained."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -204,7 +205,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The selected conflict points and polygons are saved as shp-files to be re-used later. For the dataframe with polygon ID and geometry we need to take a detour via numpy because the more straighforwad option to store the dataframe as csv does not with geometry information."
+    "To be able to also run the following notebooks, some of the data has to be written to file temporarily."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "conflict_gdf.to_file('conflicts.shp')\n",
+    "selected_polygons_gdf.to_file('polygons.shp')"
    ]
   },
   {

diff --git a/example/nb02_XY_data.ipynb b/example/nb02_XY_data.ipynb
@@ -4,9 +4,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Variable values and conflict data\n",
+    "# Samples matrix and target values\n",
     "\n",
-    "## Preparations"
+    "In this notebook, we will show how CoPro reads the samples matrix and target values needed to establish a machine-learning model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparations\n",
+    "\n",
+    "Start with loading the required packages."
    ]
   },
   {
@@ -65,38 +74,46 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### The configurations-file (cfg-file)"
+    "To be able to also run this notebooks, some of the previously saved data needs to be loaded."
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": 3,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "In the cfg-file, all the settings for the analysis are defined. By 'parsing' (i.e. reading) it, all settings and file paths are known to the model. This is a simple way to make the code independent of the input data and settings.\n",
-    "\n",
-    "**Note** that the cfg-file can be stored anywhere, not per se in the same directory where the model data is stored (as in this example case). Make sure that the paths in the cfg-file are updated if you use relative paths and change the folder location of th cfg-file."
+    "conflict_gdf = gpd.read_file('conflicts.shp')\n",
+    "selected_polygons_gdf = gpd.read_file('polygons.shp')"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 3,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "settings_file = 'example_settings.cfg'"
+    "### The configurations-file (cfg-file)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Based on this cfg-file, the set-up of the run can be initialized. One part of the cfg-file is the specification and creation of an output folder."
+    "To be able to continue the simulation with the same settings as in the previous notebook, the cfg-file has to be read again and the model needs to be initialised subsequently."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "settings_file = 'example_settings.cfg'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -118,42 +135,42 @@
     "config, out_dir = utils.initiate_setup(settings_file)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "So be able to continue from the previous notebook, some output has to be read in again."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "conflict_gdf = gpd.read_file('conflicts.shp')\n",
-    "selected_polygons_gdf = gpd.read_file('polygons.shp')"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Read the files and store the data\n",
     "\n",
-    "This is an essential part of the code. Here, we go through all model years as specified in the cfg-file and do the following:\n",
+    "### Background\n",
+    "\n",
+    "This is an essential part of the code. For a machine-learning model to work, it requires a samples matrix (X), representing the 'drivers' of conflict, and target values (Y) representing the conflicts themselves. By fitting a machine-learning model, a relation between X and Y is established, which in turn can be used to make projections.\n",
+    "\n",
+    "Additional information can be found on [scikit-learn](https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics).\n",
+    "\n",
+    "Since CoPro simulates conflict risk not only globally, but also spatially explicit for provided polygons, it is furthermore needed to be able to associate each polygons with the corresponding data points in X and Y.\n",
+    "\n",
+    "### Implementation\n",
+    "\n",
+    "CoPro goes through all model years as specified in the cfg-file. Per year, CoPro loops over all polygons remaining after the selection procedure (see previous notebook) and does the following to obtain the X-data.\n",
+    "\n",
+    "1. Assing ID to polygon and retrieve geometry information;\n",
+    "2. Calculate the mean value per polygon from each input file specified in the cfg-file in section 'data'.\n",
+    "\n",
+    "And to obtain the Y-data:\n",
+    "\n",
+    "1. Assign a Boolean value whether a conflict took place in a polygon or not - the number of casualties or conflicts per year is not relevant in thise case.\n",
     "\n",
-    "1. Get a 0/1 classifier whether a conflict took place in a geographical unit (here water province) or not;\n",
-    "2. Loop through various files with climate or environmental variables, and get mean variable value per geographical unit (here water province).\n",
+    "This information is stored in a X-array and a Y-array. The X-array has 2+n columns whereby n denotes the number of samples provided. The Y-array has obviously only 1 column.\n",
+    "In both arrays is the number of rows determined as number of years times the number of polygons. In case a row contains a missing value, the entire row is removed from the XY-array.\n",
     "\n",
-    "This information is stored in a XY-array with then is split in two different arrays. The X-array represents all climate/environmental variable values per polygon per year, while the Y-array represents the binary classifier whether conflict took place or not. In case some variables did contain no data for a given water province, this data points is dropped entirely."
+    "Note that the sample values can still range a lot depending on their units, measurement, etc. In the next notebook, the X-data will be scaled to be able to compare the different values in the samples matrix."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since we did not specify a npy-file in the cfg-file, the provided files are read per year."
+    "Since we did not specify a pre-calculated npy-file in the cfg-file, the provided files are read per year."
    ]
   },
   {
@@ -176,13 +193,6 @@
     "config.get('pre_calc', 'XY')"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now let's get to it:"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -222,7 +232,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "At the end of this function, the resulting XY-array (i.e. before splitting to make it easier) is by default stored to the input directory. This is handy because we now do not need to repeat the file reading and data storing anymore. At least as long as the settings do not change!"
+    "Depending on sample and file size, obtaining the X-array and Y-array can be time-consuming. Therefore, CoPro automatically stores a combined XY-array as npy-file if not specified otherwise in the cfg-file."
    ]
   },
   {
@@ -244,20 +254,6 @@
    "source": [
     "os.path.isfile(os.path.join(os.path.abspath(config.get('general', 'output_dir')), 'XY.npy'))"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Note** that the XY.npy can be stored anywhere as long its location is correctly specified in the cfg-file."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {