# Samples matrix and target values

In this notebook, we will show how CoPro reads the samples matrix and target values needed to establish a machine-learning model.

## Preparations

Start with loading the required packages.

In [1]:
from copro import utils, pipeline

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import os, sys
import warnings
warnings.simplefilter("ignore")

For better reproducibility, the version numbers of all key packages are provided.

In [2]:
utils.show_versions()

Python version: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
copro version: 0.0.7b
geopandas version: 0.8.0
xarray version: 0.15.1
rasterio version: 1.1.0
pandas version: 1.0.3
numpy version: 1.18.1
scikit-learn version: 0.23.2
matplotlib version: 3.2.1
seaborn version: 0.11.0
rasterstats version: 0.14.0


To be able to also run this notebooks, some of the previously saved data needs to be loaded.

In [3]:
conflict_gdf = gpd.read_file('conflicts.shp')
selected_polygons_gdf = gpd.read_file('polygons.shp')

### The configurations-file (cfg-file)

To be able to continue the simulation with the same settings as in the previous notebook, the cfg-file has to be read again and the model needs to be initialised subsequently.

In [4]:
settings_file = 'example_settings.cfg'

In [5]:
config, out_dir, root_dir = utils.initiate_setup(settings_file)


#### CoPro version 0.0.7b ####
#### For information about the model, please visit https://copro.readthedocs.io/ ####
#### Copyright (2020-2020): Jannis M. Hoch, Sophie de Bruin, Niko Wanders ####
#### Contact via: j.m.hoch@uu.nl ####
#### The model can be used and shared under the MIT license ####

INFO: verbose mode on: False
INFO: saving output to folder C:\Users\hoch0001\Documents\_code\copro\example\./OUT


## Read the files and store the data

### Background

This is an essential part of the code. For a machine-learning model to work, it requires a samples matrix (X), representing the 'drivers' of conflict, and target values (Y) representing the conflicts themselves. By fitting a machine-learning model, a relation between X and Y is established, which in turn can be used to make projections.

Additional information can be found on [scikit-learn](https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics).

Since CoPro simulates conflict risk not only globally, but also spatially explicit for provided polygons, it is furthermore needed to be able to associate each polygons with the corresponding data points in X and Y.

### Implementation

CoPro goes through all model years as specified in the cfg-file. Per year, CoPro loops over all polygons remaining after the selection procedure (see previous notebook) and does the following to obtain the X-data.

1. Assing ID to polygon and retrieve geometry information;
2. Calculate the mean value per polygon from each input file specified in the cfg-file in section 'data'.

And to obtain the Y-data:

1. Assign a Boolean value whether a conflict took place in a polygon or not - the number of casualties or conflicts per year is not relevant in thise case.

This information is stored in a X-array and a Y-array. The X-array has 2+n columns whereby n denotes the number of samples provided. The Y-array has obviously only 1 column.
In both arrays is the number of rows determined as number of years times the number of polygons. In case a row contains a missing value, the entire row is removed from the XY-array.

Note that the sample values can still range a lot depending on their units, measurement, etc. In the next notebook, the X-data will be scaled to be able to compare the different values in the samples matrix.

Since we did not specify a pre-calculated npy-file in the cfg-file, the provided files are read per year.

In [6]:
config.get('pre_calc', 'XY')

''

In [7]:
X, Y = pipeline.create_XY(config, out_dir, root_dir, selected_polygons_gdf, conflict_gdf)

{'poly_ID': Series([], dtype: float64), 'poly_geometry': Series([], dtype: float64), 'total_evaporation': Series([], dtype: float64), 'precipitation': Series([], dtype: float64), 'temperature': Series([], dtype: float64), 'irr_water_demand': Series([], dtype: float64), 'conflict_t-1': Series([], dtype: bool), 'conflict': Series([], dtype: bool)}

INFO: reading data for period from 2000 to 2015
INFO: entering year 2000
DEBUG: key poly_ID
DEBUG: key poly_geometry
DEBUG: key total_evaporation
DEBUG: key precipitation
DEBUG: key temperature
DEBUG: key irr_water_demand
DEBUG: key conflict_t-1
YOOOOO: now computing conflict for t-1
... it is the first year, so no conflict for previous year is known
DEBUG: key conflict
INFO: entering year 2001
DEBUG: key poly_ID
DEBUG: key poly_geometry
DEBUG: key total_evaporation
DEBUG: key precipitation
DEBUG: key temperature
DEBUG: key irr_water_demand
DEBUG: key conflict_t-1
YOOOOO: now computing conflict for t-1
DEBUG: key conflict
INFO: entering year 2

Depending on sample and file size, obtaining the X-array and Y-array can be time-consuming. Therefore, CoPro automatically stores a combined XY-array as npy-file if not specified otherwise in the cfg-file.

In [8]:
os.path.isfile(os.path.join(os.path.abspath(config.get('general', 'output_dir')), 'XY.npy'))

True