In [1]:
%matplotlib inline
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import getpass
import folium
import nivapy
from sqlalchemy import create_engine

# NOPE metals

We need to develop an extension module for NOPE to simulate metals. It's not clear yet exactly how this should be done, but the first step is to explore some of the datasets sent by Øyvind following our meeting in Grimstad in September.

## 1. Data exploration

### 1.1. Data from Øyvind

#### 1.1.1. Spatial data

One of the raw datasets is a little unusual. It was originally sent to Øyvind by Anders Finstad and then forwarded on to me (see e-mail from Øyvind received 28/09/2017 at 15.52). The file is called *ecco_biwa_db_storage.csv* and it's large (nearly 700 MB). Anders hasn't provided much background information, but it looks as though this is a direct dump from a spatial database and the file includes hexadecimal-encoded Well-Known Binary (WKB) spatial data.

**Note:** The original file includes duplicated column headings `geom` and `ebint` (it looks as though this is due to a previous spatial join). I have therefore renamed the first occurrences to `geom2` and `ebint2`, respectively, to avoid naming conflicts.

The file contains a huge amount of data (nearly 500 columns), so it'll take a bit of figuring out. I also need to parse the hex-encoded WKB into something that I can display on a map.

In [2]:
# Read raw data
in_csv = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
          r'\NOPE\Metals\Raw_Datasets\ecco_biwa_db_storage.csv')
df = pd.read_csv(in_csv, sep=';')

df.head()

Unnamed: 0,geom2,ebint2,corine_2000_continuous_urban_fabric_area_km2,corine_2000_discontinuous_urban_fabric_area_km2,corine_2000_industrial_or_commercial_units_area_km2,corine_2000_road_and_rail_networks_and_associated_land_area_km2,corine_2000_port_areas_area_km2,corine_2000_airports_area_km2,corine_2000_mineral_extraction_sites_area_km2,corine_2000_dump_sites_area_km2,...,column_44,column_45,no3_ug_l,no3_p_vektbasert,ebint,dist_closest_ebint,no_lakes_in_250m,no_lakes_in_100m,no_lakes_in_10m,dist_2nd_closest_ebint
0,0106000020E964000001000000010300000001000000D1...,6014758,0,0,0,0,0,0,0,0,...,82.0,,248.0208,0.396833,6014758,23.216747,1,1,0,
1,0106000020E9640000010000000103000000010000009D...,1933788,0,0,0,0,0,0,0,0,...,83.992481,46.8,230.305029,0.397763,1933788,0.0,1,1,1,
2,0106000020E964000001000000010300000001000000CF...,11900284,0,0,0,0,0,0,0,0,...,47.0,,62.0052,0.169877,11900284,0.0,1,1,1,
3,0106000020E9640000010000000103000000010000001B...,10515000,0,0,0,0,0,0,0,0,...,52.0,12.92,105.40884,0.21512,10515000,0.0,1,1,1,
4,0106000020E96400000200000001030000000100000091...,15254282,0,0,0,0,0,0,0,0,...,154.259399,6.75,44.289429,0.150644,15254282,0.0,1,1,1,


It is possible to use [Shapely](https://github.com/Toblerity/Shapely) to parse the WKB data, which converts the dataframe to a geodataframe. My intial idea was to then save this as a shapefile for display in ArcGIS. However, the column names in the file are too long for the shapefile format, and naming conflicts arise when the columns are truncated. An alternative is therefore to save the data in GeoJSON format, which is more flexible. Unfortunately, there's then no way to read this using ArcMap. QGIS can read GeoJSON, but a 700 MB JSON file is unwieldy and difficult to manipulate. The code below is nevertheless useful and worth recording for the future.

In [None]:
#from shapely import wkb
#from functools import partial # Makes it possible to use "map" with kwargs - see
#                              # https://stackoverflow.com/a/13499853/505698
#
## We need to apply wkb.loads to the 'geom2' col, using the kwarg
## hex=True. To do this, use 'partial' to create a new func
#map_func = partial(wkb.loads, hex=True)
#
## Parse geometry data
#geometry = df['geom2'].map(map_func)
#
## Delete plain text geom
#df = df.drop('geom2', axis=1)
#
## Build gdf
#crs = {'init': 'epsg:32633'} # Numbers look like UTM. Assume Zone 33N
#gdf = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)
#
## Write to GeoJSON
## Works, but some encoding issues with special chars in some columns
## Don't think utf-8 is supported
#out_path = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
#            r'\NOPE\Metals\Raw_Datasets\ecco_biwa_db_storage.json')
#gdf.to_file(out_path, driver='GeoJSON')

Rather than working with a huge GeoJSON file, it would be better to put this information back into a spatial database. I've spent a few hours messing around with Spatialite, but the development is patchy and there are issues with getting everything installed correctly. Although it seems a bit over-the-top, it's actually much easier to just install PostGIS.

Having [installed PostgreSQL 9.6 and PostGIS 2.4](http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01), I've created a new database called `niva_work`, which I'll use for manipulating spatial datasets in the future. For reference, having created this database, spatial extensions are enabled by right-clicking `Extensions` and choosing `PostGIS` from the list of extension names. I have then created a new schema called `nope_metals`, which I'll use for this project.

I have also done some further development of NivaPy to make it easier to integrate spatial and non-spatial data processing in different databases - see this [notebook](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/oracle_postgis_test.ipynb) for details.

In [2]:
# Connect to db
pg_eng = nivapy.da.connect(src='postgres')

········


In [6]:
## Write raw text data to new table
#df.to_sql('ecco_biwa', schema='nope_metals', index=False, 
#          con=pg_eng)
#
## Parse geom column from hex-encoded WKB. See:
## https://gis.stackexchange.com/a/233215/2131
#sql = ("ALTER TABLE niva_work.nope_metals.ecco_biwa "
#       "ALTER COLUMN geom2 TYPE geometry(MULTIPOLYGON, 32633) " # UTM Zone 33N
#       "USING ST_SetSRID(geom2, 32633)")
#res = engine.execute(sql)
#
## Build spatial index
#sql = ("CREATE INDEX nope_metals_ecco_biwa_gix "
#       "ON niva_work.nope_metals.ecco_biwa "
#       "USING GIST (geom2)")
#res = engine.execute(sql)

This data can now be explored using QGIS (`Layer > Add layers > Add PostGIS layers`) by supplying the following credentials:

    Name:     niva_work
    Host:     localhost
    Database: niva_work
    
The CSV has data for 4677 catchments across Fennoscandia, of which 990 are located in Norway. As demonstrated in this [notebook](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/oracle_postgis_test.ipynb), I'm pretty sure this is the "1000 lakes" dataset, where each polygon corresponds to a lake catchment area. A large number of parameters have been derived for each location, including air and water chemistry (but not much on metals), CORINE and NDVI statistics over time, population data etc.

#### 1.1.2. Other datasets

In addition to the spatial dataset described above, Øyvind has also supplied the following (via Tom Andersen):

 * **N1k_dat_29102014.txt**. Catchment characteristics for the "1000 lakes" survey
 
 * **ReginCC_data_170708b.txt**. Catchment properties for around 8000 regine catchments
 
Both these datasets are potentially useful, but it is difficult to interpret exactly what they mean without more details regarding column headings etc. **Come back to this later**.

### 1.2. External datasets

The [Meteorological Synthesizing Centre-East (MSC-E)](http://www.msceast.org/index.php/pollution-assessment/emep-domain-menu/data-hm-pop-menu) is part of EMEP, and their website includes simulated deposition of some metals (Cd, Hg and Pb) for Norway in recent years (2014 and 2015). The data are provided on the 50 km EMEP grid and are saved locally here:

C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\NOPE\Metals\Raw_Datasets\EMEP


 
## 2. "1000 lakes" dataset

Before exploring the datastes above in too much detail, I'd like to investigate the data we already have in RESA2. In 1995, NIVA undertook a national scale survey of 1000 lakes, which included testing for metals

This seems like a good starting point. As an initial data exploration, I'll attempt the following workflow:

  1. Extract metal concentrations (Ag, As, Pb, Cd, Cu, Cr, Ni, Hg and Zn) from RESA2 for all the lakes in the 1995 "1000 lakes" survey <br><br>
  
  2. Link this data to e.g. atmosphereic deposition, land use, populations etc. in the spatial data from Anders  <br><br>
  
  3. See if any relationships can be identified from which national scale metals loads can be estimated