# Notebook 6 - Extracting Data from the DABO Well Database

[GemGIS](https://github.com/cgre-aachen/gemgis) is a package for geographic information processing for geomodeling. In particular, data is prepared for direct use in [GemPy](https://github.com/cgre-aachen/gempy) via a GemPy Data Class. The package provides functions to process spatial data such as vector data (shape files, geojson files, geopackages), raster data (tiff-files), data retrieved from online services (WMS, WCS, WFS) or KML/XML files. 

At a later stage, functionality will be added to interactively add interfaces and orientations for a model, chosing the extent, defining custom sections and more. In addition, functionality will be provided to export data from GemPy into Geoinformation Systems (=GIS) such as QGIS or ArcGIS and Google Earth. 

# Overview

This notebook presents the extraction of borehole data (location of wells and stratigraphy) from logs provided by the Geological Survey NRW. The raw data of the logs is **NOT** provided with the repository but data can be requested from the Geological Survey NRW at no cost.

- [Downloading and Installing GemGIS](#gemgis)
- [Structure of GemGIS](#structure)
- [Importing Libraries](#import)
- [Version Reports](#vreport)


<a id='gemgis'></a>
## Downloading and installing GemGIS

`GemGIS` is under constant development and the latest available version can be downloaded at https://github.com/cgre-aachen/gemgis. A pip version can be found at https://pypi.org/project/gemgis/. A dedicated documentation page will follow.

<a id='structure'></a>
## Structure of GemGIS

The core of `GemGIS` is made of the `GemPyData` class (`gemgis.py`). Its attributes can directly be utilized by `GemPy` making it easier for users to load data. Methods of the `GemPyData` class allow users to directly set these attributes. Multiple other files contain functions to manipulate vector data, raster data, etc.:

* `gemgis.py` - core file containing the `GemPyData` class
* `vector.py` - file containing functions to manipulate vector data
* `raster.py` - file containing functions to manipulate raster data
* `utils.py` - file containing utility functions frequently used for the manipulation of vector/raster data
* `wms.py` - file containing methods to load online services as vector and raster data
* `visualization.py` - file containing functions to simplify plotting of spatial data
* `postprocessing.py` - file containing functions to postprocess GemPy geo_model data


If you have any problems using GemGIS, find a bug or have an idea for a new feature, open an issue at https://github.com/cgre-aachen/gemgis/issues. 

<a id='import'></a>
# Importing Libraries

Apart from creating a GemPyData class later in the tutorial, GemGIS is working with pure GeoDataFrames, Rasterio files and NumPy arrays to provide the user with easy data handling. ***Currently, geopandas version 0.8 is the latest stable version that is supported by GemGIS***. A general introduction to working with rasters and Rasterio objects in GemGIS is provided in the next notebook.

The first step is loading `GemGIS` and the auxiliary libraries `geopandas` and `rasterio` apart from `NumPy` and `Matplotlib`. `GemGIS` will also load `GemPy` the background. If the installation of `GemPy`was not successful, `GemGIS` cannot be used. 

In [1]:
import sys
sys.path.append('../../../gemgis')
import gemgis as gg
import geopandas as gpd
import rasterio
import numpy as np
import matplotlib.pyplot as plt
print(gg)



<module 'gemgis' from '../../../gemgis\\gemgis\\__init__.py'>


# Load PDF File as String and Save to txt-file

Borehole logs provided by the Geological Survey NRW through its database DABO (https://www.gd.nrw.de/gd_archive_dabo.htm) can be parsed to obtain a Pandas DataFrame, that can be used for the modeling with `GemPy`. The raw files are not provided with the repository but can be requested for a particular area, target depth and target horizon directly from the Geological Survey. The boreholes investigated here are from the Münster area.

The PDF is loaded and saved as `txt` file for further processing but can also be directly used. 

In [2]:
data = gg.misc.load_pdf('../../../BoreholeDataMuenster.pdf')

FileNotFoundError: [Errno 2] No such file or directory: '../../../BoreholeDataMuenster.pdf'

# Open saved txt file as string

The saved file can be loaded as string for further processing.

In [None]:
with open('../../../BoreholeDataMuenster.txt', "r") as text_file:
    data = text_file.read()

In [None]:
data[:100]

# Create Coordinate DataFrame

The string containing all information about the boreholes can be extracted with the function below and stored as Pandas DataFrame.

In [None]:
coordinates_dataframe = gg.misc.coordinates_table_list_comprehension(data, 'GD')
coordinates_dataframe.head()

# Converting DataFrame to GeoDataFrame

The DataFrame can be converted to a GeoDataFrame. 

In [None]:
gdf = gpd.GeoDataFrame(
    coordinates_dataframe, geometry=gpd.points_from_xy(coordinates_dataframe.X, coordinates_dataframe.Y))
gdf.head()

# Loading WMS Service for Background Imagery

The WMS Service used in Tutorial 3 will also be used here for Background Imagery. 

## Load WMS Layer and Map
A basic WMS Layer is loaded with a OpenStreetMap as reference to better locate the data.

In [None]:
wms = gg.wms.load('https://ows.terrestris.de/osm/service?')

In [None]:
wms_map = gg.wms.load_as_array('https://ows.terrestris.de/osm/service?',
                             'OSM-WMS', 'default', 'EPSG:4647', [32375000,32435000,5730000,5790000] , [4000, 2000], 'image/png')

# Plotting the Data

The data of the GeoDataFrame can then easily be plotted. The spatial distribution of the wells can now be observed. The background map shows the area around Aachen. 

In [None]:
fig, ax1 = plt.subplots()
ax1.imshow(wms_map, extent= [32375000,32435000,5730000,5790000])
gdf.plot(ax=ax1, markersize=5)
ax1.grid()
ax1.set_xlabel('m')
ax1.set_ylabel('m')

# Extract Stratigraphic Column from Borehole Logs

Next to the coordinates of the Boreholes, the provided stratigraphic column can also be extracted for the use. 

# Load supplementary Data

In [None]:
with open('../../../gemgis/data/misc/symbols.txt', "r") as text_file:
    symbols = [(i, '') for i in text_file.read().splitlines()]

with open('../../../gemgis/data/misc/formations.txt', "r") as text_file:
    formations = text_file.read().split()
    
formations = [(formations[i], formations[i+1]) for i in range(0,len(formations)-1,2)]
formations[:10]

# Load Txt File

# Extract Stratigraphic Data

In [None]:
df = gg.misc.stratigraphic_table_list_comprehension(data, 'GD', symbols, formations)

In [None]:
print(len(df))
df.head(10)

# Categorized Formations

In [None]:
df = df[df['formation']!= 'Quaternary']
df = df[df['formation']!= 'Coniacium']
df = df[df['formation']!= 'Turonium']
df = df[df['formation']!= 'OberCampanium']
df = df[df['formation']!= 'UnterCampanium']
df = df[df['formation']!= 'OberSantonium']
df = df[df['formation']!= 'MittelSantonium']
df = df[df['formation']!= 'UnterSantonium']
df = df[df['formation']!= 'AachenFM/UnterSantonium']
df = df[df['formation']!= 'MittelConiacium']
df = df[df['formation']!= 'UnterConiacium']
df = df[df['formation']!= 'OberTuronium']
df = df[df['formation']!= 'MittelTuronium']
df = df[df['formation']!= 'UnterTuronium']
df = df[df['formation']!= 'OberCenomanium']
df = df[df['formation']!= 'MittelCenomanium']
df = df[df['formation']!= 'UnterCenomanium']
df = df[df['formation']!= 'OberAlbium']
df = df[df['formation']!= 'MittelAlbium']
df = df[df['formation']!= 'UnterAlbium']
df = df[df['formation']!= 'EssenFM']
df = df[df['formation']!= 'BochumFM']
df = df[df['formation']!= 'WittenFM']
df = df[df['formation']!= 'Carboniferous']
df = df[df['formation']!= 'Devonian']
df = df[df['formation']!= 'Cenomanium']
df = df[df['formation']!= 'Coniacium']
df = df[df['formation']!= 'Cretaceous']
df = df[df['formation']!= 'HorstFM']
df = df[df['formation']!= 'DorstenFM']
df = df[df['formation']!= 'Zechstein']
df = df[df['formation']!= 'UntererKeuperGP']
df = df[df['formation']!= 'UnterJura']
df = df[df['formation']!= 'OberJura']
df = df[df['formation']!= 'MittelJura']
df = df[df['formation']!= 'Oberkreide']
df = df[df['formation']!= 'Unterkreide']
df = df[df['formation']!= 'OberConiacium']