<img src="http://openenergy-platform.org/static/OEP_logo_2_no_text.svg" alt="OpenEnergy Platform" height="100" width="100"  align="left"/>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Bhl_neue_wege_logo_transparent.svg/2000px-Bhl_neue_wege_logo_transparent.svg.png" alt="BHL" height="300" width="300" align="right"/>

# OpenEnergyPlatform
<br><br>

# MaStR Data Cleaning
Repository: https://github.com/OpenEnergyPlatform/data-preprocessing/tree/master/data-import/bnetza_mastr

Please report bugs and improvements here: https://github.com/OpenEnergyPlatform/data-preprocessing/issues <br>
How to get started with Jupyter Notebooks can be found here: https://github.com/OpenEnergyPlatform/oeplatform/wiki

In [None]:
__copyright__ = "Bauhaus Luftfahrt e.V."
__license__   = "GNU Affero General Public License Version 3 (AGPL-3.0)"
__url__       = "https://github.com/openego/data_processing/blob/master/LICENSE"
__author__    = "Benjamin W. Portner"

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Drop-duplicates" data-toc-modified-id="Drop-duplicates-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Drop duplicates</a></span></li><li><span><a href="#Drop-unlocated" data-toc-modified-id="Drop-unlocated-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Drop unlocated</a></span></li><li><span><a href="#Drop-incorrectly-located" data-toc-modified-id="Drop-incorrectly-located-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Drop incorrectly located</a></span></li></ul></div>

## Load data

First off, imports.

In [None]:
import pandas as pd
import os
import geopandas as gpd
from shapely.geometry import Polygon
from shapely.geometry import Point

from helpers import *

Now, load the datasets for hydro, wind and biomass:

In [None]:
version = '1.4'

fn_wind = f'bnetza_mastr_rli_v{version}_wind'
df_wind = pd.read_csv(f'data/OEP/bnetza_mastr_power-units_rli_v{version}/{fn_wind}.csv', 
                      encoding='utf8', sep=';', parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str})

fn_hydro = f'bnetza_mastr_rli_v{version}_hydro'
df_hydro = pd.read_csv(f'data/OEP/bnetza_mastr_power-units_rli_v{version}/{fn_hydro}.csv', 
                       encoding='utf8', sep=';', parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str})

fn_biomass = f'bnetza_mastr_rli_v{version}_biomass'
df_biomass = pd.read_csv(f'data/OEP/bnetza_mastr_power-units_rli_v{version}/{fn_biomass}.csv', 
                         encoding='utf8', sep=';', parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str})

Merge them into one DataFrame:

In [None]:
df_all = df_wind.append(df_hydro, ignore_index=True, sort=False).append(df_biomass, ignore_index=True, sort=False)
df_all.head()

For some reason, there is an unnamed column. Let's drop it:

In [None]:
df_all.drop(columns=['Unnamed: 0'], inplace=True)
df_all.head()

## Drop duplicates

Next, let's see if all the entries are unique. Dropping duplicates does not work on this dataset:

In [None]:
len_before = len(df_all)
len_after = len(df_all.drop_duplicates())

len_before, len_before == len_after

However, there are quite a few entries that are basically identical: 

In [None]:
df_all[df_all.duplicated(subset="EinheitMastrNummer", keep=False)].sort_values(by="EinheitMastrNummer").head()

Why were these not dropped before? Because the dataset contains a timestamp that documents the time of download for each entry. Naturally, that timestep is unique for each entry. Let's drop it and repeat:

In [None]:
# create one dataframe with all duplicated entries
has_double = df_all.drop(columns=["timestamp_e"]).duplicated(keep=False)
df_duplicated = df_all[has_double].copy().sort_values(by="EinheitMastrNummer").reset_index(drop=True)

# create one dataframe with only unique entries
is_unique = ~df_all.drop(columns=["timestamp_e"]).duplicated(keep="first")
df_unique = df_all[is_unique].copy().sort_values(by="EinheitMastrNummer").reset_index(drop=True)

# compare sizes
len(df_all), len(df_duplicated), len(df_unique)

There are 65,133 unique entries. That is 6,004 less than in the merged dataset. There are 9,381 entries which have at least one double.

Export the result:

In [None]:
# create output direction if it does not already exist
outdir = './output'
if not os.path.exists(outdir):
    os.mkdir(outdir)

# export
df_duplicated.to_csv("output/OEP_0_duplicated.csv", index=False)
df_unique.to_csv("output/OEP_0_unique.csv", index=False)

## Drop unlocated

Next, I will remove all entries which do not have coordinates associated with them.

In [None]:
unlocated_index = df_unique[["Laengengrad", "Breitengrad"]].isna().any(axis=1)
df_unlocated = df_unique[unlocated_index].copy().reset_index(drop=True)
df_located = df_unique[~unlocated_index].copy().reset_index(drop=True)
len(df_unique), len(df_located), len(df_unlocated)

Little more than half of the entries have no coordinates. Export:

In [None]:
df_unlocated.to_csv("output/OEP_1_unlocated.csv", index=False)
df_located.to_csv("output/OEP_1_located.csv", index=False)

## Drop incorrectly located

Some entries have coordinates outside Germany. Some of these are off-shore wind turbines. However, there are entries which have addresses in Bayern or Niedersachsen but are located in Italy, and even Africa. Also, some entries lie within Germany but in the wrong Bundesland. I will identify and remove these.


First, load the shape files. They contain polygons describing the boundaries of the Bundesländer, and of the German exclusive economic zones in the Baltic sea and in the North Sea:

In [None]:
gdf_bld = gpd.read_file(r"data\shapefiles_germany\BKG\vg2500_bld.shp")
gdf_baltic = gpd.read_file(r"data\shapefiles_germany\marineregions.org\DE_baltic_sea\eez_iho.shp")
gdf_north_sea = gpd.read_file(r"data\shapefiles_germany\marineregions.org\DE_north_sea\eez_iho.shp")

gdf_bld.head()

The nomenclature to describe the Bundesländer is different from the OEP standard. Fix that:

In [None]:
rename_map = {
    'Thüringen': 'Thueringen',
    'Sachsen-Anhalt': 'SachsenAnhalt',
    'Mecklenburg-Vorpommern': 'MecklenburgVorpommern',
    'Nordrhein-Westfalen': 'NordrheinWestfalen',
    'Rheinland-Pfalz': 'RheinlandPfalz',
    'Schleswig-Holstein': 'SchleswigHolstein',
    'Baden-Württemberg': 'BadenWuerttemberg',
}
gdf_bld["GEN"] = gdf_bld["GEN"].replace(rename_map)
gdf_baltic["GEN"] = "AusschliesslicheWirtschaftszone"
gdf_north_sea["GEN"] = "AusschliesslicheWirtschaftszone"

Merge the three GeoDataFrames into one for easier handling:

In [None]:
gdf_bld_eez = gdf_bld.\
    append(gdf_baltic, ignore_index=True, sort=False).\
    append(gdf_north_sea, ignore_index=True, sort=False)

Extend the polygons slightly to make sure that points close to their edge will be correctly identified as within:

In [None]:
buffer = 0.01
gdf_bld_eez["geometry"] = gdf_bld_eez.buffer(buffer)

Now I want to check for all entries if their coordinates are really located in the Bundesland specified in the entry. First, remove entries without Bundesland:

In [None]:
df_no_bld = df_located[df_located["Bundesland"].isna()].copy().reset_index(drop=True)
df_bld = df_located[~df_located["Bundesland"].isna()].copy().reset_index(drop=True)

To check if their coordinates are correct, I first need convert all entries to a GeoDataFrame:

In [None]:
df_id_bld_coord = df_bld[["EinheitMastrNummer", "Bundesland", "Laengengrad", "Breitengrad"]]
points = [Point(xy) for xy in zip(df_id_bld_coord["Laengengrad"], df_id_bld_coord["Breitengrad"])]
crs = {'init': 'epsg:4326'}
gdf_points = gpd.GeoDataFrame(df_id_bld_coord[["EinheitMastrNummer", "Bundesland"]], crs=crs, geometry=points)

Now, let's check. By default, all entries are tagged as incorrectly located:

In [None]:
df_bld["correct_bld"] = False

Iterate over the Bundesländer. Check for all entries of the Bundesland whether their coordinates are in the polygon defined in the shape file. If yes, tag them as correctly located.

In [None]:
for bld in df_bld["Bundesland"].unique():

    # get points which are supposedly in the Bundesland
    gdf_points_bld = gdf_points[gdf_points["Bundesland"] == bld]

    # find all points which really are in the Bundesland
    points_in_bld = gpd.sjoin(gdf_points_bld, gdf_bld_eez[gdf_bld_eez["GEN"]==bld], how="right", op="within")

    # tag those as correctly located
    df_bld.loc[
        df_bld["EinheitMastrNummer"].isin(points_in_bld["EinheitMastrNummer"]),
        "correct_bld"
    ] = True

Separate correctly located entries from incorrectly located entries.

In [None]:
df_correct_lctn = df_bld[df_bld["correct_bld"]].copy().reset_index(drop=True)
df_incorrect_lctn = df_bld[~df_bld["correct_bld"]].copy().reset_index(drop=True)
len(df_bld), len(df_correct_lctn), len(df_incorrect_lctn)

Of the 32,259 entries which have Bundesland and coordinates specified, 31,685 are located correctly. 574 are located incorrectly.

Export csv's.

In [None]:
df_no_bld.to_csv("output/OEP_2_no_bld.csv", index=False)
df_correct_lctn.to_csv("output/OEP_2_correct_lctn.csv", index=False)
df_incorrect_lctn.to_csv("output/OEP_2_incorrect_lctn.csv", index=False)

Let's also export interactive maps of correctly and incorrectly located entries. These maps are zoomable and panable. When hovering over a point on the map, the entry's name, address, type, and gross output is shown as a tooltip. Also, clicking on legend entries will show/hide the corresponding points. All the plotting logic is contained in the function "plotPowerPlants" of the "helpers" module:

In [None]:
plot_correct_lctn = plotPowerPlants(df_correct_lctn)
plot_incorrect_lctn = plotPowerPlants(df_incorrect_lctn)

# export as html
gv.renderer("bokeh").save(plot_correct_lctn, "output/OEP_2_correct_lctn")
gv.renderer("bokeh").save(plot_incorrect_lctn, "output/OEP_2_incorrect_lctn")

Looks like the outliers were identified correctly. Now, what can we do about the wrongly located entries? Maybe we can use data from other sources to get their coordinates. I will describe this in the [next notebook](OEP_MaStR_supplement_coords.ipynb).