<img src="http://openenergy-platform.org/static/OEP_logo_2_no_text.svg" alt="OpenEnergy Platform" height="100" width="100"  align="left"/>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Bhl_neue_wege_logo_transparent.svg/2000px-Bhl_neue_wege_logo_transparent.svg.png" alt="BHL" height="300" width="300" align="right"/>

# OpenEnergyPlatform
<br><br>

# MaStR Supplement Coordinate Data
Repository: https://github.com/OpenEnergyPlatform/data-preprocessing/tree/master/data-import/bnetza_mastr

Please report bugs and improvements here: https://github.com/OpenEnergyPlatform/data-preprocessing/issues <br>
How to get started with Jupyter Notebooks can be found here: https://github.com/OpenEnergyPlatform/oeplatform/wiki

In [None]:
__copyright__ = "Bauhaus Luftfahrt e.V."
__license__   = "GNU Affero General Public License Version 3 (AGPL-3.0)"
__url__       = "https://github.com/openego/data_processing/blob/master/LICENSE"
__author__    = "Benjamin W. Portner"

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#OEP-MaStR-data" data-toc-modified-id="OEP-MaStR-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>OEP MaStR data</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li></ul></li><li><span><a href="#OPSD-data" data-toc-modified-id="OPSD-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>OPSD data</a></span><ul class="toc-item"><li><span><a href="#Load-and-clean-data" data-toc-modified-id="Load-and-clean-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load and clean data</a></span></li></ul></li><li><span><a href="#Match-datasets" data-toc-modified-id="Match-datasets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Match datasets</a></span></li><li><span><a href="#Export" data-toc-modified-id="Export-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Export</a></span></li></ul></div>

## OEP MaStR data

First off, imports.

In [None]:
import pandas as pd
import os
from helpers import *

### Load data

Load data with no / wrong coordinates. Merge into one DataFrame:

In [None]:
df_no_coord = pd.read_csv(
    "output/OEP_1_unlocated.csv", 
    parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str}
)
df_wrong_coord = pd.read_csv(
    "output/OEP_2_incorrect_lctn.csv",
    parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str}
)

df_OEP_unlocated = df_no_coord.append(df_wrong_coord, ignore_index=True, sort=False)
len(df_OEP_unlocated)

There are 33,447 entries that have no / incorrect coordinates.

## OPSD data

Open power Systems Data (OPSD, https://open-power-system-data.org/) is another project offering power plant data for Germany (and many other European states). The datasets include powerplant type, generation capacity, and geo-coordinates (among other fields). 

### Load and clean data

The renewable plants dataset includes many different power plant types, including hydro, solar, wind, bioenergy, and geothermal. For this analysis, I will keep only those contained in the OEP dataset: hydro, bioenergy, and wind. Also, I will rename some columns for compatibility with the OEP MaStR data.

In [None]:
# load renewables
df_renewables = pd.read_csv(
    "data/OPSD/renewable_power_plants_DE.csv",
    sep=";", encoding="ANSI", parse_dates=["commissioning_date"]
)

# keep only wind, bioenergy and hydro
df_renewables = df_renewables[df_renewables["energy_source_level_2"].isin(["Wind","Bioenergy","Hydro"])]

# rename columns for compatibiliy with OEP data
df_renewables.rename(columns={
    "energy_source_level_2": "Einheittyp",
    "address" : "Standort",
    "federal_state": "Bundesland",
    "commissioning_date": "Inbetriebnahmedatum",
    "electrical_capacity": "Bruttoleistung",
    "lat": "Breitengrad",
    "lon": "Laengengrad",
    "eeg_id": "AnlagenschluesselEeg",
}, inplace=True)

# add columns which are not defined in OPSD data
df_renewables["Name"] = None
df_renewables["Land"] = "Deutschland"

# rename types for compatibility with OEP
df_renewables["Einheittyp"].replace({"Hydro": "Wasser", "Bioenergy": "Biomasse", "Wind": "Windeinheit"}, inplace=True)

df_renewables.head()

Apart from renewabe plant data, OPSD also offers data of fossil power plants. Strangely, some biomass and hydro plants are not contained in the renewable dataset but in the fossil one. Let's add those to the renewable set.

In [None]:
# load conventional
df_conv = pd.read_csv(
    "data/OPSD/conventional_power_plants_DE.csv",
    sep = ",", encoding = "ANSI", decimal=".", dtype={"commissioned": str}
)

# keep only wind, bioenergy and hydro
df_conv = df_conv[df_conv["energy_source_level_2"].isin(["Wind","Bioenergy","Hydro"])]

# commissioning dates are given as string with format YYYY.0 - Parse manually:
df_conv["commissioned"] = pd.to_datetime(df_conv["commissioned"], format="%Y.0")

# rename columns for compatibiliy with OEP data
df_conv.rename(columns={
    "energy_source_level_2": "Einheittyp",
    "street": "Standort",
    "state": "Bundesland",
    "commissioned": "Inbetriebnahmedatum",
    "capacity_gross_uba": "Bruttoleistung",
    "lat": "Breitengrad",
    "lon": "Laengengrad",
    "eeg": "AnlagenschluesselEeg",
}, inplace=True)

# add columns which are not defined in OPSD data
df_conv["Land"] = "Deutschland"
df_conv["Name"] = df_conv["name_bnetza"].fillna("") + df_conv["name_uba"].fillna("")

# rename types for compatibility with OEP
df_conv["Einheittyp"].replace({"Hydro": "Wasser", "Bioenergy": "Biomasse", "Wind": "Windeinheit"}, inplace=True)

# merge
df_OPSD = df_renewables.append(df_conv, ignore_index=True, sort=False)

len(df_OPSD)

The OPSD dataset contains 19,479 entries for hydro, wind, and biomass power plants.

## Match datasets

Both the OEP MaStR data and the OPSD datasets contain a so-called EEG ID, which I use for matching.

First, remove entries without EEG ID.

In [None]:
df_OPSD_EEG = df_OPSD[~df_OPSD["AnlagenschluesselEeg"].isna()]
df_OEP_EEG = df_OEP_unlocated[~df_OEP_unlocated["AnlagenschluesselEeg"].isna()]

len(df_OPSD), len(df_OPSD_EEG), len(df_OEP_unlocated), len(df_OEP_EEG)

Luckily, most entries in both datasets have an EEG ID. Next, let's find the intersection.

In [None]:
intersecting_EEGs = set(df_OPSD_EEG["AnlagenschluesselEeg"]).intersection(set(df_OEP_EEG["AnlagenschluesselEeg"]))
len(intersecting_EEGs)

1,265 unique EEG IDs are contained in both the OPSD data and in the unlocated OEP data. Fetch their coordinates from the OPSD dataset. 

In [None]:
# OPSD datasets intersecting OEP and containing coordinates
df_OPSD_intersected_located = df_OPSD_EEG[
    ( df_OPSD_EEG["AnlagenschluesselEeg"].isin(intersecting_EEGs) ) &
    ~(
        df_OPSD_EEG["Laengengrad"].isna() | 
        df_OPSD_EEG["Breitengrad"].isna() 
    )
]

# use unlocated OEP as basis
df_extracted = df_OEP_EEG.copy()

# default: all coordinates are nan
df_extracted.loc[:,["Laengengrad", "Breitengrad"]] = None

# overwrite coordinates in OEP with those of OPSD
for OPSD_index, eeg_id in df_OPSD_intersected_located["AnlagenschluesselEeg"].items():
    OEP_indices = df_extracted[df_extracted["AnlagenschluesselEeg"]==eeg_id].index
    df_extracted.loc[OEP_indices, ["Laengengrad", "Breitengrad"]] = \
        df_OPSD_intersected_located.loc[OPSD_index,["Laengengrad", "Breitengrad"]].values

# keep only overwritten entries
df_extracted.dropna(subset=["Laengengrad", "Breitengrad"], inplace=True)

len(df_extracted)

Surprise! We could extract coordinates for 1,299 entries. How can this be when there were only 1,265 intersecting EEG IDs? Because the EEG IDs are not unique! The 1,299 entries have only 1,264 unique EEG IDs. One EEG ID is missing because its coordinates are not contained in the OPSD dataset either.

In [None]:
(
    len(df_extracted),
    len(df_extracted["AnlagenschluesselEeg"].unique()), 
    len(df_OPSD_intersected_located["AnlagenschluesselEeg"].unique())
)

## Export

Export DataFrames and plots: New located entries (OEP-OPSD-intersection).

In [None]:
df_extracted.to_csv("output/OEP_3_located_only_OPSD.csv", index=False)
plot = plotPowerPlants(df_extracted)
gv.renderer("bokeh").save(plot, "output/OEP_3_located_only_OPSD")

Update the set of located entries (OEP located plus OPSD).

In [None]:
# load OEP datasets with validated locations
df_OEP_located = pd.read_csv(
    "output/OEP_2_correct_lctn.csv",
    parse_dates=["Inbetriebnahmedatum"], dtype={"Postleitzahl":str}
)

# add newly located
df_OEP_OPSD_located = df_OEP_located.append(df_extracted, ignore_index=True, sort=False)

# export
df_OEP_OPSD_located.to_csv("output/OEP_3_located_incl_OPSD.csv", index=False)
plot = plotPowerPlants(df_OEP_OPSD_located)
gv.renderer("bokeh").save(plot, "output/OEP_3_located_incl_OPSD")

Update the set of unlocated entries (OEP unlocated minus all located).

In [None]:
# now unlocated = previously unlocated - newly located
df_OEP_OPSD_unlocated = df_OEP_unlocated[
    ~( df_OEP_unlocated["EinheitMastrNummer"].isin(df_OEP_OPSD_located["EinheitMastrNummer"]) )
].copy().reset_index(drop=True)

# export
df_OEP_OPSD_unlocated.to_csv("output/OEP_3_unlocated_incl_OPSD.csv", index=False)

I do not plot the unlocated entries because none of them have coordinates. All wrongly located entries have been removed! Also, more than half of the entries in the OEP dataset are now located!

In [None]:
len(df_OEP_OPSD_located), len(df_OEP_OPSD_unlocated)