This notebook provides a brief guideline and some code, so that everybody can individually conduct some data analysis and provide a standardized DataFrame that can then be processed and displayed on a map.

Ideally, some DataFrames can be sent to **Vivika by Monday, 31st of August**, so she can visualize them for the Tuesday call.

# Guideline for individual data analysis

- After your analysis, please **provide a 4 column dataframe (ID, name, year, values) in the format displayed below** (ID, name and year is required. Theoretically you can have multiple value columns as 5th, 6th, etc. column, but don't duplicate ID, name, year. If you use external data, please see below for a matching table for ID-NUTS1 and "Bundesländer")
- Please **aim for NUTS-3 level**, alternatively NUTS-1
- Please **provide value column name** (e.g. change "AIxxxx" to human-understandable, like "school drop-outs under 15yrs")
- Please also provide a **brief description** of the column 
- Language is **English**, so please translate column names / descriptions, if necessary (to translate from German, use e.g. https://www.deepl.com/translator)
- If necessary, please convert the data in a way that **high values are positive for child well-being** (e.g. values for high-school drop-outs would need to be reversed so that high values

In [1]:
# Import libraries
%load_ext autoreload
%autoreload
  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import loadtxt
import geopandas as gpd
import math
import os
import pickle

# Datenguidepy
from datenguidepy.query_helper import get_regions, get_statistics, get_availability_summary
from datenguidepy import Query

# Processing/App
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

%matplotlib inline

In [2]:
# Query regions, e.g. get all "Bundesländer"
get_regions().query("level == 'nuts1'")

Unnamed: 0_level_0,name,level,parent
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Saarland,nuts1,DG
11,Berlin,nuts1,DG
12,Brandenburg,nuts1,DG
13,Mecklenburg-Vorpommern,nuts1,DG
14,Sachsen,nuts1,DG
15,Sachsen-Anhalt,nuts1,DG
16,Thüringen,nuts1,DG
1,Schleswig-Holstein,nuts1,DG
2,Hamburg,nuts1,DG
3,Niedersachsen,nuts1,DG


**Helper function in case you are using Datenguidepy**

This function will return a DataFrame in the required final format, taking a code as input. Feel free to optimize - it can take one or two minutes.

In [3]:
# Save all NUTS3 codes as DataFrame (used in function for standardized query)
nuts3_codes = pd.DataFrame(get_regions().query('level == "nuts3"').name)   # Get all NUTS-3 codes

In [12]:
# Function for standardized query (I get duplicate rows, so I delete them in sequence. Can surely be improved)
def nuts3_query(code):
    query = Query.region(list(nuts3_codes.index))    # Query.all_regions(nuts=3) did not work for me somehow
    query.add_field(code)
    query_res = query.results(verbose_statistics=True)    # verbose_statistics changes column name of code to title
    value_col = [col for col in query_res.columns if code in col][0]   # necessary as "code" is no longer column title
    query_res = query_res[["id", "name", "year", value_col]]     # retain only required columns
    query_res.drop_duplicates(subset=None, keep='first', inplace=True)
    query_res.rename(columns=lambda x: x+"_nuts3" if x in ["id", "name"] else x, inplace=True)
    return query_res

In [13]:
# Apply query function
df = nuts3_query('AI0304')
df

Unnamed: 0,id_nuts3,name_nuts3,year,Anteil Schulabgänger mit allgem. Hochschulreife (AI0304)
0,10041,"Saarbrücken, Regionalverband",2006,29.3
2,10041,"Saarbrücken, Regionalverband",2007,27.1
4,10041,"Saarbrücken, Regionalverband",2008,29.2
6,10041,"Saarbrücken, Regionalverband",2009,44.1
8,10041,"Saarbrücken, Regionalverband",2010,33.9
...,...,...,...,...
16,09780,"Oberallgäu, Landkreis",2014,15.9
18,09780,"Oberallgäu, Landkreis",2015,17.5
20,09780,"Oberallgäu, Landkreis",2016,17.0
22,09780,"Oberallgäu, Landkreis",2017,19.4


In [15]:
# Get additional information on the field
query = Query.all_regions(nuts=1)
field_info = query.add_field('AI0304')
field_info.get_info()

[1mkind:[0m
OBJECT

[1mdescription:[0m
Anteil Schulabgänger mit allgem. Hochschulreife

[1marguments:[0m
[4myear[0m: LIST of type SCALAR(Int)

[4mstatistics[0m: LIST of type ENUM(AI0304Statistics)
enum values:
R99910: Regionalatlas Deutschland

[1mfields:[0m
id: Interne eindeutige ID
year: Jahr des Stichtages
value: Wert
source: Quellenverweis zur GENESIS Regionaldatenbank

[1menum values:[0m
None


**Save DataFrame in pickle format**

In [None]:
# Save DataFrame to pickle
df.to_pickle("saves/AI0304.pkl")

# What happens to your DataFrame afterwards

This section is just to indicate what happens with your DataFrame - combined with the others - afterwards, so no further action required

In [None]:
# Import all DataFrames from folder
df = pd.DataFrame(columns=['id_nuts3', 'name_nuts3', 'year'])
for f in os.listdir('data_pickles'):
    if not f.startswith('.'):
        temp = pickle.load(open('data_pickles/'+f, "rb" ))
        df = pd.merge(df, temp,  how='outer', on=['year','id_nuts3', 'name_nuts3'])
df.replace(0, np.nan, inplace = True)

In [None]:
# Add columns for id_nuts1 and name_nuts1
regions = get_regions()
regions_nuts1 = regions[regions.level=="nuts1"]["name"]   # Get id and names of regions on nuts1 level
df["id_nuts1"] = [str(x)[:2] for x in df.id_nuts3]   # Add column with id_nuts1
df = pd.merge(df, regions_nuts1, how='left', left_on="id_nuts1", right_index=True)  # Add column with nuts1 name
df.rename(columns=lambda x: x+"_nuts1" if x in ["id", "name"] else x, inplace=True)   # Rename column to name_nuts1

In [None]:
# changing location of nuts1 columns
mid = df[['id_nuts1','name_nuts1']]
df.drop(labels=['id_nuts1','name_nuts1'], axis=1, inplace = True)
df.insert(0,'name_nuts1', mid['name_nuts1'])
df.insert(0,'id_nuts1', mid['id_nuts1'])

In [None]:
# scaling and saving data
scaled_df = df.iloc[:,:5]
scaled_cols=MinMaxScaler().fit_transform(X=df.iloc[:,5:])
scaled_cols=pd.DataFrame(scaled_cols, columns=df.iloc[:,5:].columns)
scaled_df=pd.concat([scaled_df,scaled_cols], axis=1)

In [None]:
#importing json containing Lankreis borders
geojson = gpd.read_file(f'landkreise_simplify200.geojson')


In [None]:
# defining features for input
features = list(df.drop(['id_nuts1','name_nuts1','id_nuts3','name_nuts3','year'],axis=1).columns)


**Next: App implementation via Plotly Dash**