# Project Group 18

Members: Karin van den Berg, Wouter Diebels, Floris Muis, Levi Mulder, Maaike Tjeerdsma

Student numbers: 4938933, 5869323, 5110394, 4712463, 4964578

# Research Objective

## Introduction
Nearly 60% of the Netherlands is flood prone [Ligtvoet, 2009]. 55% of this area includes the inland area that is protected by dunes and dikes [Minnen et al., 2012]. Regarding the floods, climate change influences the safety of the Netherlands in different ways. Besides the rise of the sea level, one of the most important factors that increase the risk of floods are the increasing peak discharges of rivers [Minnen et al., 2012]. For instance, in 2021, the extreme weather conditions and the substantial waterlevels of the rivers cause the second most expensive nature disaster of that particular year [nos, 2021]. In contrast to the high-waterlevels, climate change also causes drought. Due to the low waterlevels, freight transport via inland shipping has been difficult for some time. The waterlevel in the Rhine dropped to such an extent that shipping is hampered [Parool, 2022]. The ships can carry less so as not to lie too deep which causes more pressure on the freight transport via inland shipping. If climate change more often leads to extreme waterlevels and river discharges, the risk of these incidents will rise [Klijn et al., 2010]. According to [Baede, 2001], climate change refers to the average weather in terms of the mean and its variability over a certain timespan and a certain area [Baede, 2001]. The research objective of the project is therefore to investigate the extent to which climate change, in other words changes in temperature, affect Dutch waterlevels. Our hypothesis is that the changes in temperature play a major role in influencing the waterlevels in the Netherlands.

## Research Question
To what extend does global temperature change influence Dutch river waterlevel heights?

## Method
### Data
The data for the waterlevel heights used for this assignment was retrieved from Rijkswaterstaat. Data was available from 1980 (which was left out of further analysis) and 1987 until 2022, taken at the measuring points Eijsden and Lobith. These towns were selected because they lie on the Dutch border at the places where the Maas and the Rhine enter the Netherlands.
For the European temperatures, data from the National Centers for Environmental Information (NOAA) was used. From 1987 until August 2022, the monthly average temperature of Europe was selected. The dataset presents the anomaly of the monthly temperature in °C, relative to the average of the base period 1910-2000.
### Visualization
For the visualization, an interactive plot will be made, showing the waterlevels and the average temperature, combined with a slider for the time. 
Also, a second plot will be made, plotting time against both waterlevel heights and temperature. This plot will also show the linear regressions of the temperature and the river level heights
### Analysis
In the hypothesis it was stated that the rise in average temperature is expected to increase the variance of the waterlevel height. The variance of the waterlevel height differences are measured over the time spans of weeks, months and years. A Pearson’s r correlation test will be done on the temperature and the change of variance of waterlevel heights.


# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Karin van den Berg**: Requesting and downloading temperature data, calculating variance

**Wouter Diebels**: Writing introduction, plotting data temperature and waterlevel height

**Floris Muis**: Cleaning waterlevel height data

**Levi Mulder**: Cleaning temperature data, plotting data temperature and waterlevel height

**Maaike Tjeerdsma**: Writing methods, implementation of text into the project template


# Data Used

- [nos, 2021] (2021). Overstromingen in limburg en buurlanden op één na duurste natuurramp van 2021.
- [Baede, 2001] Baede, A. P. (2001). The climate system: an overview. Climate change 2001: the scientific basis, pages 38–47.
- [Klijn et al., 2010] Klijn, F., Kwadijk, J., de Bruijn, K., and Hunink, J. (2010). Overstromingsrisico’s en droogterisico’s in een veranderend klimaat: verkenning van wegen naar een klimaatveranderingsbestendig Nederland. Deltares Delft.
- [Ligtvoet, 2009] Ligtvoet, W. (2009). Roadmap for a climate-proof netherlands; wegen naar een klimaatbestendig nederland.
- [Minnen et al., 2012] Minnen, J. V., Ligtvoet, W., Bree, L. v., Hollander, G. d., Visser, H., Schrier, G., Bessembinder, J., van Oldenborgh, G., Prozny, T., Sluijter, R., et al. (2012). Effecten van klimaatverandering in nederland: 2012. Beleidsstudies, pages 1–125.
- [Parool, 2022] Parool, H. (2022). Dalend waterpeil in de Rijn geeft problemen voor goederenvervoer binnenvaart — parool.nl. https://www.parool.nl/nederland/dalend-waterpeil-in-de-rijn-geeft-problemen-voor-goederenvervoer-binnenvaart b670f116/. [Accessed 13-Oct-2022].


# Data Pipeline

### Import modules and data

In [48]:
import pandas as pd
import plotly.express as px
import numpy as np
from scipy import stats
import datetime as dt
import math

import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [49]:
#remove these files from your local github directory after using them, otherwise you will get a push notification saying: 
# "file size exeeds 100mb, failed to push"
# These rawdata files can be found on our google drive folder. (for group 18 members, ourselves)
# These rawdata files are in the folder Rawdata as .csv files, you need to open them in the same folder because they are too big for github 

filePath1 = "Rawdata/20221012_030.csv" #data from 20221012_030.zip
filePath2 = "Rawdata/20221012_030_2nd.csv" #data from 20221012_030 (1).zip, needs to be renamed to 20221012_030_2nd.csv
filePath3 = 'Rawdata_temperature/Temperature_data.csv'
filterSize = 5 #number of standard deviations for which values are not considered outliers. 

In [50]:
#activates the data_import function. This takes quite a long time due to the filesize. 
rawDataEijsden =  pd.read_csv(filePath1, delimiter=";", encoding='latin-1') 
rawDataLobith =  pd.read_csv(filePath2, delimiter=";", encoding='latin-1') 
rawDataTemp = pd.read_csv(filePath3, delimiter=',', skiprows=[0,1,2,3]) #read csv and skip first 4 rows with non usable data

### Functions

In [68]:
# function to clean the raw data from the riverwater level heights so it is usable
def data_cleaner(data):
    selectedData = data.iloc[:,[21,22,24]]
    selectedData = selectedData.iloc[::144,:] #144 = 6 * 24 to reduce the amount of rows to 1 row per day. 
    locationName = data.iloc[1,1]
    selectedData = selectedData[(np.abs(stats.zscore(selectedData["NUMERIEKEWAARDE"])) < filterSize)] #filters out outliers
    
    #selectedData['WAARNEMINGDATUM'] = pd.to_datetime(selectedData['WAARNEMINGDATUM'], format='%d-%m-%Y') #data from YYYYMM to DD-MM-YYYY
    
    # figure out what type of data is in the datetime column
    # selectedData['days'] = pd.to_datetime(selectedData['WAARNEMINGDATUM'])
    # selectedData['days'] = selectedData["days"]
    # selectedData['Difference'] = (selectedData['days'] - selectedData.iloc[0,3]).dt.days
    return selectedData, locationName

def first_visual(data, plotTitle):
    fig = px.line(data,title=plotTitle, x="WAARNEMINGDATUM", y="NUMERIEKEWAARDE")
    fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=10, label="10y", step="year", stepmode="backward"), #de legenda moet nog even gefixt worden. geen idee nog hoe
            dict(count=3, label="3y", step="year", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=5, label="5y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
    
    fig.show()
    return

# function to clean the raw data from the temperature so it is usable
def data_cleaner_temperature(data):
    data['Year'] = pd.to_datetime(data['Year'], format='%Y%m') #date from YYYYMM to YYYY-MM-DD
    return data    

def variance_Eijsden(data): 
    #number of data points
    number_data = len(data)
    #square deviation
    deviations = [(p - mean_all_data_Eijsden)**2 for p in data]
    #variance
    variance = sum(deviations) / number_data
    return variance
    
def stddev_Eijsden(data):
    #variance of data
    variance_data = variance_Eijsden(data)
    #standard deviation of the data 
    stddev_data = math.sqrt(variance_data)
    return stddev_data

def variance_Lobith(data): 
    #number of data points
    number_data = len(data)
    #square deviation
    deviations = [(p - mean_all_data_Lobith)**2 for p in data]
    #variance
    variance = sum(deviations) / number_data
    return variance

def stddev_Lobith(data):
    #variance of data
    variance_data = variance_Lobith(data)
    #standard deviation of the data 
    stddev_data = math.sqrt(variance_data)
    return stddev_data

# ----------------------------- functions variance specific years ----------------------------- 

# function variance Eijsden in a given year
def variance_year_Eijsden(year):
    data_Eijsden_year = dataEijsden[dataEijsden['WAARNEMINGDATUM'].str.endswith(str(year))==True]
    return (variance_Eijsden(data_Eijsden_year['NUMERIEKEWAARDE']))

#function variance Lobith in a given year
def variance_year_Lobith(year):
    data_Lobith_year = dataLobith[dataLobith['WAARNEMINGDATUM'].str.endswith(str(year))==True]
    return (variance_Lobith(data_Lobith_year['NUMERIEKEWAARDE']))


def stdev_year_Eijsden(year):
    data_Eijsden_year = dataEijsden[dataEijsden['WAARNEMINGDATUM'].str.endswith(str(year))==True]
    return (stddev_Eijsden(data_Eijsden_year['NUMERIEKEWAARDE']))

def stdev_year_Lobith(year):
    data_Lobith_year = dataLobith[dataLobith['WAARNEMINGDATUM'].str.endswith(str(year))==True]
    return (stddev_Lobith(data_Lobith_year['NUMERIEKEWAARDE']))

def figure_merge(data_merge):
    trace1 = go.Scatter(x=data_merge['WAARNEMINGDATUM'],
                    y=data_merge['NUMERIEKEWAARDE'],
                    name='Water level',
                    mode='lines+markers',
                    yaxis='y1')
    trace2 = go.Scatter(x=data_merge['WAARNEMINGDATUM'],
                    y=data_merge['Value'],
                    name='Temperature',
                    mode='lines+markers',
                    yaxis='y2')
    data = [trace1, trace2]
    layout = go.Layout(title= 'standard deviation river levels vs Temperature',
                   yaxis=dict(title='Water level'),
                   yaxis2=dict(title='Temperature anomalies in Celsius',
                               overlaying='y',
                               side='right'))

    Figure = go.Figure(data=data, layout=layout)
    return Figure

### Visualisations

In [70]:
dataEijsden, locationName1 = data_cleaner(data=rawDataEijsden)
dataLobith, locationName2 = data_cleaner(data=rawDataLobith)


first_visual(data=dataEijsden,plotTitle=locationName1)
first_visual(data=dataLobith,plotTitle=locationName2)

#we need the mean of all data, to get the variance of different periods towards the mean
number_of_all_data_Eijsden = len(dataEijsden)
mean_all_data_Eijsden = sum(dataEijsden['NUMERIEKEWAARDE'])/number_of_all_data_Eijsden

number_of_all_data_Lobith = len(dataLobith)
mean_all_data_Lobith = sum(dataLobith['NUMERIEKEWAARDE'])/number_of_all_data_Lobith

In [71]:
#plot europe temperature anomalies
dataTemp = data_cleaner_temperature(rawDataTemp)

fig = px.line(dataTemp, title = 'Europe temperature anomalies', x = 'Year', y = 'Value')
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=10, label="10y", step="year", stepmode="backward"), #de legenda moet nog even gefixt worden. geen idee nog hoe
            dict(count=3, label="3y", step="year", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=5, label="5y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)

fig.show()

In [72]:
# loop through years
year_number = range(1987, 2023)
variance_list_Eijsden = []
variance_list_Lobith = []

for n in year_number:
    numbers1 = variance_year_Eijsden(n)
    variance_list_Eijsden.append(numbers1)
    
for n in year_number:
    numbers2 = variance_year_Lobith(n)
    variance_list_Lobith.append(numbers2)

AttributeError: Can only use .str accessor with string values!

In [55]:
#plot variance Eijsden
x = year_number
y = variance_list_Eijsden
fig = px.line(dataEijsden, title = 'variance Eijsden', x = year_number , y = variance_list_Eijsden )
fig.show()

In [56]:
#plot variance Lobith
x = year_number
y = variance_list_Lobith

fig = px.line(dataLobith, title = 'variance Lobith', x = year_number, y = variance_list_Lobith)
fig.show()

In [57]:
#Deviation per day Eijsden
deviation_day_Eijsden = []

for p in (dataEijsden['NUMERIEKEWAARDE']):
    deviation_day_Eijsden.append((p - mean_all_data_Eijsden))


fig = px.line(dataEijsden, title = 'Deviation Eijsden', x = 'WAARNEMINGDATUM', y = deviation_day_Eijsden)
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=6, label="6m", step="month", stepmode="backward"), #de legenda moet nog even gefixt worden. geen idee nog hoe
            dict(count=3, label="3y", step="year", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=5, label="5y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig.show()

In [58]:
#Deviation per day Lobith
deviation_day_Lobith = []

for p in (dataLobith['NUMERIEKEWAARDE']):
    deviation_day_Lobith.append((p - mean_all_data_Lobith))


fig = px.line(dataLobith, title = 'Deviation Lobith', x = 'WAARNEMINGDATUM', y = deviation_day_Lobith)
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=6, label="6m", step="month", stepmode="backward"), #de legenda moet nog even gefixt worden. geen idee nog hoe
            dict(count=3, label="3y", step="year", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=5, label="5y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig.show()

In [59]:
stdev_list_Eijsden = []
stdev_list_Lobith = []

for n in year_number:
    numbers3 = stdev_year_Eijsden(n)
    stdev_list_Eijsden.append(numbers3)
    
for n in year_number:
    numbers4 = stdev_year_Lobith(n)
    stdev_list_Lobith.append(numbers4)

In [60]:
#plot standard deviation Eijsden
fig = px.scatter(dataEijsden, title = 'Standard Deviation Eijsden', x = year_number, y = stdev_list_Eijsden, trendline = 'ols')
fig.show()

In [61]:
#plot standard deviation Lobith
fig = px.scatter(dataLobith, title = 'Standard Deviation Lobith', x = year_number, y = stdev_list_Lobith, trendline = 'ols')
fig.show()

In [66]:
# stukje wouter werkend krijgen
# function to clean the raw data from the riverwater level heights so it is usable
#def data_cleaner(data):
#    selectedData = data.iloc[:,[21,22,24]]
#    selectedData = selectedData.iloc[::144,:] #144 = 6 * 24 to reduce the amount of rows to 1 row per day. 
#    locationName = data.iloc[1,1]
#    selectedData = selectedData[(np.abs(stats.zscore(selectedData["NUMERIEKEWAARDE"])) < filterSize)] #filters out outliers
    
#    selectedData['WAARNEMINGDATUM'] = pd.to_datetime(selectedData['WAARNEMINGDATUM'], format='%d-%m-%Y') #data from YYYYMM to DD-MM-YYYY
    
#    return selectedData, locationName

In [73]:
df_Lobith_temp = pd.merge(dataLobith, dataTemperature_merge, how='outer', on=['WAARNEMINGDATUM'])
df_Eijsden_temp = pd.merge(dataEijsden, dataTemperature_merge, how='outer', on=['WAARNEMINGDATUM'])

# #figures merge lobith and temp
figure_merge(df_Lobith_temp)

# figure merge Eijsden and temp
figure_merge(df_Eijsden_temp)