## 1. About the DWD Open Data Portal 

The data of the Climate Data Center (CDC) of the DWD (Deutscher Wetterdienst, German Weather Service) is provided on an **FTP server**. <br> **FTP** stands for _File Transfer Protocol_.

Open the FTP link ftp://opendata.dwd.de/climate_environment/CDC/ in your browser (copy-paste) and find our how it is structured hierarchically.

You can also open the link with **HTTPS** (Hypertext Transfer Protocol Secure): https://opendata.dwd.de/climate_environment/CDC/

We are interested in downloading the metadata of annual temperature to get information related to their stations

In [None]:
import requests
from bs4 import BeautifulSoup
import os
import re # to use regex expressions 
import tqdm
import pandas as pd

# URL of the DWD website
url_base = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/"
url_temporal_resolution = "annual/"
url_parameter = "kl/"
url_subdir = "historical/"
url_full = os.path.join(url_base, url_temporal_resolution, url_parameter, url_subdir)

# Directory to save the downloaded files
download_dir = "../data/original/dwd/" +  url_temporal_resolution + url_parameter + url_subdir

# Create the directory if it doesn't exist
if not os.path.exists(download_dir):
    os.makedirs(download_dir)

print("download dir: ", download_dir)

In [None]:
url_full

In [None]:
def grab_file(file_url, download_dir):
        # get only the file name from the full url
        file_name = file_url.split("/")[-1]
        # Download the file
        file_path =os.path.join(download_dir, file_name)
        with open(file_path, "wb") as file:
            file.write(requests.get(file_url).content)
        
    

In [None]:
# Send an HTTP request to the URL
response = requests.get(url_full)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, "html.parser")
    # Look for the metadata file
    links = soup.find_all(href=re.compile("Beschreibung"))
    # Take the url of the file
    file_name = links[0].get("href")
    # Download the file
    grab_file(os.path.join(url_full, file_name), download_dir)
    print(f"Downloaded: {download_dir+file_name}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [None]:
# get station path
file_path = os.path.join(download_dir,file_name)
# read the header of the file
header = open(file_path, encoding="latin").readline().split()
header

In [None]:
# translation dictionary
translate = \
{'Stations_id':'station_id',
 'von_datum':'date_from',
 'bis_datum':'date_to',
 'Stationshoehe':'altitude',
 'geoBreite': 'latitude',
 'geoLaenge': 'longitude',
 'Stationsname':'name',
 'Bundesland':'state'}

In [None]:
#pd.read_csv?

# Exercise:
notice that stations_id is originally a string, however if you read the data as the default format you will lose the leading zeros from the code.
- Check the documentation of pd.read_csv.
- Figure out how to correctly read the data. Focus on:
    - skiprows
    - names
    - encoding
    - parse_dates
    - dtype

In [None]:
df_stations_2 = pd.read_fwf(file_path,
                          skiprows=[0,1],
                          names=translate,
                          encoding="latin", 
                          parse_dates=["von_datum","bis_datum"],
                          dtype={"Stations_id":str}
                          #index_col="Stations_id"
                         )
df_stations_2

In [None]:
# read the stations dataframe # solution is in the cell

df_stations = pd.read_fwf(file_path,
                          skiprows=2,
                          names=header,
                          encoding="latin", 
                          parse_dates=["von_datum","bis_datum"],
                          dtype={"Stations_id":str}
                          #index_col="Stations_id"
                         )
df_stations

# Exercise:
Check all the different values in the "state" column. You can use the function <code>.unique()</code> for this.

In [None]:
df_stations.rename(columns=translate,inplace=True)

In [None]:
df_stations.loc[:,"state"].unique()

# Exercise:
Select only stations in NRW (you know how it is spelled from the previous exercise) which are still active (date_to is later than 2022) and which starting recording information at least in 1950.
**Hint:** On Pandas documentation, look for Dataframe.query()

In [None]:
#df_stations.query?

In [None]:
# filter stations only in NRW which are active and older than 1950
df_stations_short = df_stations.query("state == 'Nordrhein-Westfalen' and date_to >= 2022 and date_from <= 1950")

In [None]:
# df_stations.query?    REALLY USE THE ???? AND LOOK CLOSELY AT THE EXAPLES IN THERE

df_stations_short

In [None]:
# get the links. 
links = soup.find_all(href=[re.compile("KL_"+x) for x in df_stations_short.loc[:,"station_id"]])
links



In [None]:
#soup

# Question:
1) how does re.compile works?

In [None]:
try:
    # iterate through the list
    for link in tqdm.tqdm(links):
        # Take the url of the file
        file_name = link.get("href")
        # Download the file
        grab_file(os.path.join(url_full, file_name), download_dir)
    
except:
    print("Failed to download")

print("Download complete")

### Which file do I need?
extract one of the zip files to look at the content. Identify which file contains the data you are interested in.

In [None]:
import glob
zip_list = glob.glob(download_dir+"*.zip")
zip_list

In [None]:
from zipfile import ZipFile
# example of the files inside the first zip file
with ZipFile(zip_list[0]) as myzip:
    print(myzip.namelist())

# Question
Inspect the different files from the archive (.zip) example. 
1. Which file contains the temperature data? 
1. Which other parameters can be found inside?

You can find below the file names translated.
- 'Metadaten_Stationsname_Betreibername': Metadata stations' name and operator's name  
- 'Metadaten_Parameter_klima_jahr': Metadata parameters climate year
- 'Metadaten_Geraete_Lufttemperatur': Metadata devices air temperature
- 'Metadaten_Geraete_Lufttemperatur_Maximum': Metadata devices air temperature maximum
- 'Metadaten_Geraete_Lufttemperatur_Minimum': Metadata devices air temperature minimum
- 'Metadaten_Geraete_Niederschlagshoehe': Metadata devices precipitation height
- 'Metadaten_Geraete_Sonnenscheindauer': Metadata devices sunshine time
- 'Metadaten_Fehldaten': Metadata missing data
- 'Metadaten_Fehlwerte': Metadata Errors
- 'produkt_klima_jahr': Product climate year

In [None]:
# use the name pattern to get the file name
with ZipFile(zip_list[0]) as myzip:
    prod_filename = [name for name in myzip.namelist() if name.split("_")[0]=="produkt"][0] 
    print(prod_filename)

In [None]:
# Read one of the files as examplez

with ZipFile(zip_list[0]) as myzip:
    prod_filename = [name for name in myzip.namelist() if name.split("_")[0]=="produkt"][0] 
    
    #open just the product file within archive
    with myzip.open(prod_filename) as myfile:
    # read the time series data in a temporal dataframe
        df_temp = pd.read_csv(myfile, 
                      sep=";", 
                      parse_dates = ["MESS_DATUM_BEGINN", "MESS_DATUM_ENDE"], 
                      index_col = "MESS_DATUM_BEGINN", 
                      na_values = [-999.0],
                    dtype={'STATIONS_ID':str}
                         )
df_temp.head()

Now repeat the example with all the files in the ziplist. And join them in a dataframe

In [None]:
# create an empty dataFrame to merge the temperature data to
df_temp = pd.DataFrame()
# iterate through the zipfiles
for zip_file in zip_list:
    with ZipFile(zip_file) as myzip:
        #we are only interested in the file starting with 'produkt_'
        prod_filename = [name for name in myzip.namelist() if name.split("_")[0]=="produkt"][0] 
        
        #open just the product file within archive
        with myzip.open(prod_filename) as myfile:
            # read the time series data in a temporal dataframe
            df_dummy = pd.read_csv(myfile, 
                                  sep=";", 
                                  parse_dates = ["MESS_DATUM_BEGINN", "MESS_DATUM_ENDE"], 
                                  index_col = "MESS_DATUM_BEGINN", 
                                  na_values = [-999.0],
                                  dtype={"STATIONS_ID":str}
                                 )
            # Only interested in the average temperature parameter
            temp_series = df_dummy["JA_TT"].rename(df_dummy["STATIONS_ID"].iloc[0]).to_frame()
            # outer join
            df_temp = pd.merge(df_temp,temp_series,left_index=True, right_index=True, how="outer")

In [None]:
df_temp

In [None]:
df_temp.index.rename(name='year', inplace=True)
df_temp.head()

In [None]:
# Replace full datetime with year as integer
try:
    df_temp.set_index(df_temp.index.year, inplace= True) # extract year from index as int
except:
    next
df_temp

In [None]:
mean = df_temp[(df_temp.index >= 1961) & (df_temp.index <= 1990)].mean() # mean annual temp between 1961 and 1990
mean

In [None]:
df_temp_diff = (df_temp - mean)
df_temp_diff

In [None]:
df_temp_diff.info()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# plot
sns.set_style('ticks')
fig1, ax1 = plt.subplots(dpi = 300, figsize = (30,10))

sns.heatmap(df_temp.T, cmap='coolwarm', ax = ax1)
fig1.savefig('NRW_Annual_Temp_Stripes_01.png')

# Exercise
The resolution of the plot above is not optimal. Only one station started getting data from 1851. Remember that you applied a filter to the list of the stations, so it makes sense to only display data within that window of dates.
1) generate a new plot displaying only the measurements from 1950

In [None]:
# plot


# Exercise:
Good! now have a look at the temperature values. Some stations have very cool temperatures all over the series. We can assume that it is an effect of the geographic location, maybe the colder stations are placed at higher altitudes. You can investigate that by looking at your data.

In [None]:
df_stations_short.filter(like="Stations")  #like se usa para todas las cosas que tengan =() en sus palabras

We are actually interested at the changes in temperature relative to the mean historical measurements.
By plotting the temperatures differences a blue tone means a measurement below the average of that stations and a red tone means that the measurement was above the average of the station

In [None]:
# 
sns.set_style('ticks')
fig3, ax3 = plt.subplots(dpi = 150, figsize = (12,4))

sns.heatmap(df_temp_diff[df_temp_diff.index >= 1950].T, cmap='coolwarm', vmin = -2, vmax = 2, ax = ax3)
fig3.savefig('NRW_Annual_Temp_Diff_Stripes_02.png')

# Question:

- Which tendency can you see in the temperature according the plot above?
- Why does station 555 display a different tendency than the other stations?