**WBS Codingschool, Data Science Bootcamp, Final Project, Part Data Engineering**

- This Notebook is part of our final project "DengAI".
- We created a Neural Network Model in order to predict Dengue cases. For that, we needed comprehensive weather data over large areas. The ECA&D provides such for the whole of Europe, sadly without offering an API. (Source: https://www.ecad.eu/dailydata/predefinedseries.php)
- With this notebook we create a workflow that automatically downloads weather data and data about weather stations from ECA&D, processes the data and prepares output data acording to model input schema. The prepared data is saved as zip file.

**To run this notebook**
- give existing working folder as "path" (e.g. "data")
- input name for output zip-file as "zip_file_name"
- INPUTs are made in "HELM Box" below
- (Choice of features within ECA&D data is (limited) possible, also for which countries (timeperiodes) the output shall be processed.)
- Note! During workflow a 'temp' folder is created, that will be removed at the end, due to initial large download volumne. Before removal you'll be asked, if you want to delete or keep the temp folder. Please answer Y or N to finish the process.
- Tip! You can follow progress of workflow by observing the 'temp' folder.

# Data Download and Management

In [1]:
def create_temp(path):
    folder = "temp"
    temp = os.path.join(path, folder)
    if not os.path.exists(temp):
        os.makedirs(temp)
    else:
        print("A 'temp' subfolder already exists.")
    return temp        

In [2]:
def remove_temp(temp):
    import shutil
    if os.path.exists(temp):
        yn=input("Do you want to remove the temporary folder including downloaded raw data now (recommended)? (Y/N)").lower()
        if yn == "y":
            shutil.rmtree(temp)
            print(f"Subfolder '{temp}' and its contents deleted successfully.")
        elif yn == "n":
            print("The 'temp' folder including downloads and other process data remains. Thank you.")
        else:
            shutil.rmtree(temp)
            print(f"Subfolder '{temp}' and its contents are deleted.")
    else:
        print(f"Subfolder '{temp}' does not exist.")

In [3]:
def download_ecad(f, temp):
    import requests

    url = f"https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_{f}.zip"
    url_sta = f"https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_station_{f}.txt"
    
    tpath = os.path.join(temp, f"eca_blend_{f}.zip")
    tpath_sta = os.path.join(temp, f"eca_stations_{f}.txt")
    
    response = requests.get(url)
    if response.status_code == 200:
        with open(tpath, 'wb') as file:
            file.write(response.content)
    else:
        print(f"Failed to download the file {file}.")

    response_sta = requests.get(url_sta)
    if response_sta.status_code == 200:
        with open(tpath_sta, 'wb') as file:
            file.write(response_sta.content)
    else:
        print(f"Failed to download the file {file}.")

# Process Stations Data

In [4]:
def clean_stations(st):
    st['country'] = st['country'].str.strip()
    return(st)

In [5]:
def dms_to_decimal(dms_str):
    degrees, minutes, seconds = map(float, dms_str[1:].split(':'))
    decimal_degrees = degrees + (minutes / 60) + (seconds / 3600)
    return decimal_degrees if dms_str[0] in ('+', 'N', 'E') else -decimal_degrees

In [6]:
def convert_coordinates(st):
    df = st
    df['latitude'] = df['latitude_DMS'].apply(dms_to_decimal)
    df['longitude'] = df['longitude_DMS'].apply(dms_to_decimal)
    df = df[['station_id', 'station_name', 'country', 'latitude', 'longitude', 'height_m']]
    st = df
    return(st)

In [7]:
def set_stations(f, temp):
    inpath = os.path.join(temp, f"eca_stations_{f}.txt")
    outpath = os.path.join(temp, f"{f}_stations.csv")
    st = pd.read_csv(inpath, sep=",", skiprows=19, header=None, names=["station_id", "station_name", "country", "latitude_DMS", "longitude_DMS", "height_m"])
    st = clean_stations(st)
    st = st.loc[st["country"].isin(countries)]
    st = convert_coordinates(st)
    st = st[['station_id','country','latitude','longitude','height_m']]
    st.to_csv(outpath, index=False)
    station_ids = st["station_id"]
    return station_ids

# Process Weather Data

In [8]:
def load_data(f, fn, temp, station_ids):
    import zipfile
    dfs = []
    fpath = os.path.join(temp, f"eca_blend_{f}.zip")
    with zipfile.ZipFile(fpath, 'r') as z:
        for i in z.namelist():
            if "STAID" in i:
                st_no = int(i[8:14].lstrip("0"))
                if st_no in station_ids:
                    with z.open(i) as file:
                        df = pd.read_csv(file, skiprows=22, header=None)
                        df.columns = ['station_id', 'source_id', 'date', f'{f}', 'quality']
                        df = df[['station_id', 'date', f'{f}']]
                        dfs.append(df)
    df = pd.concat(dfs, ignore_index=True)
    return df

In [9]:
def clean_data(df):
    df["date"] = pd.to_datetime(df['date'], format='%Y%m%d') 
    df = df.loc[df["date"]>="2022-01-01"] #input timeperiode
    return df

# Prepare Data for Model

In [10]:
def reload_process(temp):
    txpath = os.path.join(temp, "tx_df.csv")
    tx = pd.read_csv(txpath)
    tnpath = os.path.join(temp, "tn_df.csv")
    tn = pd.read_csv(tnpath)
    rrpath = os.path.join(temp, "rr_df.csv")
    rr = pd.read_csv(rrpath)

    # no null
    tx = tx.loc[tx["tx"] != -9999]
    rr = rr.loc[rr["rr"] != -9999]
    tn = tn.loc[tn["tn"] != -9999]

    # merge features
    df = pd.merge(
            pd.merge(
            tn, tx, on=["station_id", "date"], how="inner"),                             
         rr, on=["station_id", "date"], how="inner")

    # sort and rename columns
    df = df[["station_id", "date", "rr", "tx", "tn"]
            ].rename(columns={"station_id":"station_id", "date":"date", "rr":"precipitation_amt_mm", "tx":"station_max_temp_c", "tn":"station_min_temp_c"})

    return df

In [11]:
def split_to_csv_zip(df, path, zpath):
    import zipfile
    with zipfile.ZipFile(zpath, 'w') as z:
        for i in df["station_id"].unique():
            sta_df = df.loc[df["station_id"]== i]
            csv_name = f"station_{i}.csv"
            csv_path = os.path.join(temp, csv_name)
            sta_df.to_csv(csv_path, index=False)
            z.write(csv_path, arcname=csv_name)

# HELM Box  & INPUTs

In [12]:
import pandas as pd
import os

# INPUT 1v2! <------------------------------ I N P U T: directory/path, output filename ---------------------------<
path = "data/"
zip_file_name = "data_ready_for_model.zip"

# INPUT 2v2! <------------------------------ I N P U T: select weather features and countries ---------------------<
features = ["tn", "tx", "rr"]  # tn = min Temp, tx = max Temp, rr = precipitation
filenames = [feature.upper() for feature in features]
countries = ['AT', 'BE']#, 'BG', 'CY', 'CZ', 'DK', 'EE', 'GR', 'HR','HU', 'IE', 'LT', 'LU', 'LV', 'MT', 'NL', 'PL', 'PT', 'RO', 'SI', 'SK','DE','SE', 'FI', 'FR', 'CZ', 'IT', 'ES']  # ISO country code alpha2

# call functions
dfs = {}
temp = create_temp(path)
for f, fn in zip(features, filenames):
    download_ecad(f, temp)
    station_ids = set_stations(f, temp)
    df = load_data(f, fn, temp, station_ids)
    df = clean_data(df)
    dfs[f"{f}_df"] = df
    dpath = os.path.join(temp, f"{f}_df.csv")
    df.to_csv(dpath, index=False)
df = reload_process(temp)
zpath = os.path.join(path, zip_file_name)
split_to_csv_zip(df, path, zpath)
remove_temp(temp)

Do you want to remove the temporary folder including downloaded raw data now? (Y/N) j


Subfolder 'data/temp' and its contents deleted.


# References

**Data Source: https://www.ecad.eu/dailydata/predefinedseries.php**

In [14]:
# 1) ...for min. Temperature
# EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on: 08-04-2019
# THESE DATA CAN BE USED FOR NON-COMMERCIAL RESEARCH AND EDUCATION PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED: 

# Squintu, AA, G. van der Schrier, Y. Brugnara and AMG Klein Tank. 2019. Homogenization of daily ECA\&D temperature series
# Int. J. of Climatol., 39, 1243-1261. doi:10.1002/joc.5874
# Data and metadata available at http://www.ecad.eu

# FILE FORMAT (MISSING VALUE CODE = -9999):

# 01-06 STAID: Station identifier
# 08-13 SOUID: Source identifier
# 15-22 DATE : Date YYYYMMDD
# 24-28 TN   : Minimum temperature in 0.1 &#176;C
# 30-34 Q_TN : quality code for TN (0='valid'; 1='suspect'; 9='missing')


# 2) ...for max. Temperature and Precipitation:
# EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on: 16-01-2024
# THESE DATA CAN BE USED FOR NON-COMMERCIAL RESEARCH AND EDUCATION PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED: 

# Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
# air temperature and precipitation series for the European Climate Assessment.
# Int. J. of Climatol., 22, 1441-1453.
# Data and metadata available at http://www.ecad.eu

# FILE FORMAT (MISSING VALUE CODE = -9999):

# 01-06 STAID: Station identifier
# 08-13 SOUID: Source identifier
# 15-22 DATE : Date YYYYMMDD
# 24-28 TX   : Maximum temperature in 0.1 &#176;C
# 30-34 Q_TX : quality code for TX (0='valid'; 1='suspect'; 9='missing')

# 24-28 RR   : Precipitation amount in 0.1 mm
# 30-34 Q_RR : quality code for RR (0='valid'; 1='suspect'; 9='missing')