## meterological data from DWD: German weather service  

1. data source: `sequential, hourly data from DWD`: German weather service and site in south berlin

2. features: 
- `prec_mm`- amount of precipitation (e.g. rain) in mm (normally scaled)
- `prec_bool`- boolean if precipitation (e.g. rain) (boolean)
- `humidity`- humidity (normally scaled)
- `temp`- temperature (normally scaled)
- `radiation`- solar radiation (normally scaled)
- `air_pressure`- (normally scaled)
- `wind_speed`- (normally scaled)
- `wind_degree`- (ordinary scaled) [1,360]



## Imputation of missind data
1. rule-based imputation (based on EDA)- through correlation
2. linear interpolation for values with preceding and subsequent data point or low LSTM score
3. LSTM-based multi-step forecasting
4. last value carried forward for categorical 

In [1]:
# load data 
import pandas as pd
df_weather = pd.read_csv('datasets/df_meteorological_na.csv')

### 1. rule-based imputation through correlation (N imputations = 84 values)
- prec_bool and pre_type are highly correlated in values but not in absence

- > prec_boo takes 0 (no rain) if prec_bool == 0 (type no rain)
- > prec_boo takes 1 (rain) if prec_bool >= 0 (other type of precipitation)


In [2]:
# impute values based on strong correlation in prec_bool and prec_type
for n in range(len(df_weather)):
    if pd.isna(df_weather['prec_bool'][n]) == True: # if value in prec_bool is na
        if pd.isna(df_weather['prec_type'][n]) == False: # if value for prec_type at same index is not na
            # if type ==  no rain
            if df_weather['prec_type'][n] == 0: 
                df_weather.loc[n, 'prec_bool'] = 0 # impute 0 for no rain 
                df_weather.loc[n, 'prec_mm'] = 0
            
            else: # for other type of precipitation 
                df_weather.loc[n, 'prec_bool'] = 1 

df_weather = df_weather.drop('prec_type', axis= 1)

### 2. linear interpolation of missing single values (N =12)

1. identify single (non clustered) missing values (value before and following exist)
2. plot those values
3. impute values through linear interpolation

In [3]:
features = ['prec_mm', 'humidity', 'temp',  'radiation', 'wind_speed', 'air_pressure', 'wind_degree'] 

# single missing point (time series measure exist before and after)
def find_single_missing_pints(list_missing_vlaues):
    single_missing_pints = []
    for missing_datapoint in list_missing_vlaues:
        if missing_datapoint -1 not in list_missing_vlaues and missing_datapoint +1 not in list_missing_vlaues:
            single_missing_pints.append(missing_datapoint)
    
    return single_missing_pints


def linear_interpolation(feature, time_stamp):
    
    # before value 
    before_value = int(df_weather[df_weather['time_step'] == time_stamp-1][feature].iloc[0])
    # following value
    after_value = int(df_weather[df_weather['time_step'] == time_stamp+1][feature].iloc[0])
    # impute value 
    interpolated_value = (before_value + after_value)/2
    col_indexer = df_weather[df_weather['time_step'] == time_stamp][feature].index[0]
    print(f'{interpolated_value} for {feature} at {time_stamp}')
    df_weather.loc[col_indexer, feature] = interpolated_value                 

# interpolate those missing values
for n in range(len(features)):
    current_missing = df_weather[df_weather[features[n]].isna()]['time_step']
    single_missing = find_single_missing_pints(list(current_missing))

    for missing in single_missing:
        linear_interpolation(feature= features[n], time_stamp = missing)
                        

0.0 for prec_mm at 2023012116
0.0 for prec_mm at 2023052307
0.5 for prec_mm at 2023062010
0.0 for prec_mm at 2023062103
0.0 for prec_mm at 2023092214
0.0 for prec_mm at 2023101018
0.0 for prec_mm at 2023101401
54.5 for humidity at 2023040311
3.5 for temp at 2023040311
205.0 for radiation at 2023080910
5.0 for wind_speed at 2023040312
1032.0 for air_pressure at 2023040311
255.0 for wind_degree at 2023070412
110.0 for wind_degree at 2023090712


### 2.1 linear interpolate features with lower LSTM score

In [4]:
# linear interpolate wind speed & air pressure
df_weather[[ 'wind_speed', 'air_pressure']] = df_weather[[ 'wind_speed', 'air_pressure']].interpolate(method = 'linear')

### 3. LSTM-based multi-step fore casting
1. training LSTM: load data of previous years and select time span without missing values
2. define LSTM model and required data transformation: 10,896 instances = [(454,24,1)] - [(samples, timesteps,features)]
3. fit model

In [5]:
import pandas as pd
import numpy as np
from utilities.LSTM_model import LSTM_model, lstm_impute

df_weather_train = pd.read_csv('datasets/df_meteorological_train_data.csv') # training data from the previous year

lstm_impute(df = df_weather,
            features = ['prec_mm', 'humidity', 'temp', 'radiation'], 
            lstm_model = LSTM_model(df = df_weather_train, timesteps= 24, epochs= 5, batch_size= 32)).getitem()

### 4. Last Observation Carried Forward for categorical features 
(prec_bool, wind_degree)

In [None]:
# simple forward fill bool
df_weather[['prec_bool', 'wind_degree']] = df_weather[['prec_bool', 'wind_degree']].astype('float').ffill()
df_weather.to_csv('datasets/df_meteorological_impute.csv', index = False) #save
df_weather.isna().sum()

time_step       0
prec_mm         0
prec_bool       0
humidity        0
temp            0
radiation       0
wind_degree     0
wind_speed      0
air_pressure    0
dtype: int64