To do's:
- When there is a larger set of data with weather predictions (after 2021/06/15), remove weather observations and switch to only using weather predictions
- Update list of locations to use in report
- Improve/validate prediction model based on above changes

# Resono 1 week predictions 

Make 1 week-ahead predictions of the visitor counts (hourly visitor counts based on historic data) for all locations (in druktebeeld) or a list of Resono locations. 

Generate a graph with the predictions for in the weekly report for each location. The graph will be automatically saved in the directory that you can set in the arguments section. 

Predictions are stored in a data frame with the following additional columns: 
- **'total_count_predicted'**: predicted total counts (for the next 7 days per location)
- **'data_version'**: version of the data (feature set)
- **'model_version'**: version of the model (type and settings)
- **'predicted_at'**: timestamp of prediction (moment prediction was made)

#### Information for when using this notebook:

Data file needed: 
- Hourly total counts of historic Resono data (retrieved from Resono dashboard)

Current model:
- Linear regression (based on validation with 7 weeks for selection of locations, against baseline model (repeat past week))

Model input current version:
- Past observations Resono data (average past few weeks etc.)
- Periodic data (time of day etc.)
- Stringency Index
- Holiday data
- Vacation data
- Weather data **observations** (< 2021/06/15), and **predictions** (>= 2021/06/15) (temperature, wind speed, global radiation, cloud cover)

### Preparations

Change directory to folder that contains the function files/database credentials in code blocks below.

In [None]:
def install_packages():
    # (Re-)Installs packages.
    
    get_ipython().run_cell_magic('bash', '', 'pip install imblearn\npip install xgboost\npip install mord\npip install psycopg2-binary\npip install workalendar\npip install eli5\n pip install plotly')
    
    import pandas as pd
    pd.set_option('mode.chained_assignment', None)

In [None]:
%%capture
install_packages()

In [None]:
#pip install scikit-learn==0.24.2  # Run if sklearn error 

In [None]:
import os
import pandas as pd

os.chdir("/home/jovyan/Credentials") # Directory with Azure DB credentials
import env_az

os.chdir("/home/jovyan/gitops/central_storage_analyses/notebooks_predictions/resono_week")
import prediction_model_helpers as h  # Universal predictions
import resono_week_predictions as resono_pred  # Resono 1 week model specific

import importlib  # For when coding

### Settings

#### Arguments for functions

In [None]:
# file name of historic Resono data (total daily counts)
resono_data_dir = "/home/jovyan/Resono_1week_predictions"
file_name = '2021-01-01_2021-06-23_totalsperhour.csv'

In [None]:
# frequency of sampling for data source to predict
freq = 'H'

In [None]:
# how many samples in a day
n_samples_day = 24
# how many samples in a week
n_samples_week = 24*7
# what period to predict for operational forecast (samples)
predict_period = n_samples_week

In [None]:
# list of column name(s) of variabe to predict (can also be "all")
#Y_names = "all" 
Y_names = ['Albert Cuyp', 'Vondelpark West', 'Rembrandtplein',
          'Nieuwmarkt', 'Leidseplein', 'Kalverstraat Noord', 'Kalverstraat Zuid']

# data source (for which the predictions are made)
data_source = 'resono'

# type of prediction (count -> regression or level -> classification)
target = 'count'  # Can only be count 

In [None]:
# input for starting of learnset 
start_learnset = h.get_start_learnset(train_length = 20, date_str = None)

In [None]:
# input for start prediction
start_prediction = "2021-06-28 00:00:00"  # start date of week to predict 
start_prediction = pd.to_datetime(start_prediction)

In [None]:
# Minimum number of training samples needed to make predictions (otherwise no predictions for that location)
min_train_samples = n_samples_week*4

In [None]:
# perform outlier removal ("yes" or "no")
outlier_removal = "no"

In [None]:
# set versions (for storing results)
current_model_version = 'lr_0_0'
current_data_version = "1_0" 

In [None]:
# Report graph settings
report_dir = "/home/jovyan/Resono_1week_predictions/"
week_label = "26"
legend = "yes"

### Get predictions

#### 1. Prepare data sets

In [None]:
base_df, resono_df, resono_df_raw, start_prediction, end_prediction, Y_names_all = resono_pred.prepare_data(env_az,
                                                                                                            resono_data_dir,
                                                                                                            file_name,
                                                                                                           freq, 
                                                                                                           predict_period, 
                                                                                                           start_prediction,
                                                                                                            n_samples_day, 
                                                                                                           Y_names, 
                                                                                                           target,
                                                                                                           start_learnset)

#### 2. Make predictions and store in data frame

In [None]:
# --- remove in version without backtesting
prepared_dfs = dict()
y_scalers = dict()
# ---

# Initialize data frame with predictions
final_df = pd.DataFrame()

# Predict for each location
for idx, Y in enumerate(Y_names_all):
    
    # Show location
    print(Y)
    
    # Preprocessed data frame for this location
    preprocessed_df = resono_pred.get_location_df(base_df, resono_df, Y)
    
    # Gather predictons for this location
    prepared_df, predictions, y_scaler = resono_pred.get_resono_predictions(preprocessed_df, resono_df_raw, freq, predict_period, n_samples_day, 
                                                             n_samples_week, Y, data_source, target, 
                                                             outlier_removal, start_learnset,
                                                             current_model_version, current_data_version, 
                                                             start_prediction, end_prediction, min_train_samples)
    # Add predictions to final data frame
    final_df = pd.concat([final_df, predictions], 0)
    
    # Get and store report figure
    report_df = resono_pred.get_location_report_df(final_df, prepared_df, y_scaler, Y)
    resono_pred.get_report_plot_hourly(report_df, legend, Y, report_dir, week_label)
    resono_pred.get_report_plot_daily(report_df, Y, report_dir, week_label)
    
    # --- remove in version without backtesting
    prepared_dfs[Y] = prepared_df
    y_scalers[Y] = y_scaler
    # ---

### Check operational prediction

In [None]:
final_df

### Backtesting --- remove code blocks below in version without backtesting

Test model predictions for the selected location (argument at the beginning) and time period (start_test; within the time period for which the data has been prepared)

In [None]:
# Input for backtesting

# Start testing from this timestamp until the most recent time slot
start_test = "2021-04-18 00:00:00"
# What period to predict for backtesting (samples)
predict_period = n_samples_week*7

In [None]:
# If using a NN/LSTM model, it is necessary to also install these libraries
# Related functions have to be uncommented in prediction_model_helpers.py
#pip install keras
#pip install tensorflow

In [None]:
# Perform backtesting

# Store results
locations = []
rmse_benchmarks = []
rmse_models = []
figs_pred_time = dict()
feat_imps = dict()
figs_feat_imp = dict()

# Predict for each location
for idx, Y in enumerate(Y_names_all):
    
    # Show location
    print(Y)
    
    # Prepare data
    if Y in prepared_dfs:
        df_y_predict_bt, df_y_train_bt, df_y_ground_truth_bt, df_y_ground_truth_bt_scaled, df_X_train_bt, df_X_predict_bt = h.prepare_backtesting(start_test, predict_period, freq, 
                                                                                   prepared_dfs[Y], Y, 
                                                                                   n_samples_week, target, y_scalers[Y])
    
    # Do not perform backtesting if there is not enough training data 
    if (df_X_train_bt.empty) | (len(df_X_train_bt) < min_train_samples):
        print("Not enough training data: no backtesting performed.")
        continue
    
    # Benchmark predictions
    df_y_benchmark = df_y_predict_bt.copy()
    df_y_benchmark[Y] = h.test_model_past_week_bt(df_y_train_bt, df_y_predict_bt, df_y_ground_truth_bt_scaled, 
                                                    predict_period, 
                                                   n_samples_week, target)
    if target == "count":
        df_y_benchmark = h.unscale_y(df_y_benchmark, y_scalers[Y])
        
    error_metrics_benchmark = h.evaluate(df_y_benchmark, df_y_ground_truth_bt, target, Y_name = Y,
                                         print_metrics = False)
    
    rmse_benchmarks.append(error_metrics_benchmark['rmse'])
    
    # Model predictions
    df_y_model = df_y_predict_bt.copy()
    
    model = h.train_model_ridge_regression(df_X_train_bt, df_y_train_bt, Y, target)
    df_y_model[Y] = h.test_model_ridge_regression(model, df_X_predict_bt)
    if target == "count":
        df_y_model = h.unscale_y(df_y_model, y_scalers[Y])
    error_metrics_model = h.evaluate(df_y_model, df_y_ground_truth_bt, target, Y_name = Y, print_metrics = False)
    
    rmse_models.append(error_metrics_model['rmse'])
    
    # Visualize backtesting result
    fig_pred_time = h.visualize_backtesting(df_y_ground_truth_bt, df_y_benchmark, df_y_model, target, Y, 
                                        error_metrics_model, y_label = "Total visitor count", count_to_level = False)
    figs_pred_time[Y] = fig_pred_time
    
    # Feature importance
    feat_imp, fig_feat_imp = h.feature_importance(model.coef_[0], list(df_X_train_bt.columns))
    feat_imps[Y] = feat_imp
    figs_feat_imp[Y] = fig_feat_imp
    
    locations.append(Y)

In [None]:
# Backtesting results for all locations
df_results = h.backtesting_results_all_locations(locations, rmse_models, rmse_benchmarks)

In [None]:
# Summarized results
df_results.describe()

In [None]:
# Locations for which the model performs better
df_results[df_results['RMSE_difference'] < 0]

In [None]:
# Locations for which the benchmark model performs better
df_results[df_results['RMSE_difference'] > 0]

#### Query results for specific location

In [None]:
df_results[df_results['Location'] == "Kalverstraat Noord"]

In [None]:
figs_pred_time["Kalverstraat Noord"]

In [None]:
figs_feat_imp["Kalverstraat Noord"]