# Investigate the bias over time

In [1]:
import pandas as pd
# pd.options.plotting.backend = "plotly"

In [2]:
file_name = "data/forecast-leg-accuracy-2020-12-09-1607472437"

In [3]:
df = pd.read_parquet(file_name)
df["leg_dep_date"] = df["leg_dep_date"].dt.date

So CAF implemented a big data change in the beginning of November.

So this notebook should try and look into if the bias has improved (more towards zero and stabalized).

In [4]:
date_of_change = pd.Timestamp("2020-11-01")

Let us try and plot the one week (horizon 7) performance over time. 
Please note, that the changed in the one week horizon performance should be included around 7 days after the date of change, but since it is on legs it is a bit more blurry, but around the date:

In [5]:
print(date_of_change + pd.Timedelta(days=7))

2020-11-08 00:00:00


In [58]:
import plotly.express as px
def plot_bias_over_time_for_horizon(df, horizon, metric="perc_bias"):
    if metric == "perc_bias":
        func = lambda x: 100 * (x["ffe_prediction"] - x["ffe_actual"]).sum() / x.dropna(subset=["ffe_prediction"])["ffe_actual"].sum()
    else:
        metric = "bias"
        func = lambda x: (x["ffe_prediction"] - x["ffe_actual"]).sum()
    bias = df.query(f"horizon_days == {horizon}").groupby("leg_dep_date").apply(func)
    
    fig = px.line(bias, labels={"value": metric, "variable": "Global"}, title=f"Global {metric} over time for horizion {horizon} days")
    fig.update_traces(name='Diner')
    fig.add_vline(x=date_of_change + pd.Timedelta(days=horizon), line_color="red")
    
    # Add rolling mean to show the stability
    rol_mean = bias.rolling(7).mean()
    fig2 = px.line(rol_mean, color_discrete_sequence=["green"])
    fig2.update_traces(name='Rolling 7 day mean')
    
    fig.add_trace(fig2.data[0])
    
    return fig

In [60]:
plot_bias_over_time_for_horizon(df, horizon=7, metric="perc_bias")

The 7 days horizon forecasting performance seems to be a lot more stable and performing a lot better in terms of percentage bias. 

In [61]:
plot_bias_over_time_for_horizon(df, horizon=14, metric="perc_bias")

The 14 days horizon forecasting performance also seems to be a lot more stable and performing a lot better in terms of percentage bias. 