# Error Measurement

So far, we have used a variety of measures to compare models or judge how well a model performed its task. Now, we will analyze best practices for judging the accuracy of forecasts, emphasizing the specific issues regarding **time series** data.

For those new to time series forecasting, it is most important to understand that standard cross-validation is typically not recommended. It is not possible to select randomly sampled training, validation, and testing data sets for each of these categories in a time-independent manner.

However, things are even more complicated. You need to think about how different data samples relate to each other over time, even though they appear independent. For example, suppose you are working on a **time series** classification task, so that you have many separate **time series** samples, each of which is its own data point. It may be tempting to think that in this case it is possible to randomly choose **time series** for each training, validation and test set, but this does not work. The problem with this approach is that it does not reflect how you would use your model, i.e. it would not reflect training your model on earlier data nor testing it on later data.

We don't want future information to leak into your model, as modeling doesn't work like that in practice. In turn, this means that the prediction error we measure in our model will be lower during testing than in production, since in testing we will have used cross-validation in our model in order to generate future information.

Let's look at a realistic scenario of how this could happen. Imagine you are training an air quality detector for major cities in the Western US. In your training set, you include all data from 2017 and 2018 for San Francisco, Salt Lake City, Denver, and San Diego. And your test suite, you include the same date range for Las Vegas, Los Angeles, Oakland, and Phoenix. You discover that your air quality model does very well in Las Vegas and Los Angeles measurements, but it does even better in 2018. Great.

Then you try to replicate the model training process on data from previous decades and find that it doesn't perform as well in the test as it does in the training run. So you remember the record-breaking wildfires in Southern California in 2018 and realize that they were "incorporated" into the original test/training because your training set gave you a window into the future. This is precisely why we should avoid standard cross-validation.

There are times when propagating information from the future to choosing a model is not a problem. For example, if you are just trying to understand the dynamics of a **time series** when testing the quality level of a forecast, you are not trying to make a prediction, but rather testing the best possible fit of a given model to the data. In this case, including future data helps you understand the dynamics, although you should be careful about overfitting. And even in this case, there is no doubt that maintaining a valid test set - whose requirement is not to allow information to leak in the future - would still justify concerns about **time series** and cross-validation.

Now that we've clarified things, let's go back to a concrete example of splitting data for training, validating, and testing a model. Next, we'll look more generally at how to determine when a prediction is good enough, or as good as possible. We will also examine how to estimate the uncertainty of our forecast when using techniques that do not directly produce an uncertainty or error measure as part of the output. We will end the chapter with a list of pitfalls that can help with building your **time series** model or preparing to put it into production.

## Basic Concepts: How to Test Predictions

The most important element is to ensure that you are only building with data that can be accessed far enough in advance and can be used to generate prediction. For this reason, you need to think not only about when events happen, but also when the data will be available to you.

While this sounds simple, remember that common preprocessing such as exponential smoothing can accidentally leak from the training period into the testing period. You can test this by first fitting a linear regression to an autoregressive **time series** and then to an exponentially smoothed autoregressive **time series**. You will notice that the more you smooth the **time series**, and the longer the smoothing half-life, the "better" your predictions become. This is because you are actually having to make less and less of a prediction, as more and more of your value is made up of an exponential average of previous values. In other words, it is a dangerous and treacherous lookahead that, despite this, still appears in academic articles.

Bearing in mind these dangers and other hard-to-perceive ways of feeding the past into the future and vice versa, the gold standard for any model should be backtesting with training, validation and roll-forward testing.

In backtesting, a model is developed for a set or ranges of data and then extensively tested on historical data, ideally representing the full range of possible conditions and variability. It is also important to emphasize that professionals need well-founded reasons to backtest a specific model and should avoid testing too many models. As most data analysts know, the more models you test, the more likely that model will overfit the data - that is, the more likely it will choose models with overly specific details about the current data set. rather than generalizing it robustly. Unfortunately, for **time series** professionals, this means a tricky balancing act that can lead to embarrassing results when putting models into production.

But how do we implement backtesting? We do this implementation in a way that preserves a structure similar to cross-validation while being temporally aware. The common paradigm, assuming you have data representing the sequential passage of time "in alphabetical order", is as follows:

        Train[A]               test with[B]
        Train[A B]             test with[C]
        Train[A B C]           test with[D]
        Train[A B C D]         test with[E]
        Train[A B C D E]       test with[E]
<br>

![The standard of excellence for evaluating the performance of a **time series** model, roll-forward training, validation and testing window](https://analisemacro.com.br/wp-content/uploads/2022/05/tscv.png)

<br>

You can also move the training data window instead of expanding it. In this case, your training could look something like this:

        Train[A B]         test with[C]
        Train[B C]         test with[D]
        Train[C D]         test with[E]
        Train[D E]         test with[F]
        
<br>

The method you choose depends in part on whether you think the behavior of your series is evolving over time. If so, it's best to use a rolling window so that every test period is tested with a model trained on the most relevant data. You may want to avoid overfitting, in which case using an expanding window will discipline your model better than a fixed-length window. Since this type of continuous division is a common training need, R and Python can easily generate them:

- in Python, an easy way to generate data splits is through *sklearn.model_selection.TimeSeriesSplit*
- in R, *tsCV* from the **fprecast** package will advance a model in time using backtesting and report errors

There are other packages in R and Python that will do the same. You can also write your own functions to split your data, if you have ideas about how to implement this model test in a specific project. Maybe you want to ignore certain time periods because they exhibited anomalous dynamics, or maybe you want to weight the performance of certain time periods more.

For example, suppose you work with financial data. Depending on your goals, it may be worth excluding data from extraordinary periods, such as the 2008 financial crisis. Or, if you work with retail data, you may want to consider model performance more for the Christmas shopping season, even if it sacrifices some of the accuracy in forecasting low-volume seasons.


### Model-Specific Considerations for Backtesting

Consider the dynamics of the model you are training when structuring your backtesting, especially when training a model with a certain time range of data. With traditional statistical models such as ARIMA, all data points are factored equally when selecting model parameters. Therefore, the more data, the lower the accuracy of the model, if you think that the model parameters should vary over time. This is also true for machine learning models where all training data is factored equally.

On the other hand, stochastic batch methods can result in weights and estimates that evolve over time. Thus, if you train the data in chronological order, neural network models trained with typical stochastic gradient descent methods will, to some extent, consider the temporal nature of the data. The most recent gradient adjustments to the weight will depict the most recent data. In most cases, **time series** neural network models are trained on data in chronological order, so they tend to generate better results than models trained on data in random order.

State space models also provide opportunities for tuning to adapt over time with mode. This contributes to a longer training window, because a long time window will not prevent the subsequent estimate from evolving over time.

## When Is Your Forecast Good Enough?

The quality of your forecast will depend on your overall objectives, the minimum quality level required for what you need to do, and the limits and nature of your data. If your data has a very high signal-to-noise ratio, you should have expectations for your model.

Remember: a **time series** model is not perfect. But you should aim to do as well as or a little better than alternative methods, such as solving a system of differential equations about climate change, asking a knowledgeable stock broker for a tip, or turning to a medical textbook that shows you how. classify an EEG. When evaluating performance, keep in mind the known domain-specific limits on prediction as indicated by measurements - for now the upper bound on performance in many prediction problems.

There are times when you know the model you are walking isn't good enough and you can do better. Let’s look at some things we can do to identify these opportunities:

*Plot the model outputs for the test set*

    - the distribution generated by the model should match the distribution of the values ​​you are trying to predict, assuming there is no expected regime shift or underlying trend. For example, if you are trying to predict stock prices and knowing that these prices fall and rise with the same frequency, if the model always predicts a rise, you have an inadequate model. Sometimes the distributions will be clearly wrong, while other times we can apply a statistical test to compare your model output to your actual targets.
    
*Plot model residuals over time*

    - if the residuals are not homogeneous over time, your model was not specified. The temporal behavior of the residuals may indicate additional parameters needed in the model to represent the temporal behavior.
    
*Test the model against a simple temporally aware model and null*

    - a common null model is one in which every forecast for time *t* must have the value at time *t - 1*. If the model does not perform better than a simple model, you cannot justify it. If a simple, naive model manages to outperform the model you created, your model has an intrinsic loss function or data preprocessing problem, rather than a hyperparameter grid search problem. Alternatively, it could be a data signal that has a lot of noise relative to the signal, which also suggests that your model is useless for its intended purpose.
    
*Study how the model deals with outliers*

    - in many areas, outliers are simply data outside the normal curve. These events probably could not be predicted. That is, the best your model can do is ignore these outliers instead of adjusting for them. In fact, if your model predicts outliers well, this could be a sign of overfitting or poor loss function selection. This depends on the model you chose and the loss functions you employed. However, for most uses, a model whose predictions are not as extreme as the extreme values ​​in your data set is recommended. Of course, this recommendation does not apply when the cost of outlier events is high and when the forecasting task is mainly to warn about outlier events when possible.
    
*Perform a temporal sensitivity analysis*

    - are qualitatively similar behaviors in related time series generating related results in your model? When using knowledge of your system's underlying dynamics, make sure it applies and that your model recognizes and treats similar temporal patterns in the same way. For example, if one time series shows an upward trend with a drift of 3 units per day and another shows an upward trend with a drift of 2.9 units per day, you want to make sure that the predictions made for these series are similar. . Furthermore, you would like to be sure that the classification of the predictions and comparison with the input data was sensible (a larger drift should result in a larger prediction value). If this is not the case, your model may be overfitting.

## Estimating Uncertainty in your model with a Simulation

An advantage of traditional statistical **time series** analysis is that these analyzes have well-defined analytical formulas for the uncertainty in an estimate. However, even then - and also in the case of non-statistical methods - it can be useful to understand the uncertainty associated with a forecast model through computational methods. A very intuitive and accessible way to do this is with a simple simulation. Suppose we perform an analysis of what we believe to be an AR(1) process. It is worth remembering that an AR (1) process can be expressed as:

*y<sub>t</sub> = θ × y<sub>t</sub> - 1 + e<sub>t</sub>*

Following a model fit, we want to study how variable our estimate of the coefficient θ can be. Here, one way to study this is to run several Monte Carlo simulations. We can easily run it in R, as long as we remember what we learned about AR processes:

```R
require(forecast)

phi         <- 0.7
time_steps  <- 24
N           <- 1000
sigma_error <- 1

sd_series   <- sigma_error^2 / (1 - phi^2)
starts      <- rnorm(N, sd = sqrt(sd_series))
estimates   <- numeric(N)
res         <- numeric(time_steps)

for (i in 1:N) {
    errs = rnorm(time_steps, sd = sigma_error)
    res[1]  <- starts[i] + errs[1]
    
    for (t in 2:time_steps) {
        res[t] <- phi * tail(res, 1) + errs[t]
    }
    estimates  <- c(estimates, arima(res, c(1, 0, 0))coef[1])
}

hist(estimates,
     main = "Estimated Phi for AR(1) when ts is AR(1)"
     breaks = 50
```
<br>

We can also get a sense of the range of estimates and quantiles with the *summary()* function applied to *estimates*:

```R
summary(estimates1)
```
<br>

We can also use bootstrap to ask more complicated questions. Suppose we want to know the numerical costs regarding oversimplification of our model when compared to the ground truth. Imagine that the studied process is AR(2), although we diagnosed it as an AR(1) process. To find out its impact on our estimation, we can modify the previous R code like this:

```R
## Now let's assume we have a true AR process (2)
## and since this is more complicated, let's switch to arima.sim
phi_1 <- 0.7
phi_2 <- -0.2

estimates <- numeric(N)
for (i in 1:N) {
    res <- arima.sim(list(order = c(2, 0, 0),
                          ar = c(phi_1, phi_2)),
                     n = time_steps)
    estimates[i] <- arima(res, c(1, 0, 0))coef[1]
}

hist(estimates,
     main = "Estimated Phi for AR(2) when ts is AR(2)"
     breaks = 50
```
<br>

Maybe the distributions don't seem that different to you, and they really aren't. We confirm this with statistical summaries:

```R
summary(estimates)
```
<br>

We can see that the range of estimates is wider when the model is incorrectly specified and that the estimate for the first order term is somewhat worse than when the model was correctly specified, but the drift is not very large. This may address concerns that underestimating model order will impact our *θ* estimate. We can run a variety of simulation scenarios to address potential issues and understand the range of likely estimation errors given some imagined possibilities.

## Predicting Multiple Steps Ahead

Previously, although we covered one-step-ahead prediction, you may want to predict multiple steps-ahead. This happens, among other reasons, when the **time series** data you have is at a higher temporal resolution than the **time series** values ​​you would like to predict. For example, you may have daily stock quotes available, but you would like to predict monthly stock quotes so you can come up with a long-term strategy for your retirement plan. Or you may have brain electrical activity readings taken every minute, but you would like to predict a seizure at least five minutes in advance to notify your users/patients as soon as possible. In these cases, you have several options for generating multi-step-ahead forecasts.

### Directly Adjust the Horizon of Interest

It's as simple as setting your *y* (target) value to reflect the forecast horizon of interest. So, if your data is minute-by-minute indicators, but you want a five-minute horizon ahead for your forecast, simply cut the model inputs at time *t* and train them to a label generated on the data up to time * t + 5*. You would then adjust this data according to what you are trying to predict, whether through a simple linear regression or a machine learning model, or even a deep learning network. It would actually look like this:

*model(X) = Y*

In this context, you can choose *Y* to have any time horizon you want. Therefore, each of them would be a legitimate scenario, depending on your future horizon of interest (whether ten intervals or three):

- *model<sub>1</sub>(X)<sub>t</sub> is fitted to Y<sub>t + 10</sub>*

- *model<sub>2</sub>(X)<sub>t</sub> is fitted to Y<sub>t + 3</sub>*

### Recursive Approach for Distant Time Horizons

When using a recursive approach to fit a variety of horizons, you build a model, but prepare to feed your own output as an input to predict more distant horizons. Maybe you remember this idea, about how to make predictions with multiple steps ahead and with ARIMA modeling, which uses this same strategy. Suppose we develop a model to fit one step forward, training the *model(X)<sub>t</sub> = Y<sub>t + 1</sub>*. If we wanted to adjust the horizons three steps ahead, we would do the following:

- *model(X)<sub>t</sub> -> estimate Y<sub>t + 1</sub>*

- *model(X<sub>t</sub> with the estimate of Y<sub>t + 1</sub>) -> estimate Y<sub>t + 2</sub>*

- *model(X<sub>t</sub> with the estimate of Y<sub>t + 1</sub> and estimate of Y<sub>t + 2</sub>) -> estimate Y<sub> t + 3</sub>*

The expected error for our estimate *Y<sub>t + 3</sub>* would necessarily be greater than our estimate for *Y<sub>t + 1</sub>*. How much bigger? This is already complicated. A great option to get an idea would be to run a simulation, as previously analyzed.

### Multitask Learning Applied to Time Series

Multitask learning is a general deep learning concept that can be applied with particular significance to **time series** analysis. In more general terms, multitask learning represents the idea that a model can be built for multiple purposes at the same time or to learn to generalize by trying to predict several different but related targets at the same time. Some consider this to be a form of regularization, which encourages the model to be more general by teaching it related tasks. In **time series**, you can apply multitask learning by defining targets for different time horizons in the forecasting context. In this case, your model fit would look something like this:

- *model(X)<sub>t</sub> -> (Y<sub>t + 1</sub>, Y<sub>t + 10</sub>, Y<sub>t + 100</sub> sub>*

- *model(X)<sub>t</sub> -> (Y<sub>t + 1</sub>, Y<sub>t + 2</sub>, Y<sub>t + 3</sub> sub>*

When training this model, you might also think about how to view the loss function: would you like to weight all forecasts equally, or would you like to favor certain forecast horizons over others?

If you are trying to make a very far-out forecast, you can use multitasking horizons to teach your model by including short-term horizons that can point to features that are salient and useful for longer-time horizons but are difficult to identify directly from the data. with low signal-to-noise predictions of the distant future. Another scenario for multitasking modeling would be to fit multiple time rings in the future, all with the same season, but perhaps at different points in time (such as spring in various years or Mondays in various weeks). This would be a way to adjust for seasonality and a trend at the same time.