In studies on machine learning and statistical analysis, the focus is predominantly on the performance of models in terms of accuracy. While accuracy should typically be the primary concern when evaluating a model, sometimes computational performance considerations are imperative when looking at large data sets or models that are widely deployed to serve large populations of client applications.

**Time series** data sets become so large that you simply cannot perform any analysis - or cannot do them correctly - as they are quite demanding in terms of available computational resources. In these cases, many organizations do the following:

- maximized computational resources (expensive and often wasteful, both economic and environmental);
- conduct a project poorly (insufficient hyperparameter adjustment, insufficient data, etc...);
- they do not create a project;

Neither of these options is satisfactory, especially when you are just starting out with a new data set or new analytical technique. It can be frustrating not knowing whether your failures are the result of bad data, a thorny problem, or a lack of resources. Fortunately, we will cover some workarounds to expand your options for very demanding analyzes or huge data sets.

The purpose of the notebook is to guide you with some considerations on how to reduce the computational resources required for training or inference on a specific model. Most of the time, these questions are specific to a particular data set, as well as the resources you have available and your accuracy and speed goals. In this chapter, we will address these concerns, with the hope that they partially cover the problems you encounter and can inspire future brainstorming. These considerations will come to the fore when you have completed your first few rounds of analysis and modeling and should not be a priority when you are dealing with a problem for the first time. However, when the time comes to put something into production or extend a small research project, you should revisit these concerns frequently.

## Working with Tools Built for General Use Cases

One challenge with time series data is that most tools, particularly those for machine learning, are built for a more general use case, and most illustrative examples show the use of cross-sectional data. But these machine learning methods are not as efficient with **time series** data. The solutions to your individual problems will vary, but the general ideas are the same. 

### Models Built for Cross-Sectional Data Do Not "Share" Data Across Samples

In many cases, when feeding discrete samples of **time series** data to an algorithm, most often machine learning models, you will notice that large chunks of data being fed between the samples overlap. For example, suppose you have the following data on monthly widget sales:


| Month    | Sold Widgets |
|----------|------------:|
| Jan 2014 | 11,221 |
| Feb 2014 |  9,880 |
| Mar 2014 | 14,423 |
| Apr 2014 | 16,720 |
| May 2014 | 17,347 |
| Jun 2014 | 22,020 |
| Jul 2014 | 21,340 |
| Aug 2014 | 25,973 |
| Sep 2014 | 11,210 |
| Oct 2014 | 11,583 |
| Nov 2014 | 11,539 |
| Dec 2014 | 10,240 |

You are trying to make predictions by mapping each "shape" to a nearest neighboring curve. You prepare many formats from this data. Here, we list just a few of these data points, as you may want to use six-month curves as the "shapes" of interest (note that we are not doing any data preprocessing to normalize or raw additional features of interest, such as moving averages or smoothed curves). ).

| Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 |
|-------|-------|-------|-------|-------|-------|
| 11221 | 9880  | 14423 | 16720 | 17347 | 22020 |
|  9880 | 14423 | 16720 | 17347 | 22020 | 21340 |
| 14423 | 16720 | 17347 | 22020 | 21340 | 25973 |


Interestingly, all we were able to do with this input preparation was make our dataset six times larger without including any additional information. And from a performance point of view, this is a real catastrophe, even though it is necessary for the inputs of a variety of machine learning modules. If you have encountered this problem, consider some solutions.

#### Do not use overlapping data

Think about generating just one "data point" so that each month weaves its way along just one curve. If you do this, the previous data might look similar to the following table:

| Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 |
|-------|-------|-------|-------|-------|-------|
| 11221 | 9880  | 14423 | 16720 | 17347 | 22020 |
| 21340 | 25973 | 11210 | 11583 | 12014 | 11400 |

Note that this would be quite easy, because it amounts to simple array reshaping, rather than custom data repetition.

#### Use a generator-type paradigm to iterate through the dataset

Using a generator-like paradigm to iterate through the dataset, resampling the same data structure as appropriate, is easy in Python, but you can also use R and other languages. If we imagine that the original data is stored in a 1D **NumPy** array, this would look like the following code (note that this would have to be coupled with a machine learning data structure or algorithm that accepts generators):

```python
def array_to_ts(arr):
    ids = 0
    while idx + 6 <= arr.shape[0]:
        yield arr[idx:(idx+6)]
```

Note that it is recommended to program data modeling code that does not unnecessarily harm a data set, both from a training and production point of view. In training, this will enable you to fit more training examples into memory, and in production, you will be able to perform multiple predictions with fewer training resources, in the case of predictions (or classifications) on overlapping data. If you are making frequent predictions for the same iso case, you are probably working with overlapping data; Therefore, this problem and its solutions will be very relevant.

### Models that are not Pre-Calculated generate unnecessary Lag between Measuring Data and Making a Prediction

Typically, machine learning models do not prepare for or take into account the possibility of pre-calculating part of a result before having all the data. However, this is a very common scenario for **time series**.

If you are making your model available in a time-sensitive application, such as for medical predictions, vehicle location estimates, or stock price forecasting, you may find that the lag of calculating a forecast only after all the data is available is enormous. In this case, consider whether the chosen model can be partially pre-calculated in advance. Let’s look at some examples of how this is possible:

- if you are using a recurrent neural network that takes multiple channels of information in 100 different time steps, you can pre-calculate/unroll the neural network in the first 99 time steps. So when the last data point finally arrives, you only need to do one final set of matrix multiplications (and other activation function calculations) instead of 100. In theory, this speeds up your response time by 100 times.
- se estiver usando um modelo AR(5), voc6e pode pré-calcular tudo, exceto o termo mais recente na soma que constitui o modo. Vale lembrar que um processo AR(5) se parece com a equação a seguir. Se está prestes a gerar um previsão, você já conhece os valores de *y<sub>t - 4</sub>*, *y<sub>t - 3</sub>*, *y<sub>t - 2</sub>*, e *y<sub>t - 1</sub>*, o que significa que você pode ter tudo, exceto o *phi<sub>0</sub> Ã y<sub>t</sub>* pronto para usar antes de saber *y<sub>t</sub>*:

*y<sub>t + 1</sub> = phi<sub>4</sub> × y<sub>t - 4</sub> + phi<sub>3</sub> × y<sub>t - 3</sub> + phi<sub>2</sub> × y<sub>t - 2</sub> + phi<sub>1</sub> × y<sub>t - 1</sub> + phi<sub>0</sub> × y<sub>t</sub>*

- if you are using a clustering model to find the nearest neighbors, synthesizing the characteristics of a **time series** (mean, standard deviation, maximum, minimum, etc.), you can calculate these characteristics with a **time series** with one less data point and run your model with that **time series** to identify multiple nearest neighbors. You can then update these characteristics once the final value is reached and run the entire analysis again with only the closest neighbors found in the first round of analysis. This will actually require more switching resources, but will result in a shorter time lag between the final measurement and forecast delivery.

In many cases, your model may not perform as slowly due to network lag or other factors, so pre-calculation is a worthwhile technique only when feedback timing is extremely important and when you are confident that the Model calculation is contributing to the time between an application receiving all the necessary information and outputting a useful prediction.

## Data Storage Formats: Advantages and Disadvantages

An overlooked area when it comes to performance bottlenecks for training and productizing **time series** models is data storage. Let's look at some common mistakes:

- *data storage in a row-based data format, even though the **time series** is formed by traversing a column*. This results in data where adjacent points in time are not adjacent in memory;
- *storage of raw data and execution of analyzes based on this data*. Depending on the model, it is preferable to have preprocessed data that has been downsampled, as far as possible.

Next, we'll look at these data storage factors so that your model training and inference happens as quickly as possible.

### Store your Data in Binary Format

It's tempting to store data in a comma-separated text file, like a CSV file. Normally, this is how the data is provided, so inertia leads us to make this choice. These file formats are also human-readable, which makes it easier to check the data in the file against the pipeline outputs. Lastly, this data is generally easy to upload to different platforms.

However, it is not easy for your computer to read text files. If you are working on data sets so large that you cannot fit all of your data into memory during training, you will be dealing with I/O and related processing associated with the file format you choose. By storing data in a binary format, you can substantially reduce I/O-related slowdowns in several ways:

- as the data is in binary format, your data processing package already "understands" it. There is no need to read a CSV and transform them into a data frame. When inputting the data, you will have a data frame;
- since the data is in binary format, it can be compressed better than a CSV or other text-based file. That is, the I/O itself will be shorter, as there is less physical memory to read from a file and recreate its contents.

Binary storage formats are easily accessible in R and Python. In R, use *save()* and *load()* for **data.table**. In Python, use *pickling* and note that both **Pandas** (pd.DataFrame.load(), pd.DataFrame.save()) and **NumPy** (np.load() and np .save()) include wrappers around pickling that you can use for your specific objects.

### Preprocess your Data so that You can "Swipe" over it

This recommendation is related to "Models Built for Cross-Sectional Data Do Not "Share" Data Across Samples". In this case, you should also think about how to preprocess your data and ensure that the way you do this is consistent with using a sliding window over that data to generate multiple test samples.

As an example, consider normalization or moving averages as preprocessing steps. If you plan to do this for each time window, you could get better model accuracy (although, in my experience, these gains are often intimate). However, there are several disadvantages:

- more computational resources are needed to calculate these preprocessing features over and over again on overlapping data - only to end up with very similar numbers;
- you need to store overlapping data with slightly different preprocessing over and over again;
- you can't get the most out of it when you slide a window over your data.

## Modificando sua Análise para se Adequar às Considerações de Desempenho

Muitos de nós somos culpados de ficar à vontade com um determinado conjunto de ferramentas analíticas e o conjunto de software e regras práticas sobre como ajustar um modelo que usamos. Tendemos a avaliar as necessidades de acurácia uma vez e não reavaliar quando determinamos o custo computacional de vários desempenhos de modelos possíveis.

Os dados de **séries temporais**, geralmente usados para fazer uma previsão rápida, são especialmente propensos a precisar de modelos que podem ser ajustados e produtizados de imediato. É necessário ajustar rapidamente os modelos, de modo que possam ser atualizados conforme novos dados chegam, eles precisam ter um desempenho rápido par que os consimidores das previsões dos modelos tenham o máximo de tempo possível para atingir conforme elas. por causa disso, às vezes você pode desejar mudar as expectativas - e a análise que as acompanha - para tornar os processos de análise e previsão mais rápidos e computacionalmente simplificados.

### Using All Your Data Isn't Necessarily Better

An important factor when thinking about how to optimize your analysis is understanding that not all data in a **time series** is equally important. More distant data is less important. Data during "exceptional" times is less important for building a model for normal times.

There are several ways to reduce the amount of data you use to train a model. Although many of these options have been analyzed previously, it is always good to revisit them, especially with regard to performance:

*Downsampling*

- generally, you can use less frequent data to cover the same lookback window when making a prediction. This is a way to reduce the size of your data by a multiplicative factor. Note that, depending on the analytical technique used, you also have more creative options, such as employing downsampling at different rates depending on how far back the data goes.

*Training on recent data only*

- even though machine learning loves data, there are many time series models where statistical or even deep learning techniques will do better only on recent data, rather than training on all data. This will help you reduce your input data by subtracting data that is only minimally informative for your model.

*Reduce the lookahead window used to make your prediction*

- in many **time series** models, model performance will continue to improve, however little, as you go further and further into the past. You must decide how much accuracy is actually required for performance. You may be loading much more data into memory per sample than is necessary for acceptable performance.

### Complicated Models are not always Better

It can be interesting and fun to test out the latest and greatest when it comes to choosing an analytics model. However, the truth is: does the "cost-benefit" of a more sophisticated model compensate for the additional computational resources required?

In machine learning, basically all the progress made in recent years uses more and more computational processing power to solve a problem. When it comes to problems like image recognition, where there is definitely a right answer and 100% accuracy, this makes perfect sense.

On the other hand, when it comes to problems such as **time series** predictions, where there may be physical or mathematical limits to the level of accuracy that a prediction can have, study whether the choice of a more complex model is simply a matter of choice. an automatic upgrade without a cost-benefit analysis. Consider whether the gains in terms of accuracy justify the additional lag that the model may generate when calculating a prediction. Consider whether the additional training time you will need or the additional computing resources is worth it. It may be that a less resource-intensive method with slightly worse accuracy is a better "deal" than a sophisticated model that shows almost no improvement over another version.

If you are a data analyst, this trade-off between complexity/accuracy and lag time/computational resources is something you should analyze. Consider this as another hyperparameter to tune. It's your job to flag these tradeoffs, rather than assuming a data engineer will handle it. In the data pipeline, people working in both the upstream and downstream processes cannot set aside their judgment in favor of model selection. So, take the engineering and data science side into consideration while weighing the pros and cons.

### Brief Mention of Alternative High Performance Tools

If you've fully explored the previous options, you might consider changing your underlying codebase, specifically giving up slower scripting languages like Python and R. There are several ways to do this:

- do everything with C++ and Java. Even if you haven't considered these languages, learning Basic can sometimes speed up slow parts of the pipeline enough to make impossible tasks manageable. In terms of usability and standard libraries applicable to data processing, C++ has evolved a lot. STL and C++17 syntaxes now offer many options quite comparable to Python for operating on datasets in a variety of data structures. Even if you hate C++ and Java, think about using both languages;
- In Python, you can use several different modules where you can write Python code, then compile it into C or C++ code, speeding up execution time. This can help very repetitive code with lots of *for* loops, which are slow in Python and can become more efficient in C or C++ without the need for intelligent design - just implement the same code in a faster language to solve the problem. problem. **Numba** and **Cython** are affordable Python modules that can help you speed up snippets of Python code;
- in R you can use *Rcpp* for similar functionality.