# Storing Temporal Data
In general, the value of **time series** data is in its retrospective (batch ingestion model), rather than in the live streaming of data. For this reason, storing **time series** data is necessary for most analyses.

A good storage solution is one that allows for easy access and reliability of data without requiring a large investment of computing resources. Later, we will look at what aspects of a dataset we should consider for storage, as well as examine the advantages of SQL databases, NoSQL databases, and a variety of flat file formats.

Developing a general **time series** storage solution is challenging because there are many different types of data, each with different storage, read/write, and analysis patterns. Some data will be stored and examined repeatedly, while others are only useful for a short period of time, after which they may be deleted entirely.

Use case examples:

<ins>*Use case 1˚:*</ins>
- We are collecting performance metrics on a production system. These performance metrics need to be stored for years on end, but the older the data gets, the less detailed it needs to be. Therefore, a storage medium is needed that automatically performs *downsampling* and separates the data as the information becomes old;

<ins>*Use case 2˚:*</ins>
- We have remote access to an open source repository of **time series** data, but we need to keep a local copy on your computer to reduce network traffic. The remote repository stores each time series in a folder of downloadable files on a web server, but we would like to compile all of these files into a single database to simplify things. The data must be immutable and capable of being stored indefinitely, as the aim is to have a reliable copy of the remote repository;

<ins>*Use case 3˚:*</ins>
- We create our own **time series** data by integrating a variety of data sources at different time scales, and with distinct pre-processing and formatting. Data collection and processing were tiring and time-consuming. We would like to store the data in its final format instead of running a pre-processing step successively, but we would also like to keep the raw data, to later explore pre-processing alternatives. You may need to re-examine the processed and raw data frequently as you develop new machine learning models, refitting new models on the same data, and also adding data over time as newer raw data becomes available. No need to downsample or separate data in storage.

Use cases solutions:

<ins>*Importance of how performance scales with size*</ins>
- in the first use case, we would look for a solution that could incorporate automated scripts to delete old data. We wouldn't be concerned about how the system scales to large datasets, as we plan to keep the dataset small. For the second and third case, we would expect to have a large and stable collection of data or a large and growing collection of data, respectively;

<ins>*Importance of random access versus sequential access of data points*</ins>
- in the second case, we expect all data to be accessed in equal parts, since this **time series** data would all have the same "age" upon insertion and would all reference the relevant data set. In contrast, in the first and third cases, we expect the most recent data to be accessed more frequently;

<ins>*Importance of automation scripts*</ins>
- apparently, the first case can be automated, while the second case would not require automation (since the data would be immutable). The third case suggests little automation, and also a considerable amount of data collection and processing of all parts of the data, not just the most recent ones. In the first case, we want a storage solution that can be integrated with scripts or stored procedures, while in the third case we want a solution that allows easy customization of data processing;

## Defining the Requirements

When considering storing **temporary series** data, we invite you to ask a few questions:
- <ins>*How much **time series** data will we store?* *How quickly will this data grow?*</ins>
    - We will want to choose an included storage solution for the expected growth rate of the data. Database administrators who are migrating from transaction-oriented datasets to **time series** are not infrequently surprised by how quickly the datasets can grow;<br><br>

- <ins>*Do frequencies typically have unlimited channels of updates (e.g., a constant stream of web traffic updates) or different events (e.g., a series of air traffic schedules for every major U.S. holiday over the past ten years)?*</ins>
    - If the data is like an unlimited channel, we will see more recent data. On the other hand, our data is a collection of **time series** separated into separate events, so events further apart in time can still be quite interesting. In the latter case, random access is the most likely pattern;<br><br>
    
- <ins>*Does the data have regular or irregular spacing?*</ins>
    - If the data is regularly spaced, we will be able to calculate more accurately and in advance how much data we expect to collect and how frequently this data will be entered into the system. If the data is irregularly spaced, we will use a less predictable data access style, which can efficiently facilitate periods of inactivity and periods of writing activity;<br><br>
    
- <ins>*Will we collect data continuously or will we have a well-defined end date?*</ins>
    - If we have a well-defined end date for data collection, it will be easier to know the size of the data set that needs to be accommodated. But after starting to collect a specific type of **time series**, several organizations discover that they no longer want to stop;<br><br>
    
- <ins>*What will we do with our **time series**? Is real-time views necessary? Preprocessed data for a neural network to iterate thousands of times? Fragmented data highly available to a large mobile user base?*</ins>
    - Whether the primary use case will indicate whether you are more likely to need sequential or random access to your data and the importance of a latency factor for choosing the storage format;<br><br>
    
- <ins>*How will we separate or downsample the data? How will we avoid infinite growth? What should be the life cycle of an individual data point in a **time series**?*</ins>
    - It is impossible to store all events forever. It is better to make decisions about systematic data deletion policies in advance than to do so in a one-off fashion. The more you anticipate, the better the choice you can make regarding storage formats. In the next section, we will talk more about this.<br><br>
    

The answers to these questions will indicate whether you should store raw or processed data, whether data should be entered into memory according to time or some other axis, and whether you need to store your data in a format that makes it easy to read and write. them. Use cases vary, so we must create a new inventory for each new set of data.

## Live Data versus. Stored Data

When thinking about which storage options are right for your data, it's critical to understand its lifecycle. The more realistic you are about your use cases, the less data we will need to save and the less time we will spend finding the ideal storage system, as we will not be scaling across an intractable amount of data. Organizations often over-record events of interest as they fear losing their data stores. However, having more data stored in an intractable form is less useful than having aggregated data stored over meaningful timescales.

When it comes to short-lived live data, like performance data that will be examined just to make sure nothing is wrong, we may never need to store the data in the form in which it is collected, at least not for long. This is more suited to event-driven data, where no single event is important and, instead, aggregated statistics are the values of interest.

Suppose we are running a web server that records and reports the amount of time it took each mobile device to fully load a given web page. The resulting irregularly spaced **time series** might look similar to the following table:

| Timestamp                       | Time to load the page |
| :------------------------------ | --------------------- |
| April 5, 2018 10:22:24 pm       | 23s                   |
| April 5, 2018 10:22:28 pm       | 15s                   |
| April 5, 2018 10:23:02 pm       | 14s                   |

<br>
<br>

Por diversas razões, talvez não estejamos interessados em nenhuma medição individual do tempo para carregr uma página. Gostaríamos de agregar os dados (digamos, tempo médio de carregar a página por minuto) e mesmo as estatísticas agregadas seriam interessantes apenas por um breve espaço de tempo. Para ter certeza de que podemos mostrar que o desempenho foi bom enquanto cuidava de tudo. Seria possível simplificar isso em um ponto de dados.

Instead of having 3,470 individual events that are of no interest to anyone, we will have compact and readily accessible values of interest. It is necessary to simplify data storage through aggregation and deduplication whenever possible.

| Período                      | Most accessed time | Loaded pages | Average time to load | Maximum charging time |
| :--------------------------- | ------------------ | ------------ | -------------------- | --------------------- |
| April 5, 2018 8pm - 8am      | 11pm               | 3.470        | 21s                  | 45s                   |

### Variables that change gradually

If you are storing a state variable, consider recording only the data points where the value changed. for example, if you are recording temperature in five-minute increments, your curve may look like a *step function*, especially if you only care about one value, such as the nearest degree. In this case, it is not necessary to store repetitive values, which ends up saving storage space.

### Noisy and high-frequency data

If the data is noisy, there are reasons not to care much about any specific data point. You may want to aggregate the data points before recording them, as the high noise level devalues ​​any individual measurements. Of course this will be determined by domain specificity and you will need to ensure that downstream users are still able to evaluate the noise in their measurements.