# Generating and Selecting Features

In the previous examples, we looked at **time series** analysis methods that rely on using all data points in a **time series** to fit the model. However, already preparing for the application of machine learning, we will study how to generate and select **features** for **time series**.

Feature generation is the process of identifying a quantitative way to encapsulate the most important aspects of **time series** data into just a few numeric values ​​and categorical labels. You compress the raw series data into a shorter representation using a set of features to describe a given **time series**. For example, a simple feature generation could describe each **time series** with its mean value and the number of time steps in the series. It would be a way to describe this **time series** without analyzing all the raw data step by step.

The goal of generating features is to compress as much information about the complete **time series** as possible into a few metrics, or alternatively, use these metrics to identify the most important information about the **time series** and discard the rest . This is indispensable for machine learning methods and can be successfully applied to **time series** problems, as long as we can digest a **time series** into a properly formatted input.

Once we have generated some putatively useful features, we must ensure that they are in fact useful. While it is unlikely to create many useless features manually, we may run into this problem when using code that automatically generates a large number of features from a **time series** downstream for machine learning use. For this reason, we must inspect the characteristics, when generated, in order to analyze which ones can be discarded in later analyses.

Originally, traditional machine learning models were not developed considering **time series**, so they are not automatically suitable for their analytical use. However, one way to make these models work with temporal data is to generate features. For example, by describing a univariate **time series**, without many numbers detailing the step-by-step outputs of a process, but rather describing it with a set of characteristics, we can access methods designed for cross-sectional data.

## Exemplo Introdutório

In the table below, consider the following morning, noon and night temperatures:

| Time              | Temperature (F˚) |
|:------------------|------------------|
| Monday morning    | 35 |
| Monday midday     | 52 |
| Monday evening    | 15 |
| Tuesday morning   | 37 |
| Tuesday midday    | 52 |
| Tuesday evening   | 15 |
| Wednesday morning | 37 |
| Wednesday midday  | 54 |
| Wednesday evening | 16 |
| Thursday morning  | 39 |
| Thursday midday   | 51 |
| Thursday evening  | 12 |
| Friday morning    | 41 |
| Friday midday     | 55 |
| Friday evening    | 20 |
| Saturday morning  | 43 |
| Saturday midday   | 58 |
| Saturday evening  | 22 |
| Sunday morning    | 46 |
| Sunday midday     | 61 |
| Sunday evening    | 35 |


We could plot this data and visualize elements of periodicity (a daily cycle) and also the trend of a general increase in temperatures. However, we cannot store the image of a graph in a database, and most methods that accept an image as input are data-intensive applications that seek to reduce the image into summary metrics. In other words, we must do the summary metrics ourselves. Instead of describing the 21 numbers in the table as a **time series**, we could describe the series with a few words and numbers:
- daily/periodical;
- growing trend; we could make this more quantitative by calculating a slope;
- average values for each morning, noon and night;

By doing this, we would summarize the 21-point **time series** to 2 to 5 numbers - good data compression without losing too much detail. This is a simple case of feature generation. Thus, feature *selection* would involve eliminating any features that were not descriptive enough to justify their inclusion. The justification for inclusion will depend on our downstream use of the features.

## General Considerations when Calculating Characteristics

As with any aspect of analysis, when calculating **time series** characteristics for a **time series** data set, we will consider whether the analysis makes sense and whether the effort put into generating characteristics is more likely to result in overfitting. due to too many features than resulting in relevant insights.

The best approach is to develop a set of potentially useful features while exploring and cleaning the **time series**. As we visualize data and analyze distinctions between different **time series** from the same data set or across different time periods, we can develop ideas about what types of measurements would be useful in labeling or predicting a **series. temporal**. Any prior knowledge about a system or even working hypotheses that we would like to test with further analysis can help.

### The Nature of Time Series
As we decide which *timeseries** features we will generate, we must not forget the underlying attributes of the **timeseries**, determined by ourselves during data exploration and cleaning.

#### Stationarity
Stationarity is a consideration. Many **time series** features that assume stationarity are useless unless the underlying data is stationary or at least ergodic. For example, using the mean of a **time series** as a feature is only practical when the **time series** is stationary, so that the idea of an average makes sense. When we have a non-stationary **time series**, this value is not very significant, as the value measured as an average is a bit of a fluke, a result of many tangled processes, such as a trend or a seasonal cycle.

#### Time series size
Another consideration for feature generation is the size of the **time series**. Some features may be sensitive in a non-stationary **time series**, but become unstable as the series size increases, such as minimum and maximum values. For the same underlying process, a longer **time series** is likely to calculate more extreme maximum and minimum values than a shorter **time series** generated by the same process, simply because there are more opportunities for data collection.

### Domain Knowledge
 Domain knowledge is essential to generate **time series** characteristics, as, fortunately, this makes some insights possible. For example, if we are working with a physics **time series**, we must quantify the characteristics that make sense on the time scale of the studied system, as well as ensure that the selected characteristics are not unduly influenced by the properties of, say, the error of a sensor rather than the characters of an underlying system. Another example, imagine that we are working with specific financial market data. To ensure financial stability, this market imposes maximum price variations on a given day. If the price changes too much, the stock exchange closes. In this situation, we might consider generating a feature that indicates the maximum price seen on a given day.

### External Considerations
The scope of computing resources and data storage are also important. The purpose of generating **time series** features can influence the decision of how many features to calculate and whether we should consider those that are computationally demanding. This may also depend on the overall size of the dataset we are analyzing.

### Open Source Libraries for Generating Time Series Features
There have been many advances to automate the creation of **time series** features, as they are often interesting, descriptive, and even predictive across multiple domains.

#### Python tsfresh module
Python's *tsfresh* module is an example for automatic feature generation, as it implements a broad and general set. We can get an idea of the scope of the general categories available, which are:
- <u>Descriptive statistics</u>
    - are motivated by the traditional statistical **time series** methodologies we studied previously, including:
        - increased Dickey-Fuller test value;
        - AR coefficient (k);
        - autocorrelation for a lag k;
- <u>Physics-inspired nonlinearity and complexity indicators</u>
    - this category includes:
        - the *c3()* function, a proxy to calculate the expected value of *L^2 (X^2) × L (X) × X* (*L* is the lag operator). It was proposed as a measure of *nonlinearity* in a **time series**;
        - the *cid_ce()* function, which calculates. the square root of the sum from 0 to n - 2 × lag of (x<sub>i</sub> - x<sub>i</sub> + 1)^2. It was proposed as a measure of the *complexity* of a **time series**;
        - the *friedrich_coefficients()* function, which returns coefficients from a model fitted to describe complex non-linear motion;
- <u>History compression calculations</u>
    - this category includes the following characteristics:
        - the sum of values ​​in a **time series** that occurs more than once;
        - the size of the longest consecutive subsequence that is above or below the average;
        - the first occurrence in the **time series** of the minimum or maximum value;
<br><br>

A module like *tsfresh()* can help save time and choose efficient implementations for feature selection. It can also teach ways to describe relevant data that we may not have found by searching. This module has numerous advantages, as is always the case when combining analysis with open source and fully tested tools, including:
- when calculating standard characteristics, there is no need to reinvent the wheel. When you use a shared library, you have some assurance that other users have checked its accuracy, rather than doubting your own code and having to check it;
- a library like this provides a framework for calculating characteristics, not just their extensive list;
- this library was developed to attract downstream consumers of features, mainly with **sklearn**. Thus, its characteristics can be easily used in machine learning models;


#### The Cesium time series analysis platform

A more accessible, but equally extensive, catalog of generated features is the list implemented in the **Cesium** library (https://cesium-ml.org/docs/index-html). The current list is available in the documentation (https://cesium-ml.org/docs/feature_table.html), and below we have selected some interesting features for analysis and inspection. The general categories are broken down in the source code (https://perma.cc/8HX4-MXBU), but we detail them a little more here:
- Characteristics that describe the general distribution of data values without taking into account their temporal relationships. This category can encompass a diverse set of characteristics that, despite everything, are independent of time:
    - how many local peaks are there in a histogram of the data?
    - what percentage of the data points are within a fixed window of values close to the data median:
- Features that describe the time distribution of data:
    - features that consider the time distribution between measurements as their own distribution and calculate statistics similar to those just described, now on the distribution of time differences instead of on data values;
    - features that calculate the probability that the next observation will occur within *n* time steps, given the observed distribution;
- characteristics that describe measures of the periodicity of behavior in the **time series**. These characteristics are often associated with the *Lomb-Scargle periodogram*;

The features just described can be calculated over the entire **time series** or as rolling window or expansion functions. Given what we have already learned about programming mechanisms for rolling and expanding window functions, we would undoubtedly be able to implement these features, as we can already understand the documentation and the concept of what this library does. Here we would use rolling window functions to summarize the data instead of cleaning it. In the analysis of **time series** the same techniques started to be used in many different, but equally useful, cases.

The **cesium** library provides other features in addition to feature generation. for example, it has a web-based GUI to perform feature generation and also integrates with **sklearn**. If you test these libraries on your data, you will realize that generating **time series** takes a huge amount of time. Therefore, you should carefully consider how many features you need to generate for your data and when it makes sense to generate them automatically rather than developing your own.

Many of the features these libraries generate are computationally heavy and - given the comprehensiveness of the feature lists - will often not address points of interest for the question we are trying to answer. With some domain knowledge, we can even recognize that a specific type of feature is irrelevant, noisy, or non-predictive. Do not calculate useless characteristics unnecessarily. This leaves the analysis slow and without any clarity. Automatic feature generation libraries are useful, but should be used with caution.

#### The tsfeatures package in R
**tsfeatures**, is a convenient R package for generating a variety of useful and commonly used **time series** features. The documentation includes a list of features with the following functions:
- functions *acf_features()* and *pacf_features()*: each calculates a number of related values, given the general sense of how important autocorrelation is in the behavior of a **time series**. For the *acf_features()* function, the documentation describes the following return values: "A vector of six values: first autocorrelation coefficient and sum of the square of the first ten autocorrelation coefficients of the original series, first differenced series, and twice series differenced. For seasonal data, the autocorrelation coefficient at the first seasonal lag is also returned";
- the *lumpiness()* and *stability()* functions are oriented by tiling windows, while the *max_level_shift()* and *max_var_shift()* functions are oriented by rolling windows. In each of these cases, the differences and range measurement statistics are applied to values ​​measured in overlapping (rolling) or non-overlapping (side-by-side) windows of the **time series**;
- *unitroot_kpss()* (https://perma.cc/WF3Y-7MDJ) and *unitroot()* (https://perma.cc/54XY-4HWJ);

The **tsfeatures** package usefully consolidates and includes work from a variety of academic projects related to the study of **time series** characteristics, as well as other ongoing efforts to improve the feature creation process. *time series** that serve a plurality of domains. Among them:

- *compengine()* calculates the same **time series** features developed by the compengine.org project, found to be useful on a wide variety of **time series** data from many domains;
- several features borrowed from the **hctsa** package (https://github.com/benfulcher/hctsa), whose objective is to perform highly comparative **time series** analyzes in Matblab. Some of them are in: *autocorr_features()*, *firstmin_ac()*, *pred_features()*, and *trev_num()*;

### Examples of Domain Specific Characteristics
