# Generating and Selecting Features

In the previous examples, we looked at **time series** analysis methods that rely on using all data points in a **time series** to fit the model. However, already preparing for the application of machine learning, we will study how to generate and select **features** for **time series**.

Feature generation is the process of identifying a quantitative way to encapsulate the most important aspects of **time series** data into just a few numeric values ​​and categorical labels. You compress the raw series data into a shorter representation using a set of features to describe a given **time series**. For example, a simple feature generation could describe each **time series** with its mean value and the number of time steps in the series. It would be a way to describe this **time series** without analyzing all the raw data step by step.

The goal of generating features is to compress as much information about the complete **time series** as possible into a few metrics, or alternatively, use these metrics to identify the most important information about the **time series** and discard the rest . This is indispensable for machine learning methods and can be successfully applied to **time series** problems, as long as we can digest a **time series** into a properly formatted input.

Once we have generated some putatively useful features, we must ensure that they are in fact useful. While it is unlikely to create many useless features manually, we may run into this problem when using code that automatically generates a large number of features from a **time series** downstream for machine learning use. For this reason, we must inspect the characteristics, when generated, in order to analyze which ones can be discarded in later analyses.

Originally, traditional machine learning models were not developed considering **time series**, so they are not automatically suitable for their analytical use. However, one way to make these models work with temporal data is to generate features. For example, by describing a univariate **time series**, without many numbers detailing the step-by-step outputs of a process, but rather describing it with a set of characteristics, we can access methods designed for cross-sectional data.

## Exemplo Introdutório

In the table below, consider the following morning, noon and night temperatures:

| Time              | Temperature (F˚) |
|:------------------|------------------|
| Monday morning    | 35 |
| Monday midday     | 52 |
| Monday evening    | 15 |
| Tuesday morning   | 37 |
| Tuesday midday    | 52 |
| Tuesday evening   | 15 |
| Wednesday morning | 37 |
| Wednesday midday  | 54 |
| Wednesday evening | 16 |
| Thursday morning  | 39 |
| Thursday midday   | 51 |
| Thursday evening  | 12 |
| Friday morning    | 41 |
| Friday midday     | 55 |
| Friday evening    | 20 |
| Saturday morning  | 43 |
| Saturday midday   | 58 |
| Saturday evening  | 22 |
| Sunday morning    | 46 |
| Sunday midday     | 61 |
| Sunday evening    | 35 |


We could plot this data and visualize elements of periodicity (a daily cycle) and also the trend of a general increase in temperatures. However, we cannot store the image of a graph in a database, and most methods that accept an image as input are data-intensive applications that seek to reduce the image into summary metrics. In other words, we must do the summary metrics ourselves. Instead of describing the 21 numbers in the table as a **time series**, we could describe the series with a few words and numbers:
- daily/periodical;
- growing trend; we could make this more quantitative by calculating a slope;
- average values for each morning, noon and night;

By doing this, we would summarize the 21-point **time series** to 2 to 5 numbers - good data compression without losing too much detail. This is a simple case of feature generation. Thus, feature *selection* would involve eliminating any features that were not descriptive enough to justify their inclusion. The justification for inclusion will depend on our downstream use of the features.

## General Considerations when Calculating Characteristics

As with any aspect of analysis, when calculating **time series** characteristics for a **time series** data set, we will consider whether the analysis makes sense and whether the effort put into generating characteristics is more likely to result in overfitting. due to too many features than resulting in relevant insights.

The best approach is to develop a set of potentially useful features while exploring and cleaning the **time series**. As we visualize data and analyze distinctions between different **time series** from the same data set or across different time periods, we can develop ideas about what types of measurements would be useful in labeling or predicting a **series. temporal**. Any prior knowledge about a system or even working hypotheses that we would like to test with further analysis can help.

### The Nature of Time Series
As we decide which *timeseries** features we will generate, we must not forget the underlying attributes of the **timeseries**, determined by ourselves during data exploration and cleaning.

#### Stationarity
Stationarity is a consideration. Many **time series** features that assume stationarity are useless unless the underlying data is stationary or at least ergodic. For example, using the mean of a **time series** as a feature is only practical when the **time series** is stationary, so that the idea of an average makes sense. When we have a non-stationary **time series**, this value is not very significant, as the value measured as an average is a bit of a fluke, a result of many tangled processes, such as a trend or a seasonal cycle.

#### Time series size
Another consideration for feature generation is the size of the **time series**. Some features may be sensitive in a non-stationary **time series**, but become unstable as the series size increases, such as minimum and maximum values. For the same underlying process, a longer **time series** is likely to calculate more extreme maximum and minimum values than a shorter **time series** generated by the same process, simply because there are more opportunities for data collection.

### Domain Knowledge
 Domain knowledge is essential to generate **time series** characteristics, as, fortunately, this makes some insights possible. For example, if we are working with a physics **time series**, we must quantify the characteristics that make sense on the time scale of the studied system, as well as ensure that the selected characteristics are not unduly influenced by the properties of, say, the error of a sensor rather than the characters of an underlying system. Another example, imagine that we are working with specific financial market data. To ensure financial stability, this market imposes maximum price variations on a given day. If the price changes too much, the stock exchange closes. In this situation, we might consider generating a feature that indicates the maximum price seen on a given day.

### External Considerations
The scope of computing resources and data storage are also important. The purpose of generating **time series** features can influence the decision of how many features to calculate and whether we should consider those that are computationally demanding. This may also depend on the overall size of the dataset we are analyzing.

### Open Source Libraries for Generating Time Series Features
There have been many advances to automate the creation of **time series** features, as they are often interesting, descriptive, and even predictive across multiple domains.

#### Python tsfresh module
Python's *tsfresh* module is an example for automatic feature generation, as it implements a broad and general set. We can get an idea of the scope of the general categories available, which are:
- <u>Descriptive statistics</u>
    - are motivated by the traditional statistical **time series** methodologies we studied previously, including:
        - increased Dickey-Fuller test value;
        - AR coefficient (k);
        - autocorrelation for a lag k;
- <u>Physics-inspired nonlinearity and complexity indicators</u>
    - this category includes:
        - the *c3()* function, a proxy to calculate the expected value of *L^2 (X^2) × L (X) × X* (*L* is the lag operator). It was proposed as a measure of *nonlinearity* in a **time series**;
        - the *cid_ce()* function, which calculates. the square root of the sum from 0 to n - 2 × lag of (x<sub>i</sub> - x<sub>i</sub> + 1)^2. It was proposed as a measure of the *complexity* of a **time series**;
        - the *friedrich_coefficients()* function, which returns coefficients from a model fitted to describe complex non-linear motion;
- <u>History compression calculations</u>
    - this category includes the following characteristics:
        - the sum of values ​​in a **time series** that occurs more than once;
        - the size of the longest consecutive subsequence that is above or below the average;
        - the first occurrence in the **time series** of the minimum or maximum value;
<br><br>

A module like *tsfresh()* can help save time and choose efficient implementations for feature selection. It can also teach ways to describe relevant data that we may not have found by searching. This module has numerous advantages, as is always the case when combining analysis with open source and fully tested tools, including:
- when calculating standard characteristics, there is no need to reinvent the wheel. When you use a shared library, you have some assurance that other users have checked its accuracy, rather than doubting your own code and having to check it;
- a library like this provides a framework for calculating characteristics, not just their extensive list;
- this library was developed to attract downstream consumers of features, mainly with **sklearn**. Thus, its characteristics can be easily used in machine learning models;


#### The Cesium time series analysis platform

A more accessible, but equally extensive, catalog of generated features is the list implemented in the **Cesium** library (https://cesium-ml.org/docs/index-html). The current list is available in the documentation (https://cesium-ml.org/docs/feature_table.html), and below we have selected some interesting features for analysis and inspection. The general categories are broken down in the source code (https://perma.cc/8HX4-MXBU), but we detail them a little more here:
- Characteristics that describe the general distribution of data values without taking into account their temporal relationships. This category can encompass a diverse set of characteristics that, despite everything, are independent of time:
    - how many local peaks are there in a histogram of the data?
    - what percentage of the data points are within a fixed window of values close to the data median:
- Features that describe the time distribution of data:
    - features that consider the time distribution between measurements as their own distribution and calculate statistics similar to those just described, now on the distribution of time differences instead of on data values;
    - features that calculate the probability that the next observation will occur within *n* time steps, given the observed distribution;
- characteristics that describe measures of the periodicity of behavior in the **time series**. These characteristics are often associated with the *Lomb-Scargle periodogram*;

The features just described can be calculated over the entire **time series** or as rolling window or expansion functions. Given what we have already learned about programming mechanisms for rolling and expanding window functions, we would undoubtedly be able to implement these features, as we can already understand the documentation and the concept of what this library does. Here we would use rolling window functions to summarize the data instead of cleaning it. In the analysis of **time series** the same techniques started to be used in many different, but equally useful, cases.

The **cesium** library provides other features in addition to feature generation. for example, it has a web-based GUI to perform feature generation and also integrates with **sklearn**. If you test these libraries on your data, you will realize that generating **time series** takes a huge amount of time. Therefore, you should carefully consider how many features you need to generate for your data and when it makes sense to generate them automatically rather than developing your own.

Many of the features these libraries generate are computationally heavy and - given the comprehensiveness of the feature lists - will often not address points of interest for the question we are trying to answer. With some domain knowledge, we can even recognize that a specific type of feature is irrelevant, noisy, or non-predictive. Do not calculate useless characteristics unnecessarily. This leaves the analysis slow and without any clarity. Automatic feature generation libraries are useful, but should be used with caution.

#### The tsfeatures package in R
**tsfeatures**, is a convenient R package for generating a variety of useful and commonly used **time series** features. The documentation includes a list of features with the following functions:
- functions *acf_features()* and *pacf_features()*: each calculates a number of related values, given the general sense of how important autocorrelation is in the behavior of a **time series**. For the *acf_features()* function, the documentation describes the following return values: "A vector of six values: first autocorrelation coefficient and sum of the square of the first ten autocorrelation coefficients of the original series, first differenced series, and twice series differenced. For seasonal data, the autocorrelation coefficient at the first seasonal lag is also returned";
- the *lumpiness()* and *stability()* functions are oriented by tiling windows, while the *max_level_shift()* and *max_var_shift()* functions are oriented by rolling windows. In each of these cases, the differences and range measurement statistics are applied to values ​​measured in overlapping (rolling) or non-overlapping (side-by-side) windows of the **time series**;
- *unitroot_kpss()* (https://perma.cc/WF3Y-7MDJ) and *unitroot()* (https://perma.cc/54XY-4HWJ);

The **tsfeatures** package usefully consolidates and includes work from a variety of academic projects related to the study of **time series** characteristics, as well as other ongoing efforts to improve the feature creation process. *time series** that serve a plurality of domains. Among them:

- *compengine()* calculates the same **time series** features developed by the compengine.org project, found to be useful on a wide variety of **time series** data from many domains;
- several features borrowed from the **hctsa** package (https://github.com/benfulcher/hctsa), whose objective is to perform highly comparative **time series** analyzes in Matblab. Some of them are in: *autocorr_features()*, *firstmin_ac()*, *pred_features()*, and *trev_num()*;

### Examples of Domain Specific Characteristics
Another source of inspiration can come from domain-specific features developed for a variety of **time series** data. In general, these characteristics have been developed over decades, either from heuristics that work empirically even when they are not well understood or from scientific knowledge about how the underlying mechanisms of a system work. Next, we will analyze some examples.

#### Technical indicators of the market and shares
Stock market technical indicators are arguably the most widely documented and formalized sets of indicators used for domain-specific **time series** applications. Over the last century, economists who have studied **time series** data have compiled an extensive list of characteristics that they often use to quantify data in financial markets and make predictions. Although the context may not be of interest to everyone, this list is inspiring to show how a list of domain-specific characteristics can also be quite extensive, descriptive and creative.

To get a sense of the complexity and high specificity of these indicators in financial markets, we also included a not too long list of characteristics. Given the complexity, it's no wonder that people spend their entire professional careers trying to understand how these "signs" can predict the rise and fall of financial markets.

- <u>Relative Strength Index</u> (RSI)
    - This measure is equal to *100 - 100 / (1 + RS)*, where RS is the relative strength of the average earnings (AE) during a "bull" period (rising prices) divided by the average losses (AL) during a "bear" period (decreasing prices. The lookback period for these rising and falling periods is an input parameter, so we can have different RSI values for different lookback periods. Traders have created rules of thumb about which RSI cutoff values indicate that an asset is undervalued or overvalued relative to its true value. The RSI is known as a "momentum indicator" because it is based on measurements of an asset's circulation;
- <u>Convergent/Divergent Moving Average</u> (MACD)
    - The **time series** MACD is composed of the difference between a short-term exponential moving average of the asset ("fast") and a long-term exponential moving average of the asset ("slow");
        - the **time series** "average" is an exponential moving average of the MACD series;
        - the "divergent" **time series** is the difference between the MACD **time series** and the "average" **time series**. In general, it is the value used to make financial forecasts. The other input series (MACD and average) are normally prepared only to create the "divergent" series;
- <u>Chaikin Money Flow</u> (CMF - Chaikin Money Flow)
    - this indicator calculates the direction of spending trends:
        - calculate the money flow multiplier: ((Closing - Minimum) - (Maximum - Closing)) / (Maximum - Minimum);
        - calculate the money flow volume, which is the trading volume of the day multiplied by the money flow multiplier;
        - add this volume of money flow for a given period of days and divide it by the volume of that same period of days. This indicator is an "oscillator" that varies between -1 and 1 and indicates "buying pressure" or "selling pressure", a measure of market direction;
        
        
#### Healthcare Time Series
Another field where time series characteristics have domain-specific meanings and even names is healthcare. Healthcare data offers a wide range of **time series** data. An example would be EKG data, electrocardiogram. Reading an electrocardiogram is both a science and an art, and doctors identify several characteristics and use them to read **time series**. If we were to select features for a machine learning study on EKG data, we would definitely start by studying the features and talking to an experienced clinician to understand their purpose and what they indicate.

Similarly, if we are analyzing high-resolution **time series** data on the blood glycemic index, this would also help to understand the type of patterns that tend to impact the daily data.

## How to Select Traits after Generating Them
Suppose we have automatically generated many features to describe your large **time series** data set. We may not be able to analyze all the proposed features in a first pass through the data, so it may be useful to complement the automatic *generation* of features with the automatic *selection* of features. A good selection algorithm is *FRESH*, implemented in the **tsfresh** package. *FRESH* is the acronym for extraction based on scalable hypothesis tests.

The *FRESH* algorithm is motivated by the increasing amount of **time series** data available, generally stored in a distributed manner, facilitating parallelized computing. The algorithm evaluates the importance of each input feature in relation to a target variable by calculating a *p-value* for each feature. Once calculated, the *p-values* by features are evaluated together using the Benjamini-Yekutieli procedure, which determines which features to keep based on input parameters about acceptable and similar error rates. *The Beijamini-Yekutieli procedure* is a method of limiting the number of false positives identified during the hypothesis testing used to generate the p-values in the initial step.

We will use the code from an example in the module documentation. First, we will download the **time series** data related to robot execution failures:

In [9]:
# installing libs
## ! pip install tsfresh
## ! pip install "dask[dataframe]"

Collecting dask-expr<1.2,>=1.1
  Downloading dask_expr-1.1.10-py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.2/242.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: dask-expr
Successfully installed dask-expr-1.1.10


In [30]:
# importing libs
import random
import numpy as np
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from tsfresh.examples.robot_execution_failures import (
    download_robot_execution_failures,  # Function to download example dataset
    load_robot_execution_failures       # Function to load the dataset into variables
)

In [10]:
# Downloading the robot execution failure dataset
download_robot_execution_failures()

# Loading the timeseries data and the corresponding labels
timeseries, y = load_robot_execution_failures()

We will then extract the features without having to specify them, because the package automatically calculates all of them. This goes against the advice given previously about being extremely inclusive, without worrying about computational resources. In this test dataset, there aren't many data points, but we probably wouldn't want to hastily implement this on another dataset without reducing it to a reasonable number of points:

In [12]:
# Importing the feature extraction function from tsfresh
from tsfresh import extract_features

# Extracting features from the timeseries dataset
# `column_id` specifies the identifier for different time series (e.g., which sensor or experiment)
# `column_sort` specifies the column containing the timestamps to sort the time series
extracted_features = extract_features(
    timeseries,
    column_id="id",      # The column that identifies different time-series entities
    column_sort="time"   # The column that contains time-related values for sorting
)

# Displaying the shape of the extracted features
print(f"Extracted Features Shape: {extracted_features.shape}")


Feature Extraction: 100%|███████████████████████| 20/20 [00:07<00:00,  2.81it/s]


Extracted Features Shape: (88, 4698)


Although **tsfresh** provides a way to specify which characteristics you want to calculate, in this example we chose to include them all. You can also manually define the parameters for the characteristics that take into account the parameters for calculation, instead of using the defaults. This is covered and exemplified in the documentation (hhtps://perma.cc/D5RS-BJ6T). If you do a full extraction, as we did with the example data provided by *tsfresh*, we can see that there are several calculated features:

In [16]:
# listing the names of all the extracted features
extracted_features.columns

Index(['F_x__variance_larger_than_standard_deviation',
       'F_x__has_duplicate_max', 'F_x__has_duplicate_min',
       'F_x__has_duplicate', 'F_x__sum_values', 'F_x__abs_energy',
       'F_x__mean_abs_change', 'F_x__mean_change',
       'F_x__mean_second_derivative_central', 'F_x__median',
       ...
       'T_z__fourier_entropy__bins_5', 'T_z__fourier_entropy__bins_10',
       'T_z__fourier_entropy__bins_100',
       'T_z__permutation_entropy__dimension_3__tau_1',
       'T_z__permutation_entropy__dimension_4__tau_1',
       'T_z__permutation_entropy__dimension_5__tau_1',
       'T_z__permutation_entropy__dimension_6__tau_1',
       'T_z__permutation_entropy__dimension_7__tau_1',
       'T_z__query_similarity_count__query_None__threshold_0.0',
       'T_z__mean_n_absolute_max__number_of_maxima_7'],
      dtype='object', length=4698)

There are 4,698 columns. This is a much larger number of features than we could calculate by hand, but running on a realistic set of data is also quite time consuming. When deciding how and when to implement such a large set of features, try to be realistic about your computational resources and your ability to carefully analyze the results. Remember that for **time series** data, *outliers* can unpleasantly and unnecessarily influence subsequent analyses. We want to ensure that the chosen features are resistant to outliers.

Although the *FRESH* algorithm helps to account for dependency between features, it is difficult to understand. We can also use the more traditional and transparent feature selection technique, *recursive feature elimination* (RFE). We can use RFE to complement the *FRESH* algorithm and improve our understanding of the degree of difference between the features selected by the algorithm and those not selected:
- RFE is an incremental approach to feature selection in which many features are gradually eliminated from a more inclusive model, thus creating a less inclusive model up to the minimum number of features to be included, defined at the beginning of the procedure selection. This technique is known as *backward selection*, as you start with the most inclusive model and move it "backward" to a simpler model. In contrast, in *forward selection*, features are added incrementally until the maximum number of specified features, or some other stopping criterion is reached.

We can use RFE to select characteristics and classify them by degree of importance. To test this, we combine ten features randomly selected from the list of features retained by the *FRESH* algorithm with ten features randomly selected from the list of features rejected by the *FRESH* algorithm:

In [24]:
# Assuming features_filtered and extracted_features are DataFrames
# Example filter: Select first 20 columns
features_filtered = extracted_features.iloc[:, :20]


# Step 1: Select 10 random indices from features_filtered
x_idx = random.sample(range(len(features_filtered.columns)), 10)
selX = features_filtered.iloc[:, x_idx].values  # Select values from those columns

# Step 2: Identify unselected features
unselected_features = list(set(extracted_features.columns).difference(set(features_filtered.columns)))

# Step 3: Randomly sample 10 features from the unselected set
unselected_features = random.sample(unselected_features, 10)

# Step 4: Find indices of these unselected features in the extracted_features DataFrame
unsel_x_idx = [idx for idx, val in enumerate(extracted_features.columns) if val in unselected_features]
unselX = extracted_features.iloc[:, unsel_x_idx].values  # Extract the corresponding values

# Step 5: Combine selX and unselX horizontally
mixed_X = np.hstack([selX, unselX])

With this set of 20 features, we can perform the RFE to get a sense of the ranked importance of these features in the dataset and in the model we use within the RFE:

In [44]:
# Example mixed_X and y (replace with your actual data)
# Ensure mixed_X is a NumPy array or pandas DataFrame
# Ensure y is a 1D array or pandas Series

# Handle invalid values in mixed_X
mixed_X = np.nan_to_num(mixed_X, nan=0.0, posinf=1e10, neginf=-1e10)  # Replace NaN and infinity
mixed_X = np.round(mixed_X, 4)  # Format to 4 decimal places

# Handle invalid values in y
if np.any(np.isnan(y)):
    raise ValueError("y contains NaN values. Please clean y before proceeding.")
y = y.astype(int)  # Ensure y is integer

# RFE with SVM
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(mixed_X, y)

# Ranking of features
print("Feature ranking:", rfe.ranking_)

Feature ranking: [ 4  3 10 13  5  8 11 15  1 14  9  7 18 17 16  2 19 20  6 12]


Here we can see the relative rankings of the twenty features we provided to the RFE algorithm. We expect the first ten - those selected, among others, by the *FRESH* algorithm - to rank higher than those not selected by the algorithm. This happens most of the time, but there are exceptions that break the rule. For example, we can see that in the second half of the array representing the ratings of unselected features, the fifth and seventh features are most important among the twenty. However, we cannot expect a perfect match and the results are quite consistent.

We can use RFE on the selected features as a selective elimination mode. We can also use it as a validation test, trying to adjust the parameters of the *FRESH* algorithm or the number of features we are generating as input for the *FRESH* algorithm.

Note that the *FRESH* algorithm has basically no parameters, so the number and quality of features we input is the best way to impact its output. The other parameter we define for the *FRESH* algorithm is the *fdr_level*, percentage of irrelevant features expected after generating them. This parameter defaults to 0.05, but we can set a higher value to improve the selectivity of feature filtering, especially when generating a large number of features without taking into account whether they are suitable for your domain of interest.

We look at the motivation for feature selection and a simple example of how generating features can work to convert even a short **time series** into a more compact dataset of numbers that is almost as informative as the original. We also look at examples of Python modules developed to implement automated feature generation and selection on **time series** data, which can easily generate thousands of features from a **time series**. As there is a risk that many of the features generated in this way will not be very useful, we have also seen methods with the aim of selecting the most useful features to pass them along in our analytical pipeline, so that their generation does not cause noise or missing data. of information. Feature generation serves several purposes, such as:
- produce downstream **time series** data in a format that is conducive to machine learning algorithms which, for the most part, are designed to accept feature sets per data point rather than a **time series* *;
- summarize **time series** data in order to compress temporal observations into a collection of numbers and qualitative indicators. This can help not only with analysis, but also with storing data in a more succinct and readable format in cases where it is not necessary to maintain the complete **time series**;
- provide a common set of metrics to describe and identify similarities between data that may have been measured under many different conditions. By summarizing our data more generally, we can make them comparable, as otherwise the comparison would not be easy at all;