# Feature Extraction Benchmarks
---

This walkthrough serves as a benchmark for comparing `functime` with `tsfresh` feature extraction functions. We begin the analysis by evaluating the speed of feature extraction across time series of three different sizes: 100K, 1M, and 9M. Next, we assess the speed in a groupby and aggregation context, making a performance comparison between functime with polats and tsfresh using pandas.

In [15]:
%%capture
%pip install perfplot
%pip install pandas
%pip install tsfresh
%pip install functime

In [52]:
from typing import Callable

import pandas as pd
import perfplot
import polars as pl
from tsfresh.feature_extraction import feature_calculators as tsfresh
from functime import feature_extractors as fe

In [53]:
pl.Config.set_tbl_rows(100)
pl.Config.set_fmt_str_lengths(60)
pl.Config.set_tbl_hide_column_data_types(True)

polars.config.Config

## 1. Setup for the comparison
---
We are using the M4 dataset. We create a `pd.DataFrame` and `pl.DataFrame` and we define a list of dictionnary with the following structure:
<br>
(<br>
&emsp;  `<functime_function>`,<br>
&emsp;  `<tsfresh_function>`,<br>
&emsp;  `<functime_parameters>`,<br>
&emsp;   `<tsfresh_parameters>`<br>
)<br>

In [54]:
_M4_DATASET = "../../data/m4_1d_train.parquet"

DF_PANDAS = (
    pd.melt(pd.read_parquet(_M4_DATASET))
    .drop(columns=["variable"])
    .dropna()
    .reset_index(drop=True)
)
DF_PL_EAGER = (
    pl.read_parquet(_M4_DATASET).drop("V1").melt().drop("variable").drop_nulls()
)
DF_PL_LAZY = DF_PL_EAGER.lazy()

In [49]:
FUNC_PARAMS_BENCH = [
    (fe.absolute_energy, tsfresh.abs_energy, {}, {}),
    (fe.absolute_maximum, tsfresh.absolute_maximum, {}, {}),
    (fe.absolute_sum_of_changes, tsfresh.absolute_sum_of_changes, {}, {}),
    (
        fe.approximate_entropy,
        tsfresh.approximate_entropy,
        {"run_length": 2, "filtering_level": 0.5},
        {"m": 2, "r": 0.5},
    ),
    # (fe.augmented_dickey_fuller, tsfresh.augmented_dickey_fuller, "param")
    (fe.autocorrelation, tsfresh.autocorrelation, {"n_lags": 4}, {"lag": 4}),
    (
        fe.autoregressive_coefficients,
        tsfresh.ar_coefficient,
        {"n_lags": 4},
        {"param": [{"coeff": i, "k": 4}] for i in range(5)},
    ),
    (fe.benford_correlation, tsfresh.benford_correlation, {}, {}),
    (fe.binned_entropy, tsfresh.binned_entropy, {"bin_count": 10}, {"max_bins": 10}),
    (fe.c3, tsfresh.c3, {"n_lags": 10}, {"lag": 10}),
    (
        fe.change_quantiles,
        tsfresh.change_quantiles,
        {"q_low": 0.1, "q_high": 0.9, "is_abs": True},
        {"ql": 0.1, "qh": 0.9, "isabs": True, "f_agg": "mean"},
    ),
    (fe.cid_ce, tsfresh.cid_ce, {"normalize": True}, {"normalize": True}),
    (fe.count_above, tsfresh.count_above, {"threshold": 0.0}, {"t": 0.0}),
    (fe.count_above_mean, tsfresh.count_above_mean, {}, {}),
    (fe.count_below, tsfresh.count_below, {"threshold": 0.0}, {"t": 0.0}),
    (fe.count_below_mean, tsfresh.count_below_mean, {}, {}),
    # (fe.cwt_coefficients, tsfresh.cwt_coefficients, {"widths": (1, 2, 3), "n_coefficients": 2},{"param": {"widths": (1, 2, 3), "coeff": 2, "w": 1}}),
    (
        fe.energy_ratios,
        tsfresh.energy_ratio_by_chunks,
        {"n_chunks": 6},
        {"param": [{"num_segments": 6, "segment_focus": i} for i in range(6)]},
    ),
    (fe.first_location_of_maximum, tsfresh.first_location_of_maximum, {}, {}),
    (fe.first_location_of_minimum, tsfresh.first_location_of_minimum, {}, {}),
    # (fe.fourier_entropy, tsfresh.fourier_entropy, {"n_bins": 10}, {"bins": 10}),
    # (fe.friedrich_coefficients, tsfresh.friedrich_coefficients, {"polynomial_order": 3, "n_quantiles": 30}, {"params": [{"m": 3, "r": 30}]}),
    (fe.has_duplicate, tsfresh.has_duplicate, {}, {}),
    (fe.has_duplicate_max, tsfresh.has_duplicate_max, {}, {}),
    (fe.has_duplicate_min, tsfresh.has_duplicate_min, {}, {}),
    (
        fe.index_mass_quantile,
        tsfresh.index_mass_quantile,
        {"q": 0.5},
        {"param": [{"q": 0.5}]},
    ),
    (
        fe.large_standard_deviation,
        tsfresh.large_standard_deviation,
        {"ratio": 0.25},
        {"r": 0.25},
    ),
    (fe.last_location_of_maximum, tsfresh.last_location_of_maximum, {}, {}),
    (fe.last_location_of_minimum, tsfresh.last_location_of_minimum, {}, {}),
    # (fe.lempel_ziv_complexity, tsfresh.lempel_ziv_complexity, {"n_bins": 5}, {"bins": 5}),
    (
        fe.linear_trend,
        tsfresh.linear_trend,
        {},
        {
            "param": [
                {"attr": "pvalue"},
                {"attr": "rvalue"},
                {"attr": "intercept"},
                {"attr": "slope"},
                {"attr": "stderr"},
            ]
        },
    ),
    (fe.longest_streak_above_mean, tsfresh.longest_strike_above_mean, {}, {}),
    (fe.longest_streak_below_mean, tsfresh.longest_strike_below_mean, {}, {}),
    (fe.mean_abs_change, tsfresh.mean_abs_change, {}, {}),
    (fe.mean_change, tsfresh.mean_change, {}, {}),
    (
        fe.mean_n_absolute_max,
        tsfresh.mean_n_absolute_max,
        {"n_maxima": 20},
        {"number_of_maxima": 20},
    ),
    (
        fe.mean_second_derivative_central,
        tsfresh.mean_second_derivative_central,
        {},
        {},
    ),
    (
        fe.number_crossings,
        tsfresh.number_crossing_m,
        {"crossing_value": 0.0},
        {"m": 0.0},
    ),
    (fe.number_cwt_peaks, tsfresh.number_cwt_peaks, {"max_width": 5}, {"n": 5}),
    (fe.number_peaks, tsfresh.number_peaks, {"support": 5}, {"n": 5}),
    # (fe.partial_autocorrelation, tsfresh.partial_autocorrelation, "param"),
    (
        fe.percent_reoccurring_values,
        tsfresh.percentage_of_reoccurring_values_to_all_values,
        {},
        {},
    ),
    (
        fe.percent_reoccurring_points,
        tsfresh.percentage_of_reoccurring_datapoints_to_all_datapoints,
        {},
        {},
    ),
    (
        fe.permutation_entropy,
        tsfresh.permutation_entropy,
        {"tau": 1, "n_dims": 3},
        {"tau": 1, "dimension": 3},
    ),
    (
        fe.range_count,
        tsfresh.range_count,
        {"lower": 0, "upper": 9, "closed": "none"},
        {"min": 0, "max": 9},
    ),
    (fe.ratio_beyond_r_sigma, tsfresh.ratio_beyond_r_sigma, {"ratio": 2}, {"r": 2}),
    (
        fe.ratio_n_unique_to_length,
        tsfresh.ratio_value_number_to_time_series_length,
        {},
        {},
    ),
    (fe.root_mean_square, tsfresh.root_mean_square, {}, {}),
    (fe.sample_entropy, tsfresh.sample_entropy, {}, {}),
    (
        fe.spkt_welch_density,
        tsfresh.spkt_welch_density,
        {"n_coeffs": 10},
        {"param": [{"coeff": i} for i in range(10)]},
    ),
    (fe.sum_reoccurring_points, tsfresh.sum_of_reoccurring_data_points, {}, {}),
    (fe.sum_reoccurring_values, tsfresh.sum_of_reoccurring_values, {}, {}),
    (
        fe.symmetry_looking,
        tsfresh.symmetry_looking,
        {"ratio": 0.25},
        {"param": [{"r": 0.25}]},
    ),
    (
        fe.time_reversal_asymmetry_statistic,
        tsfresh.time_reversal_asymmetry_statistic,
        {"n_lags": 3},
        {"lag": 3},
    ),
    (fe.variation_coefficient, tsfresh.variation_coefficient, {}, {}),
    (fe.var_gt_std, tsfresh.variance_larger_than_standard_deviation, {}, {}),
]

## 2 Benchmark core functions
---
Benchmark core function for time series' length of 100_000, 1_000_000 and 9_000_000. (Except 10_000 for `approximate_entropy` and 10_000/100_000 for `number_cwt_peaks` and `sample_entropy`). `all_benchmarks()` iterates through the elements in the `FUNC_PARAMS_BENCH` list and invoke `benchmark()` for each function.

In [11]:
def benchmark(
    f_feat: Callable, ts_feat: Callable, f_params: dict, ts_params: dict, is_expr: bool
):
    if f_feat.__name__ == "approximate_entropy":
        n_range = [10_000]
    elif f_feat.__name__ in ("number_cwt_peaks", "sample_entropy"):
        n_range = [10_000, 100_000]
    else:
        n_range = [10_000, 100_000, 1_000_000, 9_000_000]
    benchmark = perfplot.bench(
        setup=lambda n: (DF_PL_EAGER.head(n), DF_PANDAS.head(n)),
        kernels=[
            lambda x, _y: f_feat(x["value"], **f_params)
            if not is_expr
            else x.select(f_feat(pl.col("value"), **f_params)),
            lambda _x, y: ts_feat(y["value"], **ts_params),
        ],
        n_range=n_range,
        equality_check=False,
        labels=["functime", "tsfresh"],
    )
    return benchmark

In [19]:
def all_benchmarks(params: list[tuple], is_expr: bool) -> list:
    bench_df = pl.DataFrame(
        schema={
            "Feature name": pl.Utf8,
            "n": pl.Int64,
            "functime (ms)": pl.Float64,
            "tfresh (ms)": pl.Float64,
            "diff (ms)": pl.Float64,
            "diff %": pl.Float64,
            "speedup": pl.Float64,
        }
    )
    for x in params:
        try:
            f_feat = x[0]
            print(f"Feature: {f_feat.__name__}")
            bench = benchmark(
                f_feat=f_feat,
                ts_feat=x[1],
                f_params=x[2],
                ts_params=x[3],
                is_expr=is_expr,
            )
            bench_df = pl.concat(
                [
                    pl.DataFrame(
                        {
                            "Feature name": [x[0].__name__] * len(bench.n_range),
                            "n": bench.n_range,
                            "functime (ms)": bench.timings_s[0] * 1_000,
                            "tfresh (ms)": bench.timings_s[1] * 1_000,
                            "diff (ms)": (bench.timings_s[0] - bench.timings_s[1])
                            * 1_000,
                            "diff %": 100
                            * (bench.timings_s[0] - bench.timings_s[1])
                            / bench.timings_s[1],
                            "speedup": bench.timings_s[1] / bench.timings_s[0],
                        }
                    ),
                    bench_df,
                ]
            )
        except ValueError:
            print(f"Failed to compute feature {x[0].__name__}")
        except ImportError:
            print(f"Failed to import feature {x[0].__name__}")
        except TypeError:
            print(f"Feature {x[0].__name__} not implemented for pl.Expr")
        except AttributeError:
            print(f"Incompatible functions have been called on pl.Expr for feature {x[0].__name__}")
    return bench_df

## 3. Run benchmarks
---

In [13]:
# Code to prettify benchmark results
def table_prettifier(df: pl.DataFrame, n: int):
    table = (
        df.filter(pl.col("n") == n)
        .drop("n")
        .sort("speedup", descending=True)
        .with_columns(
            pl.when(pl.exclude("Feature name").abs() < 0.1)
            .then(pl.exclude("Feature name").round(4))
            .when(pl.exclude("Feature name").abs() < 1)
            .then(pl.exclude("Feature name").round(2))
            .when(pl.exclude("Feature name").abs() < 30)
            .then(pl.exclude("Feature name").round(1))
            .otherwise(pl.exclude("Feature name").round(1))
        )
        .with_columns(speedup="x " + pl.col("speedup").cast(pl.Utf8))
    )
    return table

In [20]:
%%capture
bench_expr = all_benchmarks(params = FUNC_PARAMS_BENCH, is_expr = True)
bench_series = all_benchmarks(params = FUNC_PARAMS_BENCH, is_expr = False)

# Lazy benchmarks
df_expr_10k = table_prettifier(bench_expr, n=10_000)
df_expr_100k = table_prettifier(bench_expr, n=100_000)
df_expr_1m = table_prettifier(bench_expr, n=1_000_000)
df_expr_9m = table_prettifier(bench_expr, n=9_000_000)

# Eager benchmarks
df_series_10k = table_prettifier(bench_series, n=10_000)
df_series_100k = table_prettifier(bench_series, n=100_000)
df_series_1m = table_prettifier(bench_series, n=1_000_000)
df_series_9m = table_prettifier(bench_series, n=9_000_000)

Feature: absolute_energy


Feature: absolute_maximum


Feature: absolute_sum_of_changes


INFO:functime.feature_extractors:Expression version of approximate_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: approximate_entropy


Feature approximate_entropy not implemented for pl.Expr
Feature: autocorrelation


INFO:functime.feature_extractors:Expression version of autoregressive_coefficients is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: autoregressive_coefficients


Feature autoregressive_coefficients not implemented for pl.Expr
Feature: benford_correlation2


Feature: benford_correlation


Feature: binned_entropy


Feature: c3


Feature: change_quantiles


Feature: cid_ce


Feature: count_above


Feature: count_above_mean


Feature: count_below


Feature: count_below_mean


Feature: energy_ratios


Feature: first_location_of_maximum


Feature: first_location_of_minimum


Feature: has_duplicate


Feature: has_duplicate_max


Feature: has_duplicate_min


Feature: index_mass_quantile


Feature: large_standard_deviation


Feature: last_location_of_maximum


Feature: last_location_of_minimum


Feature: linear_trend


Feature: longest_streak_above_mean


Feature: longest_streak_below_mean


Feature: mean_abs_change


Feature: mean_change


Feature: mean_n_absolute_max


Feature: mean_second_derivative_central


Feature: number_crossings


Feature: number_cwt_peaks


Incompatible functions have been called on pl.Expr for feature number_cwt_peaks
Feature: number_peaks


Feature: percent_reoccurring_values


Feature: percent_reoccurring_points


Feature: permutation_entropy


Feature: range_count


Feature: ratio_beyond_r_sigma


Feature: ratio_n_unique_to_length


Feature: root_mean_square


INFO:functime.feature_extractors:Expression version of sample_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: sample_entropy


INFO:functime.feature_extractors:Expression version of spkt_welch_density is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature sample_entropy not implemented for pl.Expr
Feature: spkt_welch_density


Feature spkt_welch_density not implemented for pl.Expr
Feature: sum_reoccurring_points


Feature: sum_reoccurring_values


Feature: symmetry_looking


Feature: time_reversal_asymmetry_statistic


Feature: variation_coefficient


Feature: var_gt_std


Feature: absolute_energy


Feature: absolute_maximum


Feature: absolute_sum_of_changes


Feature: approximate_entropy


Feature: autocorrelation


Feature: autoregressive_coefficients


Feature: benford_correlation2


Feature: benford_correlation


Feature: binned_entropy


Feature: c3


Feature: change_quantiles


Feature: cid_ce


Feature: count_above


Feature: count_above_mean


Feature: count_below


Feature: count_below_mean


Feature: energy_ratios


Feature: first_location_of_maximum


Feature: first_location_of_minimum


Feature: has_duplicate


Feature: has_duplicate_max


Feature: has_duplicate_min


Feature: index_mass_quantile


Feature: large_standard_deviation


Feature: last_location_of_maximum


Feature: last_location_of_minimum


Feature: linear_trend


Feature: longest_streak_above_mean


Feature: longest_streak_below_mean


Feature: mean_abs_change


Feature: mean_change


Feature: mean_n_absolute_max


Feature: mean_second_derivative_central


Feature: number_crossings


Feature: number_cwt_peaks


Feature: number_peaks


Feature: percent_reoccurring_values


Feature: percent_reoccurring_points


Feature: permutation_entropy


Feature: range_count


Feature: ratio_beyond_r_sigma


Feature: ratio_n_unique_to_length


Feature: root_mean_square


Feature: sample_entropy


Feature: spkt_welch_density


Feature: sum_reoccurring_points


Feature: sum_reoccurring_values


Feature: symmetry_looking


Feature: time_reversal_asymmetry_statistic


Feature: variation_coefficient


Feature: var_gt_std


## 4. Benchmark results
---

Display 8 tables:
- For `pl.Series`: 10k, 100k, 1M and 9M rows
- For `pl.Expr`: 10k, 100k, 1M and 9M rows

Each table contains the execution time (ms) for tsfresh and functime, the difference, the difference in % and the speedup:

### 4.1 Results for `pl.Expr`

#### 10k expr

In [21]:
df_expr_10k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",0.98,13.9,-12.9,-93.0,"""x 14.2"""
"""benford_correlation""",1.4,13.8,-12.4,-89.9,"""x 9.9"""
"""energy_ratios""",0.45,3.4,-3.0,-86.9,"""x 7.6"""
"""mean_n_absolute_max""",0.13,0.65,-0.52,-79.6,"""x 4.9"""
"""longest_streak_below_mean""",0.45,1.7,-1.3,-73.7,"""x 3.8"""
"""range_count""",0.0752,0.25,-0.18,-70.4,"""x 3.4"""
"""change_quantiles""",0.45,1.3,-0.85,-65.3,"""x 2.9"""
"""longest_streak_above_mean""",0.61,1.7,-1.1,-63.9,"""x 2.8"""
"""number_peaks""",0.64,1.7,-1.1,-62.3,"""x 2.7"""
"""ratio_beyond_r_sigma""",0.17,0.42,-0.25,-59.3,"""x 2.5"""


#### 100k expr

In [22]:
df_expr_100k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",6.5,150.0,-143.5,-95.6,"""x 22.9"""
"""benford_correlation""",10.6,153.6,-143.0,-93.1,"""x 14.5"""
"""mean_n_absolute_max""",0.56,7.3,-6.8,-92.4,"""x 13.2"""
"""longest_streak_below_mean""",2.4,16.7,-14.3,-85.6,"""x 6.9"""
"""longest_streak_above_mean""",2.4,16.3,-13.8,-85.0,"""x 6.7"""
"""energy_ratios""",1.0,4.9,-3.9,-78.9,"""x 4.7"""
"""ratio_n_unique_to_length""",2.8,7.7,-5.0,-64.3,"""x 2.8"""
"""change_quantiles""",2.2,5.6,-3.4,-60.9,"""x 2.6"""
"""absolute_maximum""",0.15,0.36,-0.21,-57.6,"""x 2.4"""
"""count_above_mean""",0.19,0.42,-0.23,-55.4,"""x 2.2"""


#### 1M expr

In [23]:
df_expr_1m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",76.9,1552.6,-1475.6,-95.0,"""x 20.2"""
"""benford_correlation""",116.6,1848.4,-1731.8,-93.7,"""x 15.8"""
"""mean_n_absolute_max""",7.6,93.6,-86.0,-91.9,"""x 12.3"""
"""energy_ratios""",8.3,74.8,-66.5,-88.9,"""x 9.0"""
"""longest_streak_below_mean""",24.0,169.2,-145.1,-85.8,"""x 7.0"""
"""longest_streak_above_mean""",24.2,168.0,-143.7,-85.6,"""x 6.9"""
"""absolute_maximum""",0.88,5.0,-4.1,-82.4,"""x 5.7"""
"""count_below_mean""",1.3,4.7,-3.4,-71.8,"""x 3.5"""
"""has_duplicate_min""",1.3,4.1,-2.8,-67.7,"""x 3.1"""
"""has_duplicate_max""",1.3,4.0,-2.7,-66.6,"""x 3.0"""


#### 9M expr

In [24]:
df_expr_9m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",873.9,15083.4,-14209.5,-94.2,"""x 17.3"""
"""mean_n_absolute_max""",62.9,946.4,-883.5,-93.4,"""x 15.1"""
"""benford_correlation""",1183.0,14642.3,-13459.3,-91.9,"""x 12.4"""
"""absolute_maximum""",6.4,46.4,-40.0,-86.1,"""x 7.2"""
"""longest_streak_below_mean""",218.4,1560.5,-1342.1,-86.0,"""x 7.1"""
"""longest_streak_above_mean""",218.9,1558.3,-1339.4,-86.0,"""x 7.1"""
"""large_standard_deviation""",35.3,224.1,-188.8,-84.2,"""x 6.3"""
"""change_quantiles""",259.5,1201.5,-942.0,-78.4,"""x 4.6"""
"""energy_ratios""",117.3,511.6,-394.3,-77.1,"""x 4.4"""
"""count_below_mean""",10.4,41.9,-31.5,-75.1,"""x 4.0"""


### 4.2 Results for `pl.Series`

#### 10k series

In [25]:
df_series_10k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""approximate_entropy""",180.0,43057.4,-42877.4,-99.6,"""x 239.2"""
"""sample_entropy""",153.6,12344.4,-12190.7,-98.8,"""x 80.3"""
"""benford_correlation2""",0.42,14.0,-13.6,-97.0,"""x 33.0"""
"""energy_ratios""",0.3,3.3,-3.0,-91.0,"""x 11.1"""
"""count_above_mean""",0.0184,0.16,-0.15,-88.8,"""x 8.9"""
"""count_below_mean""",0.0183,0.16,-0.14,-88.7,"""x 8.9"""
"""has_duplicate_min""",0.0206,0.17,-0.15,-88.2,"""x 8.5"""
"""has_duplicate_max""",0.0207,0.17,-0.15,-88.2,"""x 8.4"""
"""benford_correlation""",1.8,14.7,-13.0,-87.9,"""x 8.3"""
"""count_below""",0.0169,0.12,-0.1,-86.1,"""x 7.2"""


#### 100k series

In [26]:
df_series_100k

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""sample_entropy""",5912.4,1332607.1,-1326700.0,-99.6,"""x 225.4"""
"""benford_correlation2""",1.7,151.2,-149.5,-98.9,"""x 89.5"""
"""benford_correlation""",11.6,150.2,-138.6,-92.3,"""x 13.0"""
"""mean_n_absolute_max""",0.6,7.4,-6.8,-91.9,"""x 12.3"""
"""energy_ratios""",0.6,4.9,-4.3,-87.7,"""x 8.1"""
"""autoregressive_coefficients""",9.5,69.1,-59.6,-86.3,"""x 7.3"""
"""longest_streak_above_mean""",2.4,15.8,-13.5,-85.0,"""x 6.7"""
"""longest_streak_below_mean""",2.4,15.6,-13.2,-84.7,"""x 6.6"""
"""has_duplicate_max""",0.081,0.49,-0.41,-83.4,"""x 6.0"""
"""has_duplicate_min""",0.0808,0.49,-0.41,-83.4,"""x 6.0"""


#### 1M series

In [27]:
df_series_1m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",18.4,1546.3,-1527.8,-98.8,"""x 83.8"""
"""benford_correlation""",119.5,1558.7,-1439.2,-92.3,"""x 13.0"""
"""mean_n_absolute_max""",7.7,91.8,-84.2,-91.6,"""x 12.0"""
"""autoregressive_coefficients""",88.2,843.5,-755.3,-89.5,"""x 9.6"""
"""energy_ratios""",4.5,38.3,-33.8,-88.3,"""x 8.5"""
"""root_mean_square""",0.55,4.5,-3.9,-87.6,"""x 8.1"""
"""longest_streak_above_mean""",23.0,168.9,-145.9,-86.4,"""x 7.3"""
"""longest_streak_below_mean""",22.9,168.0,-145.1,-86.4,"""x 7.3"""
"""absolute_maximum""",1.1,5.1,-4.0,-78.8,"""x 4.7"""
"""linear_trend""",23.2,108.4,-85.3,-78.6,"""x 4.7"""


#### 9M series

In [28]:
df_series_9m

Feature name,functime (ms),tfresh (ms),diff (ms),diff %,speedup
"""benford_correlation2""",164.0,14326.9,-14162.9,-98.9,"""x 87.3"""
"""mean_n_absolute_max""",51.3,920.6,-869.2,-94.4,"""x 17.9"""
"""benford_correlation""",1422.8,16781.7,-15358.9,-91.5,"""x 11.8"""
"""root_mean_square""",3.8,39.9,-36.1,-90.4,"""x 10.4"""
"""longest_streak_below_mean""",198.9,1901.4,-1702.4,-89.5,"""x 9.6"""
"""energy_ratios""",36.1,311.5,-275.4,-88.4,"""x 8.6"""
"""longest_streak_above_mean""",201.7,1512.2,-1310.6,-86.7,"""x 7.5"""
"""linear_trend""",207.6,1110.2,-902.6,-81.3,"""x 5.3"""
"""absolute_maximum""",10.4,47.3,-36.9,-78.0,"""x 4.5"""
"""count_below_mean""",9.8,42.0,-32.2,-76.6,"""x 4.3"""


## 5. Benchmark `Group by / Aggregation` context

Benchmark combining functime's feature extraction and polars' `Group by / Aggregation` context.

In [42]:
_SP500_DATASET = "../../data/sp500.parquet"

SP500_PANDAS = pd.read_parquet(_SP500_DATASET)
SP500_PL_EAGER = pl.read_parquet(_SP500_DATASET)

In [43]:
SP500_PANDAS

Unnamed: 0,ticker,time,price
0,A,2022-06-01,122.278214
1,A,2022-06-02,128.248581
2,A,2022-06-03,127.642609
3,A,2022-06-06,126.788277
4,A,2022-06-07,128.049881
...,...,...,...
126248,ZTS,2023-05-24,169.139999
126249,ZTS,2023-05-25,165.240005
126250,ZTS,2023-05-26,164.740005
126251,ZTS,2023-05-30,160.940002


We want to compare `tsfresh` using `pandas' groupby`  with  `functime` using `polars' groupby` such as:

In [44]:
%%timeit
SP500_PANDAS.groupby(
    by = "ticker"
)["price"].agg(
    tsfresh.number_peaks,
    n = 5
)

1.05 s ± 245 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [45]:
%%timeit
SP500_PL_EAGER.group_by(
    pl.col("ticker")
).agg(
    pl.col("price").ts.number_peaks(support = 5)
)

65.5 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


If we examine the previous benchmark, we can see that the `number_peaks` operation is approximately **2.5** times faster when using `functime` compared to `tsfresh`.

In the `groupby` context, it's **10** times faster!

In [46]:
def benchmark_groupby_context(
    f_feat: Callable, ts_feat: Callable, f_params: dict, ts_params: dict
):
    benchmark = perfplot.bench(
        setup=lambda _n: (SP500_PL_EAGER, SP500_PANDAS),
        kernels=[
            lambda x, _y: x.group_by(pl.col("ticker")).agg(
                f_feat(pl.col("price"), **f_params)
            ),  # functime + polars groupby
            lambda _x, y: y.groupby("ticker")["price"].agg(
                ts_feat, **ts_params
            ),  # tsfresh + pandas groupby
        ],
        n_range=[1],
        equality_check=False,
        labels=["functime", "tsfresh"],
    )
    return benchmark

In [47]:
def all_benchmarks_groupby(params: list[tuple]) -> list:
    bench_df = pl.DataFrame(
        schema={
            "Feature name": pl.Utf8,
            "n": pl.Int64,
            "functime + pl groupby (ms)": pl.Float64,
            "tfresh + pd groupby (ms)": pl.Float64,
            "diff (ms)": pl.Float64,
            "diff %": pl.Float64,
            "speedup": pl.Float64,
        }
    )
    for x in params:
        try:
            print(f"Feature: {x[0].__name__}")
            bench = benchmark_groupby_context(
                f_feat=x[0], ts_feat=x[1], f_params=x[2], ts_params=x[3]
            )
            bench_df = pl.concat(
                [
                    pl.DataFrame(
                        {
                            "Feature name": [x[0].__name__] * len(bench.n_range),
                            "n": bench.n_range,
                            "functime + pl groupby (ms)": bench.timings_s[0] * 1_000,
                            "tfresh + pd groupby (ms)": bench.timings_s[1] * 1_000,
                            "diff (ms)": (bench.timings_s[0] - bench.timings_s[1])
                            * 1_000,
                            "diff %": 100
                            * (bench.timings_s[0] - bench.timings_s[1])
                            / bench.timings_s[1],
                            "speedup": bench.timings_s[1] / bench.timings_s[0],
                        }
                    ),
                    bench_df,
                ]
            )
        except ValueError:
            print(f"Failed to compute feature {x[0].__name__}")
        except ImportError:
            print(f"Failed to import feature {x[0].__name__}")
        except TypeError:
            print(f"Feature {x[0].__name__} not implemented for pl.Expr")
        except AttributeError:
            print(f"Incompatible functions have been called on pl.Expr for feature {x[0].__name__}")
    return bench_df

In [50]:
%%capture
bench_groupby = all_benchmarks_groupby(params=FUNC_PARAMS_BENCH)
df_groupby = table_prettifier(df=bench_groupby, n=1)

Feature: absolute_energy


Feature: absolute_maximum


Feature: absolute_sum_of_changes


INFO:functime.feature_extractors:Expression version of approximate_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: approximate_entropy


Feature approximate_entropy not implemented for pl.Expr
Feature: autocorrelation


INFO:functime.feature_extractors:Expression version of autoregressive_coefficients is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: autoregressive_coefficients


Feature autoregressive_coefficients not implemented for pl.Expr
Feature: benford_correlation


Feature: binned_entropy


Feature: c3


Feature: change_quantiles


Feature: cid_ce


Feature: count_above


Feature: count_above_mean


Feature: count_below


Feature: count_below_mean


Feature: energy_ratios


Feature: first_location_of_maximum


Feature: first_location_of_minimum


Feature: has_duplicate


Feature: has_duplicate_max


Feature: has_duplicate_min


Feature: index_mass_quantile


Feature: large_standard_deviation


Feature: last_location_of_maximum


Feature: last_location_of_minimum


Feature: linear_trend


Feature: longest_streak_above_mean


Feature: longest_streak_below_mean


Feature: mean_abs_change


Feature: mean_change


Feature: mean_n_absolute_max


Feature: mean_second_derivative_central


Feature: number_crossings


Feature: number_cwt_peaks


Incompatible functions have been called on pl.Expr for feature number_cwt_peaks
Feature: number_peaks


Feature: percent_reoccurring_values


Feature: percent_reoccurring_points


Feature: permutation_entropy


Feature: range_count


Feature: ratio_beyond_r_sigma


Feature: ratio_n_unique_to_length


Feature: root_mean_square


INFO:functime.feature_extractors:Expression version of sample_entropy is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature: sample_entropy


INFO:functime.feature_extractors:Expression version of spkt_welch_density is not yet implemented due to technical difficulty regarding Polars Expression Plugins.


Feature sample_entropy not implemented for pl.Expr
Feature: spkt_welch_density


Feature spkt_welch_density not implemented for pl.Expr
Feature: sum_reoccurring_points


Feature: sum_reoccurring_values


Feature: symmetry_looking


Feature: time_reversal_asymmetry_statistic


Feature: variation_coefficient


Feature: var_gt_std


#### S&P500 groupby

In [51]:
df_groupby

Feature name,functime + pl groupby (ms),tfresh + pd groupby (ms),diff (ms),diff %,speedup
"""energy_ratios""",8.8,2475.0,-2466.2,-99.6,"""x 279.8"""
"""range_count""",2.7,167.3,-164.6,-98.4,"""x 61.2"""
"""symmetry_looking""",3.1,170.7,-167.6,-98.2,"""x 55.9"""
"""ratio_beyond_r_sigma""",5.9,230.3,-224.3,-97.4,"""x 38.8"""
"""root_mean_square""",3.2,119.3,-116.1,-97.3,"""x 37.5"""
"""count_below""",2.7,89.6,-86.9,-97.0,"""x 33.6"""
"""percent_reoccurring_points""",9.8,329.7,-319.9,-97.0,"""x 33.5"""
"""count_above""",2.7,86.6,-83.9,-96.9,"""x 32.0"""
"""change_quantiles""",21.7,579.8,-558.1,-96.3,"""x 26.8"""
"""variation_coefficient""",2.6,65.3,-62.7,-96.0,"""x 25.1"""
