In [1]:
# we first use Tsfresh/Tsfel to generate some features to get a feel of what we are dealing with:
#   1. Tsfresh takes n*(id, time, features) and len(id) labels as input:
#   2. tsfresh doesn't test/preprocess anything about input timeseries, so you better make them stationary by yourselves
#   3. id = independent timeseries, long or short
#   4. each id has 1 label, which is against the notion that timeseries need to have timeseries label as well
#   5. however, at each time stamp, we have short memory features and long memory features, for short memory features, it is actually equivalent to having a sliding window of short timeseries and output a series of scalar labels, which is exactly how tsfresh/tsfel works
#   6. indeed, it might be the best practice to use these tools to generate short memory features and handcraft long memory/more hidden features
#   7. these tools can not catch any cross-sectional features between ids. you can generate features for 1 id at a time, works exactly the same, it is just more computationally efficient for parallelism
#   8. do features on long timeseries works equally well on its splitted many rolling short timeseries on average? not necessary
#   9. thus how do you evaluate whether features generated like this work consistently over time? either averaging statistical performance or try to train a model(then compare model weights)
#   10.try short->long window for feature importance(e.g. FFT doesn't work well on short window length)
#   11.to evaluate the effect of a feature, dont need to have too many, have enough samples that can make sure feature is stationary and preferably normal distributed

In [None]:
# | **Method**                                       | **Approach Type**           | **Pattern Type Learned**                 | **Interpretable?** | **Best Use Case**                                  | **Tools / Libraries**                |
# | ------------------------------------------------ | --------------------------- | ---------------------------------------- | ------------------ | -------------------------------------------------- | ------------------------------------ |
# | **Shapelet Transform**                           | Distance-based (Supervised) | Local subsequence "shapes"               | ✅ High            | Finding interpretable, discriminative patterns     | `tslearn`, `sktime`, `pyts`          |
# | **Dynamic Time Warping + Supervised Clustering** | Similarity + Aggregation    | Whole series similarity (flexible time)  | ✅ Medium          | Pattern grouping + average label scoring           | `tslearn`, `dtaidistance`, `HDBSCAN` |
# | **Bag-of-SFA Symbols (BOSS)**                    | Symbolic / Frequency        | Frequency of symbolic subsequences       | ✅ High            | Symbolic pattern detection (e.g., zigzags)         | `sktime`, `pyts`                     |
# | **TDE (Temporal Dictionary Ensemble)**           | Dictionary Learning         | Frequency/strength of learned shapes     | ✅ High            | Detecting repeated motifs with outcome correlation | `sktime`                             |
# | **Time Series Forest (TSF)**                     | Tree Ensemble               | Random intervals + summaries             | ⚠️ Partial         | Strong classification baseline                     | `sktime`, `tslearn`                  |
# | **HIVE-COTE 2.0**                                | Ensemble (Hybrid)           | Multiple feature types                   | ✅ Partial         | State-of-the-art accuracy on many tasks            | `sktime`                             |
# | **ROCKET / MiniROCKET / MultiROCKET**            | Random Kernels              | Statistical response to many filters     | ❌ No              | Fast, accurate classification/regression           | `sktime`, `rocket-boost`             |
# | **1D CNNs**                                      | Deep Learning (CNN)         | Localized filters (motifs)               | ⚠️ Limited         | Predicting from raw time series                    | `Keras`, `PyTorch`                   |
# | **RNNs / LSTMs / Transformers**                  | Deep Learning (Sequential)  | Temporal dynamics, memory, long patterns | ❌ No              | Capturing long-term dependencies                   | `PyTorch`, `Keras`, `Hugging Face`   |
# | **Siamese / Triplet Networks**                   | Metric Learning             | Latent similarity between time windows   | ⚠️ Medium          | Learning similarity among high-performing windows  | `PyTorch`, `TensorFlow`              |
# | **Autoencoder + Regressor**                      | Latent Representation       | Abstract embeddings                      | ⚠️ Medium          | Unsupervised pretraining + label prediction        | `PyTorch`, `scikit-learn`            |


In [None]:
# simple features
import os
import sys
import numpy as np
import pandas as pd

# raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label.parquet'))
raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label_trend_1to8.parquet'))
# raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label_trend_3to24.parquet'))
raw_data = raw_data[-(60*24*14):].copy()

# Log Returns
raw_data['log_ret'] = np.log(raw_data['close'] / raw_data['close'].shift(1))

# Momentum
for i in range(1, 6):
    raw_data[f'mom{i}'] = raw_data['close'].pct_change(periods=i)

# Volatility
window_stdev = 50
raw_data['volatility'] = raw_data['log_ret'].rolling(window=window_stdev, min_periods=window_stdev).std()

# Serial Correlation
window_autocorr = 50
for lag in range(1, 6):
    raw_data[f'autocorr_{lag}'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr).apply(
        lambda x: x.autocorr(lag=lag) if x.notna().sum() > lag else np.nan,
        raw=False
    )

# Lagged log returns
for i in range(1, 6):
    raw_data[f'log_t{i}'] = raw_data['log_ret'].shift(i)

# Moving averages
fast_window = 7
slow_window = 15
raw_data['fast_mavg'] = raw_data['close'].rolling(window=fast_window, min_periods=fast_window).mean()
raw_data['slow_mavg'] = raw_data['close'].rolling(window=slow_window, min_periods=slow_window).mean()

# ATR
high_low = raw_data['high'] - raw_data['low']
high_close = (raw_data['high'] - raw_data['close'].shift()).abs()
low_close = (raw_data['low'] - raw_data['close'].shift()).abs()
tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
raw_data['atr'] = tr.ewm(span=12, adjust=False).mean()

# raw_data.info(verbose=True, memory_usage='deep')
raw_data.describe().transpose()


In [None]:
# Tsresh Features
# https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html
import os
import sys
import numpy as np
import pandas as pd
import plotly.graph_objects as go

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import EfficientFCParameters

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# print in full
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# period < 20 = stationary (ADF test)
# period > 10 = significant trend info (correlation)

M = 200 # number of top_features kept
weeks = 1
period = 15
step = 15 # should be 1

dir = os.path.join(os.getcwd())
# dir = os.path.dirname(os.path.abspath(__file__))
filepath = os.path.join(dir, f"bar_and_label.parquet")

# 'bar_and_label.parquet'
# 'bar_and_label_trend_1to8.parquet'
# 'bar_and_label_trend_3to24.parquet'

def analyze_feature(df: pd.DataFrame):
    series = df["close"]
    df['ref'] = np.log((series / series.iloc[0]).fillna(1))
    df['value'] = np.log((series / series.rolling(period).mean().shift(period)).fillna(1))
    
    # print(df.tail())
    df = df[-int(60/5*24*7*weeks):].copy()
    df = df.sort_values('time').reset_index()
    
    # Rolling the time series
    window_size = period
    step_size = step
    segments = []
    labels = []
    times = []
    for idx, start in enumerate(range(0, len(df) - window_size, step_size)):
        end = start + window_size
        segment = df.iloc[start:end].copy()
        segment["id"] = idx  # unique id for each window
        segments.append(segment[["id", "time", "value"]])
        labels.append(df["label"].iloc[end])
        times.append(df["time"].iloc[end])

    X_features = pd.concat(segments, axis=0)
    y_labels = pd.Series(labels)
    t_times = pd.Series(times)

    print(X_features.tail())
    # print(y_labels.tail())

    # --- Feature extraction
    X_features = extract_features(
        X_features,
        column_id="id",
        column_sort="time",
        default_fc_parameters=EfficientFCParameters(),
        show_warnings=False,
        impute_function=impute, # NOTE: tsfresh's impute implementation may leak future info, but mostly okay
        # disable_progressbar=True
    )
    X_features = pd.DataFrame(X_features)  # Ensure features is a DataFrame
    X_features = X_features.loc[:, X_features.notna().sum() == len(X_features)] # remove NaN features after impute
    X_features = X_features.loc[:, X_features.nunique() > 1] # remove constant features
    print(f"Valid features number:{X_features.shape[1]}")
    # X_features = select_features(X_features, y_labels)
    # print(X_features.describe().transpose())

    # | Feature Type | Target Type | Statistical Test Used             |
    # | ------------ | ----------- | --------------------------------- |
    # | Continuous   | Binary      | Kolmogorov-Smirnov test (KS test) |
    # | Continuous   | Categorical | ANOVA F-test                      |
    # | Continuous   | Continuous  | Kendall's tau correlation test    |
    # | Binary       | Binary      | Fisher’s exact test               |
    # | Binary       | Categorical | Chi-square test                   |
    # | Binary       | Continuous  | Point biserial correlation        |

    # --- Compute correlation with label
    results = {}
    for col in X_features.columns:
        # Use Spearman for robustness (nonlinear monotonic relationships)
        corr = pd.Series(X_features[col]).corr(y_labels, method='spearman')
        results[col] = abs(corr)  # use absolute value to reflect strength
    
    # --- Sort by importance
    sorted_features = sorted(results.items(), key=lambda x: x[1], reverse=True)
    print("\nTop statistically important features (Spearman correlation):")
    for name, score in sorted_features[:10]:
        print(f"{name}: {score:.4f}")
    
    top_features = [name for name, score in sorted_features[:M]]
    X_selected = X_features[top_features].copy()
    X_selected["time"] = t_times.values
    X_selected["label"] = y_labels.values
    X_selected.set_index('time', inplace=True)
    X_selected.to_parquet(os.path.join(dir, f"tsfresh_features_and_label_{weeks}weeks.parquet"))
    # print(X_selected.describe().transpose())

if __name__ == '__main__':
    df = pd.read_parquet(filepath)
    analyze_feature(df)

"""
Top statistically important features (Spearman correlation):
close__has_duplicate_max: 0.0721
close__spkt_welch_density__coeff_5: 0.0693
close__symmetry_looking__r_0.30000000000000004: 0.0532
close__change_quantiles__f_agg_"mean"__isabs_False__qh_0.6__ql_0.4: 0.0521
close__change_quantiles__f_agg_"mean"__isabs_True__qh_0.8__ql_0.6: 0.0505 
close__large_standard_deviation__r_0.4: 0.0477
close__agg_linear_trend__attr_"slope"__chunk_len_10__f_agg_"var": 0.0474  
close__partial_autocorrelation__lag_4: 0.0434
close__change_quantiles__f_agg_"mean"__isabs_False__qh_0.2__ql_0.0: 0.0420
"""


SyntaxError: invalid syntax (3800220679.py, line 223)