In [1]:
# we first use Tsfresh/Tsfel to generate some features to get a feel of what we are dealing with:
#   1. Tsfresh takes n*(id, time, features) and len(id) labels as input:
#   2. tsfresh doesn't test/preprocess anything about input timeseries, so you better make them stationary by yourselves
#   3. id = independent timeseries, long or short
#   4. each id has 1 label, which is against the notion that timeseries need to have timeseries label as well
#   5. however, at each time stamp, we have short memory features and long memory features, for short memory features, it is actually equivalent to having a sliding window of short timeseries and output a series of scalar labels, which is exactly how tsfresh/tsfel works
#   6. indeed, it might be the best practice to use these tools to generate short memory features and handcraft long memory/more hidden features
#   7. these tools can not catch any cross-sectional features between ids. you can generate features for 1 id at a time, works exactly the same, it is just more computationally efficient for parallelism
#   8. do features on long timeseries works equally well on its splitted many rolling short timeseries on average? not necessary
#   9. thus how do you evaluate whether features generated like this work consistently over time? either averaging statistical performance or try to train a model(then compare model weights)
#   10.try short->long window for feature importance(e.g. FFT doesn't work well on short window length)
#   11.to evaluate the effect of a feature, dont need to have too many, have enough samples that can make sure feature is stationary and preferably normal distributed

In [None]:
# | **Method**                                       | **Approach Type**           | **Pattern Type Learned**                 | **Interpretable?** | **Best Use Case**                                  | **Tools / Libraries**                |
# | ------------------------------------------------ | --------------------------- | ---------------------------------------- | ------------------ | -------------------------------------------------- | ------------------------------------ |
# | **Shapelet Transform**                           | Distance-based (Supervised) | Local subsequence "shapes"               | ✅ High            | Finding interpretable, discriminative patterns     | `tslearn`, `sktime`, `pyts`          |
# | **Dynamic Time Warping + Supervised Clustering** | Similarity + Aggregation    | Whole series similarity (flexible time)  | ✅ Medium          | Pattern grouping + average label scoring           | `tslearn`, `dtaidistance`, `HDBSCAN` |
# | **Bag-of-SFA Symbols (BOSS)**                    | Symbolic / Frequency        | Frequency of symbolic subsequences       | ✅ High            | Symbolic pattern detection (e.g., zigzags)         | `sktime`, `pyts`                     |
# | **TDE (Temporal Dictionary Ensemble)**           | Dictionary Learning         | Frequency/strength of learned shapes     | ✅ High            | Detecting repeated motifs with outcome correlation | `sktime`                             |
# | **Time Series Forest (TSF)**                     | Tree Ensemble               | Random intervals + summaries             | ⚠️ Partial         | Strong classification baseline                     | `sktime`, `tslearn`                  |
# | **HIVE-COTE 2.0**                                | Ensemble (Hybrid)           | Multiple feature types                   | ✅ Partial         | State-of-the-art accuracy on many tasks            | `sktime`                             |
# | **ROCKET / MiniROCKET / MultiROCKET**            | Random Kernels              | Statistical response to many filters     | ❌ No              | Fast, accurate classification/regression           | `sktime`, `rocket-boost`             |
# | **1D CNNs**                                      | Deep Learning (CNN)         | Localized filters (motifs)               | ⚠️ Limited         | Predicting from raw time series                    | `Keras`, `PyTorch`                   |
# | **RNNs / LSTMs / Transformers**                  | Deep Learning (Sequential)  | Temporal dynamics, memory, long patterns | ❌ No              | Capturing long-term dependencies                   | `PyTorch`, `Keras`, `Hugging Face`   |
# | **Siamese / Triplet Networks**                   | Metric Learning             | Latent similarity between time windows   | ⚠️ Medium          | Learning similarity among high-performing windows  | `PyTorch`, `TensorFlow`              |
# | **Autoencoder + Regressor**                      | Latent Representation       | Abstract embeddings                      | ⚠️ Medium          | Unsupervised pretraining + label prediction        | `PyTorch`, `scikit-learn`            |

In [None]:
# simple features
import os
import sys
import numpy as np
import pandas as pd

# raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label.parquet'))
raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label(trend_1to8).parquet'))
# raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label(trend_3to24).parquet'))
raw_data = raw_data[-(60*24*14):].copy()

# Log Returns
raw_data['log_ret'] = np.log(raw_data['close'] / raw_data['close'].shift(1))

# Momentum
for i in range(1, 6):
    raw_data[f'mom{i}'] = raw_data['close'].pct_change(periods=i)

# Volatility
window_stdev = 50
raw_data['volatility'] = raw_data['log_ret'].rolling(window=window_stdev, min_periods=window_stdev).std()

# Serial Correlation
window_autocorr = 50
for lag in range(1, 6):
    raw_data[f'autocorr_{lag}'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr).apply(
        lambda x: x.autocorr(lag=lag) if x.notna().sum() > lag else np.nan,
        raw=False
    )

# Lagged log returns
for i in range(1, 6):
    raw_data[f'log_t{i}'] = raw_data['log_ret'].shift(i)

# Moving averages
fast_window = 7
slow_window = 15
raw_data['fast_mavg'] = raw_data['close'].rolling(window=fast_window, min_periods=fast_window).mean()
raw_data['slow_mavg'] = raw_data['close'].rolling(window=slow_window, min_periods=slow_window).mean()

# ATR
high_low = raw_data['high'] - raw_data['low']
high_close = (raw_data['high'] - raw_data['close'].shift()).abs()
low_close = (raw_data['low'] - raw_data['close'].shift()).abs()
tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
raw_data['atr'] = tr.ewm(span=12, adjust=False).mean()

# raw_data.info(verbose=True, memory_usage='deep')
raw_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
open,20160.0,19560.24,1026.461275,16500.5,18946.75,19636.5,20201.6875,21521.0
high,20160.0,19576.02,1021.795364,16534.5,18965.0625,19648.5,20215.125,21529.75
low,20160.0,19543.97,1030.919129,16452.5,18928.0,19624.0,20185.0625,21513.0
close,20160.0,19560.23,1026.465353,16499.5,18946.4375,19636.25,20202.3125,21521.75
label,20160.0,0.4428075,0.496731,0.0,0.0,0.0,1.0,1.0
log_ret,20159.0,-2.444558e-07,0.001295,-0.046691,-0.000659,0.0,0.000649,0.024732
mom1,20159.0,5.933983e-07,0.001294,-0.045618,-0.000659,0.0,0.000649,0.025041
mom2,20158.0,1.188938e-06,0.00183,-0.04849,-0.000909,0.0,0.000889,0.030751
mom3,20157.0,1.803692e-06,0.00226,-0.052876,-0.001099,0.0,0.00107,0.040459
mom4,20156.0,2.418886e-06,0.002603,-0.053822,-0.001254,-1.3e-05,0.001221,0.047852


In [None]:
# Tsfresh Features
# https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html
import os
import sys
import numpy as np
import pandas as pd
import plotly.graph_objects as go

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import EfficientFCParameters

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# print in full
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# period < 20 = stationary (ADF test)
# period > 10 = significant trend info (correlation)

M = 200  # number of top_features kept
weeks = 5
period = 15
step = 1  # should be 1

dir = os.path.join(os.getcwd())
# dir = os.path.dirname(os.path.abspath(__file__))
filepath = os.path.join(dir, f"bar_and_label(trend_3to24).parquet")

# 'bar_and_label(calmar).parquet'
# 'bar_and_label(trend_1to8).parquet'
# 'bar_and_label(trend_3to24).parquet'


def analyze_feature(df: pd.DataFrame):
    series = df["close"]
    df['ref'] = np.log((series / series.iloc[0]).fillna(1))
    df['value'] = np.log((series / series.rolling(period).mean().shift(period)).fillna(1))

    # print(df.tail())
    df = df[-int(60/5*24*7*weeks):].copy()
    df = df.sort_values('time').reset_index()

    # Rolling the time series
    window_size = period
    step_size = step
    segments = []
    labels = []
    times = []
    for idx, start in enumerate(range(0, len(df) - window_size, step_size)):
        end = start + window_size
        segment = df.iloc[start:end].copy()
        segment["id"] = idx  # unique id for each window
        segments.append(segment[["id", "time", "value"]])
        labels.append(df["label"].iloc[end])
        times.append(df["time"].iloc[end])

    X_features = pd.concat(segments, axis=0)
    y_labels = pd.Series(labels)
    t_times = pd.Series(times)

    print(X_features.tail())
    # print(y_labels.tail())

    # --- Feature extraction
    X_features = extract_features(
        X_features,
        column_id="id",
        column_sort="time",
        default_fc_parameters=EfficientFCParameters(),
        show_warnings=False,
        impute_function=impute,  # NOTE: tsfresh's impute implementation may leak future info, but mostly okay
        # disable_progressbar=True
    )
    X_features = pd.DataFrame(X_features)  # Ensure features is a DataFrame
    X_features = X_features.loc[:, X_features.notna().sum() == len(X_features)]  # remove NaN features after impute
    X_features = X_features.loc[:, X_features.nunique() > 1]  # remove constant features
    print(f"Valid features number:{X_features.shape[1]}")
    # X_features = select_features(X_features, y_labels)
    # print(X_features.describe().transpose())

    # | Feature Type | Target Type | Statistical Test Used             |
    # | ------------ | ----------- | --------------------------------- |
    # | Continuous   | Binary      | Kolmogorov-Smirnov test (KS test) |
    # | Continuous   | Categorical | ANOVA F-test                      |
    # | Continuous   | Continuous  | Kendall's tau correlation test    |
    # | Binary       | Binary      | Fisher’s exact test               |
    # | Binary       | Categorical | Chi-square test                   |
    # | Binary       | Continuous  | Point biserial correlation        |

    # --- Compute correlation with label
    results = {}
    for col in X_features.columns:
        # Use Spearman for robustness (nonlinear monotonic relationships)
        corr = pd.Series(X_features[col]).corr(y_labels, method='spearman')
        results[col] = abs(corr)  # use absolute value to reflect strength

    # --- Sort by importance
    sorted_features = sorted(results.items(), key=lambda x: x[1], reverse=True)
    print("\nTop statistically important features (Spearman correlation):")
    for name, score in sorted_features[:10]:
        print(f"{name}: {score:.4f}")

    top_features = [name for name, score in sorted_features[:M]]
    X_selected = X_features[top_features].copy()
    X_selected["time"] = t_times.values
    X_selected["label"] = y_labels.values
    X_selected.set_index('time', inplace=True)
    # X_selected.to_parquet(os.path.join(dir, f"features_and_label(tsfresh_trend_3to24).parquet"))
    # print(X_selected.describe().transpose())


if __name__ == '__main__':
    df = pd.read_parquet(filepath)
    analyze_feature(df)

"""
Top statistically important features (Spearman correlation):
close__has_duplicate_max: 0.0721
close__spkt_welch_density__coeff_5: 0.0693
close__symmetry_looking__r_0.30000000000000004: 0.0532
close__change_quantiles__f_agg_"mean"__isabs_False__qh_0.6__ql_0.4: 0.0521
close__change_quantiles__f_agg_"mean"__isabs_True__qh_0.8__ql_0.6: 0.0505 
close__large_standard_deviation__r_0.4: 0.0477
close__agg_linear_trend__attr_"slope"__chunk_len_10__f_agg_"var": 0.0474  
close__partial_autocorrelation__lag_4: 0.0434
close__change_quantiles__f_agg_"mean"__isabs_False__qh_0.2__ql_0.0: 0.0420
"""

          id          time     value
10074  10064  202505160937  0.000453
10075  10064  202505161008  0.000334
10076  10064  202505161028  0.001015
10077  10064  202505161056  0.000939
10078  10064  202505161139  0.000607


Feature Extraction: 100%|██████████| 30/30 [06:21<00:00, 12.73s/it]


Valid features number:320

Top statistically important features (Spearman correlation):
value__number_crossing_m__m_0: 0.0464
value__change_quantiles__f_agg_"var"__isabs_True__qh_0.8__ql_0.0: 0.0359
value__permutation_entropy__dimension_7__tau_1: 0.0352
value__permutation_entropy__dimension_6__tau_1: 0.0316
value__symmetry_looking__r_0.15000000000000002: 0.0316
value__cwt_coefficients__coeff_14__w_2__widths_(2, 5, 10, 20): 0.0307
value__change_quantiles__f_agg_"var"__isabs_True__qh_1.0__ql_0.2: 0.0303
value__change_quantiles__f_agg_"var"__isabs_True__qh_1.0__ql_0.0: 0.0293
value__change_quantiles__f_agg_"mean"__isabs_False__qh_0.2__ql_0.0: 0.0286
value__number_peaks__n_1: 0.0284


'\nTop statistically important features (Spearman correlation):\nclose__has_duplicate_max: 0.0721\nclose__spkt_welch_density__coeff_5: 0.0693\nclose__symmetry_looking__r_0.30000000000000004: 0.0532\nclose__change_quantiles__f_agg_"mean"__isabs_False__qh_0.6__ql_0.4: 0.0521\nclose__change_quantiles__f_agg_"mean"__isabs_True__qh_0.8__ql_0.6: 0.0505 \nclose__large_standard_deviation__r_0.4: 0.0477\nclose__agg_linear_trend__attr_"slope"__chunk_len_10__f_agg_"var": 0.0474  \nclose__partial_autocorrelation__lag_4: 0.0434\nclose__change_quantiles__f_agg_"mean"__isabs_False__qh_0.2__ql_0.0: 0.0420\n'

In [4]:
# simple candlestick pattern
import array
import math


class candlestrength:
    """
    Analyze candlestick strength based on body position within true range.

    Strength patterns explained (│ = wick, █ = body):

    Bullish Patterns (close > open):
    Strength 0 (Very Bullish):     Strength 1:           Strength 2:           Strength 3:           Strength 4:

        █                             │                      │                      │                      │   
        █       Bottom third          █     Bottom third     │     Middle third     █    Middle third      -    Top third
        █                             █                      █                      │                      │   

    Bearish Patterns (close < open):
    Strength 8 (Very Bearish):     Strength 7:           Strength 6:           Strength 5:           Strength 4:
        █                             █                      █                      │                      │   
        █         Top third           █      Top third       │    Middle third      █    Middle third      -    Bottom third
        █                             │                      │                      │                      │   
    """

    def __init__(self,
                 opens: array.array,
                 highs: array.array,
                 lows: array.array,
                 closes: array.array,
                 ):
        """Initialize the CandleStrength analyzer."""
        self.opens = opens
        self.highs = highs
        self.lows = lows
        self.closes = closes

        self.strength = array.array('b', [])
        self.is_bullish = None

    def update(self):
        """
        Update and calculate the candlestick strength.

        Args:
            open_price (float): Opening price
            high (float): High price
            low (float): Low price
            close (float): Closing price

        Returns:
            int: Strength rating from 0 (most bullish) to 8 (most bearish)
        """
        open = self.opens[-1]
        high = self.highs[-1]
        low = self.lows[-1]
        close = self.closes[-1]
        # Calculate true range and section sizes
        true_range = high - low
        section_size = true_range / 3

        # Calculate section boundaries
        lower_third = low + section_size
        upper_third = high - section_size

        # Determine body position
        body_high = max(open, close)
        body_low = min(open, close)

        # Determine if candle is bullish or bearish
        self.is_bullish = close > open

        # Calculate strength based on body position
        if self.is_bullish:
            if body_high <= lower_third:
                strength = 0  # Very bullish - full body in bottom third
            elif body_low <= lower_third:
                strength = 1  # Body extends into bottom third
            elif body_high <= upper_third:
                strength = 2  # Full body in middle third
            elif body_low <= upper_third:
                strength = 3  # Body extends into middle third
            else:
                strength = 4  # Body in top third
        else:  # bearish
            if body_low >= upper_third:
                strength = 8  # Very bearish - full body in top third
            elif body_high >= upper_third:
                strength = 7  # Body extends into top third
            elif body_low >= lower_third:
                strength = 6  # Full body in middle third
            elif body_high >= lower_third:
                strength = 5  # Body extends into middle third
            else:
                strength = 4  # Body in bottom third

        self.strength.append(strength)
        LEN = 100
        if len(self.strength) > 2*LEN:
            del self.strength[:-LEN]
        return

In [None]:
# bar features (model baseline tests)

import os
import sys
import array
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm

dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(dir, "../..")))

pd.set_option('display.expand_frame_repr', False)

i = 0


class TimeSeries_Analysis():
    def __init__(self, n_timestamps: int):
        from Math.performance.log_return import logreturn

        self.n_timestamps = n_timestamps

        # Initialize fixed-size arrays for each level
        self.multipliers = [1]
        n_timeframes = len(self.multipliers)
        self.counts = array.array('I', [0] * n_timeframes)
        self.opens = [array.array('d', [0.0]) for _ in range(n_timeframes)]  # trimmed as size increases
        self.highs = [array.array('d', [0.0]) for _ in range(n_timeframes)]  # trimmed as size increases
        self.lows = [array.array('d', [0.0]) for _ in range(n_timeframes)]   # trimmed as size increases
        self.closes = [array.array('d', [0.0]) for _ in range(n_timeframes)]  # trimmed as size increases
        self.volumes = [array.array('L', [0]) for _ in range(n_timeframes)]  # trimmed as size increases
        self.timestamp = array.array('d', [0.0])  # put in array to be mutable

        self.logreturn = logreturn(self.closes[i])
        self.candlestrength = candlestrength(self.opens[i], self.highs[i], self.lows[i], self.closes[i])

        self.feature_specs = {
            'logreturn': {
                'instance': self.logreturn,
                'features': [('log_returns', -1), ('log_returns', -2), ('log_returns', -3), ('log_returns', -4), ('log_returns', -5)],
                # 'Scaler': ScalingMethod.ROBUST,
            },
            'candlestrength': {
                'instance': self.candlestrength,
                'features': [('strength', -1), ('strength', -2), ('strength', -3), ('strength', -4), ('strength', -5),],
                # 'Scaler': ScalingMethod.ROBUST,
            },
        }
        self.n_features = sum(len(spec['features']) for spec in self.feature_specs.values())

        self.init_shared_tensor()

        self.init = False

    def init_shared_tensor(self):
        N_timestamps = self.n_timestamps
        N_features = self.n_features
        N_labels = 1
        N_columns = N_features + N_labels
        N_codes = 1
        self.label_index = N_features
        self.column_names = self._get_column_names()
        print(f"Initializing Pytorch Tensor: (timestamp({N_timestamps}), feature({N_features}) + label({N_labels}), codes({N_codes}))")
        self.shared_tensor = torch.zeros((N_timestamps, N_columns, N_codes), dtype=torch.float16).share_memory_()
        self.time_tensor = torch.zeros(N_timestamps, dtype=torch.int64).share_memory_()

    def analyze(self, code_idx, timestamp, open, high, low, close, label):
        self.parse_kline(timestamp, open, high, low, close)
        self.update_features(code_idx)
        self.shared_tensor[self.counts[i], self.label_index, code_idx] = label
        self.time_tensor[self.counts[i]] = timestamp
        self.counts[i] += 1

    def parse_kline(self, timestamp, open, high, low, close):
        LEN = 100

        # Append new bar
        self.opens[i].append(open)
        self.highs[i].append(high)
        self.lows[i].append(low)
        self.closes[i].append(close)
        # self.volumes[i].append(curr_vol)
        self.timestamp.append(timestamp)  # else idx

        # Trim arrays to fixed window
        if len(self.opens[i]) > 2 * LEN:
            del self.opens[i][:-LEN]
            del self.highs[i][:-LEN]
            del self.lows[i][:-LEN]
            del self.closes[i][:-LEN]
            # del self.volumes[i][:-LEN]
            del self.timestamp[:-LEN]

    def update_features(self, code_idx):
        for spec in self.feature_specs.values():
            try:
                spec['instance'].update()
            except:
                pass

        if not self.init:
            if len(self.closes[i]) <= 5:
                return
            else:
                self.init = True

        feature_idx = 0
        for spec in self.feature_specs.values():
            instance = spec['instance']
            for attr_name, idx in spec['features']:
                value = getattr(instance, attr_name)
                self.shared_tensor[self.counts[i], feature_idx, code_idx] = value[idx] if idx is not None else value
                feature_idx += 1

    def _get_column_names(self):
        names = []
        for spec_key, spec in self.feature_specs.items():
            for attr_name, idx in spec['features']:
                name = f"{spec_key}_{attr_name}_{abs(idx) if idx else 0}"
                names.append(name)
        names.append("label")
        return names

    def get_df(self, code_index: int):
        if code_index:
            data = self.shared_tensor[:, :, code_index].cpu().numpy()
        else:
            data = self.shared_tensor.squeeze(-1).cpu().numpy()

        df = pd.DataFrame(data, columns=self.column_names)
        df['time'] = self.time_tensor.cpu().numpy()
        df.set_index('time', inplace=True)
        return df


if __name__ == '__main__':
    # raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label(calmar).parquet'))
    # raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label(trend_1to8).parquet'))
    raw_data = pd.read_parquet(os.path.join(os.getcwd(), 'bar_and_label(trend_3to24).parquet'))
    # raw_data = raw_data[-(n_timestamps):].copy()

    # n_timestamps = raw_data.shape[0]
    n_timestamps = 12*24*20*1

    TA = TimeSeries_Analysis(n_timestamps)

    raw_data.reset_index(inplace=True)
    for row in tqdm(raw_data[-(n_timestamps):].itertuples(index=False), total=n_timestamps):
        TA.analyze(
            code_idx=0,
            timestamp=row.time,
            open=row.open,
            high=row.high,
            low=row.low,
            close=row.close,
            label=row.label,
        )

    features_and_labels = TA.get_df(code_index=0)
    print(features_and_labels.astype('float32').describe().transpose())  # describe function doesnt work for large number of f16
    # features_and_labels.to_parquet(os.path.join(dir, f"features_and_label(candlestick_trend_3to24).parquet"))

Initializing Pytorch Tensor: (timestamp(413636), feature(10) + label(1), codes(1))


100%|██████████| 413636/413636 [01:12<00:00, 5677.52it/s]


                              count      mean       std       min       25%       50%       75%       max
logreturn_log_returns_1    413636.0  0.000003  0.000947 -0.060974 -0.000496  0.000016  0.000504  0.025635
logreturn_log_returns_2    413636.0  0.000003  0.000947 -0.060974 -0.000496  0.000016  0.000504  0.025635
logreturn_log_returns_3    413636.0  0.000003  0.000947 -0.060974 -0.000496  0.000016  0.000504  0.025635
logreturn_log_returns_4    413636.0  0.000003  0.000947 -0.060974 -0.000496  0.000016  0.000504  0.025635
logreturn_log_returns_5    413636.0  0.000003  0.000947 -0.060974 -0.000496  0.000016  0.000504  0.025635
candlestrength_strength_1  413636.0  4.002007  2.623760  0.000000  1.000000  4.000000  7.000000  8.000000
candlestrength_strength_2  413636.0  4.002024  2.623764  0.000000  1.000000  4.000000  7.000000  8.000000
candlestrength_strength_3  413636.0  4.002016  2.623768  0.000000  1.000000  4.000000  7.000000  8.000000
candlestrength_strength_4  413636.0  4.002033 