# Table of Contents
- [Dynamic Time Patterns for Algorithmic Trading using machine learning](#dynamic-time-patterns-for-algorithmic-trading-using-machine-learning)
  - [Setup and Preprocessing](#setup-and-preprocessing)
  - [Math Behind the Algorithm](#math)
    - [Distance Metrics](#distance-metrics)  
    - [K Nearest Neighbor Algorithm](#k-nearest-neighbor-algorithm)
  - [Simple Backtest](#simple-backtest-of-forecast-predictions)
    - [Backtest Strategy](#backtest-strategy)
  - [Results and Visualization](#results-and-visualization)  
  - [Conclusion](#conclusion)

# Dynamic Time Patterns for Algorithmic Trading using machine learning
This notebook goes trough the dynamic time pattern algorithm and its implementation.


## Setup and Preprocessing
If you want to run algorithm for yourself, import libraries, and initialize data. The time-inverval and parameters i have chosen makes it take roughly 20 minutes. If you don't want to run it, just see math and implementation

### Enviornment Setup
Install the required libraries using:

In [2]:
pip install numpy pandas matplotlib fastdtw scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fastdtw import fastdtw
from sklearn.neighbors import NearestNeighbors

### Data Collection
For this project, high-frequency forex data has been used, soruced from Metatrader.

It is downloaded as 1 minute data, which is being changed to 5-minute intervals, where the Closing price of the asset will be used for the return. The data has the Time, Open, High, Low and Closing price.

In the case of the code below, the GBP/USD data is used, from the time period March 1, 2021 00:00 - march 31, 2021 23:55.

In [4]:
df = pd.read_csv("gbp_usd_5min.csv", parse_dates=["Time"])

# Sort the index to ensure monotonicity
df.set_index("Time", inplace=True)
df = df.sort_index()

# date-range of data that will be used
start_date = "2021-03-01 00:00"
end_date = "2021-03-31 17:15"

# Define range of df to be used based on dates
df = df.loc[start_date:end_date]

print(df.tail())


                        Open     High      Low    Close
Time                                                   
2021-03-31 16:55:00  1.37838  1.37848  1.37788  1.37791
2021-03-31 17:00:00  1.37792  1.37816  1.37759  1.37809
2021-03-31 17:05:00  1.37808  1.37840  1.37787  1.37787
2021-03-31 17:10:00  1.37802  1.37802  1.37794  1.37795
2021-03-31 17:15:00  1.37795  1.37801  1.37794  1.37798


### data preprocessing

Checking for missing values:

In [5]:
print(df.isna().sum())


Open     0
High     0
Low      0
Close    0
dtype: int64


Logarithmic returns are computed based on closing price at time $t$ divided by closing price at time $t+1$.

In [6]:
df["returns"] = np.log(df["Close"] / df["Close"].shift(1))
df.dropna(inplace=True)

#### Smoothing Returns Using Exponential Moving Average

The returns are smoothed using exponential moving average.
The Exponential Moving Average (EMA) is a weighted moving average that gives more significance to recent data points, making it more responsive to price changes compared to a simple moving average (SMA). 

##### Formula:
$
EMA_t = \alpha \cdot R_t + (1 - \alpha) \cdot EMA_{t-1}
$

where:
- $ EMA_t $ is the exponential moving average at time $ t $,
- $ R_t $ is the return at time \( t \),
- $ \alpha $ is the smoothing factor, calculated as:
  $
  \alpha = \frac{2}{span + 1}
  $
- $ span $ is the chosen period for smoothing.

The smoothing factor $ \alpha $ determines how much weight recent returns receive. A higher $ \alpha $ (lower span) makes the EMA react more to new data, while a lower $ \alpha $ (higher span) results in a smoother curve.

The implementation in python is as followed:

In [7]:
def apply_ema(series, span):
    return series.ewm(span=span, adjust=False).mean()

df["ema_returns"] = apply_ema(df["returns"], span=15)
df.dropna(inplace=True)
print(df["ema_returns"].head())

Time
2021-03-01 00:05:00    0.000007
2021-03-01 00:10:00    0.000052
2021-03-01 00:15:00    0.000035
2021-03-01 00:20:00    0.000022
2021-03-01 00:25:00    0.000031
Name: ema_returns, dtype: float64


To validate the function’s correctness (and to show the formula calculation in python), the EMA of return 100 is calculated and compared to the output from apply_ema():

In [8]:
span = 15
EMA_return_100 = 2/(span+1) * df["returns"].iloc[100] + (1 - 2/(span+1)) * df["ema_returns"].iloc[99]
if EMA_return_100 == df.iloc[100]["ema_returns"]:
    print("🎉🎉🎉")
print("Manual EMA:", EMA_return_100)
print("apply_ema() EMA:", df["ema_returns"].iloc[100])

🎉🎉🎉
Manual EMA: -0.00019552803368233756
apply_ema() EMA: -0.00019552803368233756


#### Standardizing EMA Returns Using Rolling Standard Deviation
Standardization is used to normalize data by adjusting for variations in scale and volatility, making it easier to compare values across different time, allowing the detection of regime shifts.

$
\sigma_t = \sqrt{\frac{1}{w} \sum_{i=t-w+1}^{t} (EMA_i - \bar{EMA})^2}
$

where:

- $ \sigma_t $ is the rolling standard deviation at time $ t $,
- $ EMA_i $ are the exponential moving averages within the window,
- $ \bar{EMA} $ is the mean EMA within the window,
- $ w $ is the window size.

standardize the EMA returns:

$
EMA_{std,t} = \frac{EMA_t}{\sigma_t}
$

In this case the window is 120.

The implementation in python is as followed:

In [9]:
window_size = 120
df["rolling_std"] = df["ema_returns"].rolling(window=window_size, min_periods=1).std()
df["rolling_std"] = df["rolling_std"].fillna(df["rolling_std"].mean())
df["ema_returns_std"] = df["ema_returns"] / df["rolling_std"]
df.dropna(inplace=True)
print(df["ema_returns_std"].head())

Time
2021-03-01 00:05:00    0.104078
2021-03-01 00:10:00    1.640531
2021-03-01 00:15:00    1.536623
2021-03-01 00:20:00    1.175463
2021-03-01 00:25:00    1.892782
Name: ema_returns_std, dtype: float64


## Math

### Distance Metrics
The kNN model compares time series embeddings using:
- **Euclidean**: $ \sqrt{\sum (x_i - y_i)^2} $
- **DTW**: Dynamic Time Warping, aligns series for pattern matching.
- **ASD**: Average Squared Difference, $\text{ASD}(x, y) = \frac{1}{n} \sum (x_i - y_i)^2 $
- **CID**: Complexity-Invariant Distance, scales Euclidean by complexity ratio.


### K-Nearest-Neighbor Algorithm

## Simple Backtest of Forecast Predictions

### Backtest Strategy

## Results and Visualization

## Conclusion