<h1>The Synopsis</h1>




Project Title:
Analyzing the Impact of Cross-Asset Data and Correlation on Price Prediction

Research Questions:
  Does the inclusion of cross-asset technical indicators significantly improve prediction accuracy compared to single-asset models?
    
   Is information from correlated assets the better at enhancing prediction performance than non correlated assets?
    
   Between linear and logistic regression which has better performance thanks to cross asset data?

Brief Synopsis:
The core objective of my research is to investigate whether incorporating data from other assets (cross-asset data) can improve the accuracy of price prediction models. Traditionally, predictive models for an asset (e.g., a stock or currency pair) rely solely on its own historical data and indicators. However, I aim to explore whether including data from related assets (e.g., other stocks or FX pairs) can provide additional predictive power.

For example, if building a linear regression model to predict Microsoft stock prices, instead of using only Microsoft’s indicators (e.g., OHLC, RSI, MACD), I would also incorporate indicators from related assets like Apple or Google stock. This approach extends to FX currency pairs, where I will examine whether data from other pairs (e.g., using EURUSD data to predict GBPUSD) improves predictions.

Data Preparation:

I have dataset with 9 FX currency pairs and 9 stocks. EURUSD, EURGBP, GBPUSD, USDJPY, AUDJPY, EURJPY, AUDUSD, GBPAUD, USDCHF, USDCAD, META, AAPL, AMZN, NFLX, GOOGL, JPM, GS, C, AXP.

Here is the complete list of explanatory variables with their column names in brackets:  

Weighted price average [weighted], Open-Close delta [ocDelta], High-Low delta [hlDelta], High-Low to Open-Close ratio [hl_oc], Percent price change [Percent_Change], Candle direction binary [CandleDirection], Simple Moving Average 14-period [SMA], Exponential Moving Average 14-period [EMA], Double Exponential Moving Average 14-period [DEMA], Triple Exponential Moving Average 14-period [TEMA], Rate of Change 14-period [ROC], Average True Range 14-period [ATR], Normalized Average True Range 14-period [NATR], Bollinger Bands Upper 14-period [BBANDS_UPPER], Bollinger Bands Middle 14-period [BBANDS_MIDDLE], Bollinger Bands Lower 14-period [BBANDS_LOWER], Aroon Up 14-period [Aroon_Up], Aroon Down 14-period [Aroon_Down], Aroon Oscillator 14-period [Aroon_Osc], Average Directional Index 14-period [ADX], Commodity Channel Index 14-period [CCI], Rate of Change Percentage 14-period [ROCP], Williams %R 14-period [WilliamsR], Chaikin Oscillator 14/28-period [ADOSC], Parabolic SAR [SAR], Weighted price minus SMA [w-SMA], Weighted price minus Bollinger Upper [w-UPPER], Weighted price minus Bollinger Lower [w-LOWER], Hammer pattern [CDLHAMMER], Inverted Hammer pattern [CDLINVERTEDHAMMER], Hanging Man pattern [CDLHANGINGMAN], Shooting Star pattern [CDLSHOOTINGSTAR], Engulfing pattern [CDLENGULFING], Harami pattern [CDLHARAMI], Piercing Line pattern [CDLPIERCING], Dark Cloud Cover pattern [CDLDARKCLOUDCOVER], Morning Star pattern [CDLMORNINGSTAR], Evening Star pattern [CDLEVENINGSTAR].  

Each variable is suffixed with the asset symbol (e.g., [SMA_AAPL] for Apple’s Simple Moving Average). Candlestick pattern variables return 0 (no pattern), +100 (bullish), or -100 (bearish).

The response variables in this study consist of two key metrics designed to capture future price movements. The first is NextCandleDirectionR [NextCandleDirectionR], a continuous variable that measures the actual numerical difference between an asset's closing price on the next trading day and its current closing price (calculated as Close_{t+1} - Close_t).The second is NextCandleDirectionC [NextCandleDirectionC], a binary classification variable that simplifies the prediction task by encoding whether the next day's closing price will be higher than the current day's close (1 if Close_{t+1} > Close_t, otherwise 0).

The variables are computed for all 18 assets (9 FX pairs and 9 stocks) and aligned with the timestamped explanatory variables (inner join by date) to ensure consistent temporal relationships in the analysis.

Model Building:

Baseline Models: Conduct  linear regression and  logistic regression for each asset using only its own data.

Cross-Asset Models: Merge all FX datasets (indexed by time) and repeat the process using cross-asset data. Do the same for stocks.

Compare the performance of baseline and cross-asset models to determine if cross-asset data improves predictions.

We decided to use  ElasticNet (L1/L2 balanced) alpha=0.000066, l1_ratio=0.5
because our data is higly multicollinear because it i all derived from the same ohlc and also because ElasticNet can be a form of stepwise regression(feature selection) ,
for both linear and logistic reg . 

Each regression method has metrics it records to be used for comparison between base and full

Linear: adjusted Rsquared Accuracy and AIC, 
Logistic: Accuracy, Precision, Specificity, AIC

For the results dataframe. the rows are each asset and each column is a result metric with suffix _base for base model and _full for full models

different frames for linear and logistic results 

We then conduct sign test to test if the results from base columns are significantly different from cross   

for each cross analysis we save a dictionary or JSON obj of its used features and columns with their absolute something value, this is for possible further analysis as shown by next two points

1. for each model we should look at percentage of columns used since elastic net drops some columns, then for cross models , we should look at how many columns/features are from other assets and express it as a percentage of the features it uses.  

2. Then maybe to take it a step furthur lets say EURUSD is correlated with symbols XYZ lets then see prevalance or dominance of those symbols in 

but these two points are a maybe. we will see ease of implementation when we get there. 

There will be no train test split for the data instead we train on the whole data and test on the whole data 

<h1>Data Acquisition & Feature Engineering</h1>

In [None]:
#Imports
import pandas as pd
import numpy as np
import os
import datetime
from datetime import timedelta
import yfinance as yf
import pyarrow.parquet as pq
import pyarrow as pa
from tqdm import tqdm
import json
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import math
from sklearn.decomposition import PCA, KernelPCA, IncrementalPCA, FastICA
from sklearn.manifold import Isomap, LocallyLinearEmbedding, MDS, TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ElasticNet
from sklearn.decomposition import PCA
import statsmodels.api as sm
from scipy.stats import binomtest


import talib as ta
#I used a on wheels install of talib specific to the current python version i am using.
#  Do so to if necessary in future
# pip install talib will typically install the 32bit version of the package which is incompatible with
# most 64 bit computers. So expect an error. The website below should have a command/release for 
# whatever python version you have. NOTE you must tailor the command to your version
# https://github.com/cgohlke/talib-build?tab=readme-ov-file

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Define asset lists
FX = [
    "EURUSD=X", "EURGBP=X", "GBPUSD=X", "USDJPY=X",
    "AUDJPY=X", "EURJPY=X", "AUDUSD=X", "GBPAUD=X",
    "USDCHF=X", "USDCAD=X"
]

STOCKS = [
    "META", "AAPL", "AMZN", "NFLX", "GOOGL",
    "JPM", "GS", "C", "AXP"
]

SYMBOLS = FX + STOCKS

# Parameters for data download
START_DATE = "2020-12-17"
END_DATE = "2025-01-01"
INTERVAL = "1d"

# Directory to save parquet files
SAVE_DIR = "new_yahoo_finance_data"
os.makedirs(SAVE_DIR, exist_ok=True)


In [None]:
#Data Download and pre processing
def process_and_save_data(ticker):
    print(f"Fetching data for {ticker}...")
    df = yf.download(ticker, start=START_DATE, end=END_DATE, interval=INTERVAL)
    
    if df.empty:
        print(f"Data for {ticker} is empty. Skipping...")
        return
    
    # Flattens the data, Keeps only Price Type
    df.columns = [col[0] for col in df.columns]

    #Change index from date to numbers and retain a date column  
    df.reset_index(inplace=True)
    df["Date"] = pd.to_datetime(df["Date"], utc=True)

    # Create derived columns
    df["weighted"] = (df['Open'] + df['High'] + df['Low'] + df['Close']) / 4
    df['ocDelta'] = df['Close'] - df['Open'] #open close delta
    df['hlDelta'] = df['High'] - df['Low']
    df['hl_oc'] = df['hlDelta'] / df['ocDelta']
    df['Percent_Change'] = (df['ocDelta']/df["Open"])*100
    df['CandleDirection'] = (df['Close'] > df['Open']).astype(int)

    # Replace infinite or NaN values
    df.replace([np.inf, -np.inf], 0, inplace=True)

    ## FEATURE CREATION OF INDICATORS AND FORMATIONS

    df[f'NextCandleDirectionR'] = df['Close'].shift(1) - df['Close']
    df['NextCandleDirectionC'] = (df['NextCandleDirectionR'] > 0).astype(int)

    #Averages
    df[f"SMA"] = ta.SMA(df['weighted'], timeperiod=14)
    df[f"EMA"] = ta.EMA(df['weighted'], timeperiod=14)
    df[f"DEMA"] = ta.DEMA(df['weighted'], timeperiod=14)
    df[f"TEMA"] = ta.TEMA(df['weighted'], timeperiod=14)
    df[f"ROC"] = ta.ROC(df['weighted'], timeperiod=14)
    df[f"ATR"] = ta.ATR(df['High'], df['Low'], df['Close'], timeperiod=14)
    df[f"NATR"] = ta.NATR(df['High'], df['Low'], df['Close'], timeperiod=14)
    upper, middle, lower = ta.BBANDS(df['weighted'], timeperiod=14)
    df[f"BBANDS_UPPER"] = upper
    df[f"BBANDS_MIDDLE"] = middle
    df[f"BBANDS_LOWER"] = lower
    #Deviations
    df["w_SMA"] = df["weighted"]-df[f"SMA"]
    df["w_UPPER"] = df["weighted"]-df[f"BBANDS_UPPER"]
    df["w_LOWER"] = df["weighted"]-df[f"BBANDS_LOWER"]
    df[f"Aroon_Up"], df[f"Aroon_Down"] = ta.AROON(df['High'], df['Low'], timeperiod=14)
    df[f"Aroon_Osc"] = ta.AROONOSC(df['High'], df['Low'], timeperiod=14)
    df[f"ADX"] = ta.ADX(df['High'], df['Low'], df['Close'], timeperiod=14)
    df[f"CCI"] = ta.CCI(df['High'], df['Low'], df['Close'], timeperiod=14)
    df[f"ROCP"] = ta.ROCP(df['weighted'], timeperiod=14)
    df[f"WilliamsR"] = ta.WILLR(df['High'], df['Low'], df['Close'], timeperiod=14)
    df[f"ADOSC"] = ta.ADOSC(df['High'], df['Low'], df['Close'], df['Volume'], fastperiod=14, slowperiod=14*2)
    df['SAR'] = ta.SAR(df['High'].values, df['Low'].values)
        
    df = df.replace([np.inf, -np.inf,np.nan], 0)

    df.drop(df.head(14).index,inplace=True)
    df.drop(df.tail(1).index,inplace=True)
    
    df.reset_index(inplace=True, drop=True)
    df = df.replace([np.inf, -np.inf,np.nan], 0)
    #Patterns
    # Hammer
    df["CDLHAMMER"] = ta.CDLHAMMER(df["Open"], df["High"], df["Low"], df["Close"])

    # Inverted Hammer
    df["CDLINVERTEDHAMMER"] = ta.CDLINVERTEDHAMMER(df["Open"], df["High"], df["Low"], df["Close"])

    # Hanging Man
    df["CDLHANGINGMAN"] = ta.CDLHANGINGMAN(df["Open"], df["High"], df["Low"], df["Close"])

    # Shooting Star
    df["CDLSHOOTINGSTAR"] = ta.CDLSHOOTINGSTAR(df["Open"], df["High"], df["Low"], df["Close"])

    # Engulfing Pattern
    df["CDLENGULFING"] = ta.CDLENGULFING(df["Open"], df["High"], df["Low"], df["Close"])

    # Harami Pattern
    df["CDLHARAMI"] = ta.CDLHARAMI(df["Open"], df["High"], df["Low"], df["Close"])

    # Piercing Line
    df["CDLPIERCING"] = ta.CDLPIERCING(df["Open"], df["High"], df["Low"], df["Close"])

    # Dark Cloud Cover
    df["CDLDARKCLOUDCOVER"] = ta.CDLDARKCLOUDCOVER(df["Open"], df["High"], df["Low"], df["Close"], penetration=0)

    # Morning Star
    df["CDLMORNINGSTAR"] = ta.CDLMORNINGSTAR(df["Open"], df["High"], df["Low"], df["Close"], penetration=0)

    # Evening Star
    df["CDLEVENINGSTAR"] = ta.CDLEVENINGSTAR(df["Open"], df["High"], df["Low"], df["Close"], penetration=0)


    # Remove '=X' from ticker symbol for cleaner column names
    clean_ticker = ticker.replace("=X", "")
    
    # Rename columns to include ticker name
    df.columns = [f"{col}_{clean_ticker}" if col != "Date" else "Date" for col in df.columns]
    
    # Save as Parquet
    file_path = os.path.join(SAVE_DIR, f"{clean_ticker}.parquet")
    df.to_parquet(file_path, engine="pyarrow")
    print(f"Saved {ticker} as {file_path}")

# Download and process all assets
for asset in tqdm(FX + STOCKS, desc="Downloading & Processing Data"):
    process_and_save_data(asset)

#declare filtration function
def FilterColumn(partial: list,df: (pd.core.frame.DataFrame),compliment: bool=False):
  """
  Provides filtered list of columns

  Uses list of keywords provided to filter columns of the df provided.
  If compliment is on, the filter excludes the columns having the input
  list of keywords. 

  Parameters
  ----------
  partial (list):list containing keywords/phrases/sections of column names  
  df  (pandas.core.frame.DataFrame): dataframe in which columns are to be filtered
  compliment (bool):False to include the columns containing the keywords, True to exclude such columns
  """

  hasIt = []
  hazIt = list(df.columns)
  
  for p in partial:
    for column in df.columns:
      if(compliment==False):
        if(column.find(p)!=-1):
          hasIt.append(column)

      if(compliment==True):
        if(column.find(p)!=-1):
          hazIt.remove(column)

  if(compliment==False):
    return hasIt
  if(compliment==True):
    return hazIt

#Merging data

# Load FX data
fx_dataframes = []
for fx_ticker in FX:
    clean_ticker = fx_ticker.replace("=X", "")
    file_path = os.path.join(SAVE_DIR, f"{clean_ticker}.parquet")
    df = pd.read_parquet(file_path)
    fx_dataframes.append(df)

# Load Stock data
stock_dataframes = []
for stock_ticker in STOCKS:
    file_path = os.path.join(SAVE_DIR, f"{stock_ticker}.parquet")
    df = pd.read_parquet(file_path)
    stock_dataframes.append(df)

# Function to merge dataframes on the 'Date' column
def merge_dataframes(dataframes):
    merged_df = dataframes[0]  # Start with the first dataframe
    for df in dataframes[1:]:
        merged_df = pd.merge(merged_df, df, on="Date", how="inner")
    return merged_df

# Merge FX dataframes
fx_merged_df = merge_dataframes(fx_dataframes)
fx_merged_df = fx_merged_df[FilterColumn(["Volume"],fx_merged_df,compliment=True)]
# Merge Stock dataframes
stock_merged_df = merge_dataframes(stock_dataframes)

# Merge the FX and Stock dataframes on the 'Date' column
combined_df = pd.merge(fx_merged_df, stock_merged_df, on="Date", how="inner")

# Save the merged dataframes to Parquet files (optional)
fx_merged_df.to_parquet(os.path.join(SAVE_DIR, "fx_merged.parquet"), engine="pyarrow")
stock_merged_df.to_parquet(os.path.join(SAVE_DIR, "stock_merged.parquet"), engine="pyarrow")
combined_df.to_parquet(os.path.join(SAVE_DIR, "combined_fx_stock.parquet"), engine="pyarrow")

<h1>Descriptive Statistics<h1>

In [None]:
# import merged data 
fx_df = pd.read_parquet(os.path.join(SAVE_DIR, "fx_merged.parquet"))
stock_df = pd.read_parquet(os.path.join(SAVE_DIR, "stock_merged.parquet"))
combined_df = pd.read_parquet(os.path.join(SAVE_DIR, "combined_fx_stock.parquet"))



In [3]:
# Check for missing values
missing_counts = combined_df.isnull().sum()
columns_with_missing = missing_counts[missing_counts > 0].index.tolist()
rows_with_missing = combined_df[combined_df.isnull().any(axis=1)].shape[0]

print(f"Columns with missing values: {columns_with_missing}")
print(f"Total rows with missing data: {rows_with_missing}/{len(combined_df)}")

Columns with missing values: []
Total rows with missing data: 0/1000


In [None]:
#Weighted price desc stats
def weighted_price_stats(df, assets):
    """
    Generate descriptive statistics for weighted price across assets.
    
    Parameters:
    - df: Merged DataFrame containing all assets
    - assets: List of asset symbols (e.g., ['EURUSD', 'AAPL'])
    
    Returns:
    - DataFrame with assets as index and statistics as columns
    """
    # Initialize stats dictionary
    stats_list = []
    
    # Calculate statistics for each asset
    for asset in assets:
        col_name = f'weighted_{asset}'
        if col_name not in df.columns:
            continue
            
        series = df[col_name].dropna()
        if len(series) == 0:
            continue
            
        stats = {
            'Asset': asset,
            'Mean': series.mean(),
            'Median': series.median(),
            'Min': series.min(),
            'Max': series.max(),
            'Std': series.std(),
            'Var': series.var(),
            'IQR': series.quantile(0.75) - series.quantile(0.25),
            'Skew': series.skew(),
            'Kurtosis': series.kurtosis(),
            'Count': len(series),
            '5th %ile': series.quantile(0.05),
            '95th %ile': series.quantile(0.95),
            'Range': series.max() - series.min(),
            'CV': series.std() / series.mean()  # Coefficient of variation
        }
        stats_list.append(stats)
    
    # Create and format DataFrame
    stats_df = pd.DataFrame(stats_list)
    stats_df.set_index('Asset', inplace=True)
    
    # Reorder columns logically
    col_order = [
        'Count', 'Mean', 'Median', 'Std', 'Var', 'Min', 'Max', 'Range',
        '5th %ile', '95th %ile', 'IQR', 'Skew', 'Kurtosis', 'CV'
    ]
    return stats_df[col_order]

# Example usage:
fx_symbols = [x.replace("=X", "") for x in FX]
stock_symbols = STOCKS

# Get stats for FX and Stocks separately
fx_stats = weighted_price_stats(fx_df, fx_symbols)
stock_stats = weighted_price_stats(stock_df, stock_symbols)

# Combine and add asset type marker
fx_stats['Type'] = 'FX'
stock_stats['Type'] = 'Stock'
all_stats = pd.concat([fx_stats, stock_stats])

# Format numeric columns
float_cols = all_stats.select_dtypes(include=[float]).columns
all_stats[float_cols] = all_stats[float_cols].round(4)

# Display
print("Weighted Price Descriptive Statistics")
print("="*60)
display(all_stats.sort_values('Type'))

In [None]:
#create visualisations skew
def plot_skew_kurt(stats_df, save_path='figures/skew_kurt_plot.png'):
    """
    Enhanced scatterplot of skewness vs kurtosis with:
    - Boxes (□) for FX pairs
    - Triangles (▲) for equities 
    - Unique colors per asset
    - Professional formatting
    - Saves as high-res PNG
    """
    # Create directory if needed
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    
    plt.figure(figsize=(12, 7), dpi=300)  # High DPI for quality
    
    # Create style mappings
    marker_map = {'FX': 's', 'Stock': '^'}  # □ for FX, ▲ for stocks
    unique_assets = stats_df['Asset'].unique()
    colors = plt.cm.tab20(np.linspace(0, 1, len(unique_assets)))
    
    # Plot each asset
    for i, asset in enumerate(unique_assets):
        asset_data = stats_df[stats_df['Asset'] == asset]
        plt.scatter(
            x=asset_data['Skew'],
            y=asset_data['Kurtosis'],
            s=150,  # Marker size
            marker=marker_map[asset_data['Type'].iloc[0]],
            color=colors[i],
            edgecolor='white',
            linewidth=0.8,
            alpha=0.9,
            label=asset
        )
    
    # Reference lines
    plt.axvline(0, color='gray', linestyle=':', alpha=0.7)
    plt.axhline(0, color='gray', linestyle=':', alpha=0.7)
    
    # Annotations
    plt.text(0.5, 2.5, "Fat Right Tails", ha='center', fontsize=10, color='darkred')
    plt.text(-0.5, 2.5, "Fat Left Tails", ha='center', fontsize=10, color='darkred')
    
    # Highlight special cases
    for asset in ['GS', 'AXP', 'EURUSD']:  # Example highlights
        if asset in stats_df['Asset'].values:
            idx = stats_df[stats_df['Asset'] == asset].index[0]
            plt.annotate(
                asset, 
                (stats_df.loc[idx, 'Skew'], stats_df.loc[idx, 'Kurtosis']),
                textcoords="offset points",
                xytext=(8,5),
                ha='left',
                fontsize=9,
                bbox=dict(boxstyle='round,pad=0.3', fc='white', alpha=0.7)
            )
    
    # Styling
    plt.title('Asset Return Distributions: Skewness vs Kurtosis', pad=20, fontsize=14)
    plt.xlabel('Skewness → Positive = Right-Skewed', fontsize=12)
    plt.ylabel('Excess Kurtosis → Positive = Fat Tails', fontsize=12)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', frameon=False)
    plt.grid(alpha=0.2)
    sns.despine()
    
    # Save as high-quality PNG
    plt.savefig(save_path, bbox_inches='tight', dpi=300, transparent=False)
    plt.close()
    print(f"Saved to {save_path}")

# Usage
plot_skew_kurt(all_stats.reset_index())

In [None]:
#create visualisations correlation
corr_cmap = LinearSegmentedColormap.from_list(
    'corr_cmap', ['#2A5CAA', 'white', '#B22222'])

def preprocess_data(df):
    """Filter to keep only continuous weighted price columns"""
    return df[[col for col in df.columns if 'weighted_' in col]]

def plot_correlation_matrix(df, title, figsize=(10,8)):
    """Plot cleaned correlation matrix"""
    # Clean column names
    df.columns = [col.replace('weighted_', '') for col in df.columns]
    corr = df.corr()
    
    # Create mask for upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))
    
    plt.figure(figsize=figsize)
    sns.heatmap(
        corr, 
        mask=mask,
        cmap=corr_cmap,
        center=0,
        vmin=-1,
        vmax=1,
        annot=True,
        fmt='.2f',
        annot_kws={'size':8},
        linewidths=0.5
    )
    plt.title(title, pad=20)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.savefig(f'figures/{title.lower().replace(" ", "_")}.png', 
               dpi=300, bbox_inches='tight')
    plt.close()
    return corr

def get_correlation_tiers(corr_matrix):
    """Identify correlation strength tiers"""
    corr_pairs = corr_matrix.stack()
    corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != 
                    corr_pairs.index.get_level_values(1)]  # Remove diagonal
    
    strong = corr_pairs[abs(corr_pairs) > 0.6].sort_values(ascending=False)
    moderate = corr_pairs[(abs(corr_pairs) >= 0.3) & 
                         (abs(corr_pairs) <= 0.6)].sort_values(ascending=False)
    
    return strong, moderate

# 1. FX vs FX correlations
fx_cont = preprocess_data(fx_df)
fx_corr = plot_correlation_matrix(fx_cont, "FX Pairs Correlation Matrix")
fx_strong, fx_moderate = get_correlation_tiers(fx_corr)

# 2. Stocks vs Stocks correlations
stock_cont = preprocess_data(stock_df)
stock_corr = plot_correlation_matrix(stock_cont, "Equities Correlation Matrix")
stock_strong, stock_moderate = get_correlation_tiers(stock_corr)

# 3. FX vs Stocks correlations
combined = pd.concat([fx_cont, stock_cont], axis=1)
cross_corr = combined.corr().loc[fx_cont.columns, stock_cont.columns]
cross_corr.columns = [col.replace('weighted_', '') for col in cross_corr.columns]
cross_corr.index = [col.replace('weighted_', '') for col in cross_corr.index]

plt.figure(figsize=(12,8))
sns.heatmap(
    cross_corr,
    cmap=corr_cmap,
    center=0,
    vmin=-1,
    vmax=1,
    annot=True,
    fmt='.2f',
    annot_kws={'size':8},
    linewidths=0.5
)
plt.title("FX vs Equities Cross-Correlation", pad=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig('figures/fx_vs_equities_correlation.png', dpi=300, bbox_inches='tight')
plt.close()

cross_strong, cross_moderate = get_correlation_tiers(cross_corr)

# Print correlation tiers
print("=== STRONG CORRELATIONS ===")
print("FX Pairs:\n", fx_strong.to_string(), "\n")
print("Equities:\n", stock_strong.to_string(), "\n")
print("FX-Stocks:\n", cross_strong.to_string(), "\n")

print("=== MODERATE CORRELATIONS ===")
print("FX Pairs:\n", fx_moderate.to_string(), "\n")
print("Equities:\n", stock_moderate.to_string(), "\n")
print("FX-Stocks:\n", cross_moderate.to_string())

<h1>Inferrential Statistics<h1>

<h2>Preprocess</h2>

In [4]:
# Scaling Data and checking for multivariate normality

# Exclude discrete/categorical columns (e.g., candlestick patterns)
categorical_keywords = ['Date','CDL', 'CandleDirection','Next'] 
non_contin_cols = FilterColumn(categorical_keywords, combined_df, compliment=False)

# Get continuous columns (numeric with >4 distinct values)
continuous_cols = [
    col for col in combined_df.columns 
    if col not in non_contin_cols 
    and np.issubdtype(combined_df[col].dtype, np.number)
    and len(combined_df[col].unique()) > 4
]

continuous_cols

#perform the Henze-Zirkler Multivariate Normality Test (H0: data is multivariate normal)
from pingouin import multivariate_normality
X_continuous = combined_df[continuous_cols]
multivariate_normality(X_continuous, alpha=.05)

#data isnt multivariate continous

# Scale all continuous features uniformly
scaler = RobustScaler()
combined_df[continuous_cols] = scaler.fit_transform(combined_df[continuous_cols])



<h2>Regressions</h2>

In [None]:
#Linear Regressions
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import json
from tqdm import tqdm

# Get clean symbol names
symbols = [s.replace('=X', '') for s in SYMBOLS]

# Define model parameters
l1_ratioA = 0.5  # Fixed L1 ratio
full_cols = FilterColumn(['Date','Next'], combined_df, compliment=True)

# Define alpha search space (wider range with more steps)
alphas = np.logspace(-5, -1, 30)  # From 1e-5 to 1e-1

# Initialize results dataframe with all needed columns
linear_results = pd.DataFrame(
    index=symbols,
    columns=[
        'RMSE_base','RMSE_full', 'R2_base','R2_full', 'Adj_R2_base','Adj_R2_full', 
        'AIC_base','AIC_full', 'Features_used_base', 'Features_used_full',
        'Cross_asset_features_used', 'Cross_asset_pctage', 'Alpha_base', 'Alpha_full'
    ],
    data=151515
)

# Initialize feature importance storage
feature_importance = {
    "linear": {
        sym: {
            "base": {},
            "full": {}
        } for sym in symbols
    }
}

for symb in tqdm(symbols, desc="Processing assets"):
    # Prepare data for current symbol
    base_cols = [s for s in full_cols if symb in s]
    X_base = combined_df[base_cols]
    X_full = combined_df[full_cols]
    y = combined_df[f"NextCandleDirectionR_{symb}"]
    
    # ===== ALPHA OPTIMIZATION WITH CONVERGENCE HANDLING =====
    def find_best_alpha(X, y, model_type='base'):
        best_alpha = None
        best_score = -np.inf
        best_model = None
        
        for alpha in alphas:
            try:
                model = ElasticNet(
                    alpha=alpha, 
                    l1_ratio=l1_ratioA,
                    random_state=42,
                    max_iter=100000,
                    selection='random'  # Helps with convergence
                )
                model.fit(X, y)
                
                # Calculate score (using R² adjusted for fair comparison)
                n = X.shape[0]
                p = np.sum(model.coef_ != 0)
                r2 = model.score(X, y)
                adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
                
                if adj_r2 > best_score:
                    best_score = adj_r2
                    best_alpha = alpha
                    best_model = model
                    
            except Exception as e:
                print(f"Warning: Failed for {symb} {model_type} with alpha {alpha:.2e}: {str(e)}")
                continue
                
        return best_alpha, best_model
    
    # Find best alpha for baseline and full models
    best_alpha_base, base_model = find_best_alpha(X_base, y, 'base')
    best_alpha_full, full_model = find_best_alpha(X_full, y, 'full')
    
    # ===== CALCULATE ALL METRICS AND STORE FEATURE IMPORTANCE =====
    def calculate_metrics(model, X, y, alpha, model_type):
        # Basic metrics
        pred = model.predict(X)
        rmse = np.sqrt(mean_squared_error(y, pred))
        r2 = model.score(X, y)
        
        # Adjusted R²
        n = X.shape[0]
        p = np.sum(model.coef_ != 0)
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
        
        # AIC
        RSS = np.sum((y - pred) ** 2)
        aic = n * np.log(RSS / n) + 2 * p
        
        # Feature usage
        features_used = np.sum(model.coef_ != 0)
        
        # Store feature importance
        feature_importance["linear"][symb][model_type] = {
            col: float(coef)  # Convert numpy to native Python float
            for col, coef in zip(X.columns, model.coef_)
            if coef != 0  # Only store non-zero coefficients
        }
        
        return {
            'RMSE': rmse,
            'R2': r2,
            'Adj_R2': adj_r2,
            'AIC': aic,
            'Features_used': features_used,
            'Alpha': alpha
        }
    
    # Calculate and store baseline metrics
    base_metrics = calculate_metrics(base_model, X_base, y, best_alpha_base, 'base')
    linear_results.loc[symb, 'RMSE_base'] = base_metrics['RMSE']
    linear_results.loc[symb, 'R2_base'] = base_metrics['R2']
    linear_results.loc[symb, 'Adj_R2_base'] = base_metrics['Adj_R2']
    linear_results.loc[symb, 'AIC_base'] = base_metrics['AIC']
    linear_results.loc[symb, 'Features_used_base'] = base_metrics['Features_used']
    linear_results.loc[symb, 'Alpha_base'] = base_metrics['Alpha']
    
    # Calculate and store full model metrics
    full_metrics = calculate_metrics(full_model, X_full, y, best_alpha_full, 'full')
    linear_results.loc[symb, 'RMSE_full'] = full_metrics['RMSE']
    linear_results.loc[symb, 'R2_full'] = full_metrics['R2']
    linear_results.loc[symb, 'Adj_R2_full'] = full_metrics['Adj_R2']
    linear_results.loc[symb, 'AIC_full'] = full_metrics['AIC']
    linear_results.loc[symb, 'Features_used_full'] = full_metrics['Features_used']
    linear_results.loc[symb, 'Alpha_full'] = full_metrics['Alpha']
    
    # Cross-asset features
    cross_features = sum(1 for col in X_full.columns[full_model.coef_ != 0] if symb not in col)
    linear_results.loc[symb, 'Cross_asset_features_used'] = cross_features
    linear_results.loc[symb, 'Cross_asset_pctage'] = (cross_features / full_metrics['Features_used'] * 100 
                                                   if full_metrics['Features_used'] > 0 else 0)

# Display results
print("\nFinal Linear Regression Results:")
display(linear_results)

# Save results
linear_results.to_csv('linear_regression_results_optimized.csv')

# Save feature importance
with open('linear_feature_importance.json', 'w') as f:
    json.dump(feature_importance, f, indent=2)

print("\nFeature importance saved to linear_feature_importance.json")

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c


Final Linear Regression Results:





Unnamed: 0,RMSE_base,RMSE_full,R2_base,R2_full,Adj_R2_base,Adj_R2_full,AIC_base,AIC_full,Features_used_base,Features_used_full,Cross_asset_features_used,Cross_asset_pctage,Alpha_base,Alpha_full
EURUSD,0.0037,0.0024,0.4471,0.7653,0.4312,0.6066,-11123.2059,-11229.9629,28,403,380,94.2928,0.0,0.0
EURGBP,0.0039,0.0026,0.5113,0.7715,0.4982,0.6157,-11067.1703,-11069.3748,26,405,378,93.3333,0.0,0.0
GBPUSD,0.0049,0.003,0.4683,0.805,0.4502,0.6612,-10560.3189,-10781.2847,33,424,404,95.283,0.0,0.0
USDJPY,0.6227,0.3522,0.4382,0.8203,0.4178,0.5454,-877.3138,-878.9423,35,604,570,94.3709,0.0,0.0001
AUDJPY,0.4756,0.2641,0.4327,0.8251,0.4115,0.6065,-1414.4184,-1553.1986,36,555,526,94.7748,0.0,0.0001
EURJPY,0.6309,0.3582,0.444,0.8207,0.4232,0.5432,-849.3679,-839.418,36,607,574,94.5634,0.0,0.0001
AUDUSD,0.0034,0.0021,0.4311,0.7771,0.4183,0.6403,-11320.7768,-11541.8767,22,380,358,94.2105,0.0,0.0
GBPAUD,0.0064,0.004,0.4484,0.7853,0.4319,0.6245,-10052.9425,-10198.7996,29,428,404,94.3925,0.0,0.0
USDCHF,0.0032,0.0021,0.4869,0.7829,0.4721,0.6434,-11424.6098,-11558.8927,28,391,366,93.6061,0.0,0.0
USDCAD,0.004,0.0025,0.4485,0.7818,0.4331,0.6312,-11000.2723,-11165.5395,27,408,386,94.6078,0.0,0.0



Feature importance saved to linear_feature_importance.json


In [None]:
#LogisticRegressions
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
import numpy as np
import pandas as pd
import json
from tqdm import tqdm

# Initialize symbols and results structures
symbols = [s.replace('=X', '') for s in SYMBOLS]
full_cols = FilterColumn(['Date','Next'], combined_df, compliment=True)
l1_ratioA = 0.5

# Initialize results DataFrame
logistic_results = pd.DataFrame(
    index=symbols,
    columns=[
        'Accuracy_base', 'Precision_base', 'Specificity_base', 'AIC_base', 
        'Features_used_base', 'Alpha_base',
        'Accuracy_full', 'Precision_full', 'Specificity_full', 'AIC_full',
        'Features_used_full', 'Alpha_full',
        'Cross_asset_features_used', 'Cross_asset_pct'
    ],
    data=151515
)

# Initialize feature storage dictionary
feature_storage = {
    "logistic": {
        sym: {"base": {}, "full": {}} for sym in symbols
    }
}

# Define alpha search space
alphas = np.logspace(-3, 1, 20)  # Wider range for logistic regression

for symb in tqdm(symbols, desc="Processing assets"):
    # Prepare data
    base_cols = [s for s in full_cols if symb in s]
    X_base = combined_df[base_cols]
    X_full = combined_df[full_cols]
    y = combined_df[f"NextCandleDirectionC_{symb}"]
    
    # ===== ALPHA OPTIMIZATION =====
    def find_best_logistic_alpha(X, y, model_type):
        best_alpha = None
        best_score = -np.inf
        best_model = None
        
        for alpha in alphas:
            try:
                model = LogisticRegression(
                    penalty='elasticnet',
                    solver='saga',
                    l1_ratio=l1_ratioA,
                    C=1/(alpha * len(y)),
                    random_state=42,
                    max_iter=10000,
                    class_weight='balanced'
                )
                model.fit(X, y)
                
                # Use balanced accuracy as selection criteria
                pred = model.predict(X)
                tn, fp, fn, tp = confusion_matrix(y, pred).ravel()
                specificity = tn / (tn + fp)
                precision = precision_score(y, pred)
                score = (accuracy_score(y, pred) + specificity + precision) / 3  # Combined metric
                
                if score > best_score:
                    best_score = score
                    best_alpha = alpha
                    best_model = model
                    
            except Exception as e:
                continue
                
        return best_alpha, best_model
    
    # Find optimal alphas
    best_alpha_base, base_model = find_best_logistic_alpha(X_base, y, 'base')
    best_alpha_full, full_model = find_best_logistic_alpha(X_full, y, 'full')
    
    # ===== METRIC CALCULATION =====
    def calculate_logistic_metrics(model, X, y, alpha):
        # Predictions
        pred = model.predict(X)
        proba = model.predict_proba(X)[:, 1]
        
        # Basic metrics
        tn, fp, fn, tp = confusion_matrix(y, pred).ravel()
        metrics = {
            'Accuracy': accuracy_score(y, pred),
            'Precision': precision_score(y, pred),
            'Specificity': tn / (tn + fp),
            'Features_used': np.sum(model.coef_ != 0),
            'Alpha': alpha
        }
        
        # AIC calculation
        n = len(y)
        k = np.sum(model.coef_ != 0) + 1
        ll = model.score(X, y) * n  # Approximate log-likelihood
        metrics['AIC'] = 2 * k - 2 * ll
        
        # Feature importance
        features = {
            col: float(coef) 
            for col, coef in zip(X.columns, model.coef_[0])
            if coef != 0
        }
        
        return metrics, features
    
    # Calculate and store baseline metrics
    base_metrics, base_features = calculate_logistic_metrics(base_model, X_base, y, best_alpha_base)
    for metric, value in base_metrics.items():
        logistic_results.loc[symb, f'{metric}_base'] = value
    feature_storage["logistic"][symb]["base"] = base_features
    
    # Calculate and store full model metrics
    full_metrics, full_features = calculate_logistic_metrics(full_model, X_full, y, best_alpha_full)
    for metric, value in full_metrics.items():
        logistic_results.loc[symb, f'{metric}_full'] = value
    feature_storage["logistic"][symb]["full"] = full_features
    
    # Cross-asset features
    cross_features = sum(1 for col in full_features if symb not in col)
    logistic_results.loc[symb, 'Cross_asset_features_used'] = cross_features
    logistic_results.loc[symb, 'Cross_asset_pct'] = (cross_features / full_metrics['Features_used'] * 100 
                                                   if full_metrics['Features_used'] > 0 else 0)

# Save results
logistic_results.to_csv('logistic_regression_results_optimized.csv')

# Save feature importance
with open('logistic_feature_importance.json', 'w') as f:
    json.dump(feature_storage, f, indent=2)

print("\nLogistic Regression Results:")
display(logistic_results)
print("\nFeature importance saved to logistic_feature_importance.json")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr


Logistic Regression Results:





Unnamed: 0,Accuracy_base,Precision_base,Specificity_base,AIC_base,Features_used_base,Alpha_base,Accuracy_full,Precision_full,Specificity_full,AIC_full,Features_used_full,Alpha_full,Cross_asset_features_used,Cross_asset_pct
EURUSD,0.677,0.6839,0.6598,-1280,36,0.0016,0.792,0.8101,0.8062,-344,619,0.0026,584,94.3457
EURGBP,0.72,0.7212,0.7041,-1368,35,0.001,0.768,0.7791,0.7755,-88,723,0.001,685,94.7441
GBPUSD,0.753,0.7565,0.754,-1448,28,0.0043,0.804,0.8143,0.8165,-378,614,0.0026,575,93.6482
USDJPY,0.713,0.6645,0.7305,-1360,32,0.0026,0.784,0.734,0.7784,-138,714,0.001,678,94.958
AUDJPY,0.689,0.6659,0.7222,-1358,9,0.2069,0.815,0.7919,0.8185,-464,582,0.0026,553,95.0172
EURJPY,0.711,0.6859,0.7263,-1360,30,0.0043,0.789,0.7551,0.7747,-144,716,0.001,682,95.2514
AUDUSD,0.664,0.6602,0.6493,-1254,36,0.0016,0.816,0.8323,0.8397,-380,625,0.0026,591,94.56
GBPAUD,0.546,1.0,1.0,-1084,3,0.8859,0.774,0.7674,0.7683,-476,535,0.0043,498,93.0841
USDCHF,0.538,1.0,1.0,-1070,2,0.336,0.78,0.759,0.7701,-110,724,0.001,688,95.0276
USDCAD,0.554,0.6273,0.883,-1096,5,0.336,0.796,0.787,0.7953,-166,712,0.001,678,95.2247



Feature importance saved to logistic_feature_importance.json


In [None]:
#maybe visualisations
#1. Scattergraph of predicted vs actual for both baseline and full
#2. Confusion Matrix box

<h2>Sign Test of Results</h2>

In [None]:

# Load your results (replace with your path)
df = pd.read_csv("linear_regression_results_optimized.csv")

# Compute differences (Full - Base)
df["RMSE_diff"] = df["RMSE_full"] - df["RMSE_base"]  # Lower RMSE = better → Negative diff = improvement
df["Adj_R2_diff"] = df["Adj_R2_full"] - df["Adj_R2_base"]        # Higher R² = better → Positive diff = improvement
df["AIC_diff"] = df["AIC_full"] - df["AIC_base"]     # Lower AIC = better → Negative diff = improvement

n_better_rmse = (df["RMSE_diff"] < 0).sum()  # Count where Full model has lower RMSE
n_total_rmse = (df["RMSE_diff"] != 0).sum()  # Ignore ties (if any)

# One-sided test: Is Full model better more often than chance?
result_rmse = binomtest(n_better_rmse, n_total_rmse, p=0.5, alternative='two-sided')
print(f"RMSE: p-value = {result_rmse.pvalue:.4f}")

n_better_Adj_R2 = (df["Adj_R2_diff"] > 0).sum()  # Count where Full model has higher R²
n_total_Adj_R2 = (df["Adj_R2_diff"] != 0).sum()  # Ignore ties

result_Adj_R2 = binomtest(n_better_Adj_R2, n_total_Adj_R2, p=0.5, alternative='two-sided')
print(f"Adj_R²: p-value = {result_Adj_R2.pvalue:.4f}")

n_better_aic = (df["AIC_diff"] < 0).sum()  # Count where Full model has lower AIC
n_total_aic = (df["AIC_diff"] != 0).sum()  # Ignore ties

result_aic = binomtest(n_better_aic, n_total_aic, p=0.5, alternative='two-sided')
print(f"AIC: p-value = {result_aic.pvalue:.4f}")

RMSE: p-value = 0.0000
R²: p-value = 0.0001
AIC: p-value = 1.0000


In [15]:
df = pd.read_csv("logistic_regression_results_optimized.csv")
df.columns

df["Accuracy_diff"] = df["Accuracy_full"] - df["Accuracy_base"]
n_better_Accuracy = (df["Accuracy_diff"] < 0).sum()  # Count where Full model has lower Accuracy
n_total_Accuracy = (df["Accuracy_diff"] != 0).sum()  # Ignore ties (if any)
result_Accuracy = binomtest(n_better_Accuracy, n_total_Accuracy, p=0.5, alternative='two-sided')
print(f"Accuracy: p-value = {result_Accuracy.pvalue:.4f}")

df["Precision_diff"] = df["Precision_full"] - df["Precision_base"]
n_better_Precision = (df["Precision_diff"] < 0).sum()  # Count where Full model has lower Precision
n_total_Precision = (df["Precision_diff"] != 0).sum()  # Ignore ties (if any)
result_Precision = binomtest(n_better_Precision, n_total_Precision, p=0.5, alternative='two-sided')
print(f"Precision: p-value = {result_Precision.pvalue:.4f}")

df["Specificity_diff"] = df["Specificity_full"] - df["Specificity_base"]
n_better_Specificity = (df["Specificity_diff"] < 0).sum()  # Count where Full model has lower Specificity
n_total_Specificity = (df["Specificity_diff"] != 0).sum()  # Ignore ties (if any)
result_Specificity = binomtest(n_better_Specificity, n_total_Specificity, p=0.5, alternative='two-sided')
print(f"Specificity: p-value = {result_Specificity.pvalue:.4f}")

df["AIC_diff"] = df["AIC_full"] - df["AIC_base"]
n_better_AIC = (df["AIC_diff"] < 0).sum()  # Count where Full model has lower AIC
n_total_AIC = (df["AIC_diff"] != 0).sum()  # Ignore ties (if any)
result_AIC = binomtest(n_better_AIC, n_total_AIC, p=0.5, alternative='two-sided')
print(f"AIC: p-value = {result_AIC.pvalue:.4f}")

Accuracy: p-value = 0.0075
Precision: p-value = 0.1671
Specificity: p-value = 0.6476
AIC: p-value = 0.0000
