關聯統計算法是一種用於分析數據集中變數之間關聯性的統計方法。它主要用於發現變數之間的相關性、依賴性或聯繫性，以及識別出可能的關聯規則。其中最著名且常用的關聯統計算法是關聯規則挖掘，其中包括 Apriori 算法和FP-Growth 算法。

In [1]:
from efficient_apriori import apriori

import pandas as pd
import yfinance as yf

import time 

In [2]:
data = [
	['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
	['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
	['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
	['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
	['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']
]

In [3]:
itemsets, rules = apriori(data, min_support=0.5,  min_confidence=1) 
rules

[{Eggs} -> {Kidney Beans},
 {Onion} -> {Eggs},
 {Milk} -> {Kidney Beans},
 {Onion} -> {Kidney Beans},
 {Yogurt} -> {Kidney Beans},
 {Kidney Beans, Onion} -> {Eggs},
 {Eggs, Onion} -> {Kidney Beans},
 {Onion} -> {Eggs, Kidney Beans}]

## try on stocks

我們的想法是構造一些事件「一起發生」的場景數據集，用關聯算法幫我們找出常見的經常一起發生的事件，例如將當日各項股票的升跌，和第二日各項股票的升跌，看成是「一起發生」的相同場景。

In [4]:
tickers = ['^SPX', 'NVDA', 'MSFT', 'AAPL', 'META']

df = yf.download(tickers, period="60d", interval="5m")

df["Close"]

[*********************100%%**********************]  5 of 5 completed


Ticker,AAPL,META,MSFT,NVDA,^SPX
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-11-21 09:30:00-05:00,191.210007,338.059998,374.934998,502.459991,
2023-11-21 09:35:00-05:00,190.919998,338.605499,375.749390,504.260010,
2023-11-21 09:40:00-05:00,190.764999,338.660004,376.000000,503.135010,
2023-11-21 09:45:00-05:00,190.919998,338.839996,375.214996,500.510010,
2023-11-21 09:50:00-05:00,191.021194,339.184998,375.720001,501.600006,
...,...,...,...,...,...
2024-02-16 15:35:00-05:00,182.119995,473.484985,404.149994,730.851990,5012.060059
2024-02-16 15:40:00-05:00,182.054993,473.290100,403.910004,728.770020,5006.509766
2024-02-16 15:45:00-05:00,181.735001,472.170013,403.769989,725.760010,5000.839844
2024-02-16 15:50:00-05:00,181.964005,473.390015,403.703094,727.234985,5005.089844


yf 預設用 Open, Close ... 分組，用 df.Close 或 df['Close'] 取出所有的 Close 欄

In [5]:
data = df['Close'].fillna(0)

updown = data.diff().apply(lambda r: r.apply(lambda x: r.name + '_UP' if x > 0 else r.name + '_DOWN'))

updown

Ticker,AAPL,META,MSFT,NVDA,^SPX
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-11-21 09:30:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN
2023-11-21 09:35:00-05:00,AAPL_DOWN,META_UP,MSFT_UP,NVDA_UP,^SPX_DOWN
2023-11-21 09:40:00-05:00,AAPL_DOWN,META_UP,MSFT_UP,NVDA_DOWN,^SPX_DOWN
2023-11-21 09:45:00-05:00,AAPL_UP,META_UP,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN
2023-11-21 09:50:00-05:00,AAPL_UP,META_UP,MSFT_UP,NVDA_UP,^SPX_DOWN
...,...,...,...,...,...
2024-02-16 15:35:00-05:00,AAPL_DOWN,META_UP,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN
2024-02-16 15:40:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN
2024-02-16 15:45:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN
2024-02-16 15:50:00-05:00,AAPL_UP,META_UP,MSFT_DOWN,NVDA_UP,^SPX_UP


根據每日相對於前一日的升跌，將數字轉為 label

In [6]:
updown['Next ^SPX'] = "N_" + updown['^SPX'].shift(-1)

updown = updown[:-1]

updown

Ticker,AAPL,META,MSFT,NVDA,^SPX,Next ^SPX
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-11-21 09:30:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2023-11-21 09:35:00-05:00,AAPL_DOWN,META_UP,MSFT_UP,NVDA_UP,^SPX_DOWN,N_^SPX_DOWN
2023-11-21 09:40:00-05:00,AAPL_DOWN,META_UP,MSFT_UP,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2023-11-21 09:45:00-05:00,AAPL_UP,META_UP,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2023-11-21 09:50:00-05:00,AAPL_UP,META_UP,MSFT_UP,NVDA_UP,^SPX_DOWN,N_^SPX_DOWN
...,...,...,...,...,...,...
2024-02-16 15:30:00-05:00,AAPL_DOWN,META_DOWN,MSFT_UP,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2024-02-16 15:35:00-05:00,AAPL_DOWN,META_UP,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2024-02-16 15:40:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN,N_^SPX_DOWN
2024-02-16 15:45:00-05:00,AAPL_DOWN,META_DOWN,MSFT_DOWN,NVDA_DOWN,^SPX_DOWN,N_^SPX_UP


將第二日 SPX 的升跌往上移動一格，各項股票的升跌就和第二日發生 SPX 升跌出現在同一行裡，相當於「一起發生」。

最後一行沒有第二日的數據，去掉

In [7]:
start_time = time.time()
itemsets, rules = apriori(updown.values, min_support=0.2,  min_confidence=0.5)
print("--- run time: %s seconds ---" % (time.time() - start_time))

--- run time: 0.02684187889099121 seconds ---


In [8]:
rules_rhs = filter(lambda rule: len(rule.lhs) == 1 and rule.rhs[0].startswith('N_'), rules)

只關注目標事件為第二日 SPX 的升跌，即 N_^SPX_UP 或者 N_^SPX_DOWN

In [9]:
for rule in sorted(rules_rhs, key=lambda rule: rule.confidence, reverse=True):
  print(rule)

{^SPX_UP} -> {N_^SPX_UP} (conf: 0.522, supp: 0.265, lift: 1.028, conv: 1.030)
{META_DOWN} -> {N_^SPX_UP} (conf: 0.521, supp: 0.259, lift: 1.026, conv: 1.028)
{AAPL_DOWN} -> {N_^SPX_UP} (conf: 0.518, supp: 0.252, lift: 1.019, conv: 1.020)
{NVDA_UP} -> {N_^SPX_UP} (conf: 0.516, supp: 0.268, lift: 1.015, conv: 1.016)
{MSFT_UP} -> {N_^SPX_UP} (conf: 0.509, supp: 0.263, lift: 1.002, conv: 1.002)
{MSFT_DOWN} -> {N_^SPX_UP} (conf: 0.507, supp: 0.245, lift: 0.998, conv: 0.998)
{^SPX_DOWN} -> {N_^SPX_DOWN} (conf: 0.506, supp: 0.249, lift: 1.030, conv: 1.030)
{META_UP} -> {N_^SPX_DOWN} (conf: 0.505, supp: 0.254, lift: 1.026, conv: 1.026)
{AAPL_UP} -> {N_^SPX_DOWN} (conf: 0.501, supp: 0.257, lift: 1.018, conv: 1.018)
{NVDA_DOWN} -> {N_^SPX_DOWN} (conf: 0.500, supp: 0.240, lift: 1.017, conv: 1.017)
{NVDA_DOWN} -> {N_^SPX_UP} (conf: 0.500, supp: 0.240, lift: 0.984, conv: 0.983)


兩個概念：support 和 confidence

- support 是指在所有發生的場景中，兩者同時發生的次數，即 support(A -> B) = count(A & B) / count(ALL)
- confidence 是指在 A 發生的場景中，B 發生的次數， 即 support(A -> B) = count(A & B) / count(A)

最終結果以 confidence 做排序，發現單一因子對於 SPX 升跌的確信度都很靠前，很少出現多因子