# 基于因子挖掘的机器学习股票交易策略

本项目旨在通过机器学习和因子挖掘技术，开发一个高效的股票交易策略。我们的方法结合了传统的技术分析指标和先进的时间序列特征提取技术，以捕捉市场中潜在的alpha因子。

项目流程包括：
1. 数据获取与预处理：使用qstock库获取历史股票数据，并进行初步清理。
2. 特征工程：利用TA-Lib计算传统技术指标，如RSI、MACD等。
3. 因子挖掘：应用tsfresh库进行大规模时间序列特征提取，挖掘潜在的alpha因子。
4. 模型训练：使用SVM或随机森林算法构建预测模型，识别最具预测力的因子组合。
5. 策略回测：基于模型预测结果，构建交易策略并进行历史回测，评估策略的有效性。

通过这个项目，我们旨在展示如何将现代数据科学技术应用于量化投资领域，以及如何系统地挖掘和验证潜在的alpha因子。这个方法不仅适用于个股分析，也可以扩展到更广泛的市场和资产类别。

注意：本项目仅用于教育和研究目的，不构成任何投资建议。在实际交易中应用任何策略之前，都需要进行充分的风险评估和额外的验证。

# Factor Mining-based Machine Learning Stock Trading Strategy

This project aims to develop an efficient stock trading strategy using machine learning and factor mining techniques. Our approach combines traditional technical analysis indicators with advanced time series feature extraction methods to capture potential alpha factors in the market.

The project workflow includes:
1. Data Acquisition and Preprocessing: Using the qstock library to obtain historical stock data and perform initial cleaning.
2. Feature Engineering: Utilizing TA-Lib to calculate traditional technical indicators such as RSI, MACD, etc.
3. Factor Mining: Applying the tsfresh library for large-scale time series feature extraction to uncover potential alpha factors.
4. Model Training: Building predictive models using SVM or Random Forest algorithms to identify the most predictive factor combinations.
5. Strategy Backtesting: Constructing trading strategies based on model predictions and conducting historical backtests to evaluate strategy effectiveness.

Through this project, we aim to demonstrate how modern data science techniques can be applied to quantitative investment, and how to systematically mine and validate potential alpha factors. This approach is not only applicable to individual stock analysis but can also be extended to broader markets and asset classes.

Note: This project is for educational and research purposes only and does not constitute any investment advice. Before applying any strategy in actual trading, thorough risk assessment and additional validation are necessary.

# 数据获取与预处理
# Data Acquisition and Preprocessing

这个单元格主要完成以下任务:
1. 导入必要的库,如qstock用于获取股票数据。
2. 使用qstock获取苹果公司(AAPL)的历史股价数据。
3. 对数据进行基本的清理和预处理,包括删除不需要的列、处理缺失值、重置索引等。

This cell performs the following tasks:
1. Imports necessary libraries, such as qstock for fetching stock data.
2. Uses qstock to obtain historical stock price data for Apple Inc. (AAPL).
3. Performs basic data cleaning and preprocessing, including removing unnecessary columns, handling missing values, resetting the index, etc.

In [None]:
import qstock as qs

# 获取沪深300指数高开低收、成交量、成交金额、换手率数据，index是日期
# data = qs.get_data(code_list=['AAPL','NVDA','MSFT'], start='20050408', end='20240324', freq='d')
data = qs.get_data(code_list=['AAPL'], start='20050408', end='20240324', freq='d')
# 删除名称列、排序并去除空值
data = data.drop(columns=['name']).sort_index().fillna(method='ffill').dropna()
# 插入日期列
data.insert(0, 'date', data.index)
# 将日期从datetime格式转换为str格式
data['date'] = data['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
data = data.reset_index(drop=True)

print(data.shape)
data.tail(10)

# 特征工程
# Feature Engineering

这个单元格进行了以下操作:
1. 使用talib库计算了多个技术指标,如斜率(slope)、相对强弱指标(RSI)、威廉指标(Williams %R)、MACD和抛物线SAR。
2. 删除了一些原始列(开盘价、最高价、最低价),只保留计算出的指标和其他必要的列。
3. 再次处理了可能出现的缺失值。

This cell performs the following operations:
1. Uses the talib library to calculate multiple technical indicators, such as slope, Relative Strength Index (RSI), Williams %R, MACD, and Parabolic SAR.
2. Removes some original columns (open, high, low prices), keeping only the calculated indicators and other necessary columns.
3. Handles any potential missing values that might have appeared after calculations.

In [None]:
import talib

# 收盘价的斜率
data['slope'] = talib.LINEARREG_SLOPE(data['close'].values, timeperiod=5)
# 相对强弱指标
data['rsi'] = talib.RSI(data['close'].values, timeperiod = 14)
# 威廉指标值
data['wr'] = talib.WILLR(data['high'].values, data['low'].values, data['close'].values, timeperiod=7)
# MACD中的DIF、DEA和MACD柱
data['dif'], data['dea'], data['macd'] = talib.MACD(data['close'].values, fastperiod=12, slowperiod=26, signalperiod=9)
# 抛物线指标
data['sar'] = talib.SAR(data['high'].values, data['low'].values)
# 删除开盘价、最高价和最低价
data = data.drop(columns=['open','high','low']).fillna(method='ffill').dropna().reset_index(drop=True)

print(data.shape)
data.tail(10)

# 时间序列特征提取
# Time Series Feature Extraction

这个单元格主要完成以下任务:
1. 使用tsfresh库的roll_time_series函数对数据进行滚动处理,创建多个时间窗口的特征。
2. 使用tsfresh的extract_features函数提取大量时间序列特征。
3. 调整提取的特征的索引,使其与原始数据对应。

This cell mainly accomplishes the following tasks:
1. Uses the roll_time_series function from the tsfresh library to perform rolling window processing on the data, creating features for multiple time windows.
2. Uses the extract_features function from tsfresh to extract a large number of time series features.
3. Adjusts the index of the extracted features to correspond with the original data.

In [None]:
from tsfresh.utilities.dataframe_functions import roll_time_series

data_roll = roll_time_series(data, column_id='code', column_sort='date', max_timeshift=20, min_timeshift=5).drop(columns=['code'])

print(data_roll.shape)
data_roll.head(15)

In [None]:
gg = data_roll.groupby('id').agg({'date':['count', min, max]})

print(gg.shape)
gg.head(20)

In [None]:
from tsfresh import extract_features

data_feat = extract_features(data_roll, column_id='id', column_sort='date')
# 对单独标的而言，将日期作为index
data_feat.index = [v[1] for v in data_feat.index] 

print(data_feat.shape)

# 数据准备
# Data Preparation

这个单元格进行了以下操作:
1. 将原始特征与通过tsfresh提取的特征合并。
2. 创建目标变量:
   - 'pct': 下一天的收益率
   - 'rise': 二元分类变量,表示股价是否上涨
3. 删除含有缺失值的行。

This cell performs the following operations:
1. Merges the original features with the features extracted through tsfresh.
2. Creates target variables:
   - 'pct': The next day's return rate
   - 'rise': A binary classification variable indicating whether the stock price rose
3. Removes rows containing missing values.

In [None]:
import pandas as pd

# 将原始因子加入因子矩阵当中
data_feat = pd.merge(data_feat, data.set_index('date', drop=True).drop(columns=['code']), 
                     how='left', left_index=True, right_index=True)

# 给数据打标签
data_feat['pct'] = data_feat['close'].shift(-1) / data_feat['close'] - 1.0
data_feat['rise'] = data_feat['pct'].apply(lambda x: 1 if x>0 else 0)
data_feat = data_feat.dropna(subset=['pct'])

print(data_feat.shape)

In [None]:
from tsfresh import select_features

# 划分训练集和测试集
num_train = round(len(data_feat)*0.8)
data_train = data_feat.iloc[:num_train, :]
y_train = data_feat.iloc[:num_train, :]['rise']
data_test = data_feat.iloc[num_train:, :]
y_test = data_feat.iloc[num_train:, :]['rise']

# 特征选择
data_train0 = select_features(data_train.drop(columns=['pct','rise']).dropna(axis=1, how='any'), y_train)
select_columns = list(data_train0.columns) + ['pct','rise']
data_train = data_train[select_columns]
data_test = data_test[select_columns]

print(data_train.shape)

# 模型训练
# Model Training

这个单元格完成了以下任务:
1. 将数据集分为训练集(80%)和测试集(20%)。
2. 使用tsfresh的select_features函数进行特征选择。
3. 使用SimpleImputer处理缺失值。
4. 使用StandardScaler对特征进行标准化。
5. 训练RandomForestClassifier模型。
6. 在训练集和测试集上评估模型性能。
7. 保存训练好的模型和selected特征名称。

This cell accomplishes the following tasks:
1. Splits the dataset into training (80%) and test (20%) sets.
2. Uses the select_features function from tsfresh for feature selection.
3. Handles missing values using SimpleImputer.
4. Standardizes features using StandardScaler.
5. Trains a RandomForestClassifier model.
6. Evaluates model performance on both training and test sets.
7. Saves the trained model and the names of selected features.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# 转化为numpy的ndarray数组格式
X_train = data_train.drop(columns=['pct','rise']).values
X_test = data_test.drop(columns=['pct','rise']).values

# 对数据进行标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 训练模型
classifier = SVC(C=1.0, kernel='rbf')
classifier.fit(X_train, y_train)

In [None]:
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
data_train['pred'] = y_train_pred
data_test['pred'] = y_test_pred
accuracy_train = 100 * data_train[data_train.rise==data_train.pred].shape[0] / data_train.shape[0]
accuracy_test = 100 * data_test[data_test.rise==data_test.pred].shape[0] / data_test.shape[0]
print('训练集预测准确率：%.2f%%' %accuracy_train)
print('测试集预测准确率：%.2f%%' %accuracy_test)

import joblib
# 保存模型
joblib.dump(classifier, 'trained_model.pkl')

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# 转化为numpy的ndarray数组格式
X_train = data_train.drop(columns=['pct', 'rise']).values
X_test = data_test.drop(columns=['pct', 'rise']).values

# 使用SimpleImputer填补缺失值
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# 对数据进行标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 训练模型
classifier = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42, n_jobs=-1)
classifier.fit(X_train, y_train)

y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
data_train['pred'] = y_train_pred
data_test['pred'] = y_test_pred
accuracy_train = 100 * data_train[data_train.rise == data_train.pred].shape[0] / data_train.shape[0]
accuracy_test = 100 * data_test[data_test.rise == data_test.pred].shape[0] / data_test.shape[0]
print('训练集预测准确率：%.2f%%' % accuracy_train)
print('测试集预测准确率：%.2f%%' % accuracy_test)

import joblib
# 保存模型
joblib.dump(classifier, 'trained_model_rf.pkl')

# 保存特征名称
feature_names = data_train.drop(columns=['pct', 'rise']).columns
joblib.dump(feature_names, 'feature_names.pkl')
# 保存特征选择后的列名
selected_feature_names = data_train.drop(columns=['pct', 'rise']).columns
joblib.dump(selected_feature_names, 'selected_feature_names.pkl')

# 策略回测
# Strategy Backtesting

这个单元格进行了以下操作:
1. 基于模型预测结果计算策略收益。
2. 计算策略和买入持有策略的累积收益。
3. 计算策略的年化收益率。
4. 绘制策略收益与买入持有策略的对比图。

This cell performs the following operations:
1. Calculates strategy returns based on model predictions.
2. Computes cumulative returns for both the strategy and a buy-and-hold approach.
3. Calculates the annualized return of the strategy.
4. Plots a comparison of the strategy returns versus a buy-and-hold approach.

In [None]:
import backtrader as bt
import pandas as pd
import joblib
import talib
import dill as pickle  # 使用 dill 代替 pickle
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from tsfresh import extract_features

class PandasDataExtend(bt.feeds.PandasData):
    lines = ('Slope', 'Rsi', 'Wr', 'Dif', 'Dea', 'Macd', 'Sar',)
    params = (('Slope', -1), ('Rsi', -1), ('Wr', -1), ('Dif', -1), ('Dea', -1), ('Macd', -1), ('Sar', -1),)

class MLStrategy(bt.Strategy):
    def __init__(self):
        self.model = joblib.load('trained_model_rf.pkl')
        self.selected_feature_names = joblib.load('selected_feature_names.pkl')
        self.imputer = SimpleImputer(strategy='mean')
        self.scaler = StandardScaler()
        initial_data = self.data_to_dataframe()
        self.features = extract_features(initial_data, column_id='code', column_sort='date')
        self.features = self.features.reindex(columns=self.selected_feature_names, fill_value=0)
        self.imputer.fit(self.features)
        self.scaler.fit(self.features)
        self.extracted_features = {}
        self.daily_value = []
        self.trades = []

    def data_to_dataframe(self):
        data = {
            'date': [self.data.datetime.date(0)],
            'code': [self.data._name],
            'Close': [self.data.close[0]],
            'High': [self.data.high[0]],
            'Low': [self.data.low[0]],
            'Open': [self.data.open[0]],
            'Volume': [self.data.volume[0]],
        }
        for name in self.params._getkeys():
            data[name] = [getattr(self.data, name)[0]]
        return pd.DataFrame(data)

    def next(self):
        current_date = self.data.datetime.date(0)
        if current_date not in self.extracted_features:
            current_data = self.data_to_dataframe()
            current_features = extract_features(current_data, column_id='code', column_sort='date')
            current_features = current_features.reindex(columns=self.selected_feature_names, fill_value=0)
            current_features = self.imputer.transform(current_features)
            current_features = self.scaler.transform(current_features)
            self.extracted_features[current_date] = current_features
        else:
            current_features = self.extracted_features[current_date]
        prediction = self.model.predict(current_features)
        if prediction == 1 and not self.position:
            self.buy()
            self.trades.append((current_date, 'buy', self.data.close[0]))
        elif prediction == 0 and self.position:
            self.sell()
            self.trades.append((current_date, 'sell', self.data.close[0]))
        self.daily_value.append((current_date, self.broker.getvalue()))

dataframe = pd.read_csv('AAPL.csv', index_col='Date', parse_dates=True)
dataframe['Slope'] = talib.LINEARREG_SLOPE(dataframe['Close'].values, timeperiod=5)
dataframe['Rsi'] = talib.RSI(dataframe['Close'].values, timeperiod=14)
dataframe['Wr'] = talib.WILLR(dataframe['High'].values, dataframe['Low'].values, dataframe['Close'].values, timeperiod=7)
dataframe['Dif'], dataframe['Dea'], dataframe['Macd'] = talib.MACD(dataframe['Close'].values, fastperiod=12, slowperiod=26, signalperiod=9)
dataframe['Sar'] = talib.SAR(dataframe['High'].values, dataframe['Low'].values)
dataframe = dataframe.dropna()

cerebro = bt.Cerebro()
cerebro.addstrategy(MLStrategy)
data = PandasDataExtend(dataname=dataframe)
cerebro.adddata(data)
cerebro.broker.set_cash(10000)
cerebro.addsizer(bt.sizers.FixedSize, stake=10)
cerebro.broker.setcommission(commission=0.001)

print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue())
cerebro.run()
print('Ending Portfolio Value: %.2f' % cerebro.broker.getvalue())

cerebro.plot()

In [None]:
img = cerebro.plot(dpi=800)

In [None]:
img[0][0].savefig('AAPL1.png')