This project is about the application of machine learning algorithms such as support vector machines, random forests, and naive Bayes. It provides you a more practical example to help you use these algorithms better.   

In this article, in order to reduce the noise in the extracted features, we discretize common technical analysis indicators as feature values, such as RSI, MACD, and so on. Then we apply SVM, Naive Bayes and Random Forest algorithm to predict the rise and fall of the next trading day. In order to be closer to the actual application scenario, we will add today's data to the training data every day. Because the training data set has changed, we will train a new model every day.  

Here we use JQdata API to obtain Chinese financial data, you can also use `tushare` as an alternative. Our program will involve many statements that come with the `jqdata` package. If you are not interested in this, you can skip the data collection and directly read the machine learning part of the code.  

How to install and use jqdata:  
https://www.joinquant.com/help/api/help?name=JQData#%E5%85%B3%E4%BA%8EJQData%E3%80%81jqdatasdk%E5%92%8Cjqdata

In [126]:
import talib
import numpy as np
import pandas as pd
from jqdatasdk import *
from sklearn import preprocessing
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
import datetime
import time
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings("ignore") #cancel warninig
auth('your phone','your passwd')
is_auth = is_auth()
print(is_auth)

True


We choose 399300.XSHE as market index.Taking into account the timeliness of the data, we start with January-4-2010 and use February-26-2015 as the end date of the input data to train the model. The backtest runs until June 3 2016 (these dates are all trading days).  

So that we have two intervals: Original data period[start_date,test_start_date]; Backtest period[test_start_date,end_date].

In [31]:
test_stock = '399300.XSHE' # Choose 399300.XSHE as market index
start_date = datetime.date(2010, 1, 4) # Set the start date of the experiment data, note that the date needs to be a trading day
end_date = datetime.date(2016, 6, 3) # The expiration date of all the data, also needs to be a trading day
test_start_date = datetime.date(2015, 2, 26) # Set the date to start backtest, also needs to be a trading day

trading_days = get_all_trade_days().tolist() # Get a list of all trading days to conveniently view and check


In [32]:
trading_days

[datetime.date(2005, 1, 4),
 datetime.date(2005, 1, 5),
 datetime.date(2005, 1, 6),
 datetime.date(2005, 1, 7),
 datetime.date(2005, 1, 10),
 datetime.date(2005, 1, 11),
 datetime.date(2005, 1, 12),
 datetime.date(2005, 1, 13),
 datetime.date(2005, 1, 14),
 datetime.date(2005, 1, 17),
 datetime.date(2005, 1, 18),
 datetime.date(2005, 1, 19),
 datetime.date(2005, 1, 20),
 datetime.date(2005, 1, 21),
 datetime.date(2005, 1, 24),
 datetime.date(2005, 1, 25),
 datetime.date(2005, 1, 26),
 datetime.date(2005, 1, 27),
 datetime.date(2005, 1, 28),
 datetime.date(2005, 1, 31),
 datetime.date(2005, 2, 1),
 datetime.date(2005, 2, 2),
 datetime.date(2005, 2, 3),
 datetime.date(2005, 2, 4),
 datetime.date(2005, 2, 16),
 datetime.date(2005, 2, 17),
 datetime.date(2005, 2, 18),
 datetime.date(2005, 2, 21),
 datetime.date(2005, 2, 22),
 datetime.date(2005, 2, 23),
 datetime.date(2005, 2, 24),
 datetime.date(2005, 2, 25),
 datetime.date(2005, 2, 28),
 datetime.date(2005, 3, 1),
 datetime.date(2005, 3,

For the convenience of calling peer data in subsequent programs, we convert the above datetime information into index to establish the position of this row of other data in the database.

We also create a list to store the backtest results.

In [33]:
start_date_index = trading_days.index(start_date)
end_date_index = trading_days.index(end_date)
test_start_index = trading_days.index(test_start_date)

# Save the results of each day's forecast. If the forecast is correct on a certain day, save 1, and if the forecast is wrong on a certain day, save -1
result_list = []

In order to facilitate everyone to better understand the code, we first extract one day in the backtest period for operation, and then use the `for` loop.

In [34]:
trading_days.index(test_start_date)

2460

In [35]:
# We choose the first day from the interval as the given test day:[test_start,end_date],whose index bigger than 2460
index_end = 2461

In [36]:
# Step 1: We get all the training data generated from the initial point to the given backtest period
for index in range(start_date_index, index_end):
    # We first get the opening high, low closing and trading volume data of the stock in the first 35 trading days through the `get_price` method.
    start_day = trading_days[index - 35]
    end_day = trading_days[index] 
    stock_data = get_price(test_stock, start_date=start_day, end_date=end_day, \
                           frequency='daily', fields=['close','high','low','volume'])
    close_prices = stock_data['close'].values
    high_prices = stock_data['high'].values
    low_prices = stock_data['low'].values
    volumes = stock_data['volume'].values
    # Then we use the feature calculation API provided by the `talib` package to calculate the corresponding features. 
    # If you are not interested, you can skip this part.
    # We choose [SMA，WMA，MOM，STCK，STCD，MACD，RSI，WILLR，CCI，MFI，OBV，ROC，CMO]
    # as a technical indicator to train the ML model
    sma_data = talib.SMA(close_prices)[-1]    
    wma_data = talib.WMA(close_prices)[-1]
    mom_data = talib.MOM(close_prices)[-1]
    stck, stcd = talib.STOCH(high_prices, low_prices, close_prices)
    stck_data = stck[-1]
    stcd_data = stcd[-1]

    macd, macdsignal, macdhist = talib.MACD(close_prices)
    macd_data = macd[-1]
    rsi_data = talib.RSI(close_prices,timeperiod=10)[-1]
    willr_data = talib.WILLR(high_prices, low_prices, close_prices)[-1]
    cci_data = talib.CCI(high_prices, low_prices, close_prices)[-1]

    mfi_data = talib.MFI(high_prices, low_prices, close_prices, volumes)[-1]
    obv_data = talib.OBV(close_prices, volumes)[-1]
    roc_data = talib.ROC(close_prices)[-1]
    cmo_data = talib.CMO(close_prices)[-1]

    # Save a set of training data
    features = []
    features.append(sma_data)
    features.append(wma_data)
    features.append(mom_data)
    features.append(stck_data)
    features.append(stcd_data)
    features.append(macd_data)
    features.append(rsi_data)
    features.append(willr_data)
    features.append(cci_data)
    features.append(mfi_data)
    features.append(obv_data)
    features.append(roc_data)
    features.append(cmo_data)
    # A temporary variables used when judging y. Delete after discretization
    features.append(close_prices[-1])

    # Code for calculating classification labels
    start_day = trading_days[index]
    end_day = trading_days[index + 1]
    stock_data = get_price(test_stock, start_date=start_day, end_date=end_day, \
                           frequency='daily', fields=['close','high','low','volume'])
    close_prices = stock_data['close'].values
    
    # Determine whether the market index price is rising
    label = False
    if close_prices[-1] > close_prices[-2]:
        label = True
    # From this, we got the data before this certain backtest period
    x_all.append(features)
    y_all.append(label)

In [37]:
x_all

[[1, 1, 1, -1, -1, 1, 1, 1, -1, -1, -1, 1, 1],
 [1, 1, 1, -1, -1, 1, -1, -1, 1, 1, -1, 1, -1],
 [-1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1],
 [-1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, 1, 1],
 [-1, -1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1],
 [-1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1],
 [-1, -1, -1, -1, 1, -1, 1, -1, 1, 1, -1, -1, 1],
 [-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 1],
 [-1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1],
 [1, 1, -1, 1, 1, 1, -1, 1, 1, 1, -1, -1, -1],
 [-1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1],
 [-1, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, 1],
 [-1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1],
 [-1, -1, -1, 1, -1, -1, -1, -1, 1, 1, -1, -1, -1],
 [-1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1],
 [-1, -1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1],
 [-1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, -1, 1],
 [-1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1],
 [-1, -1, -1, 1, 1, -1, 1, -1, -1, 1, -1, -1, -1],


In [38]:
y_all

[False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,

Since there is a lot of noise in the stock market, if the continuous value feature is directly put into the machine learning algorithm for training, the result is likely to be over-fitting (good backtest results, poor real market results).  

In order to avoid over-fitting, this article combines the meaning of each feature and discretizes continuous-valued features into binary features.   

Traverse from the back to the front, the first row of data needs to be discarded, the range only contains the first element, not the second element

In [39]:
for index in range(len(x_all)-1, 0, -1):
    # SMA
    if x_all[index][0] < x_all[index][-1]:
        x_all[index][0] = 1
    else:
        x_all[index][0] = -1
    # WMA
    if x_all[index][1] < x_all[index][-1]:
        x_all[index][1] = 1
    else:
        x_all[index][1] = -1
    # MOM
    if x_all[index][2] > 0:
        x_all[index][2] = 1
    else:
        x_all[index][2] = -1
    # STCK
    if x_all[index][3] > x_all[index-1][3]:
        x_all[index][3] = 1
    else:
        x_all[index][3] = -1
    # STCD
    if x_all[index][4] > x_all[index-1][4]:
        x_all[index][4] = 1
    else:
        x_all[index][4] = -1
    # MACD
    if x_all[index][5] > x_all[index-1][5]:
        x_all[index][5] = 1
    else:
        x_all[index][5] = -1

    # RSI
    if x_all[index][6] > 70:
        x_all[index][6] = -1
    elif x_all[index][6] < 30:
        x_all[index][6] = 1
    else:
        if x_all[index][6] > x_all[index-1][6]:
            x_all[index][6] = 1
        else:
            x_all[index][6] = -1
    # WILLR
    if x_all[index][7] > x_all[index-1][7]:
        x_all[index][7] = 1
    else:
        x_all[index][7] = -1
    # CCI
    if x_all[index][8] > 200:
        x_all[index][8] = -1
    elif x_all[index][8] < -200:
        x_all[index][8] = 1
    else:
        if x_all[index][8] > x_all[index-1][8]:
            x_all[index][8] = 1
        else:
            x_all[index][8] = -1

    # MFI
    if x_all[index][9] > 90:
        x_all[index][9] = -1
    elif x_all[index][9] < 10:
        x_all[index][9] = 1
    else:
        if x_all[index][9] > x_all[index-1][9]:
            x_all[index][9] = 1
        else:
            x_all[index][9] = -1
    # OBV
    if x_all[index][10] > x_all[index-1][10]:
        x_all[index][10] = 1
    else:
        x_all[index][10] = -1
    # ROC
    if x_all[index][11] > 0:
        x_all[index][11] = 1
    else:
        x_all[index][11] = -1
    # CMO
    if x_all[index][12] > 50:
        x_all[index][12] = -1
    elif x_all[index][12] < -50:
        x_all[index][12] = 1
    else:
        if x_all[index][12] > x_all[index-1][12]:
            x_all[index][12] = 1
        else:
            x_all[index][12] = -1        
    # delete price
    x_all[index].pop(-1)

Let's sort out the data currently obtained and perform basic operations on it.

In [108]:
# Remove the first row of data
x_all = x_all[1:]
y_all = y_all[1:]

y_all = pd.DataFrame(y_all)
x_all = pd.DataFrame(x_all)
x_all.columns = ['SMA','WMA','MOM','STCK','STCD','MACD','RSI','WILLR','CCI','MFI','OBV','ROC','CMO']
x_all = x_all.fillna(value = 0)
x_all

Unnamed: 0,SMA,WMA,MOM,STCK,STCD,MACD,RSI,WILLR,CCI,MFI,OBV,ROC,CMO
10,1,1,-1,-1,-1,-1,1,1,-1,1,-1,-1,0.0
11,-1,-1,-1,-1,-1,-1,1,-1,-1,1,-1,-1,0.0
12,-1,-1,-1,1,-1,-1,1,-1,1,1,-1,-1,0.0
13,-1,-1,-1,-1,-1,-1,1,-1,-1,1,-1,-1,0.0
14,-1,-1,-1,-1,-1,-1,1,-1,1,1,1,-1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3768,-1,1,1,1,1,1,1,1,1,-1,1,1,1.0
3769,1,1,1,1,1,1,1,1,1,1,-1,1,1.0
3770,1,1,1,-1,1,1,1,1,1,1,-1,1,1.0
3771,-1,1,1,-1,-1,1,-1,-1,-1,1,-1,1,-1.0


In [109]:
y_all.isna().sum()

0    0
dtype: int64

In [110]:
# Training data is all data execpt the last one,
x_train = x_all.iloc[:-1,]
y_train = y_all.iloc[:-1,]
# … while the last one is the test data.
x_test = x_all.iloc[-1:,]
y_test = y_all.iloc[-1:,]

In [111]:
x_train

Unnamed: 0,SMA,WMA,MOM,STCK,STCD,MACD,RSI,WILLR,CCI,MFI,OBV,ROC,CMO
10,1,1,-1,-1,-1,-1,1,1,-1,1,-1,-1,0.0
11,-1,-1,-1,-1,-1,-1,1,-1,-1,1,-1,-1,0.0
12,-1,-1,-1,1,-1,-1,1,-1,1,1,-1,-1,0.0
13,-1,-1,-1,-1,-1,-1,1,-1,-1,1,-1,-1,0.0
14,-1,-1,-1,-1,-1,-1,1,-1,1,1,1,-1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3767,-1,-1,-1,1,1,1,1,1,1,-1,1,-1,1.0
3768,-1,1,1,1,1,1,1,1,1,-1,1,1,1.0
3769,1,1,1,1,1,1,1,1,1,1,-1,1,1.0
3770,1,1,1,-1,1,1,1,1,1,1,-1,1,1.0


In [112]:
y_train

Unnamed: 0,0
10,False
11,False
12,False
13,False
14,True
...,...
3767,True
3768,True
3769,True
3770,False


In [113]:
x_test

Unnamed: 0,SMA,WMA,MOM,STCK,STCD,MACD,RSI,WILLR,CCI,MFI,OBV,ROC,CMO
3772,1,1,1,1,-1,1,1,1,1,1,-1,1,1.0


In [128]:
y_test

Unnamed: 0,0
3772,True


In [115]:
# The following three lines of code use machine learning SVM, random forest, and naive Bayes. You should just call one of them.
# However, you can also call them together to compare the effect of the algorithm, but this will consume a lot of time.

# define you method
clf = svm.SVC()
#     clf = RandomForestClassifier(n_estimators=50)
#     clf = GaussianNB()
# train the mode 
clf.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [129]:
# Forecast, and record the forecast results of the day
prediction = clf.predict(x_test)
prediction

array([False])

In [127]:
if prediction == np.array(y_test):
    print('True')
    result_list.append(1)
else:
    print('False')
    result_list.append(-1)

False


Let's merge the code and take a look at it as a whole.

In [130]:
time_start = time.time()

for index_end in range(test_start_index, end_date_index):

    x_all = []
    y_all = []
    

    for index in range(start_date_index, index_end):
     
        start_day = trading_days[index - 35]
        end_day = trading_days[index]
        
       
        stock_data = get_price(test_stock, start_date=start_day, end_date=end_day, \
                               frequency='daily', fields=['close','high','low','volume'])
        close_prices = stock_data['close'].values
        high_prices = stock_data['high'].values
        low_prices = stock_data['low'].values
        volumes = stock_data['volume'].values
       
        sma_data = talib.SMA(close_prices)[-1]    
        wma_data = talib.WMA(close_prices)[-1]
        mom_data = talib.MOM(close_prices)[-1]
        stck, stcd = talib.STOCH(high_prices, low_prices, close_prices)
        stck_data = stck[-1]
        stcd_data = stcd[-1]

        macd, macdsignal, macdhist = talib.MACD(close_prices)
        macd_data = macd[-1]
        rsi_data = talib.RSI(close_prices,timeperiod=10)[-1]
        willr_data = talib.WILLR(high_prices, low_prices, close_prices)[-1]
        cci_data = talib.CCI(high_prices, low_prices, close_prices)[-1]
        
        mfi_data = talib.MFI(high_prices, low_prices, close_prices, volumes)[-1]
        obv_data = talib.OBV(close_prices, volumes)[-1]
        roc_data = talib.ROC(close_prices)[-1]
        cmo_data = talib.CMO(close_prices)[-1]
        
      
        features = []
        features.append(sma_data)
        features.append(wma_data)
        features.append(mom_data)
        features.append(stck_data)
        features.append(stcd_data)
        features.append(macd_data)
        features.append(rsi_data)
        features.append(willr_data)
        features.append(cci_data)
        features.append(mfi_data)
        features.append(obv_data)
        features.append(roc_data)
        features.append(cmo_data)

        features.append(close_prices[-1])
    
   
        start_day = trading_days[index]
        end_day = trading_days[index + 1]
        stock_data = get_price(test_stock, start_date=start_day, end_date=end_day, \
                               frequency='daily', fields=['close','high','low','volume'])
        close_prices = stock_data['close'].values
        
        label = False
        if close_prices[-1] > close_prices[-2]:
            label = True
  
        x_all.append(features)
        y_all.append(label)
        
  
    for index in range(len(x_all)-1, 0, -1):
        # SMA
        if x_all[index][0] < x_all[index][-1]:
            x_all[index][0] = 1
        else:
            x_all[index][0] = -1
        # WMA
        if x_all[index][1] < x_all[index][-1]:
            x_all[index][1] = 1
        else:
            x_all[index][1] = -1
        # MOM
        if x_all[index][2] > 0:
            x_all[index][2] = 1
        else:
            x_all[index][2] = -1
        # STCK
        if x_all[index][3] > x_all[index-1][3]:
            x_all[index][3] = 1
        else:
            x_all[index][3] = -1
        # STCD
        if x_all[index][4] > x_all[index-1][4]:
            x_all[index][4] = 1
        else:
            x_all[index][4] = -1
        # MACD
        if x_all[index][5] > x_all[index-1][5]:
            x_all[index][5] = 1
        else:
            x_all[index][5] = -1

        # RSI
        if x_all[index][6] > 70:
            x_all[index][6] = -1
        elif x_all[index][6] < 30:
            x_all[index][6] = 1
        else:
            if x_all[index][6] > x_all[index-1][6]:
                x_all[index][6] = 1
            else:
                x_all[index][6] = -1
        # WILLR
        if x_all[index][7] > x_all[index-1][7]:
            x_all[index][7] = 1
        else:
            x_all[index][7] = -1
        # CCI
        if x_all[index][8] > 200:
            x_all[index][8] = -1
        elif x_all[index][8] < -200:
            x_all[index][8] = 1
        else:
            if x_all[index][8] > x_all[index-1][8]:
                x_all[index][8] = 1
            else:
                x_all[index][8] = -1
                
        # MFI
        if x_all[index][9] > 90:
            x_all[index][9] = -1
        elif x_all[index][9] < 10:
            x_all[index][9] = 1
        else:
            if x_all[index][9] > x_all[index-1][9]:
                x_all[index][9] = 1
            else:
                x_all[index][9] = -1
        # OBV
        if x_all[index][10] > x_all[index-1][10]:
            x_all[index][10] = 1
        else:
            x_all[index][10] = -1
        # ROC
        if x_all[index][11] > 0:
            x_all[index][11] = 1
        else:
            x_all[index][11] = -1
        # CMO
        if x_all[index][12] > 50:
            x_all[index][12] = -1
        elif x_all[index][12] < -50:
            x_all[index][12] = 1
        else:
            if x_all[index][12] > x_all[index-1][12]:
                x_all[index][12] = 1
            else:
                x_all[index][12] = -1        
        # delete price
        x_all[index].pop(-1)
    
        
    x_all = x_all[1:]
    y_all = y_all[1:]

    y_all = pd.DataFrame(y_all)
    x_all = pd.DataFrame(x_all)
    x_all = x_all.fillna(value = 0)    
    

    x_train = x_all.iloc[:-1,]
    y_train = y_all.iloc[:-1,]

    x_test = x_all.iloc[-1:,]
    y_test = y_all.iloc[-1:,]

    clf = svm.SVC()
#     clf = RandomForestClassifier(n_estimators=50)
#     clf = GaussianNB()

    clf.fit(x_train, y_train)

    prediction = clf.predict(x_test)
    if prediction == np.array(y_test):
        print('True')
        result_list.append(1)
    else:
        print('False')
        result_list.append(-1)
# plot the accuracy curve
x = range(0, len(result_list))
y = []
for i in range(0, len(result_list)):
    y.append((1 + float(sum(result_list[:i])) / (i+1)) / 2)
line, = plt.plot(x, y)
plt.show()

time_end = time.time()
print('The algorithm consume %d time' & (time_end - time_start) )

False
False
True
True
True
True
False
True
True
False
True
True
True
True
True
False


Exception: 您当天的查询条数超过了每日最大查询限制：100万条；付费可增加流量权限，详情请咨询管理员，微信号：JQData02

Ooops, it is very unfortunite that I reach the limit of daily data access amount and cannot run all the code here. Hope you can get some intuition that how to write a ML-based strategy in the file.