# TUTORIAL


In this tutorial I will show how to run the second part of the notebook `Stocks_Performance_Predictor.ipyn`, which is also the most interesting one.

I will use the files that have been uploaded in the Tutorial folder: those are the files that one would build in the first part of the notebook.

We will also need the custom function `get_price_var()` at the end of this notebook (it comes from the main notebook).

Let's begin from the required imports.

In [1]:
import numpy as np
import pandas as pd
from pandas_datareader import data

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from sklearn.metrics import classification_report

from tqdm import tqdm

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Custom function to pull prices of stocks
def get_price_var(symbol):
    '''
    Get historical price data for a given symbol leveraging the power of pandas_datareader and Yahoo.
    Compute the difference between first and last available time-steps in terms of Adjusted Close price.
    
    Input: ticker symbol
    Output: percent price variation 
    '''
    # read data
    prices = data.DataReader(symbol, 'yahoo', '2019-01-01', '2019-12-31')['Adj Close']

    # get all timestamps for specific lookups
    today = prices.index[-1]
    start = prices.index[0]

    # calculate percentage price variation
    price_var = ((prices[today] - prices[start]) / prices[start]) * 100
    return price_var

### LOAD DATA

It is now time to load the example data. The file `Example_Data.csv` contains all the financial data for the stocks from the Technology sector in the US stock exchange that are available from the Financial Modeling Prep API. The financial data are specified by the file `indicators.txt` (also available in this repo). The dataset has already been cleaned and prepared according to the following rules:

  * remove all those columns that have more than 20 0-valued entries;
  * remove all those columns that have more than 15 nan-valued entries;
  * fill the remaining nan-valued entries with the average value of each column.
  
Those are the rules I followed, but their definition is up to the user.

In [2]:
# Load complete dataset as pickle (dataframe with class column)
df = pd.read_csv('Example_DATASET.csv', index_col=0)

### TRAIN TEST SPLIT

Once the dataset is loaded, we must split it into training and testing. 20% of the data will be used to test the ML models, note the parameter `stratify` used in order to keep the same class-ratio between training and testing datasets.

From the `train_split` and `test_split` we extract both input data `X_train`, `X_test` and output target data `y_train`, `y_test`.

A sanity check is performed after.

In [3]:
# Divide data in train and testing
train_split, test_split = train_test_split(df, test_size=0.2, random_state=1, stratify=df['class'])
X_train = train_split.iloc[:, :-1].values
y_train = train_split.iloc[:, -1].values
X_test = test_split.iloc[:, :-1].values
y_test = test_split.iloc[:, -1].values

print()
print(f'Number of training samples: {X_train.shape[0]}')
print()
print(f'Number of testing samples: {X_test.shape[0]}')
print()
print(f'Number of features: {X_train.shape[1]}')


Number of training samples: 510

Number of testing samples: 128

Number of features: 107


### DATA STANDARDIZATION

The next step consists in the standardization of the data. We leverage the `StandardScaler()` available from `scikit-learn`. It is important to use the same coefficients when standardizing both training and testing data: for this reason we first fit the scaler to `X_train`, and then apply it it both `X_train` and `X_test` via the method `.fit_transform()`. 

In [4]:
# Standardize input data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### ML MODEL I: SUPPORT VECTOR MACHINE


The first classification model we will run is the support vector machine. A `GridSeachCV` is performed in order to tune some hyper-parameters (`kernel`, `gamma`, `C`). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of _false positives_.

In [5]:
# Parameter grid to be tuned
tuned_parameters = [{'kernel': ['rbf', 'linear'],
                     'gamma': [1e-3, 1e-4],
                     'C': [0.01, 0.1, 1, 10, 100]}]

clf1 = GridSearchCV(SVC(random_state=1),
                    tuned_parameters,
                    n_jobs=6,
                    scoring='precision_weighted',
                    cv=5)
clf1.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf1.best_score_, clf1.best_params_))
print()

Best score and parameters found on development set:

0.713 for {'C': 0.01, 'gamma': 0.001, 'kernel': 'linear'}



### ML MODEL II: RANDOM FOREST


The second classification model we will run is the random forest. A `GridSeachCV` is performed in order to tune some hyper-parameters (`n_estimators`, `max_features`, `max_depth`, `criterion`). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of _false positives_.

In [6]:
# Parameter grid to be tuned
tuned_parameters = {'n_estimators': [32, 256, 512, 1024],
                    'max_features': ['auto', 'sqrt'],
                    'max_depth': [4, 5, 6, 7, 8],
                    'criterion': ['gini', 'entropy']}

clf2 = GridSearchCV(RandomForestClassifier(random_state=1),
                    tuned_parameters,
                    n_jobs=6,
                    scoring='precision_weighted',
                    cv=5)
clf2.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf2.best_score_, clf2.best_params_))
print()

Best score and parameters found on development set:

0.724 for {'criterion': 'gini', 'max_depth': 5, 'max_features': 'auto', 'n_estimators': 32}



### ML MODEL III: EXTREME GRADIENT BOOSTING


The third classification model we will run is the extreme gradient boosting. A `GridSeachCV` is performed in order to tune some hyper-parameters (`learning_rate`, `max_depth`, `n_estimators`). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of _false positives_.

In [7]:
# Parameter grid to be tuned
tuned_parameters = {'learning_rate': [0.01, 0.001],
                    'max_depth': [4, 5, 6, 7, 8],
                    'n_estimators': [32, 128, 256]}

clf3 = GridSearchCV(xgb.XGBClassifier(random_state=1),
                   tuned_parameters,
                   n_jobs=6,
                   scoring='precision_weighted', 
                   cv=5)
clf3.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf3.best_score_, clf3.best_params_))
print()

Best score and parameters found on development set:

0.697 for {'learning_rate': 0.001, 'max_depth': 4, 'n_estimators': 256}



### ML MODEL IV: MULTI-LAYER PERCEPTRON


The fourth classification model we will run is the multi-layer perceptron (feed-forward neural network). A `GridSeachCV` is performed in order to tune some hyper-parameters (`hidden_layer_sizes`, `activation`, `solver`). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of _false positives_.

In [8]:
# Parameter grid to be tuned
tuned_parameters = {'hidden_layer_sizes': [(32,), (64,), (32, 64, 32)],
                    'activation': ['tanh', 'relu'],
                    'solver': ['lbfgs', 'adam']}

clf4 = GridSearchCV(MLPClassifier(random_state=1, batch_size=4, early_stopping=True), 
                    tuned_parameters,
                    n_jobs=6,
                    scoring='precision_weighted',
                    cv=5)
clf4.fit(X_train, y_train)

print('Best score, and parameters, found on development set:')
print()
print('%0.3f for %r' % (clf4.best_score_, clf4.best_params_))
print()

Best score, and parameters, found on development set:

0.730 for {'activation': 'relu', 'hidden_layer_sizes': (32, 64, 32), 'solver': 'adam'}



### EVALUATE THE MODELS


Now that 4 classification models have been trained, we must test them and compare their performance with respect to each other and to the benchmarks (S&P 500, DOW JONES). Indeed, we don't limit ourserlves to the comparison of their testing accuracies: we want to understand which model allows to make more money.

First, we load the 2019 percent price variations for all the stocks, and we filter them in order to get only those used to test the models (`pvar_test`).

In [9]:
# Load all 2019 price variations for the tech stocks.
pvar = pd.read_csv('Example_2019_price_var.csv', index_col=0)

# Get 2019 price variations ONLY for the stocks in testing split
pvar_test = pvar.loc[test_split.index.values, :]

Now, we build a new dataframe `df1` in which, for each tested stock, we collect all the predicted classes from each model (it is reminded that the two classes are `0`=IGNORE, `1`=BUY).

If the model predicts class `1`, we proceed to buy 100 USD worth of that stock; otherwise, we ignore the stock.

In [10]:
# Initial investment can be $100 for each stock whose predicted class = 1
buy_amount = 100

# In new dataframe df1, store all the information regarding each model's predicted class and relative gain/loss in $USD
df1 = pd.DataFrame(y_test, index=test_split.index.values, columns=['ACTUAL']) # first column is the true class (BUY/INGORE)

df1['SVM'] = clf1.predict(X_test) # predict class for testing dataset
df1['VALUE START SVM [$]'] = df1['SVM'] * buy_amount # if class = 1 --> buy $100 of that stock
df1['VAR SVM [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START SVM [$]'] # compute price variation in $
df1['VALUE END SVM [$]'] = df1['VALUE START SVM [$]'] + df1['VAR SVM [$]'] # compute final value

df1['RF'] = clf2.predict(X_test)
df1['VALUE START RF [$]'] = df1['RF'] * buy_amount
df1['VAR RF [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START RF [$]']
df1['VALUE END RF [$]'] = df1['VALUE START RF [$]'] + df1['VAR RF [$]']

df1['XGB'] = clf3.predict(X_test)
df1['VALUE START XGB [$]'] = df1['XGB'] * buy_amount
df1['VAR XGB [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START XGB [$]']
df1['VALUE END XGB [$]'] = df1['VALUE START XGB [$]'] + df1['VAR XGB [$]']

df1['MLP'] = clf4.predict(X_test)
df1['VALUE START MLP [$]'] = df1['MLP'] * buy_amount
df1['VAR MLP [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START MLP [$]']
df1['VALUE END MLP [$]'] = df1['VALUE START MLP [$]'] + df1['VAR MLP [$]']

Finally, we build a compact dataframe `MODELS_COMPARISON` in which we collect the main information required to perform the comparison between the classification models and the benchmarks (S&P 500, DOW JONES).

Leveraging the dataframe `df1`, we can easily compute gains and losses for each model (`net_gain_`, `percent_gain_`).

Since we miss the data from the benchmarks, we quickly exploit the custom function `get_price_var` in order to get the percent price variation for both S&P 500 (^GSPC) and DOW JONES (^DJI) for the year 2019.

In [11]:
# Create a new, compact, dataframe in order to show gain/loss for each model
start_value_svm = df1['VALUE START SVM [$]'].sum()
final_value_svm = df1['VALUE END SVM [$]'].sum()
net_gain_svm = final_value_svm - start_value_svm
percent_gain_svm = (net_gain_svm / start_value_svm) * 100

start_value_rf = df1['VALUE START RF [$]'].sum()
final_value_rf = df1['VALUE END RF [$]'].sum()
net_gain_rf = final_value_rf - start_value_rf
percent_gain_rf = (net_gain_rf / start_value_rf) * 100

start_value_xgb = df1['VALUE START XGB [$]'].sum()
final_value_xgb = df1['VALUE END XGB [$]'].sum()
net_gain_xgb = final_value_xgb - start_value_xgb
percent_gain_xgb = (net_gain_xgb / start_value_xgb) * 100

start_value_mlp = df1['VALUE START MLP [$]'].sum()
final_value_mlp = df1['VALUE END MLP [$]'].sum()
net_gain_mlp = final_value_mlp - start_value_mlp
percent_gain_mlp = (net_gain_mlp / start_value_mlp) * 100

percent_gain_sp500 = get_price_var('^GSPC') # get percent gain of S&P500 index
percent_gain_dj = get_price_var('^DJI') # get percent gain of DOW JONES index

MODELS_COMPARISON = pd.DataFrame([start_value_svm, final_value_svm, net_gain_svm, percent_gain_svm],
                    index=['INITIAL COST [USD]', 'FINAL VALUE [USD]', '[USD] GAIN/LOSS', 'ROI'], columns=['SVM'])
MODELS_COMPARISON['RF'] = [start_value_rf, final_value_rf, net_gain_rf, percent_gain_rf]
MODELS_COMPARISON['XGB'] = [start_value_xgb, final_value_xgb, net_gain_xgb, percent_gain_xgb]
MODELS_COMPARISON['MLP'] = [start_value_mlp, final_value_mlp, net_gain_mlp, percent_gain_mlp]
MODELS_COMPARISON['S&P 500'] = ['', '', '', percent_gain_sp500]
MODELS_COMPARISON['DOW JONES'] = ['', '', '', percent_gain_dj]
MODELS_COMPARISON

Unnamed: 0,SVM,RF,XGB,MLP,S&P 500,DOW JONES
INITIAL COST [USD],12300.0,4700.0,10100.0,10000.0,,
FINAL VALUE [USD],15701.241963,6624.89708,13268.852673,12834.504406,,
[USD] GAIN/LOSS,3401.241963,1924.89708,3168.852673,2834.504406,,
ROI,27.652374,40.955257,31.374779,28.345044,28.7148,22.24


From the dataframe `MODELS_COMPARISON`, it is possible to see that:

  * XGB and RF are the ML models that yield the highest ROI, 31.3% and 40.9% respectively
  * RF outperforms the S&P 500 by more than 12 p.p, while it outperforms the DOW JONES by almost 20 p.p.
  * XGB outperforms the S&P 500 by a few p.p., while it outperforms the DOW JONES by almost 10 p.p.
  * MLP and SVM are closely matched, and yield an ROI of 28.3% and 27.2% respectively
  * MLP and SVM perform similarly to the S&P 500, while they both outperfom the DOW JONES
  * the SVM leads to the highest net gains, at about 3290 USD; however, it also has the highest initial investment cost at 12100 USD
  * the RF leads to the lowest net gains, at about 1920 USD; however, it also has the lowest initial investment cost at 4700 USD

So, this example proves, at least as proof-of-concept, that it is possible to find useful information in the 10-K filings that the publicly traded companies release. The financial information can be used to train machine learning models that learn to recognize buy-worthy stocks.


For what concerns a more traditional comparison between the performance of the ML models implemented, it is possible to analyze the `classification_report`. 

In [17]:
from sklearn.metrics import classification_report

print()
print(53 * '=')
print(15 * ' ' + 'SUPPORT VECTOR MACHINE')
print(53 * '-')
print(classification_report(y_test, clf1.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(20 * ' ' + 'RANDOM FOREST')
print(53 * '-')
print(classification_report(y_test, clf2.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(14 * ' ' + 'EXTREME GRADIENT BOOSTING')
print(53 * '-')
print(classification_report(y_test, clf3.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(15 * ' ' + 'MULTI-LAYER PERCEPTRON')
print(53 * '-')
print(classification_report(y_test, clf4.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')


               SUPPORT VECTOR MACHINE
-----------------------------------------------------
              precision    recall  f1-score   support

      IGNORE       0.40      0.05      0.09        38
         BUY       0.71      0.97      0.82        90

    accuracy                           0.70       128
   macro avg       0.55      0.51      0.45       128
weighted avg       0.62      0.70      0.60       128

-----------------------------------------------------
                    RANDOM FOREST
-----------------------------------------------------
              precision    recall  f1-score   support

      IGNORE       0.37      0.79      0.50        38
         BUY       0.83      0.43      0.57        90

    accuracy                           0.54       128
   macro avg       0.60      0.61      0.54       128
weighted avg       0.69      0.54      0.55       128

-----------------------------------------------------
              EXTREME GRADIENT BOOSTING
-----------------

Looking carefully, it is fair to ask: **why does the RF returns the highest ROI if it is the method with the lowest weighted accuracy?** This happens because:

  * RF has the highest precision for the BUY class (83%). Indeed, 83% of the BUY predictions are _true positives_, and the remaning 17% are _false positives_
  * Mimizing the number of _false positives_ allows to minimize the quantity of money spent on stocks that will decrease in value during 2019
  * RF has the highest recall for the IGNORE class (79%), meaning that it correctly identified 79% of the stocks that should not be bought

However, all this means that we miss a lot of potential stocks to be bought, since RF leads to a high number of _false negatives_. Indeed, it is easy to see that RF has the lowest recall value for the BUY class (43%), meaning that we only find 43% of the total stocks that should have been classified as BUY-worthy.