# Financial Stock Price Prediction with Technical Indicators

## Introduction

This notebook is aimed to serve as an introduction to the creation of a not personalized recommendation algorithm operating on stock market prices in order to predict future profitability of those assets. The time series are converted into financial technical indicators to predict the future.

The notebook covers the processing steps, the calculation of the features fed to the prediction models, and provides the outcome of a profitability prediction model.

In [1]:
# Local Mode
#storageDIR = "HugeStockMarketDataset" # creates a dataset directory in the same folder as the notebook
#storageDIRNews = "NewsSentimentDataset"
# Container Mode
storageDIR = ""


## Dataset

Different types of financial asset recommendation systems use different sources of data to produce their recommendations. The approach we introduce in this notebook is known as Profitability Prediction, where assets that are predicted to gain significant value over the following six months are recommended. This type of approach uses past pricing data, i.e. the price for different assets over time, to identify pricing trends and hence future profitable assets. Hence, as input, we need the price history over time for a range of assets.

### Pricing data

For illustration, in this notebook we will use open pricing data, available from Yahoo! Finance. In particular, it contains the historical price and volume data for US-based stocks and ETFs trading on the NYSE, NASDAQ and AMEX markets, and it runs up to the end of March 2022. Each entry of this dataset is comprised of:
 - Date: The date of the pricing data
 - Open: Opening price for that day
 - High: The maximum price for that day
 - Low: The minimum price for that day
 - Close: The closing price for that day
 - AdjClose: The adjusted closing price
 - Volume: The amount of the asset that is traded

We introduce here three different ways to download the data. Along with this example, we provide the pricing information for ~200 financial assets in a single file.

Please, download the following file and upload it.


In [2]:
import pandas as pd
import numpy as np
import glob, os, random, math

file_name = "reduced-stocks.csv"

prices_df = pd.read_csv(file_name)
prices_df["Date"] = pd.to_datetime(prices_df["Date"])

print("Dataset Extraction and Loading as Dataframe Complete")

Dataset Extraction and Loading as Dataframe Complete


In [3]:
prices_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock
0,2018-01-02,30.403334,30.42,29.383333,29.433332,29.433332,1333500.0,ACGL
1,2018-01-03,29.466667,30.016666,29.110001,29.459999,29.459999,1887600.0,ACGL
2,2018-01-04,29.463333,29.84,29.07,29.57,29.57,1835100.0,ACGL
3,2018-01-05,29.663334,29.870001,29.309999,29.453333,29.453333,1257300.0,ACGL
4,2018-01-08,29.469999,29.58,29.216667,29.456667,29.456667,1380300.0,ACGL


#### Filtering the pricing data

Pandas allows us to perform manipulations on the pricing data so that we can extract only what we need for training the model. We will only use pricing data from 2018 to 2021. We shall consider data until July 2019 as the past, and we shall train models at different points of time.

Lets first filter the dataset to only hold data from the dates we care about:

In [4]:
prices_df['Date'] = pd.to_datetime(prices_df['Date'])
min_date = pd.to_datetime('2018-01-01')
max_date = pd.to_datetime('2021-01-10')
# Selecting only that data from either 2016 or 2017
prices_df = prices_df[prices_df['Date'] >= min_date]
prices_df = prices_df[prices_df['Date'] <= max_date]
print("Filtered the data prices")
prices_df

Filtered the data prices


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock
0,2018-01-02,30.403334,30.420000,29.383333,29.433332,29.433332,1333500.0,ACGL
1,2018-01-03,29.466667,30.016666,29.110001,29.459999,29.459999,1887600.0,ACGL
2,2018-01-04,29.463333,29.840000,29.070000,29.570000,29.570000,1835100.0,ACGL
3,2018-01-05,29.663334,29.870001,29.309999,29.453333,29.453333,1257300.0,ACGL
4,2018-01-08,29.469999,29.580000,29.216667,29.456667,29.456667,1380300.0,ACGL
...,...,...,...,...,...,...,...,...
131274,2021-01-04,13.320000,13.320000,13.320000,13.320000,13.320000,0.0,AKO-A
131275,2021-01-05,13.640000,13.640000,13.640000,13.640000,13.640000,200.0,AKO-A
131276,2021-01-06,13.560000,13.980000,13.560000,13.770000,13.770000,1000.0,AKO-A
131277,2021-01-07,13.700000,13.700000,13.700000,13.700000,13.700000,200.0,AKO-A


After this step, we print below the number of stocks.

In [5]:
stocks = prices_df['Stock'].unique().tolist()
print("Num. stocks with data between 2018 and 2021: " + str(len(stocks)))

Num. stocks with data between 2018 and 2021: 200


## Feature Creation for the Model

Now that we have collected the pricing data and the knowledge graph, we can craft the features we can use in our model. We distinguish to kind of features: price-based technical indicators and knowledge graph embeddings.

### Technical indicators

Now that we have the pricing data in a more useful form, we can now convert that data into additional indicators that a machine learned model can use for identifying patterns/trends. In effect, we want to capture how the price for an asset changed in the recent past, for use as indicators for future performance (of course past performance is not always a good indicator, and more advanced approaches may mix in other sources of evidence here). We convert the pricing data into 3 different indicator (feature) types:

**NOTE:** In the following equations, the sub-index $t$ indicates the time of computation of the metric. $t-1$ might indicate, then, the previous day, and so on.

1. <b>Returns</b>: The returns on investment (ROI) represent the percentage change between close prices on different dates, across different periods.

\begin{equation}
\text{ROI}_t(n) = \frac{\text{Close}_t - \text{Close}_{t-n}}{\text{Close}_{t-n}}
\end{equation}

2. <b>Volatility</b>: Volatility represents the risk of a stock as expressed by its fluctuations, and is expressed as the standard deviation of the logarithmic returns of the stock. In this case, we take the daily returns.
\begin{equation}
\text{Volatility}_t(N,n) = \sqrt{\frac{1}{N-1} \sum_{i=0}^{N-1} \log^2(\text{ROI}_{t-i}(n)) - \left(\frac{1}{N-1} \sum_{i=0}^{N-1} \log(\text{ROI}_{t-i}(n))\right)^2} * \sqrt{n}
\end{equation}
Here, $N$ represents the number of periods we consider for measuring the Volatility (here, we take $N$ days), and $n$ represents the period of time for computing the ROI (here, we take $n = 1$ day). In the right square root, $n$ is the number of periods covered by the ROI calculation. For instance, if we took a monthly measure of ROI, we should measure $n$ in months. In this example, as each period is equal to a day, we take $n = 1$.

3. <b>Mean price</b>: This indicator just represents the average price of an asset over a period of time:
\begin{equation}
\text{Mean}_t(n) = \frac{1}{n} \sum_{i=0}^{n-1} \text{Close}_{t-i}
\end{equation}




In [6]:
def returns(df, periods=[1,3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f"return_{t}"] = (df['Close'] - df['Close'].shift(t)) / df['Close'].shift(t)
    return df

def log_returns(df, periods=[1,3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f"log_return_{t}"] = (df['Close'] - df['Close'].shift(t)) / df['Close'].shift(t)
    return df

def volatility(df, roi_periods = [1], periods=[3,5,7,14,21,28,63,126]):
    for n in roi_periods:
        name = f"log_return_{n}"
        if not name in df.columns:
            log_returns(df, roi_periods)
            break

    for t in periods:
        for n in roi_periods:
            df[f"volatility_{t}_{n}"] = df[f"log_return_{n}"].rolling(window=t).std()*np.sqrt(n)

    df['3_28_volatility_ratio'] = df['volatility_3_1'] / df['volatility_28_1']
    return df

def mean_price(df, periods=[3,5,7,14,21,28,63,126]):
    for t in periods:
        df[f'mean_{t}'] = df['Close'].rolling(window=t).mean()

    return df


In [7]:
import datetime as dt
pricedfs = []
i = 0
timea = dt.datetime.now()
for s in stocks:
    df = prices_df[prices_df['Stock'] == s]
    pricedfs.append(df)
    i += 1
    if i % 100 == 0:
        print("Processed " + str(i) + " stocks (" + str((dt.datetime.now() - timea).seconds) + " s)")
print("Dataset Filtering Complete")

Processed 100 stocks (1 s)
Processed 200 stocks (3 s)
Dataset Filtering Complete


In [8]:
pd.options.mode.chained_assignment = None  # default='warn'

newpricedfs = []
i = 0
timea = dt.datetime.now()
for p in pricedfs:
    if not p.empty:
        p1 = returns(p)
        p1 = volatility(p1)
        p1 = mean_price(p1)
        newpricedfs.append(p1.dropna())
        i += 1
        if i % 100 == 0:
            timeb = dt.datetime.now()
            print("Processed " + str(i) + " stocks (" + str((dt.datetime.now() - timea).seconds) + " s)")
print ("Metrics calculated for all stocks")

Processed 100 stocks (2 s)
Processed 200 stocks (4 s)
Metrics calculated for all stocks


In [9]:
newpricedfs[0]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,volatility_126_1,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126
126,2018-07-03,26.650000,26.920000,26.650000,26.760000,26.760000,537300.0,ACGL,0.004882,0.012869,...,0.012760,0.225157,26.616666,26.508000,26.508571,26.642619,26.932857,26.790833,26.774286,28.080106
127,2018-07-05,26.959999,26.990000,26.660000,26.930000,26.930000,926200.0,ACGL,0.006353,0.017763,...,0.012774,0.077833,26.773333,26.640000,26.554285,26.634524,26.928254,26.817143,26.744603,28.060027
128,2018-07-06,26.980000,27.299999,26.680000,27.280001,27.280001,1482600.0,ACGL,0.012997,0.024409,...,0.012826,0.379685,26.990000,26.812000,26.678571,26.666191,26.932063,26.850833,26.725714,28.041852
129,2018-07-09,27.250000,27.959999,26.950001,27.889999,27.889999,1858000.0,ACGL,0.022361,0.042227,...,0.012984,0.686943,27.366667,27.098000,26.910000,26.744524,26.964286,26.918333,26.743175,28.029445
130,2018-07-10,28.000000,28.250000,27.860001,28.240000,28.240000,1784000.0,ACGL,0.012549,0.048645,...,0.013034,0.475161,27.803333,27.420000,27.170000,26.839762,26.971111,26.985357,26.774656,28.019788
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
756,2021-01-04,36.150002,36.230000,34.639999,34.900002,34.900002,2167100.0,ACGL,-0.032437,-0.011331,...,0.021510,1.225778,35.516668,35.436001,35.302858,34.996429,34.510000,34.220000,32.960476,31.760159
757,2021-01-05,35.040001,35.360001,34.500000,35.040001,35.040001,1814500.0,ACGL,0.004011,-0.015177,...,0.021259,1.186043,35.336668,35.378001,35.297144,35.101429,34.626667,34.257500,33.030318,31.819921
758,2021-01-06,35.709999,37.160000,35.669998,36.580002,36.580002,3357300.0,ACGL,0.043950,0.014139,...,0.021581,1.744733,35.506668,35.634001,35.542858,35.219287,34.741429,34.330000,33.131905,31.890794
759,2021-01-07,36.840000,36.959999,35.919998,36.240002,36.240002,2516900.0,ACGL,-0.009295,0.038395,...,0.021495,1.309340,35.953335,35.766001,35.672858,35.279287,34.883810,34.426072,33.227143,31.963810


We finally compute the target of our recommendations: return at 6 months into the future (126 financial days)

In [10]:
for i in range(len(newpricedfs)):
    newpricedfs[i]["target"] = newpricedfs[i]["return_126"].shift(-126)
    newpricedfs[i] = newpricedfs[i].dropna()

In [11]:
newpricedfs[0]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
126,2018-07-03,26.650000,26.920000,26.650000,26.760000,26.760000,537300.0,ACGL,0.004882,0.012869,...,0.225157,26.616666,26.508,26.508571,26.642619,26.932857,26.790833,26.774286,28.080106,-0.036622
127,2018-07-05,26.959999,26.990000,26.660000,26.930000,26.930000,926200.0,ACGL,0.006353,0.017763,...,0.077833,26.773333,26.640,26.554285,26.634524,26.928254,26.817143,26.744603,28.060027,-0.020052
128,2018-07-06,26.980000,27.299999,26.680000,27.280001,27.280001,1482600.0,ACGL,0.012997,0.024409,...,0.379685,26.990000,26.812,26.678571,26.666191,26.932063,26.850833,26.725714,28.041852,-0.034824
129,2018-07-09,27.250000,27.959999,26.950001,27.889999,27.889999,1858000.0,ACGL,0.022361,0.042227,...,0.686943,27.366667,27.098,26.910000,26.744524,26.964286,26.918333,26.743175,28.029445,-0.052348
130,2018-07-10,28.000000,28.250000,27.860001,28.240000,28.240000,1784000.0,ACGL,0.012549,0.048645,...,0.475161,27.803333,27.420,27.170000,26.839762,26.971111,26.985357,26.774656,28.019788,-0.052054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
630,2020-07-06,28.799999,29.170000,28.270000,28.500000,28.500000,1207000.0,ACGL,0.014596,-0.005236,...,0.478540,28.246667,28.352,28.334286,29.133571,30.360952,30.335000,27.766191,33.748889,0.224561
631,2020-07-07,28.299999,28.299999,27.430000,27.510000,27.510000,1327100.0,ACGL,-0.034737,-0.022735,...,0.740220,28.033333,28.180,28.138572,28.868571,30.005238,30.248929,27.773810,33.621825,0.273719
632,2020-07-08,27.370001,28.059999,27.000000,27.650000,27.650000,2009300.0,ACGL,0.005089,-0.015664,...,0.781254,27.886667,27.980,28.131429,28.650000,29.624286,30.199286,27.770159,33.498730,0.322966
633,2020-07-09,27.490000,27.799999,26.219999,27.040001,27.040001,1792500.0,ACGL,-0.022061,-0.051228,...,0.610335,27.400000,27.758,27.941429,28.399286,29.274762,30.157143,27.743968,33.374206,0.340237


In [12]:
full_kpis_df = pd.concat(newpricedfs)
full_kpis_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
126,2018-07-03,26.650000,26.920000,26.650000,26.760000,26.760000,537300.0,ACGL,0.004882,0.012869,...,0.225157,26.616666,26.508,26.508571,26.642619,26.932857,26.790833,26.774286,28.080106,-0.036622
127,2018-07-05,26.959999,26.990000,26.660000,26.930000,26.930000,926200.0,ACGL,0.006353,0.017763,...,0.077833,26.773333,26.640,26.554285,26.634524,26.928254,26.817143,26.744603,28.060027,-0.020052
128,2018-07-06,26.980000,27.299999,26.680000,27.280001,27.280001,1482600.0,ACGL,0.012997,0.024409,...,0.379685,26.990000,26.812,26.678571,26.666191,26.932063,26.850833,26.725714,28.041852,-0.034824
129,2018-07-09,27.250000,27.959999,26.950001,27.889999,27.889999,1858000.0,ACGL,0.022361,0.042227,...,0.686943,27.366667,27.098,26.910000,26.744524,26.964286,26.918333,26.743175,28.029445,-0.052348
130,2018-07-10,28.000000,28.250000,27.860001,28.240000,28.240000,1784000.0,ACGL,0.012549,0.048645,...,0.475161,27.803333,27.420,27.170000,26.839762,26.971111,26.985357,26.774656,28.019788,-0.052054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131148,2020-07-06,13.500000,13.800000,13.500000,13.650000,13.650000,23400.0,AKO-A,0.011111,0.112469,...,1.290079,13.133333,12.814,12.681429,12.672857,12.683810,12.402143,11.772857,12.241270,-0.024176
131149,2020-07-07,14.250000,14.250000,13.580000,13.580000,13.580000,1700.0,AKO-A,-0.005128,0.108571,...,1.338175,13.576667,13.050,12.864286,12.722857,12.709048,12.475000,11.819524,12.231111,0.004418
131150,2020-07-08,13.620000,13.620000,13.190000,13.190000,13.190000,800.0,AKO-A,-0.028719,-0.022963,...,0.458651,13.473333,13.234,12.977143,12.754286,12.751905,12.533929,11.856667,12.218413,0.043973
131151,2020-07-09,13.190000,13.190000,13.190000,13.190000,13.190000,0.0,AKO-A,0.000000,-0.033700,...,0.352799,13.320000,13.422,13.090000,12.794286,12.794762,12.600714,11.888889,12.205238,0.038666


In [13]:
full_kpis_df.to_csv(os.path.join(storageDIR, "kpis.csv"), index=False)

In order to use the model for learning and testing, we thin the dataset by getting only the data at Mondays.

In [14]:
filtered_kpis_df = full_kpis_df.loc[full_kpis_df['Date'].dt.weekday == 1]
filtered_kpis_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Stock,return_1,return_3,...,3_28_volatility_ratio,mean_3,mean_5,mean_7,mean_14,mean_21,mean_28,mean_63,mean_126,target
126,2018-07-03,26.650000,26.920000,26.650000,26.760000,26.760000,537300.0,ACGL,0.004882,0.012869,...,0.225157,26.616666,26.508000,26.508571,26.642619,26.932857,26.790833,26.774286,28.080106,-0.036622
130,2018-07-10,28.000000,28.250000,27.860001,28.240000,28.240000,1784000.0,ACGL,0.012549,0.048645,...,0.475161,27.803333,27.420000,27.170000,26.839762,26.971111,26.985357,26.774656,28.019788,-0.052054
135,2018-07-17,28.530001,28.709999,28.330000,28.360001,28.360001,1008000.0,ACGL,-0.005959,0.009972,...,0.811561,28.440001,28.288001,28.224286,27.451429,27.185556,27.255119,26.874286,27.965291,-0.017278
140,2018-07-24,29.170000,29.250000,28.900000,29.129999,29.129999,1916700.0,ACGL,0.003099,0.004136,...,0.173422,29.073333,29.036000,28.867143,28.354286,27.739048,27.498452,27.040635,27.905370,-0.011329
145,2018-07-31,30.469999,30.700001,30.170000,30.559999,30.559999,1869000.0,ACGL,0.007251,0.018667,...,0.143114,30.346666,30.108000,29.815714,29.222857,28.613810,28.067857,27.290529,27.883122,-0.039594
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131130,2020-06-09,12.290000,12.290000,12.290000,12.290000,12.290000,0.0,AKO-A,0.000000,0.059483,...,1.864851,12.543333,12.118000,12.018571,11.725714,11.734286,11.711429,11.122857,12.560397,0.004882
131135,2020-06-16,13.200000,13.200000,12.880000,12.880000,12.880000,200.0,AKO-A,0.008614,0.032051,...,1.663026,13.116667,12.838000,12.681429,12.227143,11.969048,11.950357,11.243810,12.506587,-0.031056
131140,2020-06-23,13.240000,13.240000,12.510000,12.510000,12.510000,600.0,AKO-A,-0.009501,-0.009501,...,0.609242,12.696667,12.694000,12.731429,12.635000,12.247143,12.041429,11.476190,12.409524,0.035172
131145,2020-06-30,12.400000,12.400000,12.030000,12.270000,12.270000,3800.0,AKO-A,-0.010484,-0.002439,...,0.221968,12.356667,12.334000,12.401429,12.640714,12.472381,12.213929,11.639841,12.292063,0.082315


## Dataset split

### Getting the training / test examples

In order to execute this, we need training and test examples. Basically, we shall use the training examples for our model, and the test examples

Then, we are doing the following:
1. Choose a recommendation date. For instance, 2020-06-30.
2. We get 6 months for obtaining target values,
3. The previous 6 months are used as training examples. (Essentially from 2019-07-02 to 2019-12-31)
4. Then, we take the targets of the examples at that point as test targets.

In [15]:
train_data = filtered_kpis_df[(filtered_kpis_df["Date"] >= pd.to_datetime("2019-07-02")) & (filtered_kpis_df["Date"] <= pd.to_datetime("2019-07-02"))]
test_data = filtered_kpis_df[filtered_kpis_df["Date"] == pd.to_datetime("2020-06-30")]

In [16]:
train_data.to_csv(os.path.join(storageDIR, "training.csv"), index=False)
test_data.to_csv(os.path.join(storageDIR, "test.csv"), index=False)

## Basic model
As a baseline, we provide here the training model. Considering the data, we train a random forest model using, as input, the different technical indicators.

In [17]:
basic_kpis = ["return_28","return_63","return_126", "volatility_28_1", "volatility_63_1", "volatility_126_1", "mean_28", "mean_63", "mean_126"]

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

We first read the training and test data from files.

In [19]:
train_data = pd.read_csv(os.path.join(storageDIR, "training.csv"))
test_data = pd.read_csv(os.path.join(storageDIR, "test.csv"))

And we separate the training features and the training targets the algorithm shall learn.

In [20]:
basic_training_data = train_data[basic_kpis]
basic_targets = train_data["target"]

Then, we create our model: a linear regressor. Any regression algorithm might be configured at this point. We have imported several regressors, you can try and change it by another one.

In [21]:
model = LinearRegression()

Finally, we call the function `fit` to train the model. This function receives, as imput, the training data features and the training targets. Once the algorithm is train, we use the `predict` method over the test features to generate the predictions, and we sort the values by descending scores to generate a recommendation ranking

In [22]:
model.fit(basic_training_data, basic_targets)
basic_test_data = test_data[["Stock", "target"]]
basic_test_data["prediction"] = model.predict(test_data[basic_kpis])
basic_ranking = basic_test_data.sort_values( by="prediction", ascending = False).head(10)
basic_ranking

Unnamed: 0,Stock,target,prediction
95,NVR,0.266381,3.433874
142,TPL,0.153643,0.818988
44,FCNCA,0.391289,0.537302
58,HUBS,0.753822,0.279266
86,MTN,0.482515,0.271896
12,AWH,0.611979,0.26913
1,ADSK,0.246331,0.257196
19,BURL,0.31595,0.254117
72,KWR,0.331484,0.214472
25,CHE,0.175073,0.207695


And we compute the average profitability of the model:

In [23]:
basic_prof = basic_ranking["target"].mean()

We do the same for the advanced technical indicators.

We show here the results for our algorithm, in terms of profitability (return on investment) at 6 months. As we can see, the model with basic KPIs is showing a 3.7% profitability over the studied 6 month period on its top 10. This is enough to beat the market median (2.04%)

In [25]:
profitability_df = pd.DataFrame([{"Model" : "Basic KPIs", "RoI@10" : basic_prof}, {"Model" : "Market median", "RoI@10" : test_data["target"].median()}])
profitability_df

Unnamed: 0,Model,RoI@10
0,Basic KPIs,0.372847
1,Market median,0.204059


<!-- In the above table, we are showing the top 10 stocks that were predicted to be profitable. The last two columns report the actual return on investment after 9 months and asset volitility. Note that a return value of 1.5 means a 150% return on investment.

We notice that predicted returns for the top stocks are exceedingly high, which is not ordinary. However, we can also see that the actual returns for these stocks are similar for several of these instances, i.e. the model is not wrong in predicting these as profitable investments in the short term. However, we can also see that the volitility fo these stocks is very high, i.e. these are 'high-risk' assets that may subsequently crash in price.

We can also analyse the statistics for this predictions across the dataset. -->

<!-- The returns and volatility for the top stocks, ranked by predicted returns, are far higher than their averages across the test set. This indicates that ranking assets by their predicted returns can produce some highly profitable but risk-laden investment recommendations, which might be suitable for aggressive investors. However, it remains to be seen how much of this is owed to fluctuations and outliers in the data, and perhaps even if there are better ways to capture the returns and volatility of the dataset.

Next, we look at the differences between the actual and predicted returns. -->

<!-- Lastly, we can examine the mean absolute error and mean squared error of the predictions. As these can be quite dependent on the dataset and problem in question, we also assume a simple baseline, by taking the median of all stock returns from the test dataset. We then compare the results of applying these metrics to the baseline and our predictor model. -->

<!-- We can see from this that the random forest model presents an improvement (reduction) in both MAE and MSE. -->