In this notebook we analyse another algorithm. We already saw how K-Means perform, and we were able to get a portoflio of healthy companies. Now the question is what can we do with these stocks?
As an investor we are interested on how much return we will have if we buy these stocks, and sell them after a period of time (buy-and hold portfolio). So a logic step will be to create a model so it can predict the return on our portfolio.

In this model we used a multiple univariate regression to predict the amount of returns on each stock. We began by importing "pandas_datareader" libraries. These libraries permitted us to import stock information, such as closing and opening prices. This is not a default library, therefore you may need to pip install them through Anaconda command line. 
For example, let us see look TD bank's stock information for month of February this year:

In [1]:
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from numpy.linalg import inv
import math

In [2]:
start_date = '2020-02-01'
end_date = '2020-02-29'
february_info = data.DataReader('TD', 'yahoo', start_date, end_date)
display(february_info)

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-02-03,55.700001,54.98,55.220001,54.990002,920700,54.990002
2020-02-04,56.349998,55.470001,55.470001,56.040001,1020700,56.040001
2020-02-05,56.75,56.279999,56.450001,56.599998,992800,56.599998
2020-02-06,57.009998,56.650002,56.869999,56.93,596100,56.93
2020-02-07,57.099998,56.630001,56.77,56.98,896900,56.98
2020-02-10,56.900002,56.599998,56.66,56.779999,1327000,56.779999
2020-02-11,57.209999,56.82,57.0,56.919998,1133300,56.919998
2020-02-12,57.169998,56.77,57.139999,56.84,2181200,56.84
2020-02-13,56.82,56.349998,56.700001,56.52,1433700,56.52
2020-02-14,56.860001,56.52,56.52,56.82,1056500,56.82


Now that we have access to stock information, we can actually calculate yearly historical returns for our portfolio. That can be done using the formula:

$$Total Stock Return = \frac{(P_{1} - P_{0}) + Dividends}{P_{0}}$$

where $P_{0}$ is the initial price of the stock, $P_{1}$ is the ending price of the stock (after 1 period) and dividends are a company's compensation for their shareholders. Since we do not have access to amount of dividends we can use adjusted close price, which the ending price for a period after all the accounting for any corporate actions. This is widely used for historical returns for a stock. We will talk about how we can add the dividends into our model in the project paper.

So our new equation becomes
$$Total Stock Return = \frac{AP_{1} - P_{0}}{P_{0}}$$

where $AP_{1}$ is the adjusted close price after 1 period.


We have all our formuals set. Let us discuss how we are going to create our model:
* Since we are a small investor we begin by selecting 5 stocks from our master portfolio.
* Then we calculate their yearly historical returns. These are our target data. If we do any predictions, we will compare them to these values to reflect on our accuracy of the model.
* We will get the features for these 5 stocks from all 5 years. We will use the same features we used in the K-Means model for simplicty. 
* We now have a 5 matrices (using 5 stocks) of 5 training examples (using 5 years of data), and 5 matrices of target predictions.
* We use normal equation below to solve for the paramters of the multiple linear regression:
$$\theta = (X^{T}X)^{-1}X^{T}y$$
* We use these parameters to predict our returns.
* We calculate our amount of error using root-square-mean-error(RMSE). This indicates how many standard devations the predictions are from the true target values. The RMSE has the following form:
$$RMSE = \sqrt{\frac{\sum\limits_{i=1}^n(\hat y_{i} - y_{i})}{n}} $$

In [3]:
# Two telecommunication companies, two tech companies, and Walmart
hist_return = [] #empty list for historical returns.
tickers = ['CMCSA','T', 'AAPL', 'MSFT', 'WMT']


start_date = '2014-01-01'
end_date = '2014-12-31'
for ticker in tickers:
    unit_cost = data.DataReader(ticker, 'yahoo', start_date)['Open'][0]
    adj_close = data.DataReader(ticker, 'yahoo', end_date)['Adj Close'][0]
    ticker_return = ((adj_close - unit_cost) / unit_cost)*100
    hist_return.append(ticker_return)
    
start_date = '2015-01-01'
end_date = '2015-12-31'
for ticker in tickers:
    unit_cost = data.DataReader(ticker, 'yahoo', start_date)['Open'][0]
    adj_close = data.DataReader(ticker, 'yahoo', end_date)['Adj Close'][0]
    ticker_return =  ((adj_close - unit_cost) / unit_cost)*100
    hist_return.append(ticker_return)
    
start_date = '2016-01-01'
end_date = '2016-12-30'
for ticker in tickers:
    unit_cost = data.DataReader(ticker, 'yahoo', start_date)['Open'][0]
    adj_close = data.DataReader(ticker, 'yahoo', end_date)['Adj Close'][0]
    ticker_return =  ((adj_close - unit_cost) / unit_cost)*100
    hist_return.append(ticker_return)
    
    
start_date = '2017-01-01'
end_date = '2017-12-29'
for ticker in tickers:
    unit_cost = data.DataReader(ticker, 'yahoo', start_date)['Open'][0]
    adj_close = data.DataReader(ticker, 'yahoo', end_date)['Adj Close'][0]
    ticker_return =  ((adj_close - unit_cost) / unit_cost)*100
    hist_return.append(ticker_return)
    
start_date = '2018-01-01'
end_date = '2018-12-31'
for ticker in tickers:
    unit_cost = data.DataReader(ticker, 'yahoo', start_date)['Open'][0]
    adj_close = data.DataReader(ticker, 'yahoo', end_date)['Adj Close'][0]
    ticker_return =  ((adj_close - unit_cost) / unit_cost)*100
    hist_return.append(ticker_return)

Here expected return for each stock is calculated throughout years 2014 to 2018.

In [4]:
hist_return = np.array(hist_return)
years = [];
for i in range(2014, 2019):
    years.append(i)

returns_df = pd.DataFrame(hist_return.reshape(5,5), columns = tickers)
returns_df['year'] = years
returns_df.name = 'HISTORICAL Returns'
print(returns_df.name)
display(returns_df)

HISTORICAL Returns


Unnamed: 0,CMCSA,T,AAPL,MSFT,WMT,year
0,1.705207,-28.38881,27.456902,11.344808,-3.731348,2014
1,-10.146865,-18.620589,-11.904417,9.353252,-35.554956,2015
2,18.108101,4.171407,7.569124,8.095585,6.627371,2016
3,9.813918,-20.169875,41.511218,31.603032,36.547809,2017
4,-18.278692,-32.165617,-8.887455,15.891587,-8.055473,2018


In the following cell we create a list of stocks and compare each one to their data from 2018 to 2014. The data files are read in and stored in a list which is used in the first outer loop to get the significant features for comparison. The outer loop iterates through each year. The inner loop takes each stock from the selected year and adds it to its own list. It then takes the appropriate Expected Return values (calculated above) and appends them to each stock as a new column. Lastly, we convert these lists to dataframes and store the dataframes in another list titled datatables. To see results, the last two lines in the following cell may be uncommented.

In [5]:
# Read in all datasets
df2018 = pd.read_csv("2018_Financial_Data.csv")
df2017 = pd.read_csv("2017_Financial_Data.csv")
df2016 = pd.read_csv("2016_Financial_Data.csv")
df2015 = pd.read_csv("2015_Financial_Data.csv")
df2014 = pd.read_csv("2014_Financial_Data.csv")

# Insert data into list for iteration and initialize other lists
datatables = []
listtotal = []
list_stocks = []
list_years = [df2018, df2017, df2016, df2015, df2014]
for i in range(len(list_years)):
    
    # Clear list_stocks for the upcoming year
    if len(list_stocks)!=0:
        list_stocks.clear()
        
    # Modify data to get desired amount of features
    financial_data = pd.DataFrame(list_years[i], columns = ['Symbol', 'Revenue', 'Revenue Growth', 'Gross Profit', 
                                                  'Operating Income', 'Earnings before Tax', 'Free Cash Flow', 
                                                  'Net Income', 'Total current assets',
                                                  'Operating Expenses',  'Net Debt', 'Short-term debt', 'Long-term debt', 
                                                  'Total shareholders equity', 'Weighted Average Shs Out', 
                                                  'Total current liabilities', 'Total debt', 'Total liabilities']) 
    
    for j in range(len(tickers)):
        list_stocks.append(pd.DataFrame(financial_data.loc[financial_data['Symbol'] == tickers[j], :]))
        
        if i == 0:
            # Add stocks to their own list
            listtotal.append(list_stocks[j])
            
        elif i == (len(list_years)-1):
            listtotal[j] = listtotal[j].append(list_stocks[j]) # Append stock data together throughout 5 years
            
            # Add on Expected Return column to each stock
            listtotal[j]['Expected Return'] = [hist_return[j],hist_return[j+len(tickers)], 
                                            hist_return[j+(2*len(tickers))], hist_return[j+(3*len(tickers))],
                                            hist_return[j+(4*len(tickers))]]
            
            datatables.append(pd.DataFrame(listtotal[j]))
            
        else:
            listtotal[j] = listtotal[j].append(list_stocks[j])
            
#for k in range(0, len(datatables)):
    #display(datatables[k])

In this cell we use multiple regression on our results. First we find the X and Y matrices for the normal equation, then we calculate the best paramters for our regression model. We then evaluate the amount of error between our predictions and real returns using the RMSE.

In [8]:
# Loop through each of the tables and calculate the pararmeters for multiple linear regression 
# After calculating the parameters, test the regression and calcualte RMSE
expected_return = []
error = []
param_theta_apple = []
for i in range(0, len(datatables)):
    
    # Find what are the X and Y matrices for normal equation
    x = datatables[i].drop('Symbol',1)
    x = x.drop('Expected Return', 1).to_numpy()
    y = datatables[i]['Expected Return'].to_numpy()
    
    # Calculate the parameters
    XTX = np.dot(x.transpose(),x)
    XTX_inverse = inv(XTX)
    XTX_inverse_X = np.dot(XTX_inverse, x.transpose())
    param_theta = np.dot(XTX_inverse_X, y)
    

    # Find RMSE for the model using the training data and targets we have
    error_sum = 0
    for j in range (0,5):
        y_pred = np.dot(param_theta, x[j])
        expected_return.append(y_pred)
        #print(y_pred) #==> see the predictions here
        pred_difference = y_pred - returns_df[tickers[i]][j]
        error_sum = error_sum + ((pred_difference**2) / 5)
    RMSE = math.sqrt(error_sum)
    error.append(RMSE)

In [9]:
expected_return = np.array(expected_return)
returns_pred = pd.DataFrame(expected_return.reshape(5,5), columns = tickers)
returns_pred['year'] = years
returns_pred['RMSE'] = error
returns_pred.name = 'PREDICTED Returns'
print(returns_pred.name)
display(returns_pred)

PREDICTED Returns


Unnamed: 0,CMCSA,T,AAPL,MSFT,WMT,year,RMSE
0,11.53125,-5.09375,25.5625,16.15625,-12.375,2014,7.109598
1,-59.556285,-60.22416,-34.707178,-41.39959,-41.148818,2015,30.806531
2,77.994347,11.976519,-8.720534,51.756857,18.15482,2016,29.072128
3,58.820363,15.020561,-47.501074,32.965628,27.529715,2017,33.20945
4,-16.030975,-47.928816,-5.48332,24.351925,-21.545435,2018,12.504246


In short, the RMSE values are very large (more will be discussed in paper). This is due to small amount of training examples. We only have access to one year of financial data through the base data set. We can look into alternative models but at the end we will have the same problem. However it is intereting to see that the model tends to follow positive or negative returns on predictions with respect to the historical returns. Unfortunatley this is not enough for a risk averse investor and with the data we have, we cannot make certain predictions this way.