In [1]:
# Code to do library installations
import sys # Import the library that does system activities (like install other packages)

Next we'll have code that you, as a user, will input. 

You'll submit what company ticker you're interested in, and the start and end dates of interest. 

You can tind the ticker symbols for a lot of companies [here](http://www.eoddata.com/symbols.aspx?AspxAutoDetectCookieSupport=1).

When selected a date range, keep in mind COVID-19 has had an insane effect on the market. Stocks are trading very irreguarly and differently than the did, pre-COVID. If you traing on historical data before COVID, and try to use it after COVID, your model might not be that effective. 

In [2]:
# Choose your ticker
tickerSymbol = "AMZN"

# Choose date range - format should be 'YYYY-MM-DD' 
startDate = '2015-04-01' # as strings
endDate = '2020-01-01' # as strings

Next, we'll go ahead and install that *yfinance* Python library. As a reminder, this is how we'll get stock price information from the Yahoo! Finance website. 

This will be what we use to go and get the stock data for that ticker.

In [3]:
# Check if local computer has the library yfinance. If not, install. Then Import it.
!{sys.executable} -m pip install yfinance # Check if the machine has yfinance, if not, download yfinance
import yfinance as yf # Import library to access Yahoo finance stock data

Collecting yfinance
  Downloading yfinance-0.1.70-py2.py3-none-any.whl (26 kB)
Collecting requests>=2.26
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[?25l[K     |█████▏                          | 10 kB 37.5 MB/s eta 0:00:01[K     |██████████▍                     | 20 kB 18.7 MB/s eta 0:00:01[K     |███████████████▋                | 30 kB 23.4 MB/s eta 0:00:01[K     |████████████████████▊           | 40 kB 21.7 MB/s eta 0:00:01[K     |██████████████████████████      | 51 kB 10.4 MB/s eta 0:00:01[K     |███████████████████████████████▏| 61 kB 12.2 MB/s eta 0:00:01[K     |████████████████████████████████| 63 kB 2.1 MB/s 
Collecting lxml>=4.5.1
  Downloading lxml-4.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 20.4 MB/s 
Installing collected packages: requests, lxml, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Unin

Now that *yfinance* is imported, let's go ahead and get the stock data using the *yfinance* package.

We'll print out a preview of what the data looks like once it is complete. 

In [5]:
# Create ticker yfinance object
tickerData = yf.Ticker(tickerSymbol)

# Create historic data dataframe and fetch the data for the dates given. 
df = tickerData.history(start = startDate, end = endDate)

# Print statement showing the download is done

# Show what the first 5 rows of the data frame
# Note the dataframe has:
#   - Date (YYY-MM-DD) as an index
#   - Open (price the stock started as)
#   - High (highest price stock reached that day)
#   - Low (lowest price stock reached that day)
#   - Close (price the stock ended the day as)
#   - Volume (how many shares were traded that day)
#   - Dividends (any earnings shared to shareholders)
#   - Stock Splits (any stock price changes)

print('-----------------------')
print('Done!')
print(df.head())

-----------------------
Done!
                  Open        High         Low       Close   Volume  \
Date                                                                  
2015-04-01  372.100006  373.160004  368.339996  370.260010  2458100   
2015-04-02  370.500000  373.279999  369.000000  372.250000  1875300   
2015-04-06  370.100006  380.200012  369.359985  377.040009  3050700   
2015-04-07  376.149994  379.309998  374.029999  374.410004  1954900   
2015-04-08  374.660004  381.579987  374.649994  381.200012  2636400   

            Dividends  Stock Splits  
Date                                 
2015-04-01          0             0  
2015-04-02          0             0  
2015-04-06          0             0  
2015-04-07          0             0  
2015-04-08          0             0  


Let's get another useful library imported, [pandas](https://pandas.pydata.org/). 

*pandas* is the best way to manipulate dataframe objects.

[*numpy*](https://numpy.org/) is also helpful dealing with data structures. 

In [6]:
# Import the library that does dataframe management
import pandas as pd # Library that manages dataframes
import numpy as np

The date is just a string right now, but Python is smart and can realize it is a date if we help it out. These date variable types are easier to work with and efficient. 

Let's change the date time from a string to a date type. 

In [7]:
# Change the date column to a pandas date time column 

# Define string format
date_change = '%Y-%m-%d'

# Create a new date column from the index
df['Date'] = df.index

# Perform the date type change
df['Date'] = pd.to_datetime(df['Date'], format = date_change)

# Create a variable that is the date column
Dates = df['Date']

We know the "Open", "High", "Low", "Close", "Volume" are useful, but there is more data that can be derived off of this data. 

Financial Technical Indicators are useful to understand what is going on with a particular stock. 

We will create some of these with help from a package called *ta* standing for technical anlysis. 

First, we'll have to install and then import the package.

In [8]:
# Add financial information and indicators 
!{sys.executable} -m pip install ta # Download ta
from ta import add_all_ta_features # Library that does financial technical analysis 

Collecting ta
  Downloading ta-0.10.1.tar.gz (24 kB)
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.10.1-py3-none-any.whl size=28985 sha256=e2b90cd8b6daad602c24be6ab3f549b6df1524bc76d322f8f22ee7609a5789ea
  Stored in directory: /root/.cache/pip/wheels/bc/2a/c2/a56e77d07edc16a1fa7fb012667e55cb0643cfa65996bddecc
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.10.1


Now that the package is imported, let's add these technical indicators to our dataframe.

We'll print out each column name of our dataframe to see what new columns we gained. 

In [9]:
# Add all technical analysis to the dataframe we've already loaded
df = add_all_ta_features(df, "Open", "High", "Low", "Close", "Volume", fillna=True) 

print(df.columns)

  dip[idx] = 100 * (self._dip[idx] / value)
  din[idx] = 100 * (self._din[idx] / value)


Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits',
       'Date', 'volume_adi', 'volume_obv', 'volume_cmf', 'volume_fi',
       'volume_em', 'volume_sma_em', 'volume_vpt', 'volume_vwap', 'volume_mfi',
       'volume_nvi', 'volatility_bbm', 'volatility_bbh', 'volatility_bbl',
       'volatility_bbw', 'volatility_bbp', 'volatility_bbhi',
       'volatility_bbli', 'volatility_kcc', 'volatility_kch', 'volatility_kcl',
       'volatility_kcw', 'volatility_kcp', 'volatility_kchi',
       'volatility_kcli', 'volatility_dcl', 'volatility_dch', 'volatility_dcm',
       'volatility_dcw', 'volatility_dcp', 'volatility_atr', 'volatility_ui',
       'trend_macd', 'trend_macd_signal', 'trend_macd_diff', 'trend_sma_fast',
       'trend_sma_slow', 'trend_ema_fast', 'trend_ema_slow',
       'trend_vortex_ind_pos', 'trend_vortex_ind_neg', 'trend_vortex_ind_diff',
       'trend_trix', 'trend_mass_index', 'trend_dpo', 'trend_kst',
       'trend_kst_sig', 'trend_kst_diff', 'trend

Yay! Now, we've added the techincal indicators!

You can learn and understand what all these new values are on the documentation of the *ta* site. They have a [dictionary](https://technical-analysis-library-in-python.readthedocs.io/en/latest/ta.html) that exmplains what these indicators are, and what they mean. 

Now that we have the technical indicators and dates sorted out, let's add some date features that will show what month it is, what day of the dear it is, what day in the quarter it is, ect. 

To do that, we will use a Python package called *fastai*. 

Let's install the package.

In [10]:
# Install fastai to use the date function
!{sys.executable} -m pip install fastai # Download fastai 
import fastai.tabular # Library that does date factors



After it is imported, let's add the new date features. 

In [11]:
# Define the date parts 
fastai.tabular.add_datepart(df,'Date', drop = 'True')

# Ensure the correct format
df['Date'] = pd.to_datetime(df.index.values, format = date_change)

# Add the date parts
fastai.tabular.add_cyclic_datepart(df, 'Date', drop = 'True')

  for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
  for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
  df[prefix + 'Elapsed'] = field.astype(np.int64) // 10 ** 9
  df[prefix + 'Elapsed'] = field.astype(np.int64) // 10 ** 9


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,volume_adi,volume_obv,volume_cmf,...,Is_year_start,Elapsed,weekday_cos,weekday_sin,day_month_cos,day_month_sin,month_year_cos,month_year_sin,day_year_cos,day_year_sin
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-01,372.100006,373.160004,368.339996,370.260010,2458100,0,0,-4.997689e+05,2458100,-0.203315,...,False,1427846400,-0.222521,0.974928,1.000000,0.000000,6.123234e-17,1.0,0.021516,0.999769
2015-04-02,370.500000,373.279999,369.000000,372.250000,1875300,0,0,4.729342e+05,4333400,0.109137,...,False,1427932800,-0.900969,0.433884,0.978148,0.207912,6.123234e-17,1.0,0.004304,0.999991
2015-04-06,370.100006,380.200012,369.359985,377.040009,3050700,0,0,1.745000e+06,7384100,0.236319,...,False,1428278400,1.000000,0.000000,0.500000,0.866025,6.123234e-17,1.0,-0.064508,0.997917
2015-04-07,376.149994,379.309998,374.029999,374.410004,1954900,0,0,7.149045e+04,5429200,0.007655,...,False,1428364800,0.623490,0.781831,0.309017,0.951057,6.123234e-17,1.0,-0.081676,0.996659
2015-04-08,374.660004,381.579987,374.649994,381.200012,2636400,0,0,2.418781e+06,8065600,0.201979,...,False,1428451200,-0.222521,0.974928,0.104528,0.994522,6.123234e-17,1.0,-0.098820,0.995105
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-24,1793.810059,1795.569946,1787.579956,1789.209961,881300,0,0,2.367677e+08,351414100,0.110968,...,False,1577145600,0.623490,0.781831,-0.050649,-0.998717,8.660254e-01,-0.5,0.990532,-0.137279
2019-12-26,1801.010010,1870.459961,1799.500000,1868.770020,6005400,0,0,2.424870e+08,357419500,0.148106,...,False,1577318400,-0.900969,0.433884,0.347305,-0.937752,8.660254e-01,-0.5,0.994671,-0.103102
2019-12-27,1882.920044,1901.400024,1866.010010,1869.800049,6186600,0,0,2.376255e+08,363606100,0.038199,...,False,1577404800,-0.900969,-0.433884,0.528964,-0.848644,8.660254e-01,-0.5,0.996298,-0.085965
2019-12-30,1874.000000,1884.000000,1840.619995,1846.890015,3674700,0,0,2.350131e+08,359931400,0.026659,...,False,1577664000,1.000000,0.000000,0.918958,-0.394356,8.660254e-01,-0.5,0.999407,-0.034422


### Data Pulling, Complete!

We've now pulled all the data we need. We'll start creating our model now! 

Let's start by defining how far out we want to do our predictions. 

I'm interested in 1 day out, 5 days out, and 10 days out so I'll add those to my *shifts* list. 

We'll also define how much of our data we want to use to train and how much we will use to evaluate the model. 75% is a good start. 




In [12]:
# Define key model parameters

# Set days out to predict 
shifts = [1,5,10]

# Set a training percentage
train_pct = .75

# Plotting dimensions
w = 16 # width
h = 4 # height 

### Defining Functions

Next, we'll define some functions to do some tasks for us.

The first one is boring and tedious, but the packages we used were a little lazy on what variable types they used. The following function just goes through and makes sure the right columns are numbers (floats) and the right columns are categories (like strings). 

In [13]:
# Ensure column types are correct

def CorrectColumnTypes(df):
  # Input: dataframe 
  # ouptut: dataframe (with column types changed)

  # Numbers
  for col in df.columns[1:80]:
      df[col] = df[col].astype('float')

  for col in df.columns[-10:]:
      df[col] = df[col].astype('float')

  # Categories 
  for col in df.columns[80:-10]:
      df[col] = df[col].astype('category')

  return df 

In order to do the days in the future, we have to move our closing costs by that number of days.

We'll write a function that does that for us. 

In [14]:
# Create the lags 
def CreateLags(df,lag_size):
  # inputs: dataframe , size of the lag (int)
  # ouptut: dataframe ( with extra lag column), shift size (int)

  # add lag
  shiftdays = lag_size
  shift = -shiftdays
  df['Close_lag'] = df['Close'].shift(shift)
  return df, shift


Finally, we'll actually divide the historic data into the test and train sets.

We'll split up the x's and the y as well for this.

We'll end up with a test and training set for the *x*'s and the *y*. 

In [15]:
# Split the testing and training data 
def SplitData(df, train_pct, shift):
  # inputs: dataframe , training_pct (float between 0 and 1), size of the lag (int)
  # ouptut: x train dataframe, y train data frame, x test dataframe, y test dataframe, train data frame, test dataframe

  train_pt = int(len(df)*train_pct)
  
  train = df.iloc[:train_pt,:]
  test = df.iloc[train_pt:,:]
  
  x_train = train.iloc[:shift,1:-1]
  y_train = train['Close_lag'][:shift]
  x_test = test.iloc[:shift,1:-1]
  y_test = test['Close'][:shift]

  return x_train, y_train, x_test, y_test, train, test



The best way to understand how good our predictions are is to actually *see* and *compare*. We'll do this by making a time series visualization.This visual will compare the actual versus the predicted over time. 

The best visualization package for Python is [plotly](https://plotly.com/). 

We'll start by installing it. 

In [16]:
!{sys.executable} -m pip install plotly # Download plotly 
import plotly.graph_objs as go  # Import the graph ojbects 



Now we'll make a function that greats a sweet graph for us

In [17]:
# Function to make the plots
def PlotModelResults_Plotly(train, test, pred, ticker, w, h, shift_days,name):
  # inputs: train dataframe, test dataframe, predicted value (list), ticker ('string'), width (int), height (int), shift size (int), name (string)
  # output: None

  # Create lines of the training actual, testing actual, prediction 
  D1 = go.Scatter(x=train.index,y=train['Close'],name = 'Train Actual') # Training actuals
  D2 = go.Scatter(x=test.index[:shift],y=test['Close'],name = 'Test Actual') # Testing actuals
  D3 = go.Scatter(x=test.index[:shift],y=pred,name = 'Our Prediction') # Testing predction

  # Combine in an object  
  line = {'data': [D1,D2,D3],
          'layout': {
              'xaxis' :{'title': 'Date'},
              'yaxis' :{'title': '$'},
              'title' : name + ' - ' + tickerSymbol + ' - ' + str(shift_days)
          }}
  # Send object to a figure 
  fig = go.Figure(line)

  # Show figure
  fig.show()

## Making the Model

In order to make the models, we'll be using a package called SciKit Learn.

We'll have to install and import the package. 

In [18]:
# Import sklearn modules that will help with modeling building

!{sys.executable} -m pip install sklearn # Download sklearn 
from sklearn.metrics import mean_squared_error # Install error metrics 
from sklearn.linear_model import LinearRegression # Install linear regression model
from sklearn.neural_network import MLPRegressor # Install ANN model 
from sklearn.preprocessing import StandardScaler # to scale for ann



As discussed earlier, the easiest form of machine learning is linear regression. 

In [19]:
# Regreesion Function

def LinearRegression_fnc(x_train,y_train, x_test, y_test):
  #inputs: x train data, y train data, x test data, y test data (all dataframe's)
  # output: the predicted values for the test data (list)
  
  lr = LinearRegression()
  lr.fit(x_train,y_train)
  lr_pred = lr.predict(x_test)
  lr_MSE = mean_squared_error(y_test, lr_pred)
  lr_R2 = lr.score(x_test, y_test)
  print('Linear Regression R2: {}'.format(lr_R2))
  print('Linear Regression MSE: {}'.format(lr_MSE))

  return lr_pred


In [20]:
# ANN Function 

def ANN_func(x_train,y_train, x_test, y_test):

  # Scaling data
  scaler = StandardScaler()
  scaler.fit(x_train)
  x_train_scaled = scaler.transform(x_train)
  x_test_scaled = scaler.transform(x_test)


  MLP = MLPRegressor(random_state=1, max_iter=1000, hidden_layer_sizes = (100,), activation = 'identity',learning_rate = 'adaptive').fit(x_train_scaled, y_train)
  MLP_pred = MLP.predict(x_test_scaled)
  MLP_MSE = mean_squared_error(y_test, MLP_pred)
  MLP_R2 = MLP.score(x_test_scaled, y_test)

  print('Muli-layer Perceptron R2 Test: {}'.format(MLP_R2))
  print('Multi-layer Perceptron MSE: {}'.format(MLP_MSE))

  return MLP_pred

Let's create one last function to calculate how much money we would have made, had we been trading this strategy

In [21]:
def CalcProfit(test_df,pred,j):
  pd.set_option('mode.chained_assignment', None)
  test_df['pred'] = np.nan
  test_df['pred'].iloc[:-j] = pred
  test_df['change'] = test_df['Close_lag'] - test_df['Close'] 
  test_df['change_pred'] = test_df['pred'] - test_df['Close'] 
  test_df['MadeMoney'] = np.where(test_df['change_pred']/test_df['change'] > 0, 1, -1) 
  test_df['profit'] = np.abs(test['change']) * test_df['MadeMoney']
  profit_dollars = test['profit'].sum()
  print('Would have made: $ ' + str(round(profit_dollars,1)))
  profit_days = len(test_df[test_df['MadeMoney'] == 1])
  print('Percentage of good trading days: ' + str( round(profit_days/(len(test_df)-j),2))     )

  return test_df, profit_dollars

## Let's Start Predicting!
## Time To Make Money!

We've gotten our data, created functions, now let's get to the point of actually doing predictions. 

For the ticker, we'll have a prediction for each time length out into the future. 

In [22]:
# Go through each shift....

for j in shifts: 
  print(str(j) + ' days out:')
  print('------------')
  df_lag, shift = CreateLags(df,j)
  df_lag = CorrectColumnTypes(df_lag)
  x_train, y_train, x_test, y_test, train, test = SplitData(df, train_pct, shift)

  # Linear Regression
  print("Linear Regression")
  lr_pred = LinearRegression_fnc(x_train,y_train, x_test, y_test)
  test2, profit_dollars = CalcProfit(test,lr_pred,j)
  PlotModelResults_Plotly(train, test, lr_pred, tickerSymbol, w, h, j, 'Linear Regression')

  # Artificial Neuarl Network 
  print("ANN")
  MLP_pred = ANN_func(x_train,y_train, x_test, y_test)
  test2, profit_dollars = CalcProfit(test,MLP_pred,j)
  PlotModelResults_Plotly(train, test, MLP_pred, tickerSymbol, w, h, j, 'ANN')
  print('------------')







1 days out:
------------
Linear Regression


  estimator=estimator,
  X = check_array(X, **check_params)
  X = check_array(X, **check_params)


Linear Regression R2: -1.4614441095763415
Linear Regression MSE: 36514.338996211685
Would have made: $ -310.6
Percentage of good trading days: 0.52


ANN
Muli-layer Perceptron R2 Test: 0.9803675697398474
Multi-layer Perceptron MSE: 291.23765640247757
Would have made: $ 259.3
Percentage of good trading days: 0.51



Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.




Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.


Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.


Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.



------------
5 days out:
------------
Linear Regression
Linear Regression R2: -19.589286295045365
Linear Regression MSE: 307272.3181395197
Would have made: $ 499.4
Percentage of good trading days: 0.53


ANN
Muli-layer Perceptron R2 Test: 0.9032010552337836
Multi-layer Perceptron MSE: 1444.6171530934607
Would have made: $ 576.0
Percentage of good trading days: 0.49


------------
10 days out:
------------
Linear Regression
Linear Regression R2: -47.31703559129529
Linear Regression MSE: 732692.6260633768
Would have made: $ 2198.8
Percentage of good trading days: 0.59



Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.


Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.


Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.



ANN
Muli-layer Perceptron R2 Test: 0.7150876669500319
Multi-layer Perceptron MSE: 4320.487855795382
Would have made: $ 1290.5
Percentage of good trading days: 0.56


------------
