# Stock Price Predictor

The first step is to load the required modules to make the predictions we need.

In [1]:
%matplotlib notebook

import sys, os, pdb
import uuid, json, time
import pandas as pd

# import predictions algorithms
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor

sys.path.append(os.getcwd() + '/src')
# import main stocks predictor / data preprocessing file
import lib.stocks as st
import lib.visualizer as vzr

#### Configurations & Parameters

Below we set the tickers we would like to train on and the dates for starting predictions.

In [2]:
DATE_TRAIN_START = '2017-01-01'
DATE_TEST_START = '2018-06-01'
DATE_END = '2019-01-01'

WINDOWS = [5]
HORIZONS = [7]

TICKERS_TRAIN = ['AMZN', 'AAPL', 'CSCO', 'NVDA']
TICKERS_PREDICT = ['NVDA', 'AMZN']

The next step is to create a directory where we will save the transformed data. This is done to avoid loading many data files in memeory since our algorithm may apply multiple windows and horizons (a file for each).

Once we've created a directory, we proceed to load a single data representing needed information about all the specified stocks __before__ transformation.

In [3]:
# create a directory with a unique ID
TRIAL_ID = uuid.uuid1()
DIRECTORY = "_trials/{}".format(TRIAL_ID)
os.makedirs(DIRECTORY)

print("Loading data for {}...".format(', '.join(TICKERS_TRAIN)))

data_files = st.loadMergedData(
    WINDOWS, HORIZONS, TICKERS_TRAIN, TICKERS_PREDICT,
    DATE_TRAIN_START, DATE_END, DIRECTORY
)

print("A new trial started with ID: {}\n".format(TRIAL_ID))
print("The data files generated are:")
print(data_files)

Loading data for AMZN, AAPL, CSCO, NVDA...
AMZN Index(['adj_close_AMZN', 'volume_AMZN', 'returns_AMZN'], dtype='object')
AAPL Index(['adj_close_AAPL', 'volume_AAPL', 'returns_AAPL'], dtype='object')
CSCO Index(['adj_close_CSCO', 'volume_CSCO', 'returns_CSCO'], dtype='object')
NVDA Index(['adj_close_NVDA', 'volume_NVDA', 'returns_NVDA'], dtype='object')
            adj_close_AMZN  volume_AMZN  returns_AMZN  roll_mean_2_AMZN  \
02-01-2017             NaN          NaN           NaN               NaN   
03-01-2017          753.67          NaN           NaN               NaN   
04-01-2017          757.18    -0.286998      0.004657           755.425   
05-01-2017          780.45     1.322250      0.030732           768.815   
06-01-2017          795.99     0.026786      0.019912           788.220   

            roll_mean_3_AMZN  roll_mean_4_AMZN  adj_close_AAPL  volume_AAPL  \
02-01-2017               NaN               NaN             NaN          NaN   
03-01-2017               NaN        

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  data = pd.concat(data, axis=1)


Now we create a list of regressors which we would like to use for making predictions:

In [4]:
classifiers = [
    ('GradientBoosted', MultiOutputRegressor(GradientBoostingRegressor())),
    # ('AdaBoost', MultiOutputRegressor(AdaBoostRegressor()))
]

In [5]:
from IPython.display import display

# - combine the results of each classifier along with its w + h into a response object
all_results = {}

# - train each of the models on the data and save the highest performing
#         model as a pickle file
for h, w, file_path in data_files:
    # Start measuing time
    time_start = time.time()
    
    # load data
    finance = pd.read_csv(file_path, encoding='utf-8', header=0)
    finance = finance.set_index(finance.columns[0])
    finance.index.name = 'Date'
    
    # perform preprocessing
    X_train, y_train, X_test, y_test = \
        st.prepareDataForClassification(finance, DATE_TEST_START, TICKERS_PREDICT, h, w)

    results = {}

    print("Starting an iteration with a horizon of {} and a window of {}...".format(h, w))

    for i, clf_ in enumerate(classifiers):
        print("Training and testing the {} model...".format(clf_[0]))
        
        # perform k-fold cross validation
        results['cross_validation_%s'%clf_[0]] = \
            st.performCV(X_train, y_train, 10, clf_[1], clf_[0])
        
        # perform predictions with testing data and record result
        preds, results['accuracy_%s'%clf_[0]] = \
            st.trainPredictStocks(X_train, y_train, X_test, y_test, clf_[1], DIRECTORY)

        print("\nBelow is a sample of of the results:\n")
        display(preds.sample(10).sort_index().reindex_axis(sorted(preds.columns), axis=1))
            
        # plot results
        vzr.visualize_predictions(preds, title='Testing Data Results')

    results['window'] = w
    results['horizon'] = h

    # Stop time counter
    time_end = time.time()
    results['time_lapsed'] = time_end - time_start

    all_results['H%s_W%s'%(h, w)] = results

print(json.dumps(all_results, indent=4))

Starting an iteration with a horizon of 7 and a window of 5...
Training and testing the GradientBoosted model...

Below is a sample of of the results:





Unnamed: 0_level_0,adj_close_AMZN_pred,adj_close_AMZN_true,adj_close_NVDA_pred,adj_close_NVDA_true
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
21-09-2017,965.817943,964.65,183.680189,180.7799
22-12-2017,1180.782249,1168.36,196.591995,195.27
24-10-2017,993.624903,975.9,198.540595,198.68
25-01-2017,825.853833,836.52,107.2584,107.450653
25-05-2017,993.178581,993.38,141.366565,138.140994
26-06-2018,1496.990839,1497.05,225.504526,225.52
27-04-2018,1496.90034,1497.05,225.504526,225.52
29-12-2017,1179.144302,1169.47,195.455515,193.5
30-07-2018,1496.90034,1497.05,225.504526,225.52
30-08-2018,1496.914953,1497.05,225.504526,225.52


<IPython.core.display.Javascript object>

{
    "H7_W5": {
        "cross_validation_GradientBoosted": 1.337,
        "accuracy_GradientBoosted": 0.6814344739820678,
        "window": 5,
        "horizon": 7,
        "time_lapsed": 0.970531702041626
    }
}
