# **Cointegration Testing Analysis and Decision Tree Modeling for the AAPL Stock**
## In this notebook we start to bring things together from other sections of the project.  We combine our analyzed and preprocessed core_stock_data with our preprocessed and dynamically generated secondary stock data to perform cointegration testing to see if we can create cointegrated pairs, which we will add as separate features to use for our Decision Tree model.  We will also merge in our exogenous data for this Decision Tree model, data we haven't talked about a whole lot yet.  This data will aid in adding in another layer to the stock chains we want to create at the end output at the end of the Decision Tree based on their relationships.  We will use the output of the Decision Tree to read into how the stocks work with each other so we can start to formulate our trading strategy in the next step.

#### As with our other notebooks let's read in our data to use for this.

In [1]:
import sys
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, r2_score
from statsmodels.tsa.stattools import coint, adfuller
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
# Now let's access the main core_stock_data.csv file, as well as the secondary stock data and exo_data to all be used in this notebook!
csv_path = os.path.join(project_root, 'data', 'core_stock_unscaled.csv')
core_stock_data = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
print(core_stock_data.head())
print(core_stock_data.shape)

csv_path = os.path.join(project_root, 'data', 'secondary_stocks_gen_filtered.csv')
sec_stock_data = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
print(sec_stock_data.head())
print(sec_stock_data.shape)

csv_path = os.path.join(project_root, 'data', 'exo_data_unscaled.csv')
exo_data = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
print(exo_data.head())
print(exo_data.shape)



            Close_core  Volume_core  Open_core  High_core   Low_core  \
Date                                                                   
2019-03-14   45.932499   94318000.0  45.974998  46.025002  45.639999   
2019-03-15   46.529999  156171600.0  46.212502  46.832500  45.935001   
2019-03-18   47.005001  104879200.0  46.450001  47.097500  46.447498   
2019-03-19   46.632500  126585600.0  47.087502  47.247501  46.480000   
2019-03-20   47.040001  124140800.0  46.557499  47.372501  46.182499   

            SMA_core   EMA_core   RSI_core  BBM_core   BBU_core  ...  \
Date                                                             ...   
2019-03-14  41.35925  42.219051  75.741602  41.35925  46.695085  ...   
2019-03-15  41.50025  42.388107  76.985910  41.50025  47.003365  ...   
2019-03-18  41.72940  42.569162  78.724282  41.72940  47.174667  ...   
2019-03-19  41.92075  42.728509  73.527018  41.92075  47.369412  ...   
2019-03-20  42.12190  42.897587  80.396901  42.12190  47.569044

#### Now let's grab just our AAPL stock for this notebook.

In [3]:
aapl_data = core_stock_data[core_stock_data['Ticker'] == 'AAPL']

# As a check make sure data is aligned with AAPL and sec_stock_data
aapl_data = aapl_data.loc[sec_stock_data.index]

#### Now we are ready.  We have our core stock and our secondary stock data.  Before we do the cointegration tests we will perform a ADF (Augmented Dickey-Fuller) test to check for stationarity.  This will help our cointegration results.

In [4]:

# Perform ADF Test on the 'Close' price column
adf_result = adfuller(aapl_data['Close_core'])

# Display ADF test results
print('ADF Statistic:', adf_result[0])
print('p-value:', adf_result[1])
print('Critical Values:', adf_result[4])




ADF Statistic: -25.057373959777124
p-value: 0.0
Critical Values: {'1%': -3.430374530859638, '5%': -2.8615508423302964, '10%': -2.5667757709800068}


#### Let's break down our results.  The ADF Statistic as a negative number is expected and a higher negative number here suggests stronger evidence against our hypothesis that the data is non-stationary.  The output here is fairly large which means the data is very likely to be stationary.  For the p value a value of 0 can occur when the ADF statistic is very large.  This value also aids in showing that the data is stationary.  Finally for the critical values as all being negative as well, this again shows that we can reject the null hypothesis and safely say that our AAPL data is stationary and move on with the next phase of the notebook.

#### Now since there are 200 stocks in our secondary stock data we will need to make a function to process the tests more efficiently.

In [5]:
# Let's make a function that runs the test for our core stock for each secondary stock we have generated.
def cointegration_test(core_stock_data, sec_stock_data, core_ticker):
    results = []
    
    core_data = core_stock_data[core_stock_data['Ticker'] == core_ticker]['Close_core']
    
    secondary_tickers = sec_stock_data['ticker'].unique()
    
    for ticker in secondary_tickers:
        sec_data = sec_stock_data[sec_stock_data['ticker'] == ticker]['Close_sec']
        
        # As a check ensure the lengths match by trimming the larger series (if needed)
        min_len = min(len(core_data), len(sec_data))
        core_trimmed = core_data.iloc[-min_len:]
        sec_trimmed = sec_data.iloc[-min_len:]
        
        # Perform the cointegration test.
        coint_t, p_value, critical_values = coint(core_trimmed, sec_trimmed)
        
        # Append results to our results list
        results.append({
            'Secondary_Ticker' : ticker,
            'T-Statistic' : coint_t,
            'P_value' : p_value,
            'Critical_Values' : critical_values,
            'Cointegrated' : p_value < 0.05 # True if cointegrated
        })

    # Convert the results into a dataframe
    results_df = pd.DataFrame(results)
    
    return results_df

# Call the function now and perform cointegration testing on subject core stock and all secondary stocks
results_df = cointegration_test(core_stock_data, sec_stock_data, core_ticker='AAPL')

print(results_df)

# Filter and display only the cointegrated pairs
cointegrated_pairs = results_df[results_df['Cointegrated'] == True]
print("\nCointegrated Pairs:")
print(cointegrated_pairs)







    Secondary_Ticker  T-Statistic   P_value  \
0                RMD    -0.814387  0.932787   
1               AMAT    -2.290508  0.378299   
2               JNPR    -2.632174  0.224915   
3               DECK    -2.150604  0.449963   
4                JBL    -1.825677  0.617168   
..               ...          ...       ...   
195               ON    -1.828137  0.615957   
196             SWKS    -0.497450  0.964513   
197              MCD    -2.279811  0.383669   
198             ADSK    -0.781050  0.937149   
199                A    -1.380142  0.804239   

                                       Critical_Values  Cointegrated  
0    [-3.9044872726627737, -3.340613212756168, -3.0...         False  
1    [-3.9044872726627737, -3.340613212756168, -3.0...         False  
2    [-3.9044872726627737, -3.340613212756168, -3.0...         False  
3    [-3.9044872726627737, -3.340613212756168, -3.0...         False  
4    [-3.9044872726627737, -3.340613212756168, -3.0...         False  
..       

#### Neat!  Out of our 200 secondary stocks that we generated and preprocessed, we have 3 cointegration pairs with our core stock!  For this notebook using AAPL as our core stock it looks like AZN, ROL, and VTRS (AstraZeneca PLC, Rollins Inc, and Viatris Inc respectively) are the cointegrated pairs as their p-value is less than 0.05 which is the deterministic stat for the cointegration test.

#### Ok time for our Decision Tree models.  We have a lot of setup to do for them first too.  We need to rename our ticker feature columns in our core_stock_ and sec_stock_data dataframes so they don't overwrite each other, and we then need to merge all 3 of our datasets together.  We will be careful as to how since it is a lot of data at this point.  Once done we will integrate our cointegrated pair information into the new df so this information is fed into the Decision Tree model.

In [6]:
# First let's rename our two columns.
core_stock_data.rename(columns = {'Ticker' : 'core_ticker'}, inplace = True)
sec_stock_data.rename(columns = {'ticker' : 'sec_ticker'}, inplace = True)

# Now let's make a variable that we can use for our cointegrated pairs
cointegrated_stocks = ['AZN', 'ROL', 'VTRS'] # To be edited for each core stock's results.
# Now let's make a single column that will have a binary '1' for when the cointegrated stock pair appears.
sec_stock_data['cointegrated_flag'] = sec_stock_data['sec_ticker'].apply(lambda x: 1 if x in cointegrated_stocks else 0)

sec_stock_data.head()




Unnamed: 0_level_0,sec_ticker,Close_sec,Volume_sec,Open_sec,High_sec,Low_sec,SMA_sec,EMA_sec,RSI_sec,BBM_sec,...,Momentum_7_Lag_Std_1_3_sec,Momentum_30_Lag_Avg_1_3_sec,Momentum_30_Lag_Std_1_3_sec,Momentum_50_Lag_Avg_1_3_sec,Momentum_50_Lag_Std_1_3_sec,OBV_Lag_Avg_1_3_sec,OBV_Lag_Std_1_3_sec,Diff_Close_EMA_sec,Ratio_Close_EMA_sec,cointegrated_flag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-03-14,RMD,101.0,972500.0,102.57,102.57,100.959999,104.9046,101.0,63.818062,104.9046,...,1e-10,2.940002,1e-10,12.889999,1e-10,1e-10,1611779.0,1e-10,1.0,0
2019-03-15,RMD,100.370003,2279400.0,100.900002,101.730003,100.199997,104.9046,100.975294,63.818062,104.9046,...,1e-10,2.940002,1e-10,12.889999,1e-10,1e-10,1611779.0,1e-10,0.994006,0
2019-03-18,RMD,97.400002,1915700.0,100.360001,100.610001,96.940002,104.9046,100.835087,63.818062,104.9046,...,1e-10,2.940002,1e-10,12.889999,1e-10,1e-10,1611779.0,1e-10,0.965934,0
2019-03-19,RMD,97.900002,1101100.0,97.660004,98.190002,97.07,104.9046,100.719985,63.818062,104.9046,...,1e-10,2.940002,1e-10,12.889999,1e-10,1e-10,2100176.0,1e-10,0.972002,0
2019-03-20,RMD,98.809998,1467600.0,99.129997,100.800003,98.449997,104.9046,100.645084,63.818062,104.9046,...,1e-10,2.940002,1e-10,12.889999,1e-10,1e-10,961414.0,1e-10,0.981767,0


#### The new feature column has been added on the end of our sec_stock_data and the column renaming is done.  Let's move on to merging.  Deep breath!

In [7]:
# First merge, core_stock_data with sec_stock_data.
#coresec_data = pd.merge(core_stock_data, sec_stock_data, how = 'left', left_index=True, right_index=True)

#print(coresec_data.head())
#print(coresec_data.shape)

# Next is the new coresec_data with our exo_data.
#coresec_exo_data = pd.merge(coresec_data, exo_data, how = 'left', left_index=True, right_index=True)

#print(coresec_exo_data.head())
#print(coresec_exo_data.shape)

# Finally merge sec_stock_data with exo_data, as we will need this here for setting up our model.
sec_exo_data = pd.merge(sec_stock_data, exo_data, how = 'left', on = 'Date')

#print(sec_exo_data.head())
#print(sec_exo_data.shape)

# One more, since we are focused on just one core stock at a time let's merge its data only with sec_stock_data and exo_data.
aapl_sec_exo_data = pd.merge(aapl_data, sec_exo_data, how = 'left', on = 'Date')

print(aapl_sec_exo_data.shape)

MemoryError: Unable to allocate 59.0 GiB for an array with shape (152, 52101601) and data type float64

#### Alright that's a big step accomplished, and we have compiled a lot of data (almost 1.3B values!).  Let's set up our Decision Tree model.

#### We will now set up our X,y for the model.  Similar to how we did for the Linear Regression models in this project we will eschew the traditional train_test_split methodology in favor of a more straightforward approach that works well with time series data.

In [None]:

# Let's set up our Decision Tree model and get it ready.
X = coresec_exo_data.drop(['Close_core', 'core_ticker', 'sec_ticker'], axis = 1) # ALL the features and data minus the target
y = aapl_data['Close_core'] # Our target

# Now create the X,y for train and test for time series.
split_index = int(len(X) * 0.8)

X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]


#### Ok let's now instantiate the model and prep for the initial run.

In [None]:
# Set up the Decision Tree model itself.
dec_tree = DecisionTreeRegressor(random_state=42)

# Set up the parameter grid for tuning if needed.
param_grid = {
    'max_depth' : [3, 5, 6],
    'min_samples_split' : [10, 20, 30],
    'min_samples_leaf' : [5, 10, 15]
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(dec_tree, param_grid, cv = 3, scoring = 'neg_mean_squared_error', n_jobs = -1)

# Fit the model.
grid_search.fit(X_train, y_train)



#### Now let's get the best model and run it again on the training data.

In [None]:
best_tree = grid_search.best_estimator_

best_tree.fit(X_train, y_train)

# Now we can make predictions
y_train_pred = best_tree.predict(X_train)
y_test_pred = best_tree.predict(X_test)