# **Secondary Stock Data - Feature Filtering and Optimization Using Correlation Matrix and VIF Notebook**
## In this notebook we will take our secondary stock data that has been previously preprocessed in sec_stock_preprocessing.ipynb and apply a method called VIF (Variance Inflation Factor) as well as a triangular correlation matrix to optimize our list of lagged features based on our target of the Closing Price for our data.  In other parts of this project it has been shown that there is extremely high collinearity with a lot of our features with our target, creating a lot of data leakage.  This will really skew our model results, so it is something we need to look at now before we get too far along.  Once we optimize our features our models will be much more efficient, as well as using this refined list of features to generate our secondary stocks for better cointegration tests at later parts of the project. 

#### As usual let's start by bringing in the libraries and logic necessary for reading in our file.

In [1]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

#### Now let's read in our notebook and take a look; we will need the scaled version of our preprocessed secondary stock data.

In [2]:
# Now let's access the main core_stock_data.csv file
csv_path = os.path.join(project_root, 'data', 'sec_stock_preprocessed.csv')
sec_stock_data = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
print(sec_stock_data.head())
print(sec_stock_data.shape)

            Close_sec  Volume_sec  Open_sec  High_sec   Low_sec   SMA_sec  \
Date                                                                        
2019-03-14  -0.259231   -0.558310 -0.255721 -0.258880 -0.255947 -0.249447   
2019-03-15  -0.260650    0.087049 -0.259482 -0.260749 -0.257680 -0.249447   
2019-03-18  -0.267337   -0.044652 -0.260699 -0.263240 -0.265112 -0.249447   
2019-03-19  -0.266211   -0.464214 -0.266780 -0.268624 -0.264815 -0.249447   
2019-03-20  -0.264162   -0.246529 -0.263469 -0.262818 -0.261669 -0.249447   

             EMA_sec   RSI_sec   BBM_sec   BBU_sec  ...  \
Date                                                ...   
2019-03-14 -0.258458  0.678803 -0.249447 -0.254554  ...   
2019-03-15 -0.258515  0.678803 -0.249447 -0.254554  ...   
2019-03-18 -0.258838  0.678803 -0.249447 -0.254554  ...   
2019-03-19 -0.259103  0.678803 -0.249447 -0.254554  ...   
2019-03-20 -0.259275  0.678803 -0.249447 -0.254554  ...   

            Momentum_7_Lag_Std_1_3_sec  Moment

#### We need to first set up the Correlation Matrix, and will start with a threshold of 0.95.  Doing the correlation matrix first will allow us to filter out the first batch of features which will set us up nicely for our VIF calculations.  First though let's drop our ticker column as it is not in the right format to be processed by the following calculations.  We also need to introduce our target in our Close Price (Close_sec), as all of the correlations will be based on relationships with this feature.

In [3]:
# Setting up our target for the correlation
target = 'Close_sec'

# Dropping ticker column for now.
sec_stock_data = sec_stock_data.drop(columns = ['ticker'])

# Calculate the correlation of each feature with the target
corr_with_target = sec_stock_data.corr()[target].abs()

# Set a correlation threshold with the target 
corr_threshold_with_target = 0.25

# Filter features that have a decent correlation with the target
selected_features = corr_with_target[corr_with_target > corr_threshold_with_target].index

# Print correlation results for selected features
print(f"Features that correlate well with the target ({target}):\n{corr_with_target[selected_features]}")


# Set up the correlation matrix for the selected features (after filtering by target correlation)
corr_matrix = sec_stock_data[selected_features].corr().abs()

# Set the pairwise correlation threshold
corr_threshold = 0.85

# Set up a triangular matrix to identify features with high pairwise correlation
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

selected_features = corr_with_target[corr_with_target > corr_threshold_with_target].index

selected_features = [feature for feature in selected_features if feature != target]


print(f"Number of remaining features after correlation matrix filtering:\n{len(selected_features)}")



Features that correlate well with the target (Close_sec):
Close_sec                      1.000000
Volume_sec                     0.448197
Open_sec                       0.999826
High_sec                       0.999914
Low_sec                        0.999921
                                 ...   
Momentum_30_Lag_Avg_1_3_sec    0.287041
Momentum_30_Lag_Std_1_3_sec    0.716781
Momentum_50_Lag_Avg_1_3_sec    0.358614
Momentum_50_Lag_Std_1_3_sec    0.721166
Diff_Close_EMA_sec             0.353725
Name: Close_sec, Length: 85, dtype: float64
Number of remaining features after correlation matrix filtering:
84


#### Good start, we filtered down to 59 features from 151 so we did a nice job of not over-filtering.  That will give our VIF (Variance Inflation Factor) a good pool to work with and we can manually step that threshold and adjust it until we find the set of features we like.  Let's get that VIF Calculation up and running.

In [7]:
# Let's set up our VIF Calculation here.  we will make it as a function so we can tune our threshold as needed.

def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data['Feature'] = X.columns
    vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    return vif_data

# Filter data for VIF calculation
sec_data_for_vif = sec_stock_data[selected_features]

# Calculate VIF for the remaining features
vif_result = calculate_vif(sec_data_for_vif)

# Set a VIF threshold 
vif_threshold = 30

# Filter features that have VIF below the threshold
filtered_vif_result = vif_result[vif_result['VIF'] < vif_threshold]

final_features = filtered_vif_result['Feature']

# Print final selected features
print(f"Final selected features with VIF < {vif_threshold}:\n{final_features}")
print(f"Number of final selected features: {len(final_features)}")

print(f"VIF values of selected features:\n{filtered_vif_result}")



  vif = 1. / (1. - r_squared_i)


Final selected features with VIF < 30:
0                      Volume_sec
70           MACD_Lag_Std_1_3_sec
72    MACD_Signal_Lag_Std_1_3_sec
73      MACD_Hist_Lag_Std_1_3_sec
75         ATR_14_Lag_Std_1_3_sec
76     Momentum_1_Lag_Std_1_3_sec
77     Momentum_3_Lag_Std_1_3_sec
78     Momentum_7_Lag_Std_1_3_sec
80    Momentum_30_Lag_Std_1_3_sec
82    Momentum_50_Lag_Std_1_3_sec
Name: Feature, dtype: object
Number of final selected features: 10
VIF values of selected features:
                        Feature       VIF
0                    Volume_sec  1.265354
70         MACD_Lag_Std_1_3_sec  7.740249
72  MACD_Signal_Lag_Std_1_3_sec  5.483497
73    MACD_Hist_Lag_Std_1_3_sec  6.205859
75       ATR_14_Lag_Std_1_3_sec  6.016526
76   Momentum_1_Lag_Std_1_3_sec  6.654860
77   Momentum_3_Lag_Std_1_3_sec  4.478609
78   Momentum_7_Lag_Std_1_3_sec  5.042563
80  Momentum_30_Lag_Std_1_3_sec  4.168060
82  Momentum_50_Lag_Std_1_3_sec  4.206210


#### We have a final list of 10 features, down from our original 151, that have been optimized according to our target according to their rate of collinearity.  Due to time restrictions we will just manually implement these in various areas in our project for now.  Later as time allows we will explore other methodologies (RFECV/Ridge, Boruta with RandomForest, etc.) to see if the results differ.

#### In this notebook we used correlation matrix and VIF (Variance Inflation Factor) to filter down and score our features against our target feature in Closing Price (Close_Sec).  We ended up with 10 features that we will manually implement where needed now in our project.