## Analysis of Stock Prices with Macroeconomic Indicators applied to different regression models

 Going through a varity of regression analysis techniques using S&P500 index companies and their closing prices as dep variables and all other m.e indicators as indep variables. The data is for 12 years; from 2012 start to 2023 end. (Month by Month average basis)

 The indicators specifcally chosen excluded the very common ones such as GPD and Inflation etc. to see the effect of the others.
 
The ones used were: 
Unemployment Rate,	Federal Funds Rate,	Housing Starts,	Personal Savings Rate,	Average Hourly Earnings,	Money Supply M1,	Long-Term Interest Rates,	Average Weekly Hours,	Personal Consumption Expenditures,	Personal Income.

#### The main task was to compare the following models: 
- Linear Regression
- Lasso Regression
- XGBoost
- ARIMA
- Random Forest Regressor

Results are at the end



In [2]:
import pandas as pd
import numpy as np
import glob as glob
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

The Dataset is already mostly preprocessed and missing values have been handled. Check the merging.py file for details on how it was done for combining historicals and balancesheet data

In [3]:
file_path = 'Stocks_ME_Merged_dataset.csv'
dfile =pd.read_csv(file_path)

dfile['Date'] = pd.to_datetime(dfile["Date"])

In [20]:
X = dfile.drop(columns=['Date', 'Average Close'])
y = dfile['Average Close']

# Declaring the X - Dep and y - Indep variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
# Standardizing the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Using Linear Regression
(most common method, no parameter tuning for this one)

In [22]:
from sklearn.linear_model import LinearRegression
# Declaring the model
model = LinearRegression()

model.fit(X_train_scaled, y_train)
model_predictions = model.predict(X_test_scaled) 

# Evaluation of the model
mse = mean_squared_error(y_test, model_predictions)
r2 = r2_score(y_test, model_predictions)

print(f'Linear Regression - Mean Squared Error: {mse}')
print(f'Linear Regression - R2_SCORE: {r2}')

Linear Regression - Mean Squared Error: 532286.3592794591
Linear Regression - R2_SCORE: 0.4496216702810921


### Using Lasso Regression
(adds a penaltity to coefficients when rss is calculated and reduces the least impactful ones to zero to reduce multicollinearity and focus on more important features)

In [23]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV


lasso = Lasso()
parameters = {'alpha': np.logspace(-4,4,10)} # generates 10 values b/w 10^-4 and 10^4
lasso_regressor = GridSearchCV(lasso, parameters, cv=2) # evaluates the model on two different subsets of the data. (b/c of limited data)

lasso_regressor.fit(X_train_scaled, y_train)

best_alpha_lasso = lasso_regressor.best_params_['alpha'] # Alpha value is the penalty added each time
print(f'Best Alpha for Lasso: {best_alpha_lasso}')

Lasso_prediction = lasso_regressor.predict(X_test_scaled)
lasso_mse = mean_squared_error(y_test, Lasso_prediction)
lasso_r2 = r2_score(y_test, Lasso_prediction)
print(f'Lasso Regression Mean Squared Error: {lasso_mse}')
print(f'Lasso Regression R2 Score: {lasso_r2}')
warnings.filterwarnings('ignore')

Best Alpha for Lasso: 166.81005372000558
Lasso Regression Mean Squared Error: 629775.631061466
Lasso Regression R2 Score: 0.34881881927148495


### Using XgBoost 
(Uses Gradient boosting technique, meaning that it learns from the errors of the previous model and updates it to correct/reduce its residuals)

In [24]:
import xgboost as xgb

# Creating and inputting
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, random_state=42)
xg_reg.fit(X_train_scaled, y_train)

# Evaluating and printing the predictions
xgb_predictions = xg_reg.predict(X_test_scaled)
xgb_mse = mean_squared_error(y_test, xgb_predictions)
xgb_r2 = r2_score(y_test, xgb_predictions)

print(f"XGBoost MSE: {xgb_mse}")
print(f"XGBoost R2 Score: {xgb_r2}")

XGBoost MSE: 950400.6645766628
XGBoost R2 Score: 0.017296007657378265


### Using ARIMA 
(Used for time series data. Better - reportedly- to capture time related trends overtime; basically spicy linear regression)

In [25]:
from statsmodels.tsa.arima.model import ARIMA

# Converting to time series (basically it gives a certain number to each date to work through in an easier manner)
df_time_series = pd.DataFrame({'y':y})
df_time_series.index = dfile['Date']

arima_model = ARIMA(df_time_series, order=(7,0,1)) # pdq
# p = how many lagged var's you want to use to predict the next value
# d = makes the data stationary; 1 means subtract each obv from previous. *limited data so == 0
# q = no. of past errors to include to capture shocks. 
arima_model_fit = arima_model.fit()

# Predictions
forecast = arima_model_fit.forecast(steps=len(y_test))
arima_mse = mean_squared_error(y_test, forecast)
arima_r2 = r2_score(y_test, forecast)

print(f"ARIMA MSE: {arima_mse}")
print(f"ARIMA R2 Score: {arima_r2}")

warnings.filterwarnings('ignore')

ARIMA MSE: 4657320.966387032
ARIMA R2 Score: -3.815619430704169


### Using Random Forest Regressor
(Splits the data, based on random selection of features, into a given number of trees and further branches them out till it hits a leaf node which has variance either minimum than the parent tree or some other stopping condition has been met i.e. maximum number of branch splits. Then, it averages them all out)

In [26]:
from sklearn.ensemble import RandomForestRegressor

# Creating the model. I Love this model
rfr = RandomForestRegressor(n_estimators=1000)
rfr.fit(X_train_scaled, y_train)

# Predictions and Evaluation
rfr_predictions = rfr.predict(X_test_scaled)
rfr_mse = mean_squared_error(y_test, rfr_predictions) 
rfr_r2 = r2_score(y_test, rfr_predictions)

print(f"RFR MSE: {rfr_mse}")
print(f"RFR R2_Score: {rfr_r2}")

RFR MSE: 703187.7341453121
RFR R2_Score: 0.2729114998896135


#### The following are the results produced. The data and the M.E indicators chosen can be blamed for the low R2 scores but the point here, once again was not to predict the outcomes but compares them across carious techniques to see which one is the best one when used in the most simple manner without any major fine tuning required.

- Linear Regression - Mean Squared Error: 532286.3592794591
- Linear Regression - R2_SCORE: 0.4496216702810921
#
- Lasso Regression Mean Squared Error: 629775.631061466
- Lasso Regression R2 Score: 0.34881881927148495
#
- XGBoost MSE: 950400.6645766628
- XGBoost R2 Score: 0.017296007657378265
#
- ARIMA MSE: 4657320.966387032
- ARIMA R2 Score: -3.815619430704169
#
- RFR MSE: 703187.7341453121
- RFR R2_Score: 0.2729114998896135
#

The clear winner here is Linear Regression which I presume would be to the linearity in the models independant variables and how connected they were. If there was more dimentionality in the model, like other features included than m.e indicators such as financial metrics, then, I think that would have introduced a lot of nonlinearity in the model for which Random Forest would have been best applied to. I am underwhelemed by the performance of Arima model. Perhaps I donot understand its usage well enough. 

# Valete