<a href="https://colab.research.google.com/github/Riddick4-droid/Machine_Learning-Pt/blob/main/Stock_Prediction_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STOCK PRICE PREDICTION WITH DIFFERENT ALGORITHMS

In this notebook I explore the use of rhree different regression algorithms for predicting stock prices. I will evaluate the models with known regression metrics such as `MEAN SQUARED ERROR` or `ROOT MEAN SQUARED ERROR` , the `MEAN ABSOLUTE ERROR` and the `R_SQUARED`. The models i will explore include:
1. Linear regression
2. SGD Regressor
3. Regression Forest
4. Neural Network-from Tensorflow

In [None]:
#importing dependencies
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler
from sklearn.tree import DecisionTreeRegressor,plot_tree
import kagglehub
import os

In [None]:
##load data from kagglehub api
path = kagglehub.dataset_download("jacksoncrow/stock-market-dataset")

#print path
print(f'path to dataset is {path}')

In [None]:
os.listdir(path)

In [None]:
pd.read_csv(os.path.join(path,'symbols_valid_meta.csv')).head(20)

In [None]:
#lets check the file in the path directory
os.listdir(os.path.join(path,'stocks'))[:20]

In [None]:
##lets select one stock
##selecting 'CEI.csv'
new_path = os.path.join(path,'stocks')

In [None]:
##lets load it as a csv with pandas
data = pd.read_csv(os.path.join(new_path,'AAPL.csv')).set_index('Date')

#display
display(data.head())

In [None]:
##check the last date
display(data.tail(30))

In [None]:
##checking the data shape
print(f'There are {data.shape[0]} instances and {data.shape[1]} features')

In [None]:
##explore the data
matplotlib.rcParams['figure.figsize']=(12,5)
plt.plot(data['Close'][:1000],label='close')
plt.plot(data['Open'][:1000],label='open')
plt.xlabel('Date')
plt.ylabel('Open and Close')
plt.legend(loc='upper right');

In [None]:
#stocks are auto regressive(or have temporal dependencies) in nature,where a price at a time may depend on
#one or more previous prices
#i will demonstrate a simple function to give us this autoregressive data
import typing
def generate_autoreg(data,
                     cutoff:int=60,
                     window:int=60,
                     return_df:typing.Literal['test','train']='train'):
    #this function handles the autoregressive splitting for timeseries for me
    #allows flexible window selection by just plugging in the autoregressive window integer for num days
    #returns a dataframe for clear structure understanding: set 'return_df=True' always
    train_data = data['Close'].iloc[:-cutoff]

    #use the final 30 days data as test data
    #ensure that it also autoregressive
    test_data = data['Close'].iloc[-cutoff:]


    x_train,y_train = [],[] #collects prior days
    x_test,y_test = [],[] #collects next day prices

    #scale the data
    scale = MinMaxScaler()

    #scale data and get values then reshape those values
    data_scaled_train = scale.fit_transform(train_data.values.reshape(-1,1))

    data_scaled_test = scale.transform(test_data.values.reshape(-1,1))

    #get for train data
    for i in range(window,len(train_data)):
        x_train.append(data_scaled_train[i-window:i,0])
        y_train.append(np.array(train_data)[i])
    #convert to numpy array
    x_train,y_train = np.array(x_train),np.array(y_train)

    #for test data
    for i in range(window,len(test_data)):
        x_test.append(data_scaled_test[i-window:i,0])
        y_test.append(np.array(test_data)[i])

    x_test,y_test = np.array(x_test),np.array(y_test)

    #lets use a dataframe
    if return_df=='train':
        colnames = [f'day_{i}' for i in range(1,window+1)]
        days_data_train = pd.DataFrame(x_train,columns=colnames)
        days_data_train['next_day_price'] = y_train
        return days_data_train

    elif return_df == 'test':
        colnames = [f'day_{i}' for i in range(1,window+1)]
        days_data_test = pd.DataFrame(x_test,columns=colnames)
        days_data_test['next_day_price'] = y_test
        return days_data_test

In [None]:
##lets test the function
days_data_train = generate_autoreg(data=data,
                                   cutoff=60,
                                   window=20,
                                   return_df='train')

#display data
display(days_data_train.head(10))

In [None]:
##lets test the function
days_data_test = generate_autoreg(data=data,
                                   cutoff=60,
                                   window=20,
                                   return_df='test')

#display data
display(days_data_test.head(10))

The above was an example of a 20 day window that means we will 20 days data always trying to predict the 20 + 1 day price. The 20 days price are scaled and act as our features. The function does well to skip the next day prices and exempt it from being scaled as these values are our targets and in ML we mostly do no scale targets. IN autoregression, specifically stock prices, it is imperative to always have more days data prior so that the model can have enough to learn from in extracting patterns. However, simple machine learning models like linear models i.e regression models in this case due to the target variable being of continuous nature fail to capture long-range dependencies-thus they do not have the ability to remember whats in the past and will suffer to find the underlying patterns unless otherwise they overfit.
As such i will experiment with neural networks and find out if they are able to solve the long range and high dimensionality of autoregressive models problem

In [None]:
#check days data shape
print(f'new autoregressive data has shape {days_data_train.shape}')

In [None]:
#we want to test the models' ability to capture the autoregressive relationship between
##the features or prior days and the next day
#we use the first train data to train the algorithm to get our model
#then test the models ability to predict with unseen data from the test data set

# Linear Regression

In [None]:
#first model is linear regression without SGD
#split the data into features and target
x_train = days_data_train.drop('next_day_price',axis=1)
y_train = days_data_train['next_day_price']

#init model
lr = LinearRegression(fit_intercept=True,n_jobs=-1) #not much hyperparameters to tune here

#fit
lr.fit(x_train,y_train)

In [None]:
##lets define a funcition to capture all the metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

def capture_metrics(test,model):
    #this function captures 3 metrics specific to regression at a go
    #it captures the model's performance in terms of MSE,MAE and R_sqr
    #specifically on the test data
    #enables automation and reduces the need to always write the metrics calculation for
    #all models tested

    x_test = test.drop('next_day_price',axis=1)
    y_test = test['next_day_price']

    #make prediction
    predictions = model.predict(x_test)

    #get model performance
    mse = mean_squared_error(y_true=y_test,y_pred=predictions)
    mae = mean_absolute_error(y_true=y_test,y_pred=predictions)
    r_squared = r2_score(y_true=y_test,y_pred=predictions)

    #print results
    print(f'MSE: {mse} | MAE: {mae} | RSQ: {r_squared}')

    #return dataframe
    results = {
        f'mse_{model.__class__.__name__}':round(mse,4),
        f'mae_{model.__class__.__name__}':round(mae,4),
        f'rsq_{model.__class__.__name__}':round(r_squared,4)
    }

    return pd.DataFrame([results]),predictions

In [None]:
###apply on train data
results_train,inference_lr_train = capture_metrics(days_data_train,lr)

results_train

In [None]:
len(inference_lr_train)==len(y_train)

In [None]:
##lets visualize the values
import typing
def plot_model_pred(predictions,data,plot_range:int,figsize:typing.Tuple[int,int],plot_continuation:bool=False):
    ##the function plots the actual data trend vs the predicted
    ##you can choose to let it plot it side by side or to let it plot as a s continuation of trend

    #set figsize
    plt.figure(figsize=figsize)

    if plot_continuation:
        x=range(plot_range)
        plt.plot(x,predictions[-x[-1]-1:],label='predictions',linestyle='--')
        plt.plot(x,data['next_day_price'][-x[-1]-1:],label='actuals')
        plt.title('Actual vs Predicted')
        plt.ylabel('Index')
        plt.xlabel('Value')
        plt.legend()
        plt.show()
    else:
        pass

In [None]:
plot_model_pred(inference_lr_train, days_data_train, 5, (15,5), True)

In [None]:
results_test,inference_lr = capture_metrics(days_data_test,lr)

results_test

In [None]:
len(inference_lr)==len(days_data_test['next_day_price'])

In [None]:
##plot test data predictions
plot_model_pred(inference_lr, days_data_test, 40, (15,5), True)

# SGDRegressor-That is regression with Stochastic Gradient Descent

In [None]:
# implement the GridSearchCV
params = {
    'alpha':[1e-4,3e-4,1e-3],
    'eta0':[0.01,0.03,0.1]
}

#init model
sgd_lr = SGDRegressor(penalty='l2',max_iter=5000,random_state=42,verbose=1)

#implement the cross validation
#first perform TimeSeriesSPlit to maintain timeseries structure
tscv = TimeSeriesSplit(n_splits=3)

#grid search
grid_search = GridSearchCV(sgd_lr,params,cv=tscv,scoring='r2')

#fit
grid_search.fit(x_train,y_train)

In [None]:
##get best params
grid_search.best_params_

In [None]:
##fit the model again with the best parameters
sgd_lr = SGDRegressor(penalty='l2',max_iter=5000,random_state=42,verbose=1,alpha=0.0001,eta0=0.1)

#fit
sgd_lr.fit(x_train,y_train)

In [None]:
##lets compute metrics
results_train_sgd,inference_lr_train_sgd = capture_metrics(days_data_train,sgd_lr)

results_train_sgd

In [None]:
##plot predictions for train data
plot_model_pred(inference_lr_train_sgd, days_data_train, 200, (15,5), True)

always try to get a granular view of the model performance when plotting the predictions as this will help see whether it is gettting it right. Using a more denser plot will hide the model's performance and will give us the impressinon that it is actually performing well

In [None]:
#check for the test data
results_test_sgd,inference_lr_test_sgd = capture_metrics(days_data_test,sgd_lr)

results_test_sgd

In [None]:
##plot predictions for test data
plot_model_pred(inference_lr_test_sgd, days_data_test, 40, (15,5), True)

# Decision Trees

In [None]:
##lets try with dt
dt = DecisionTreeRegressor(criterion='squared_error',random_state=42)

#setup params
param_grid = {'min_samples_split':[2,3],
              'max_depth':[25,30],
              'min_samples_leaf':[3,4],
              'ccp_alpha':[1e-4,3e-4,1e-3],
              }
#this may take a while :) if no patience, use RandomizedSearchCV
grid_search_dt = GridSearchCV(estimator=dt,param_grid=param_grid,cv=tscv,n_jobs=-1,scoring='r2')

#fit
grid_search_dt.fit(x_train,y_train)

In [None]:
##best estimator
grid_search_dt.best_params_

#fit with best params
#the '**' before the grid_search_dt helps unpack the values in the dictionary
dt = DecisionTreeRegressor(**grid_search_dt.best_params_,random_state=42,criterion='squared_error')

#fit
dt.fit(x_train,y_train)

In [None]:
##check performance and plot
results_train_dt,inference_dt_train = capture_metrics(days_data_train,dt)

results_train_dt

In [None]:
#plot for the train
plot_model_pred(inference_dt_train, days_data_train, 100, (15,5), True)

In [None]:
##check performance and plot
results_test_dt,inference_dt_test = capture_metrics(days_data_test,dt)

results_test_dt


#plot
plot_model_pred(inference_dt_test, days_data_test, 40, (15,5), True)

# RandomForestRegressor

In [None]:
##parmeters for tuning and cross val
param_grid = {
    'max_depth':[20,30,50],
    'min_samples_split':[2,5,10],
    'min_samples_leaf':[1,3,5]
}

rf = RandomForestRegressor(n_estimators=30,n_jobs=-1,random_state=42)

grid_search_rf = GridSearchCV(rf,param_grid=param_grid,cv=tscv,scoring='r2',n_jobs=-1,verbose=1)

#fit
grid_search_rf.fit(x_train,y_train)

In [None]:
##best params
grid_search_rf.best_params_,grid_search_rf.best_score_

#fit
rf_best = RandomForestRegressor(**grid_search_rf.best_params_,n_estimators=30,n_jobs=-1,random_state=42)

rf_best.fit(x_train,y_train)

In [None]:
#capture metrics
results_train_rf,inference_rf_train = capture_metrics(days_data_train,rf_best)

results_train_rf


#plot
plot_model_pred(inference_rf_train, days_data_train, 100, (15,5), True)

In [None]:
##test data
#capture metrics
results_test_rf,inference_rf_test = capture_metrics(days_data_test,rf_best)

results_test_rf


#plot
plot_model_pred(inference_rf_test, days_data_test, 40, (15,5), True)

In [None]:
##function to plot all
all_results_train = {
    'truth':days_data_train['next_day_price'].tolist(),
    'lr':inference_lr_train.tolist(),
    'sgd_lr':inference_lr_train_sgd.tolist(),
    'decision_trees':inference_dt_train.tolist(),
    'random_forest':inference_rf_train.tolist()
}

#make dataframe
train_perf = pd.DataFrame(all_results_train)

train_perf.iloc[-150:].plot();

In [None]:
all_results_test = {
    'truth':days_data_test['next_day_price'].tolist(),
    'lr':inference_lr.tolist(),
    'sgd_lr':inference_lr_test_sgd.tolist(),
    'decision_trees':inference_dt_test.tolist(),
    'random_forest':inference_rf_test.tolist()
}

#make dataframe
test_perf = pd.DataFrame(all_results_test)

test_perf.plot();

The final graph displays the actual stock prices (truth) in blue, alongside the predictions from the four different models on the test data: Linear Regression (lr, orange), SGD Regressor (sgd_lr, green), Decision Trees (decision_trees, red), and Random Forest (random_forest, purple).

Here's an interpretation of their performance:

`Actual Prices (truth)`: The blue line shows the actual stock price movement in the test set, which has a general downward trend with several fluctuations.
`Linear Regression (lr)` and `SGD Regressor (sgd_lr)`: Both lr (orange) and sgd_lr (green) appear to follow the general downward trend of the actual prices. However, they tend to be smoother and lag behind the sharper movements and sudden drops in the actual prices. They capture the overall direction but lack the responsiveness to capture rapid changes.
`Decision Trees (decision_trees)`: The red line for Decision Trees shows a very poor performance. It predicts constant values for significant periods, indicating a severe lack of generalization. For instance, it predicts around 300 for the first segment, then drops to around 275-280, and then later to around 260. This step-like behavior suggests it has failed to learn the underlying patterns and has likely overfit to the training data, producing flat predictions on unseen data.
`Random Forest (random_forest)`: Similar to Decision Trees, the purple line for Random Forest also exhibits a step-like or largely flat prediction for considerable stretches, particularly in the earlier part of the test data (predicting around 290). While it shows some more variation than the single `Decision Tree`, it still struggles to accurately track the dynamic nature of the stock prices in the test set. It too appears to suffer from poor generalization.
`Overall Conclusion`: Based on this visualization, both Linear Regression and SGD Regressor models, while imperfect, provide a more reasonable approximation of the stock price trend on the test data compared to the Decision Tree and Random Forest models. The tree-based models perform poorly, indicating they are not well-suited for capturing the continuous and dynamic nature of this time series data, likely due to overfitting during training.