# **Core Stock Baseline Modeling - Linear Regression for NVDA Ticker**
## In this notebook we will examine only the Nvidia stock for the periods we have selected for this project (01-01-2019 through 06-30-2024), and perform a Linear Regression model using the preprocessed dataframe that we created in our other notebook.  We will look to get an idea of how well our data is predicting with our new features that we created.

#### As usual let's start by bringing in our important libraries and logic for needing to complete this notebook.

In [1]:
import sys
import os

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import cross_val_score

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)




#### Now let's read in our data that we will use for this notebook.

In [2]:
# Now let's access the main core_stock_data.csv file
csv_path = os.path.join(project_root, 'data', 'core_stock_preprocessed.csv')
preprocessed_df = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
preprocessed_df.head()

Unnamed: 0_level_0,Close,Volume,Open,High,Low,SMA_core,EMA_core,RSI_core,RMA_core,Close_Lag_1,...,EMA_Lag_Std_1_3,SMA_Lag_Avg_1_3,SMA_Lag_Std_1_3,RMA_Lag_Avg_1_3,RMA_Lag_Std_1_3,Close_Lag_Avg_1_3,Close_Lag_Std_1_3,Diff_Close_EMA_core,Ratio_Close_EMA_core,Ticker
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-03-14,-1.013302,-0.074793,-1.012345,-1.018762,-1.009483,-1.085271,-1.049825,1.249108,0.801504,-1.013494,...,-0.081211,-1.086158,-0.085461,0.817706,-0.6253,-1.014627,-0.348813,0.037931,0.801504,AAPL
2019-03-15,-1.007093,0.293158,-1.009876,-1.010456,-1.006383,-1.083783,-1.048013,1.325622,0.931542,-1.013494,...,-0.081211,-1.086158,-0.085461,0.817706,-0.6253,-1.014627,-0.348813,0.074966,0.931542,AAPL
2019-03-18,-1.002157,-0.011968,-1.007406,-1.00773,-1.000997,-1.081366,-1.046072,1.432516,1.018034,-1.00728,...,-0.081211,-1.085413,-0.085461,0.884049,-0.6253,-1.011515,-0.348813,0.100376,1.018034,AAPL
2019-03-19,-1.006028,0.117158,-1.000779,-1.006187,-1.000655,-1.079347,-1.044364,1.112931,0.846961,-1.002341,...,-0.066578,-1.084358,-0.062046,0.935581,-0.515015,-1.008829,-0.323448,0.054402,0.846961,AAPL
2019-03-20,-1.001793,0.102615,-1.006289,-1.004901,-1.003782,-1.077225,-1.042552,1.535368,0.916237,-1.006214,...,-0.067836,-1.08238,-0.055671,0.951042,-0.666796,-1.006398,-0.386895,0.075012,0.916237,AAPL


#### Great now let's grab the data just for our subject NVDA ticker to use in our Linear Regression model.

In [3]:
nvda_data = preprocessed_df[preprocessed_df['Ticker'] == 'NVDA']
print(nvda_data.head())
nvda_data.shape

               Close    Volume      Open      High       Low  SMA_core  \
Date                                                                     
2019-01-02 -1.455257  2.390567 -1.456369 -1.456605 -1.454946  0.416128   
2019-01-03 -1.457395  3.561279 -1.455550 -1.457459 -1.455566  0.416128   
2019-01-04 -1.455265  2.847835 -1.456291 -1.456798 -1.455038  0.416128   
2019-01-07 -1.453391  3.582742 -1.454326 -1.454957 -1.453270  0.416128   
2019-01-08 -1.454319  4.039938 -1.452198 -1.454471 -1.453146  0.416128   

            EMA_core  RSI_core  RMA_core  Close_Lag_1  ...  EMA_Lag_Std_1_3  \
Date                                                   ...                    
2019-01-02 -1.465854  0.456727 -0.370774     0.518515  ...         0.015187   
2019-01-03 -1.465941 -3.408328 -1.146266    -1.455748  ...        27.314884   
2019-01-04 -1.465938 -0.339385 -0.343192    -1.457888  ...        27.344370   
2019-01-07 -1.465859  0.600040  0.333526    -1.455756  ...        -0.111496   
2019-01

(1382, 32)

#### Now let's prepare our Linear Regression model.  We will be dropping the Ticker and Close columns for the X as Ticker is an improper format and our Close feature will be our y target.  The Date feature is our index so it won't be included in the model as it will automatically be excluded from the feature set.

In [4]:
X = nvda_data[['EMA_core', 'SMA_core', 'RSI_core', 'Close_Lag_1']]
y = nvda_data['Close']

#### Let's now set up the rest of the model and run our first set of predictions on it.  I will be looking for the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) for metrics here.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = LinearRegression()
model.fit(X_train, y_train)

#### Now let's make our predictions and see where we end up.

In [6]:
y_pred = model.predict(X_test)

#### Let's retrieve our MAE and RMSE metrics for performance.

In [7]:
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae}')
print(f'Root Mean Squared Error (RMSE): {rmse}')

Mean Absolute Error (MAE): 0.014229073910618964
Root Mean Squared Error (RMSE): 0.0004245469048552596


#### Great scores, let's verify with cross validation just to make sure that our data isn't being misinterpreted by the model.

In [8]:
scores = cross_val_score(model, X, y, cv = 5, scoring = 'neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)

print(f'Cross-validated RMSE Scores: {rmse_scores}')
print(f'Mean RMSE: {rmse_scores.mean()}')

Cross-validated RMSE Scores: [0.16299463 0.01115374 0.01624387 0.01553908 0.07491328]
Mean RMSE: 0.05616892024928642


#### Our original scores were quite unrealistic, we have since tested with our model and the scores above reflect a great resting place to go forward with.  5.6 cents of error in our NVDA data in predicting future prices, not too bad!

In [9]:
correlations = X.corrwith(y)
print(correlations.sort_values(ascending = False))

EMA_core       0.990690
Close_Lag_1    0.976618
SMA_core       0.410351
RSI_core       0.165734
dtype: float64


#### There are several feature columns that are influencing our Close target because they are correlating too well.  Let's go back to our X and y assignment at the top of our notebook and drop some more of those columns to balance our score.

#### Now that our predictions look good let's make a few plots to visualize our results, starting with a plot that shows Actual vs Predicted values.

In [10]:
fig = go.Figure()

fig.add_trace(go.Scatter(x = y_test, y = y_pred,
                        mode = 'markers',
                        name = 'Predicted vs Actual'))

fig.add_trace(go.Scatter(x = y_test, y = y_test,
                        mode = 'lines',
                        name = 'Ideal Line'))

fig.update_layout(title = 'AAPL Actual vs Predicted Closing Prices',
                xaxis_title = 'Actual Closing Price',
                yaxis_title = 'Predicted Closing Price',
                template = 'plotly_dark')

fig.show()




#### Pretty good, we will improve this further later on in the project.  Let's look at a residuals plot as well now.

In [11]:
# Calculate the residuals first.
residuals = y_test - y_pred

# Create the residuals plot
fig = go.Figure()

fig.add_trace(go.Scatter(x = y_pred, y = residuals,
                mode = 'markers',
                name = 'Residuals'))

fig.add_trace(go.Scatter(x = y_pred, y = np.zeros_like(y_pred),
                mode = 'lines',
                name = 'Zero Line'))

fig.update_layout(title = 'Residuals vs Predicted Closing Prices',
                xaxis_title = 'Predicted Closing Prices',
                yaxis_title = 'Residuals',
                template = 'plotly_dark')

fig.show()




#### In the plot above, contrary to many other plots we create, we are wanting to see randomness.  We are looking for the data to be scattered around the zero line and randomly distributed, and that the spread is fairly symmetrical around the zero.  If there was clustering it could indicate bias in the predictions, which in this one we do see some centralized clustering.  In order to alleviate the distortion seen above I will be addressing outliers (outside this notebook) and performing more advanced models that will help capture the behavior that this data is trying to exhibit.  

#### Since we have our residuals variable established let's make a distribution plot for it to get another look.

In [12]:
fig = px.histogram(residuals, nbins = 30, marginal = 'box', histnorm = 'probability density')

fig.update_layout(title = 'AAPL Distribution of Residuals',
                xaxis_title = 'Residuals',
                yaxis_title = 'Density',
                template = 'plotly_dark')

fig.show()

#### Overall this is a good distribution.  The distribution is centered around 0, and suggests that on average the model's predictions are unbiased.  The histogram appears somewhat bell-shaped though it is definitely not normal.  Again the highest concentration of residuals are very close to the 0 marker, indicating that most predictions are very accurate.  Speaking to the boxplot at the top, is shows a few outliers on each end.  The majority of the data points within the IQR (Interquartile Range) suggest that most predictions are fairly close to the actual values.  If anything is to be improved I can look at the outliers here and remove them so as the model isn't hindered by their bias.

## Summary of the Analysis on the NVDA Linear Regression Model

#### In this notebook we developed a baseline Linear Regression model to predict the closing prices of NVDA stock.  We implemented several of our previous preprocessing steps, including feature creation and scaling, followed by model creation and evaluation.

### Key Insights

#### - The model performed well as a baseline, with a Mean Squared Error (RMsE) of approximately 5.6 cents.
#### - The residuals analysis indicated that the model's errors are unbiased and normally distributed, suggesting that the model is generally reliable.
#### - The distribution of residuals showed no significant skewness or outliers, further confirming the model's robustness for a baseline.

### Potential Next Steps

#### - In the next phase, we will explore more advanced models such as Decision Trees, GRU, and Transformer models to improve prediction accuracy.
#### - We will also consider incorporating additional data and refining features based on the insights gained from this baseline model.