# **Core Stock Baseline Modeling - Linear Regression for TSLA Ticker**
## In this notebook we will examine only the Tesla stock for the periods we have selected for this project (01-01-2019 through 06-30-2024), and perform a Linear Regression model using the preprocessed dataframe that we created in our other notebook.  We will look to get an idea of how well our data is predicting with our new features that we created.

#### As usual let's start by bringing in our important libraries and logic for needing to complete this notebook.

In [100]:
import sys
import os

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import cross_val_score

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)




#### Now let's read in our data that we will use for this notebook.

In [101]:
# Now let's access the main core_stock_data.csv file
csv_path = os.path.join(project_root, 'data', 'core_stock_preprocessed.csv')
preprocessed_df = pd.read_csv(csv_path, parse_dates=['Date'], index_col= 'Date')
preprocessed_df.head()

Unnamed: 0_level_0,Close,Volume,Open,High,Low,SMA_core,EMA_core,RSI_core,RMA_core,Close_Lag_1,...,EMA_Lag_Std_1_3,SMA_Lag_Avg_1_3,SMA_Lag_Std_1_3,RMA_Lag_Avg_1_3,RMA_Lag_Std_1_3,Close_Lag_Avg_1_3,Close_Lag_Std_1_3,Diff_Close_EMA_core,Ratio_Close_EMA_core,Ticker
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-03-14,-1.010398,0.03858,-1.009605,-1.015299,-1.006805,-1.062552,-1.040363,1.266968,0.782694,-1.010284,...,-0.073765,-1.063089,-0.076952,0.796947,-0.554398,-1.011303,-0.312358,0.057257,0.782694,AAPL
2019-03-15,-1.005062,0.452947,-1.007484,-1.008167,-1.00414,-1.061304,-1.038819,1.34041,0.900346,-1.010284,...,-0.073765,-1.063089,-0.076952,0.796947,-0.554398,-1.011303,-0.312358,0.088819,0.900346,AAPL
2019-03-18,-1.000821,0.109331,-1.005363,-1.005826,-0.999509,-1.059275,-1.037165,1.443013,0.978601,-1.004948,...,-0.073765,-1.062464,-0.076952,0.856856,-0.554398,-1.008633,-0.312358,0.110473,0.978601,AAPL
2019-03-19,-1.004147,0.254746,-0.999669,-1.004502,-0.999215,-1.057581,-1.03571,1.136258,0.823821,-1.000706,...,-0.063359,-1.061579,-0.060461,0.903391,-0.456453,-1.006327,-0.293423,0.071293,0.823821,AAPL
2019-03-20,-1.000508,0.238368,-1.004403,-1.003398,-1.001903,-1.0558,-1.034166,1.541735,0.886499,-1.004033,...,-0.064253,-1.059921,-0.055971,0.917352,-0.591252,-1.004241,-0.340785,0.088857,0.886499,AAPL


#### Great now let's grab the data just for our subject TSLA ticker to use in our Linear Regression model.

In [102]:
tsla_data = preprocessed_df[preprocessed_df['Ticker'] == 'TSLA']
print(tsla_data.head())
tsla_data.shape

               Close    Volume      Open      High       Low  SMA_core  \
Date                                                                     
2019-01-02 -1.235957  0.578270 -1.237963 -1.236249 -1.239220  2.822138   
2019-01-03 -1.241768  0.106643 -1.237427 -1.239623 -1.240075  2.822138   
2019-01-04 -1.231451  0.149742 -1.238022 -1.234559 -1.236852  2.822138   
2019-01-07 -1.221169  0.165529 -1.228662 -1.223525 -1.227804  2.822138   
2019-01-08 -1.220937  0.110995 -1.216611 -1.219244 -1.222220  2.822138   

            EMA_core  RSI_core  RMA_core  Close_Lag_1  ...  EMA_Lag_Std_1_3  \
Date                                                   ...                    
2019-01-02 -1.237114  0.165409 -0.277934     3.082193  ...         0.091251   
2019-01-03 -1.237347 -3.203487 -0.643004    -1.235834  ...        50.430704   
2019-01-04 -1.237157  0.572292  0.019237    -1.241645  ...        50.467524   
2019-01-07 -1.236562  1.400157  0.649990    -1.231328  ...        -0.093617   
2019-01

(1382, 32)

#### Now let's prepare our Linear Regression model.  We will be dropping the Ticker and Close columns for the X as Ticker is an improper format and our Close feature will be our y target.  The Date feature is our index so it won't be included in the model as it will automatically be excluded from the feature set.

In [103]:
X = tsla_data[['EMA_core', 'SMA_core', 'RSI_core', 'Close_Lag_1', 'Volume', 'Open', 'High', 'Low', 'Close_Lag_3']]
y = tsla_data['Close']

#### Let's now set up the rest of the model and run our first set of predictions on it.  I will be looking for the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) for metrics here.

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = LinearRegression()
model.fit(X_train, y_train)

#### Now let's make our predictions and see where we end up.

In [105]:
y_pred = model.predict(X_test)

#### Let's retrieve our MAE and RMSE metrics for performance.

In [106]:
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae}')
print(f'Root Mean Squared Error (RMSE): {rmse}')

Mean Absolute Error (MAE): 0.016894743260739375
Root Mean Squared Error (RMSE): 0.0005844685220932393


#### Great scores, let's verify with cross validation just to make sure that our data isn't being misinterpreted by the model.

In [107]:
scores = cross_val_score(model, X, y, cv = 5, scoring = 'neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)

print(f'Cross-validated RMSE Scores: {rmse_scores}')
print(f'Mean RMSE: {rmse_scores.mean()}')

Cross-validated RMSE Scores: [0.02587436 0.02354417 0.03333949 0.02871106 0.0194676 ]
Mean RMSE: 0.026187336264943007


#### Our original scores were quite unrealistic, we have since tested with our model and the scores above reflect a great resting place to go forward with.  2.4 cents of error in our TSLA data in predicting future prices, not too bad!

In [108]:
correlations = X.corrwith(y)
print(correlations.sort_values(ascending = False))

Low            0.999234
High           0.999152
Open           0.998103
Close_Lag_1    0.989525
EMA_core       0.968766
Close_Lag_3    0.968332
SMA_core       0.654543
RSI_core      -0.012661
Volume        -0.432286
dtype: float64


#### There are several feature columns that are influencing our Close target because they are correlating too well.  Let's go back to our X and y assignment at the top of our notebook and drop some more of those columns to balance our score.

#### Now that our predictions look good let's make a few plots to visualize our results, starting with a plot that shows Actual vs Predicted values.

In [109]:
fig = go.Figure()

fig.add_trace(go.Scatter(x = y_test, y = y_pred,
                        mode = 'markers',
                        name = 'Predicted vs Actual'))

fig.add_trace(go.Scatter(x = y_test, y = y_test,
                        mode = 'lines',
                        name = 'Ideal Line'))

fig.update_layout(title = 'AAPL Actual vs Predicted Closing Prices',
                xaxis_title = 'Actual Closing Price',
                yaxis_title = 'Predicted Closing Price',
                template = 'plotly_dark')

fig.show()




#### Pretty good, we will improve this further later on in the project.  Let's look at a residuals plot as well now.

In [110]:
# Calculate the residuals first.
residuals = y_test - y_pred

# Create the residuals plot
fig = go.Figure()

fig.add_trace(go.Scatter(x = y_pred, y = residuals,
                mode = 'markers',
                name = 'Residuals'))

fig.add_trace(go.Scatter(x = y_pred, y = np.zeros_like(y_pred),
                mode = 'lines',
                name = 'Zero Line'))

fig.update_layout(title = 'Residuals vs Predicted Closing Prices',
                xaxis_title = 'Predicted Closing Prices',
                yaxis_title = 'Residuals',
                template = 'plotly_dark')

fig.show()




#### In the plot above, contrary to many other plots we create, we are wanting to see randomness.  We are looking for the data to be scattered around the zero line and randomly distributed, and that the spread is fairly symmetrical around the zero.  If there was clustering it could indicate bias in the predictions.  

#### Since we have our residuals variable established let's make a distribution plot for it to get another look.

In [111]:
fig = px.histogram(residuals, nbins = 30, marginal = 'box', histnorm = 'probability density')

fig.update_layout(title = 'TSLA Distribution of Residuals',
                xaxis_title = 'Residuals',
                yaxis_title = 'Density',
                template = 'plotly_dark')

fig.show()

#### Overall this is a good distribution.  The distribution is centered around 0, and suggests that on average the model's predictions are unbiased.  The histogram appears somewhat bell-shaped though it is definitely not normal.  Again the highest concentration of residuals are very close to the 0 marker, indicating that most predictions are very accurate.  Speaking to the boxplot at the top, is shows a few outliers on each end.  The majority of the data points within the IQR (Interquartile Range) suggest that most predictions are fairly close to the actual values.  If anything is to be improved I can look at the outliers here and remove them so as the model isn't hindered by their bias.

## Summary of the Analysis on the AAPL Linear Regression Model

#### In this notebook we developed a baseline Linear Regression model to predict the closing prices of AAPL stock.  We implemented several of our previous preprocessing steps, including feature creation and scaling, followed by model creation and evaluation.

### Key Insights

#### - The model performed well as a baseline, with a Mean Absolute Error (MAE) of approximately 2.4 cents.
#### - The residuals analysis indicated that the model's errors are unbiased and normally distributed, suggesting that the model is generally reliable.
#### - The distribution of residuals showed no significant skewness or outliers, further confirming the model's robustness for a baseline.

### Potential Next Steps

#### - In the next phase, we will explore more advanced models such as Decision Trees, GRU, and Transformer models to improve prediction accuracy.
#### - We will also consider incorporating additional data and refining features based on the insights gained from this baseline model.