## Table of Contents

1. [Introduction](#introduction)
2. [ARMA Model (df1)](#arma-model-df1)
   - [Data Preparation](#data-preparation-df1)
   - [Preprocessing](#preprocessing-df1)
   - [Modeling](#modeling-df1)
   - [Conclusion](#conclusion-df1)
3. [GARCH Model (df2)](#garch-model-df2)
   - [Data Preparation](#data-preparation-df2)
   - [Preprocessing](#preprocessing-df2)
   - [Modeling](#modeling-df2)
   - [Conclusion](#conclusion-df2)
4. [LSTM Model (df3)](#lstm-model-df3)
   - [Data Preparation](#data-preparation-df3)
   - [Preprocessing](#preprocessing-df3)
   - [Modeling](#modeling-df3)
   - [Conclusion](#conclusion-df3)
5. [Summary and Conclusions](#summary-and-conclusions)
6. [References](#references)


Before delving into the modeling process, it's essential to organize the dataset appropriately for each modeling technique. In our approach, we'll segment the dataset into three distinct dataframes, each tailored for a specific modeling technique. This segmentation ensures that we apply the most suitable preprocessing and modeling strategies for each model. Let's outline the breakdown:

**ARMA Model (df1):** 
* This dataframe will focus on preparing the data for the AutoRegressive Moving Average (ARMA) model. It includes essential columns such as 'Date' and 'Adj Close', optimized for time-series analysis.

**GARCH Model (df2):**
* Here, we'll structure the data to suit the Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) model. The dataframe will contain 'Date' and 'pct_change' columns, crucial for capturing volatility patterns.

**LSTM Model (df3):**
* For the Long Short-Term Memory (LSTM) model, we'll set up a dataframe with features like 'Open', 'High', 'Low', 'Close', 'Adj Close', and 'Volume'. This comprehensive dataset enables the LSTM model to learn intricate temporal dependencies and patterns.

In [13]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Loading the data
The data comes from the Kaggle: a free, open-source data-sharing portal with a massive range of datasets.
b

In [14]:
df = pd.read_csv('../data/TSLA_cleaned.csv')
df.head()

Unnamed: 0,Date,Open,Open_Delta,High,High_Delta,Low,Low_Delta,Close,Close_Delta,Adj Close,Adj Close_Delta,Volume,Volume_Delta,daily_return Adj Close
0,2010-07-01,5.0,,5.184,,4.054,,4.392,,4.392,,41094000,,
1,2010-07-02,4.6,-0.4,4.62,-0.564,3.742,-0.312,3.84,-0.552,3.84,-0.552,25699000,-15395000.0,-0.125683
2,2010-07-06,4.0,-0.6,4.0,-0.62,3.166,-0.576,3.222,-0.618,3.222,-0.618,34334500,8635500.0,-0.160937
3,2010-07-07,3.28,-0.72,3.326,-0.674,2.996,-0.17,3.16,-0.062,3.16,-0.062,34608500,274000.0,-0.019243
4,2010-07-08,3.228,-0.052,3.504,0.178,3.114,0.118,3.492,0.332,3.492,0.332,38557000,3948500.0,0.105063


1. **ARMA Model (df1):** 
**Data Characteristics:** 
* ARMA (AutoRegressive Moving Average) models are well-suited for stationary time series data, which exhibit constant statistical properties over time.

**Preprocessing Approach:**
* The provided data seems to consist of daily stock prices (like 'Adj Close') and their changes over time.
* In preprocessing, standardization of the 'value' column (representing stock prices) has been done to ensure that the data has a mean of 0 and a standard deviation of 1. This is crucial for ARMA modeling as it assumes normally distributed stationary data.


**Modeling Stage:**
* ARMA models are used for modeling the autocorrelation in the data and predicting future values based on past observations.
* With the preprocessed data, the ARMA model can be fitted to capture the linear dependencies between past and present values, aiding in predicting future stock prices.

In [25]:
# Selecting 'Date' and 'Adj Close' columns from df and renaming 'Adj Close' to 'Value'
df1 = df[['Date', 'Adj Close']].rename(columns={'Adj Close': 'value'})

# Displaying the first few rows of df1 to verify the changes
df1.head()

Unnamed: 0,Date,value
3,2010-07-07,3.16
4,2010-07-08,3.492
5,2010-07-09,3.48
6,2010-07-12,3.41
7,2010-07-13,3.628


In [17]:
# Check if there are any categorical variables
categorical_columns = df1.select_dtypes(include=['object']).columns
if len(categorical_columns) > 0:
    # Create dummy or indicator features for categorical variables
    df1 = pd.get_dummies(df1, columns=categorical_columns)

# Standardize the magnitude of numeric features using a scaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df1[['value']])
scaled_df = pd.DataFrame(scaled_df, columns=['value'])

# Split into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(scaled_df, df1['value'], test_size=0.2, random_state=42)

# Check the range of features
feature_ranges = X_train.max() - X_train.min()
print("Ranges of features:")
print(feature_ranges)

Ranges of features:
value    4.660569
dtype: float64


In summary, standardization ensures fair treatment of all features in the model, regardless of their original scale or units. The feature range check helps ensure that after standardization, all features have reasonable variations that align with their natural characteristics.

In our case, after standardization, the range of values for the 'Adj Close' feature is approximately 4.660569. This means that the standardized 'Adj Close' prices have a reasonable variation, indicating that they are appropriately scaled for inclusion in the model. This information provides further assurance that the preprocessing steps have been applied effectively, setting the stage for robust modeling and analysis.

2. **GARCH Model (df2):**

**Data Characteristics:** 
* GARCH (Generalized AutoRegressive Conditional Heteroskedasticity) models are suitable for time series data with volatility clustering, where periods of high volatility tend to cluster together.


**Preprocessing Approach:**
* The data for this model consists of percentage changes ('pct_change') of the 'Adj Close' prices, which is a common input for GARCH models.
* Similar to the ARMA preprocessing, standardization has been performed on the 'value' column to ensure the data meets the assumptions of the GARCH model.


**Modeling Stage:**
* GARCH models are specifically designed to model the volatility clustering phenomenon often observed in financial time series.
* By fitting a GARCH model to the preprocessed data, one can capture the time-varying volatility and make predictions about future volatility levels, which is valuable for risk management and option pricing.

In [20]:
df_copy = df
df_copy['pct_change'] = 100*df_copy['Adj Close'].pct_change()
df_copy.dropna(inplace=True)
df2 = df_copy.drop(columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'])
# Selecting 'Date' and 'pct_change' columns from df and renaming 'pct_change' to 'Value'
df2 = df2[['Date', 'pct_change']].rename(columns={'pct_change': 'value'})
df2.head()

Unnamed: 0,Date,value
3,2010-07-07,-1.924271
4,2010-07-08,10.506329
5,2010-07-09,-0.343643
6,2010-07-12,-2.011494
7,2010-07-13,6.392962


In [22]:
# Check if there are any categorical variables
categorical_columns = df2.select_dtypes(include=['object']).columns
if len(categorical_columns) > 0:
    # Create dummy or indicator features for categorical variables
    df2 = pd.get_dummies(df2, columns=categorical_columns)

# Standardize the magnitude of numeric features using a scaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df2[['value']])
scaled_df = pd.DataFrame(scaled_df, columns=['value'])

# Split into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(scaled_df, df2['value'], test_size=0.2, random_state=42)

# Check the range of features
feature_ranges = X_train.max() - X_train.min()
print("Ranges of features:")
print(feature_ranges)

Ranges of features:
value    12.47568
dtype: float64



Certainly! Here's an example summary for the preprocessing of df2:

In summary, the preprocessing steps for df2, geared towards preparing data for the GARCH model, ensure that the dataset is appropriately structured and scaled for accurate modeling.

Standardization of the percentage change in 'Adj Close' prices is a critical step in ensuring fair treatment of features in the model. This process transforms the data to have a mean of 0 and a standard deviation of 1, allowing for consistent interpretation and analysis across different features.

After standardization, the feature range check reveals that the range of values for the standardized percentage change is approximately 12.47568. This indicates that the data exhibits reasonable variation, ensuring that no single feature dominates the modeling process due to differences in scale.

Overall, these preprocessing steps lay a solid foundation for training the GARCH model, providing confidence that the model will be able to effectively capture volatility patterns in the financial data.

3. **LSTM Model (df3):**

**Data Characteristics:**
* LSTM (Long Short-Term Memory) models are a type of recurrent neural network (RNN) that excel at capturing long-term dependencies and patterns in sequential data.


**Preprocessing Approach:**
* Unlike the previous models, the data here contains multiple features such as 'Open', 'High', 'Low', 'Close', 'Adj Close', and 'Volume'.
* Preprocessing involves standardization of all numeric features using a scaler, ensuring that the magnitudes of different features do not bias the LSTM model during training.


**Modeling Stage:**
* LSTMs are particularly effective for modeling complex, non-linear relationships in sequential data.
* With the preprocessed data, an LSTM model can be trained to learn the temporal patterns and dependencies present in the historical stock price data, potentially leading to more accurate predictions of future stock prices.

In [23]:
df3 = df[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']]
df3.head()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
3,3.28,3.326,2.996,3.16,3.16,34608500
4,3.228,3.504,3.114,3.492,3.492,38557000
5,3.516,3.58,3.31,3.48,3.48,20253000
6,3.59,3.614,3.4,3.41,3.41,11012500
7,3.478,3.728,3.38,3.628,3.628,13400500


In [24]:
# Check if there are any categorical variables
categorical_columns = df3.select_dtypes(include=['object']).columns
if len(categorical_columns) > 0:
    # Create dummy or indicator features for categorical variables
    df3 = pd.get_dummies(df3, columns=categorical_columns)

# Standardize the magnitude of numeric features using a scaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df3)
scaled_df = pd.DataFrame(scaled_df, columns=df3.columns)

# Split into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(scaled_df, df3['Close'], test_size=0.2, random_state=42)

# Check the range of features
feature_ranges = X_train.max() - X_train.min()
print("Ranges of features:")
print(feature_ranges)

Ranges of features:
Open          4.704777
High          4.659606
Low           4.703418
Close         4.658829
Adj Close     4.658829
Volume       10.697265
dtype: float64


In summary, the preprocessing steps for df3, tailored for training an LSTM model, ensure that the dataset is appropriately structured and scaled for effective sequential data analysis.

Standardization of all numeric features, including 'Open', 'High', 'Low', 'Close', 'Adj Close', and 'Volume', is crucial for maintaining consistency in the model training process. This process transforms the data to have a mean of 0 and a standard deviation of 1, facilitating the learning of patterns and dependencies by the LSTM model.

After standardization, the feature range check reveals that the ranges of values for the standardized features vary:

'Open': Approximately 4.704777
'High': Approximately 4.659606
'Low': Approximately 4.703418
'Close': Approximately 4.658829
'Adj Close': Approximately 4.658829
'Volume': Approximately 10.697265
These ranges indicate that the standardized features exhibit reasonable variations, ensuring that no single feature dominates the model training process due to differences in scale. The wider range observed for 'Volume' compared to other features is expected, as trading volumes often vary significantly across different time periods.

Overall, these preprocessing steps establish a solid groundwork for training the LSTM model, enabling it to effectively capture temporal dependencies and patterns in the financial data.