#Machine Learning Model for Financial Data Prediction

#### Kabilan Mani
#### 230413612
#### ec230413612@qmul.ac.uk


### 1. Importing Essential Libraries


Here, we're bringing together all the tools and libraries that will empower our data analysis and model-building process. Each library has a specific role, from basic data manipulation with **numpy** and **pandas** to advanced machine learning with **xgboost** and **pytorch**.

In [1]:
# Importing necessary libraries for data manipulation, machine learning, and deep learning.
# numpy and pandas are used for numerical operations and data handling.
# sklearn provides tools for data preprocessing, model selection, and evaluation.
# xgboost is a popular library for gradient boosting algorithms.
# torch is used for building and training deep learning models.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import xgboost as xgb
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.neural_network import MLPRegressor
import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from statsmodels.tsa.arima.model import ARIMA

### 2. Loading and Merging Datasets

We're loading two key datasets, ensuring that their formats align, and then merging them based on the date. This merge gives us a powerful dataset where **financial indicators** and **stock performance** are side by side, laying the groundwork for our predictive models.

In [3]:
# Load financial and stock data
merged_financial_data = pd.read_csv('Merged_Financial_Data.csv')
stock_data = pd.read_csv('TSLA_Quarterly_Data.csv')

# Ensure column names are consistent
merged_financial_data.columns = merged_financial_data.columns.astype(str)
stock_data.columns = stock_data.columns.astype(str)

# Convert 'Date' columns to datetime format
merged_financial_data['Date'] = pd.to_datetime(merged_financial_data['Date'])
stock_data['Date'] = pd.to_datetime(stock_data['Date'])

# Merge datasets on 'Date'
merged_df = pd.merge(merged_financial_data, stock_data, on='Date', how='inner')

# Prepare features and target
features = merged_df.drop(columns=['Date', 'Close_y'])
target = merged_df['Close_y']

# Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=0.2, random_state=42)

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)


### 3. Preparing Features and Scaling


After merging the datasets, we prepare our feature matrix and target variable. Scaling the features ensures that each one contributes equally to the model, and splitting the data allows us to assess how well our model generalizes to unseen data.

In [10]:
# Preparing the feature matrix (X) by dropping non-essential columns and setting the stock price as the target variable (y).
features = merged_df.drop(columns=['Date', 'Close_y'])
target = merged_df['Close_y']

# Standardizing the feature matrix using StandardScaler.
# This step is crucial for many machine learning algorithms, which perform better when features are on a similar scale.
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Splitting the dataset into training and testing sets to evaluate our model's performance.
# We use 80% of the data for training and reserve 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=0.2, random_state=42)


### 4. Handling Missing Data


Missing data can be a common issue in real-world datasets. Here, we're using a straightforward imputation technique to fill in these gaps, ensuring that our model can work with a complete dataset.

In [11]:
# Addressing any missing values in the dataset by imputing them with the mean of each feature.
# This is a simple yet effective way to handle missing data without losing any rows from our dataset.
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

### 5.Model

#### 5.1 ARIMA Model Training

This code provides a comprehensive approach to building and evaluating an ARIMA model for time series forecasting. It first ensures the data is stationary, applies differencing if needed, and optionally tunes the ARIMA parameters using auto_arima.
This workflow is typical in time series analysis and helps in creating reliable forecasting models.

In [43]:
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error

# Function to test stationarity
def test_stationarity(timeseries):
    result = adfuller(timeseries)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    for key, value in result[4].items():
        print(f'Critical Values {key}: {value}')

# Check if the time series is stationary
test_stationarity(y_train)

# Differencing if the series is not stationary
if adfuller(y_train)[1] > 0.05:
    y_train_diff = np.diff(y_train, n=1)
    print("Applied first differencing")
else:
    y_train_diff = y_train

# Use auto_arima to find the best parameters (optional)
try:
    from pmdarima import auto_arima
    auto_arima_model = auto_arima(y_train_diff, seasonal=False, trace=True)
    print(auto_arima_model.summary())
    p, d, q = auto_arima_model.order
except ImportError:
    # If pmdarima is not available, fall back to manual tuning
    p, d, q = 5, 1, 0  # Default values to be adjusted

# Train ARIMA model with tuned parameters
arima_model = ARIMA(y_train_diff, order=(p, d, q))
arima_result = arima_model.fit()

# Forecast using ARIMA
arima_forecast_diff = arima_result.forecast(steps=len(y_test))

# Reverse differencing to obtain the forecast in the original scale
arima_forecast = np.r_[y_train[-1], arima_forecast_diff].cumsum()[1:]

# Evaluate ARIMA model
arima_mse = mean_squared_error(y_test, arima_forecast)
print(f'ARIMA Model MSE: {arima_mse:.4f}')


ADF Statistic: -14.11899967915911
p-value: 2.4437281550243927e-26
Critical Values 1%: -6.045114
Critical Values 5%: -3.9292800000000003
Critical Values 10%: -2.98681


  warn('Non-stationary starting autoregressive parameters'


ARIMA Model MSE: 56256.5453




In [44]:
# Load and transpose data for feature selection
data_path = 'Final_Transposed_Financial_Data_with_Category.csv'
df = pd.read_csv(data_path)
df_transposed = df.set_index('category').T

# Extract features and target
target_column = 'Close'
features = df_transposed.drop(columns=[target_column]).values
target = df_transposed[target_column].values

# Handle missing values
features = pd.DataFrame(features).fillna(features.mean()).values
target = np.nan_to_num(target)

# Scale features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1)


### 5.2 XGBoost Model Training

XGBoost is a powerful tool for regression tasks. By training it on our data, we're creating a model that can learn the intricate patterns between financial metrics and stock prices, setting the stage for our predictions.

In [50]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Check the shapes of the training data and labels
print(f'Shape of X_train_imputed: {X_train_imputed.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test_imputed: {X_test_imputed.shape}')
print(f'Shape of y_test: {y_test.shape}')

# Ensure that X_train and y_train have the same number of rows
min_train_len = min(X_train_imputed.shape[0], y_train.shape[0])
X_train_imputed = X_train_imputed[:min_train_len]
y_train = y_train[:min_train_len]

# Ensure there are no missing values in y_train and y_test
print(f'Missing values in y_train: {np.isnan(y_train).sum()}')
print(f'Missing values in y_test: {np.isnan(y_test).sum()}')

# Prepare data for XGBoost
try:
    dtrain = xgb.DMatrix(X_train_imputed, label=y_train)
    dtest = xgb.DMatrix(X_test_imputed, label=y_test)

    # Define and train XGBoost model
    params = {
        'objective': 'reg:squarederror',
        'max_depth': 7,
        'learning_rate': 0.01,
        'n_estimators': 100,
        'verbosity': 1
    }
    bst = xgb.train(params, dtrain, num_boost_round=100)

    # Make predictions
    y_test_pred_xgb = bst.predict(dtest)

    # Evaluate XGBoost model
    xgb_mse = mean_squared_error(y_test, y_test_pred_xgb)
    xgb_mae = mean_absolute_error(y_test, y_test_pred_xgb)
    xgb_r2 = r2_score(y_test, y_test_pred_xgb)

except xgb.core.XGBoostError as e:
    print(f'XGBoost Error: {e}')


Shape of X_train_imputed: (8, 92)
Shape of y_train: (8,)
Shape of X_test_imputed: (3, 92)
Shape of y_test: (3,)
Missing values in y_train: 0
Missing values in y_test: 0


Parameters: { "n_estimators" } are not used.



#### 5.3 Hybrid Model (ARIMA + Neural Network) Training

 This code implements a hybrid model combining the strengths of ARIMA and a neural network. The ARIMA model captures linear trends and patterns in the time series, while the neural network learns any remaining non-linear patterns from the residuals.
1. By training on these residuals, the network helps improve the overall forecasting accuracy, making it a powerful approach for time series prediction tasks.
2. The inclusion of early stopping and learning rate scheduling ensures that the model trains efficiently and avoids overfitting.

In [47]:
from torch import nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Truncate y_train and ARIMA fitted values to have the same length
min_len = min(len(y_train), len(arima_result.fittedvalues))
y_train_truncated = y_train[:min_len]
arima_fitted_truncated = arima_result.fittedvalues[:min_len]

# Calculate residuals from ARIMA model
arima_residuals = y_train_truncated - arima_fitted_truncated

# Split data into training and validation sets for better evaluation
residual_train, residual_val = train_test_split(arima_residuals, test_size=0.2, random_state=42)

# Prepare residual data for Neural Network
residual_train_tensor = torch.tensor(residual_train, dtype=torch.float32).view(-1, 1)
residual_val_tensor = torch.tensor(residual_val, dtype=torch.float32).view(-1, 1)

# Define a more complex neural network model
class ResidualNN(nn.Module):
    def __init__(self):
        super(ResidualNN, self).__init__()
        self.fc1 = nn.Linear(residual_train_tensor.shape[1], 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 32)
        self.fc5 = nn.Linear(32, 1)
        self.dropout = nn.Dropout(0.3)
        self.bn1 = nn.BatchNorm1d(256)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(64)
        self.bn4 = nn.BatchNorm1d(32)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        x = torch.relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        x = torch.relu(self.bn3(self.fc3(x)))
        x = torch.relu(self.bn4(self.fc4(x)))
        x = self.fc5(x)
        return x

# Initialize the model, criterion, and optimizer
nn_model = ResidualNN()
criterion = nn.MSELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(optimizer, 'min', patience=5, factor=0.5, verbose=True)

# Training with early stopping, learning rate scheduling, and validation
num_epochs = 100
best_val_loss = float('inf')
patience, trials = 10, 0

for epoch in range(num_epochs):
    nn_model.train()
    optimizer.zero_grad()
    y_pred = nn_model(residual_train_tensor)
    loss = criterion(y_pred, residual_train_tensor)
    loss.backward()
    optimizer.step()

    nn_model.eval()
    with torch.no_grad():
        val_pred = nn_model(residual_val_tensor)
        val_loss = criterion(val_pred, residual_val_tensor)

    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        trials = 0
    else:
        trials += 1
        if trials >= patience:
            print(f"Early stopping on epoch {epoch+1}")
            break

    print(f'Epoch {epoch+1}/{num_epochs}, Training Loss: {loss.item():.4f}, Validation Loss: {val_loss.item():.4f}')

# After training, use the model for predictions or further evaluation




Epoch 1/100, Training Loss: 18049.7324, Validation Loss: 4542.3193
Epoch 2/100, Training Loss: 18021.4121, Validation Loss: 4537.5654
Epoch 3/100, Training Loss: 17926.2441, Validation Loss: 4533.3760
Epoch 4/100, Training Loss: 17964.0488, Validation Loss: 4527.6904
Epoch 5/100, Training Loss: 17919.7734, Validation Loss: 4523.0640
Epoch 6/100, Training Loss: 17864.1953, Validation Loss: 4516.7461
Epoch 7/100, Training Loss: 17872.7480, Validation Loss: 4510.6758
Epoch 8/100, Training Loss: 17889.1113, Validation Loss: 4502.5229
Epoch 9/100, Training Loss: 17847.6270, Validation Loss: 4494.4131
Epoch 10/100, Training Loss: 17843.9727, Validation Loss: 4484.7881
Epoch 11/100, Training Loss: 17830.3828, Validation Loss: 4475.7832
Epoch 12/100, Training Loss: 17846.5645, Validation Loss: 4466.7334
Epoch 13/100, Training Loss: 17836.0176, Validation Loss: 4458.7559
Epoch 14/100, Training Loss: 17811.8301, Validation Loss: 4448.6870
Epoch 15/100, Training Loss: 17828.0566, Validation Loss:

 After training the model, the below section forecasts future values and evaluates the model’s accuracy using the MSE metric.

### 6.Evaluation of Models


This code segment evaluates three different models on a test dataset:

    **ARIMA Model**: A traditional time series model that captures linear patterns.
    **XGBoost Model**: A powerful machine learning model that captures complex patterns and interactions.
    **Hybrid Model**: A combination of ARIMA and a neural network, designed to capture both linear and non-linear patterns in the data.
Each model's performance is measured using standard metrics like MSE, MAE, and R2 Score.

In [52]:
import numpy as np
import torch
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Evaluate ARIMA model
arima_mse = mean_squared_error(y_test, arima_forecast)
arima_mae = mean_absolute_error(y_test, arima_forecast)
arima_r2 = r2_score(y_test, arima_forecast)

print(f'ARIMA Model MSE: {arima_mse:.4f}')
print(f'ARIMA Model MAE: {arima_mae:.4f}')
print(f'ARIMA Model R2 Score: {arima_r2:.4f}')

# Convert arima_forecast to the correct shape if necessary
arima_forecast_array = arima_forecast.reshape(-1, 1)

#Evaluate XGBoost Model

print(f'XGBoost Model MSE: {xgb_mse:.4f}')
print(f'XGBoost Model MAE: {xgb_mae:.4f}')
print(f'XGBoost Model R2 Score: {xgb_r2:.4f}')

# Evaluate Hybrid model (Combine ARIMA forecast and neural network)
nn_forecast = nn_model(torch.tensor(arima_forecast_array, dtype=torch.float32)).detach().numpy()
hybrid_forecast = arima_forecast + nn_forecast.flatten()

# Calculate metrics for the hybrid model
hybrid_mse = mean_squared_error(y_test, hybrid_forecast)
hybrid_mae = mean_absolute_error(y_test, hybrid_forecast)
hybrid_r2 = r2_score(y_test, hybrid_forecast)

print(f'Hybrid Model MSE: {hybrid_mse:.4f}')
print(f'Hybrid Model MAE: {hybrid_mae:.4f}')
print(f'Hybrid Model R2 Score: {hybrid_r2:.4f}')


ARIMA Model MSE: 56256.5453
ARIMA Model MAE: 216.5778
ARIMA Model R2 Score: -4.7823
XGBoost Model MSE: 13287.9088
XGBoost Model MAE: 100.3768
XGBoost Model R2 Score: -0.3658
Hybrid Model MSE: 58778.0403
Hybrid Model MAE: 221.8783
Hybrid Model R2 Score: -5.0415


### 7. Prediction of Next Value

In [56]:
# Handle ARIMA warnings by ensuring a proper index is set on the data
y_test = y_test.reset_index(drop=True)  # Ensure index is properly set
arima_forecast = arima_result.forecast(steps=1)[0]  # Correctly extract ARIMA forecast value
print(f'Next value prediction using ARIMA: {arima_forecast:.4f}')

# Predict the next value using the most recent data point from X_test
next_input = X_test_imputed[-1].reshape(1, -1)  # Reshape to ensure it's in the right format
next_dmatrix = xgb.DMatrix(next_input)
next_value_pred_xgb = bst.predict(next_dmatrix)[0]  # Get the prediction for the next value

print(f'Next value prediction using XGBoost: {next_value_pred_xgb:.4f}')

# Hybrid Model: Combine ARIMA forecast and neural network residual prediction
try:
    next_residual = nn_model(torch.tensor([[arima_forecast]], dtype=torch.float32)).item()
    next_value_hybrid = arima_forecast + next_residual
    print(f'Next value prediction using Hybrid Model: {next_value_hybrid:.4f}')
except KeyError as e:
    print(f"KeyError occurred: {e}")


Next value prediction using ARIMA: 126.8539
Next value prediction using XGBoost: 147.4598
Next value prediction using Hybrid Model: 127.4968
