# Machine Learning Models for stock prediction - non transformation version

# Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings("ignore")

## Import data

The data was downloaded from Bloomberg on:
* Exchange rate of Vietnam with its major trading partners: the China and the US
* Precious metal spot price and future price: Gold, Silver, Palladium, Platinum
* Global Stock Indices: Hang Seng Index, Nasdaq 100, Nasdaq Composite, Nikkei 225, SP500, DOJI, Shanghai Shenzhen CSI3000, Shanghai Shenzhen Composite and Singapore Stock Index
* Volatility stock index: VIX Index

The data will be imported from previous EDA session, which has been cleaned.

In [2]:
# Import data
data = pd.read_csv('data.csv')

# Transfer date column to date time
import datetime
data['Date'] = pd.to_datetime(data['Date'], format = '%m/%d/%Y')

# Turn date into index
data.set_index('Date', inplace = True)

# Machine Learning models - Regression

We will perform the test on several models to determine which models will perform best. The models are:
- Linear Regression (include Ridge and Lasso for avoid multicolinearity)
- Decision Trees Regressor
- Gradient Boosting Regressor
- XGBoost Regressor

## Train test split - 80% 20%

In [21]:
# Define train test split
def train_test_split(df, target):
    # Define train, cv, test time
    train_time = int(round(len(df) * 0.8))
    
    # Define X, y
    X = np.arange(0, len(df))
    y = df[target]
    
    # Train test split
    X_train = X[:train_time]
    X_test = X[train_time:]
    
    y_train = y.iloc[:train_time]
    y_test = y.iloc[train_time:]
    
    # Print out to check shape
    print(X_train.shape)
    print(X_test.shape)
    
    print(y_train.shape)
    print(y_test.shape)
    
    return X_train, y_train, X_test, y_test

In [22]:
# Train test split the dataset
X_train, y_train, X_test, y_test = train_test_split(data, 'index_vni')

(3340,)
(835,)
(3340,)
(835,)


## Prepare models

We will prepare regression models. We will fit on non-tuned models first to see the model, then based on the result, we will perform further hyperparameters tunning if needed

In [23]:
# Import model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

In [24]:
# Call out object
lr = LinearRegression()
ridge = Ridge(alpha = 15)
lasso = Lasso(alpha = 20)
dtr = DecisionTreeRegressor(criterion = 'squared_error', max_depth = 10, min_samples_split = 4)
gb = GradientBoostingRegressor()
xgb = XGBRegressor()
rf = RandomForestRegressor(criterion = 'squared_error', max_depth = 10, min_samples_split = 4)
svr = SVR()

# Set models list
models = [lr, ridge, lasso, dtr, gb, xgb, rf, svr]

## Test models

In [25]:
# Import metrics
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse

def rmse(mse):
    rmse = np.sqrt(mse)

In [26]:
# Define evaluate model
def evaluate_model(model, X_train, y_train, X_cv, y_cv):
    # Fit model and obtain result
    model.fit(X_train, y_train)
    y_pred_cv = model.predict(X_cv)
    MAE = mae(y_cv, y_pred_cv)
    MSE = mse(y_cv, y_pred_cv)
    RMSE = rmse(MSE)

    # Store result
    return MAE, MSE, RMSE

In [27]:
# Test the model

col = ['Linear Regression','Ridge','Lasso','Decision Tree','Gradient Boosting','XGBoost','Random Forest', 'SVR']
MAE_cv = []
MSE_cv = []

for model in models:
    MAE, MSE, RMSE = evaluate_model(model, X_train.reshape(-1,1), y_train, X_test.reshape(-1,1), y_test)
    MAE_cv.append(MAE)
    MSE_cv.append(MSE)

cv_result = pd.DataFrame(data = [MAE_cv, MSE_cv], columns = [col], index = ['MAE','MSE'])
cv_result.T.sort_values(by = 'MAE')

Unnamed: 0,MAE,MSE
Linear Regression,254.927068,100731.572821
Ridge,254.927069,100731.573519
Lasso,254.965061,100754.140121
SVR,329.300403,143074.986964
XGBoost,378.520056,184394.282982
Decision Tree,382.792711,187726.939374
Random Forest,385.914154,190177.626501
Gradient Boosting,389.849369,193293.623894


When we apply simple univariate models, we can see that the errors are very high!