# Ubiquant Market Prediction 
## Comparing ML Techniques and LightGBM Finetuning! ⚡


<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:aquamarine;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Introduction

This notebook aims to share initial analysis on which technique may get you a nice score in the Ubiquant Market Prediction Competition. 🍀

It does this by trying out different models and comparing their results! 

The models compared are:


* Linear Regression
* Random Forest Regressor
* Ridge
* XGBoost
* LightGBM
* Support Vector Regressor


The 3 sections of this notebook includes: Import, Model Comparison and LightGBM Finetuning (As the LightGBM, seen the best performance, from our analysis).

Here we go! 😀


<div style="color:yellow;
           display:fill;
           border-radius:5px;
           background-color:chartreuse;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Import

First we need to import!

In [None]:
# Import libraries

import datetime
from datetime import datetime
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import plotly_express as px
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
import time
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Import competition data

train = pd.read_csv('/kaggle/input/ubiquant-market-prediction/train.csv', nrows=10000)
example_test = pd.read_csv('/kaggle/input/ubiquant-market-prediction/example_test.csv')
example_sample_submission =  pd.read_csv('/kaggle/input/ubiquant-market-prediction/example_sample_submission.csv')

In [None]:
train.head()

Please note we only imported the first 10,000 rows as an introductory method to compare techniques. And to share the results.

As the competition progresses futher work could include attempting different import methods. 

That said, lets see how the different models faired! 

<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:turquoise;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Data Prep

# Data Prep for Models

df = train

scaler = StandardScaler()
X = np.array(df.drop(['row_id', 'time_id', 'investment_id', 'target'], axis = 1))
scaler.fit(X)
X = scaler.transform(X)

y = np.array(df['target'])

print(X.shape)
print(y.shape)In this section we compare different techniques performance

First need to do some data prep for the models.

In [None]:
# Data Prep for Models

df = train

scaler = StandardScaler()
X = np.array(df.drop(['row_id', 'time_id', 'investment_id', 'target'], axis = 1))
scaler.fit(X)
X = scaler.transform(X)

y = np.array(df['target'])

print(X.shape)
print(y.shape)

That is the inital prep complete!


<div style="color:yellow;
           display:fill;
           border-radius:5px;
           background-color:yellow;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Linear Regression

By fitting a linear equation to observed data, linear regression seeks to model the relationship between two variables. One variable is regarded as an explanatory variable, while the other is regarded as a dependent variable. A modeller might, for example, use a linear regression model to match people's weights to their heights.

(For futher details our recommended read is by Yale, linked below).


Lets try it out! 💡

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_confidence = lr.score(X_test, y_test)
lr_confidence


<div style="color:fuchsia;
           display:fill;
           border-radius:5px;
           background-color:fuchsia;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:fuchsia;">

</p>
</div>

# RandomForestRegressor

A random forest is a meta estimator that employs averaging to increase predicted accuracy and control over-fitting by fitting a number of classification decision trees on various sub-samples of the dataset. 

If bootstrap=True (default), the sub-sample size is regulated by the max samples argument; otherwise, the entire dataset is utilised to create each tree.

(For futher details our recommended read is by Scikit - Learn, linked below).

Let's try it out! 🪄

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
rf_confidence = rf.score(X_test, y_test)
rf_confidence


<div style="color:orange;
           display:fill;
           border-radius:5px;
           background-color:orange;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:orange;">

</p>
</div>

# Ridge

This model handles a regression problem in which the loss function is the linear least squares function and the l2-norm is used for regularisation. 

Ridge Regression or Tikhonov regularisation are other terms for the same thing. When y is a 2d-array of shape (n samples, n targets), this estimator contains built-in support for multi-variate regression. 

(For futher details our recommended read is by Scikit - Learn, linked below).

Lets give it a go! 👀

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
rg = Ridge()
rg.fit(X_train, y_train)
rg_confidence = rg.score(X_test, y_test)
rg_confidence


<div style="color:blueviolet;
           display:fill;
           border-radius:5px;
           background-color:blueviolet;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:blueviolet;">

</p>
</div>

# XGBoost

XGBoost is a distributed gradient boosting toolkit that has been tuned for efficiency, flexibility, and portability. It uses the Gradient Boosting framework to create machine learning algorithms. 

XGBoost is a parallel tree boosting (also known as GBDT, GBM) algorithm that solves a variety of data science issues quickly and accurately. The same algorithm may tackle problems with billions of examples in a distributed environment (Hadoop, SGE, MPI). 

(For futher details our recommended read is by Scikit - Learn, linked below).

Lets have a go! 👓

In [None]:
# Import xgboost
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))


XGB_confidence = xg_reg.score(X_test, y_test)
XGB_confidence

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test =  xgb.DMatrix(data=X_test,label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

print('Prediction: %.3f' % preds[0])


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:blue;">

</p>
</div>

# LightGBM

LightGBM is a decision tree-based gradient boosting framework that improves model efficiency while reducing memory utilisation.

It employs two innovative techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB), Which address the drawbacks of the histogram-based approach employed in most GBDT (Gradient Boosting Decision Tree) frameworks. The properties of LightGBM Algorithm are formed by the two methodologies of GOSS and EFB. They work together to make the model run smoothly and give it an advantage over competing GBDT frameworks. 

(For futher details our recommended read is by Light GBM, linked below) 


Lets try it out!🤗

In [None]:
from lightgbm import LGBMRegressor


# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

LGBM = LGBMRegressor()

# Fit the regressor to the training set
LGBM.fit(X, y)

LGBM_confidence = LGBM.score(X_test, y_test)
LGBM_confidence


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:coral;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Support Vector Regressor 

Support vector regression is a feature of support vector machines. In other words, there is a concept known as a support vector machine that may be used to analyse both regression and classification data.

Support vector regression (SVR), is distinguished by the use of kernels, sparse solutions, and VC control of the margin and number of support vectors. 

(For futher details our recommended read is by Awad & Khanna, linked below) 


Lets give it a go! 💪


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
svr = SVR()
svr.fit(X_train, y_train)
svr_confidence = svr.score(X_test, y_test)
svr_confidence


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:yellow;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Model Comparision ✨

Lets compare each ML models performane 🔍

In [None]:
# Compare performence of each model

names = ['Linear Regression', 'Random Forest', 'Ridge', 'SVR', 'XGBoost', 'LGBM']
columns = ['model', 'accuracy']
scores = [lr_confidence, rf_confidence, rg_confidence, svr_confidence, XGB_confidence, LGBM_confidence]
alg_vs_score = pd.DataFrame([[x, y] for x, y in zip(names, scores)], columns = columns)

In [None]:
import plotly.express as px

fig = px.bar(alg_vs_score, y='accuracy', x='model',
            title="Performance of Different Models", color="model", hover_name="accuracy",
             color_discrete_sequence=px.colors.qualitative.Pastel
             )

fig.show()

We see the Light GBM performed the best! 🥇⭐


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:lightgreen;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:lightgreen;">

</p>
</div>

# Fine Tuning LightGBM Hyperparameters


Lets fine tune to see if we can improve the score! 🧠

### Tuning Number of Trees

An important hyperparameter for the LightGBM ensemble algorithm is the number of decision trees used in the ensemble.

Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees are often better.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

In [None]:
df = train
scaler = StandardScaler()
X = np.array(df.drop(['target'], 1))
scaler.fit(X)
X = scaler.transform(X)
X = np.array(df.drop(['target'], 1))
y = np.array(df['target'])

In [None]:
X.shape

In [None]:
y.shape

In [None]:

# explore lightgbm boosting type effect on performance
from numpy import arange
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
from sklearn.model_selection import KFold, cross_val_score

# get a list of models to evaluate
def get_models():
    models = dict()
    trees = [10, 50, 100, 500, 1000, 5000]
    for n in trees:
        models[str(n)] = LGBMRegressor(n_estimators=n)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
    
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Tree Depth

Varying the depth of each tree added to the ensemble is another important hyperparameter for gradient boosting.

The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).

Gradient boosting generally performs well with trees that have a modest depth, finding a balance between skill and generality.

Tree depth is controlled via the “max_depth” argument and defaults to an unspecified value as the default mechanism for controlling how complex trees are is to use the number of leaf nodes.

There are two main ways to control tree complexity: the max depth of the trees and the maximum number of terminal nodes (leaves) in the tree. In this case, we are exploring the number of leaves so we need to increase the number of leaves to support deeper trees by setting the “num_leaves” argument.

The example below explores tree depths between 1 and 10 and the effect on model performance.

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    for i in range(1,11):
        models[str(i)] = LGBMRegressor(max_depth=i, num_leaves=2**i)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Learning Rate

Learning rate controls the amount of contribution that each model has on the ensemble prediction.

Smaller rates may require more decision trees in the ensemble.

The learning rate can be controlled via the “learning_rate” argument and defaults to 0.1.

The example below explores the learning rate and compares the effect of values between 0.0001 and 1.0.

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
    for r in rates:
        key = '%.4f' % r
        models[key] = LGBMRegressor(learning_rate=r)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Boosting Type

A feature of LightGBM is that it supports a number of different boosting algorithms, referred to as boosting types.

The boosting type can be specified via the “boosting_type” argument and take a string to specify the type. The options include:

‘gbdt‘: Gradient Boosting Decision Tree (GDBT).
‘dart‘: Dropouts meet Multiple Additive Regression Trees (DART).
‘goss‘: Gradient-based One-Side Sampling (GOSS).
The default is GDBT, which is the classical gradient boosting algorithm.

In [None]:


# get a list of models to evaluate
def get_models():
    models = dict()
    types = ['gbdt', 'dart', 'goss']
    for t in types:
        models[t] = LGBMRegressor(boosting_type=t)
    return models
 

def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
    
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()


<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:deeppink;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Conclusion

Thanks for reading this notebook! We wish you the best in the competition 💥

Please give it an upvote - if you found it  useful insight into ML models  you could use. And LGBM finetuning.

Thanks 👍

<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:royalblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Recommended Reads


Here are some links to reading material we found useful and would recommend: 
- https://www.geeksforgeeks.org/lightgbm-light-gradient-boosting-machine/
- https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
- https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/
- https://machinelearningmastery.com/light-gradient-boosted-machine-lightgbm-ensemble/
- http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
- https://link.springer.com/chapter/10.1007/978-1-4302-5990-9_4
