# Ubiquant Market Prediction 
## Comparing ML Techniques and LightGBM Finetuning! ⚡


<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:aquamarine;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Introduction

This notebook aims to share initial analysis on which technique may get you a nice score in the Ubiquant Market Prediction Competition. 🍀

It does this by trying out different models and comparing their results! 

The models compared are:


* Linear Regression
* Random Forest Regressor
* Ridge
* XGBoost
* LightGBM
* Support Vector Regressor


The 3 sections of this notebook includes: Import, Model Comparison and LightGBM Finetuning (As the LightGBM, seen the best performance, from our analysis).

Here we go! 😀


<div style="color:yellow;
           display:fill;
           border-radius:5px;
           background-color:chartreuse;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Import

First we need to import!

In [None]:
# Import libraries
import pandas as pd
import pickle
import lightgbm as lgb
import datetime
from datetime import datetime
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import plotly_express as px
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
import time
import warnings
warnings.filterwarnings("ignore")

In [None]:
#Importing pickle files
with open('../input/ubiquant-how-to-make-pickle-file/train.pickle', 'rb') as f:
    train = pickle.load(f)

In [None]:
train.head()

In [None]:
train.info()

Please note we only imported the first 10,000 rows as an introductory method to compare techniques. And to share the results.

As the competition progresses futher work could include attempting different import methods. 

That said, lets see how the different models faired! 

<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:turquoise;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Data Prep

# Data Prep for Models

df = train

scaler = StandardScaler()
X = np.array(df.drop(['row_id', 'time_id', 'investment_id', 'target'], axis = 1))
scaler.fit(X)
X = scaler.transform(X)

y = np.array(df['target'])

print(X.shape)
print(y.shape)In this section we compare different techniques performance

First need to do some data prep for the models.

In [None]:
# # Data Prep for Models

# df = train

# scaler = StandardScaler()
# X = np.array(df.drop(['row_id', 'time_id', 'investment_id', 'target'], axis = 1))
# scaler.fit(X)
# X = scaler.transform(X)

# y = np.array(df['target'])

# print(X.shape)
# print(y.shape)

In [None]:
lim  = int(train.shape[0]*0.75) # the row number at which we split the train & test dataset (here we take the first 75% of the data-df)
x_train = train.iloc[0:lim,4:] # select x_train as the first 75% of the features dataset
x_test   = train.iloc[lim:,4:] # select x_test as the last 25% of the features dataset
y_train = train.target.iloc[0:lim] # select y_train as the first 75% of the target dataset
y_test   = train.target.iloc[lim:] # select y_test as the last 25% of the target dataset

print(np.shape(x_train), np.shape(x_test), np.shape(y_train), np.shape(y_test))

In [None]:
x_train.head()

In [None]:
y_train.head()

That is the inital prep complete!


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:blue;">

</p>
</div>

# LightGBM

LightGBM is a decision tree-based gradient boosting framework that improves model efficiency while reducing memory utilisation.

It employs two innovative techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB), Which address the drawbacks of the histogram-based approach employed in most GBDT (Gradient Boosting Decision Tree) frameworks. The properties of LightGBM Algorithm are formed by the two methodologies of GOSS and EFB. They work together to make the model run smoothly and give it an advantage over competing GBDT frameworks. 

(For futher details our recommended read is by Light GBM, linked below) 


Lets try it out!🤗

**Defaults (Base Params)**

In [None]:
model=lgb.LGBMRegressor(random_state=0, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))

In [None]:
lgb.plot_metric(model)
plt.show()

#  Tuning Parameters
 
 -Trees = [10, 50, 100(default), 500, 1000, 5000]
 
 -Tuning Tree Depth = 1 ....10
 
 -Learning rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
 
 -Boosting types = ['gbdt', 'dart', 'goss']

**Number of trees / n_estimator=600**

In [None]:
model=lgb.LGBMRegressor(random_state=0, n_estimators=600, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

**Boosting Type = goss**

In [None]:
model=lgb.LGBMRegressor(random_state=0, boosting_type='goss', num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

**Learning rates = 1.0**

In [None]:
model=lgb.LGBMRegressor(random_state=0, learning_rate=1.0, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

****Learning rates = 0.0001****

In [None]:
model=lgb.LGBMRegressor(random_state=0, learning_rate=0.0001, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

****Learning rates = 0.05****

In [None]:
model=lgb.LGBMRegressor(random_state=0, learning_rate=0.05, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

**Max depth = 20**

In [None]:
model=lgb.LGBMRegressor(random_state=0, max_depth=20, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

**Max bin = 2000**

In [None]:
model=lgb.LGBMRegressor(random_state=0, max_bin=2000, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_metric(model)
plt.show()

In [None]:
model=lgb.LGBMRegressor(random_state=0, objective='rmse', boosting_type='gbdt', num_leaves=100, 
                        n_jobs=-1, learning_rate=0.05, feature_fraction=0.8, num_threads=4)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
model.get_params()

In [None]:
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],verbose=50 , eval_metric='logloss') 

#objective='rmse', boosting_type='gbdt', num_leaves=100, n_jobs=-1, learning_rate=0.1, feature_fraction=0.8, bagging_fraction=0.8, num_threads=2) 

In [None]:
LGBM_confidence = model.score(x_test, y_test)
LGBM_confidence

In [None]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))


<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:yellow;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:yellow;">

</p>
</div>

# Fine Tuning LightGBM Hyperparameters


Lets fine tune to see if we can improve the score! 🧠

### Tuning Number of Trees

An important hyperparameter for the LightGBM ensemble algorithm is the number of decision trees used in the ensemble.

Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees are often better.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

In [None]:
# df = train
# scaler = StandardScaler()
# X = np.array(df.drop(['target'], 1))
# scaler.fit(X)
# X = scaler.transform(X)
# X = np.array(df.drop(['target'], 1))
# y = np.array(df['target'])

In [None]:
# X.shape

In [None]:
# y.shape



In [None]:

# explore lightgbm boosting type effect on performance
from numpy import arange
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
from sklearn.model_selection import KFold, cross_val_score

#model=lgb.LGBMRegressor(random_state=0)

# get a list of models to evaluate
def get_models():
    models = dict()
    trees = [10, 50, 100, 200]
    for n in trees:
        models[str(n)] = lgb.LGBMRegressor(n_estimators=n, num_threads=4)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, x_train, y_train, cv=cv, n_jobs=-1)
    return scores
 
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
    
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Tree Depth

Varying the depth of each tree added to the ensemble is another important hyperparameter for gradient boosting.

The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).

Gradient boosting generally performs well with trees that have a modest depth, finding a balance between skill and generality.

Tree depth is controlled via the “max_depth” argument and defaults to an unspecified value as the default mechanism for controlling how complex trees are is to use the number of leaf nodes.

There are two main ways to control tree complexity: the max depth of the trees and the maximum number of terminal nodes (leaves) in the tree. In this case, we are exploring the number of leaves so we need to increase the number of leaves to support deeper trees by setting the “num_leaves” argument.

The example below explores tree depths between 1 and 10 and the effect on model performance.

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    for i in range(1,11):
        models[str(i)] = LGBMRegressor(max_depth=i, num_leaves=2**i)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, x_train, y_train, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Learning Rate

Learning rate controls the amount of contribution that each model has on the ensemble prediction.

Smaller rates may require more decision trees in the ensemble.

The learning rate can be controlled via the “learning_rate” argument and defaults to 0.1.

The example below explores the learning rate and compares the effect of values between 0.0001 and 1.0.

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    rates = [0.0001, 0.001, 0.01, 0.1, 1.0]
    for r in rates:
        key = '%.4f' % r
        models[key] = LGBMRegressor(learning_rate=r)
    return models
 
# evaluate a give model using cross-validation
def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

### Tuning Boosting Type

A feature of LightGBM is that it supports a number of different boosting algorithms, referred to as boosting types.

The boosting type can be specified via the “boosting_type” argument and take a string to specify the type. The options include:

‘gbdt‘: Gradient Boosting Decision Tree (GDBT).
‘dart‘: Dropouts meet Multiple Additive Regression Trees (DART).
‘goss‘: Gradient-based One-Side Sampling (GOSS).
The default is GDBT, which is the classical gradient boosting algorithm.

In [None]:


# get a list of models to evaluate
def get_models():
    models = dict()
    types = ['gbdt', 'dart', 'goss']
    for t in types:
        models[t] = LGBMRegressor(boosting_type=t)
    return models
 

def evaluate_model(model):
    cv = KFold(n_splits=10, random_state=1)
    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    return scores
 

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
    
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()


<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:deeppink;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Conclusion

Thanks for reading this notebook! We wish you the best in the competition 💥

Please give it an upvote - if you found it  useful insight into ML models  you could use. And LGBM finetuning.

Thanks 👍

<div style="color:turquoise;
           display:fill;
           border-radius:5px;
           background-color:royalblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:turquoise;">

</p>
</div>

# Recommended Reads


Here are some links to reading material we found useful and would recommend: 
- https://www.geeksforgeeks.org/lightgbm-light-gradient-boosting-machine/
- https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
- https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/
- https://machinelearningmastery.com/light-gradient-boosted-machine-lightgbm-ensemble/
- http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
- https://link.springer.com/chapter/10.1007/978-1-4302-5990-9_4
