<a href="https://www.kaggle.com/code/nataliasz/quick-lgbm-no-leak?scriptVersionId=97453334" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# JPX Tokyo Stock Exchange Prediction <span style="color:DarkCyan"> with LGBM</span>

Thank you for viewing my notebook, I hope you enjoy it 📊<br>
Don't hesitate to leave any feedback 😉

# Table of Contents
1. [Overview](#Overview)
1. [Load JPX data](#Load-JPX-data)
2. [Preprocess](#Preprocess)
3. [Train the LGBM Model](#Train-the-LGBM-Model)
4. [Predict Test Data](#Predict-Test-Data)
5. [Submit](#Submit)

# Overview

In this notebook, I will build a <span style="color:DarkCyan">Light Gradient Boosting Model</span> for the [JPX Tokyo Stock Exchange Prediction Competition](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction).

### Quick introduction to <span style="color:DarkCyan">LGBM</span>

[Light GBM](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc) is a gradient boosting model that uses decision tree algorithm.<br>
<span style="color:DarkCyan">Light GBM</span> grows tree vertically while other gradient boosting algorithms (e.g. XGBoost) grow trees horizontally.<br>
<span style="color:DarkCyan">LGBM</span> chooses the leaf with max delta loss to grow. Holding leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
import jpx_tokyo_market_prediction

pd.set_option('display.max_columns', 100)

In [2]:
lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # gbdt - traditional Gradient Boosting Decision Tree
    'objective': 'regression',  # L2 loss
    'metric': 'rmse',
    'learning_rate': 0.05,
    'lambda_l1': 0.5,  # L1 regularization
    'lambda_l2': 0.5,  # L2 regularization
    'num_leaves': 10,
    'feature_fraction': 0.5,  # LightGBM will select 50% of features before training each tree
    'bagging_fraction': 0.5,  # LightGBM will select 50% part of data without resampling
    'bagging_freq': 5,  #  perform bagging at every k iteration
    'min_child_samples': 10,
    'seed': 42
}

Load datasets

# Load JPX data

In [3]:
file_path = '/kaggle/input/jpx-tokyo-stock-exchange-prediction/'
prices = pd.read_csv(Path(file_path, 'train_files/stock_prices.csv'))
stock_list = pd.read_csv(Path(file_path, 'stock_list.csv'))

In [4]:
prices.describe()

Unnamed: 0,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,Target
count,2332531.0,2324923.0,2324923.0,2324923.0,2324923.0,2332531.0,2332531.0,18865.0,2332293.0
mean,5894.835,2594.511,2626.54,2561.227,2594.023,691936.6,1.000508,22.01773,0.0004450962
std,2404.161,3577.192,3619.363,3533.494,3576.538,3911256.0,0.0677304,29.882453,0.02339879
min,1301.0,14.0,15.0,13.0,14.0,0.0,0.1,0.0,-0.5785414
25%,3891.0,1022.0,1035.0,1009.0,1022.0,30300.0,1.0,5.0,-0.01049869
50%,6238.0,1812.0,1834.0,1790.0,1811.0,107100.0,1.0,15.0,0.0
75%,7965.0,3030.0,3070.0,2995.0,3030.0,402100.0,1.0,30.0,0.01053159
max,9997.0,109950.0,110500.0,107200.0,109550.0,643654000.0,20.0,1070.0,1.119512


In [5]:
prices.head()

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073
1,20170104_1332,2017-01-04,1332,568.0,576.0,563.0,571.0,2798500,1.0,,False,0.012324
2,20170104_1333,2017-01-04,1333,3150.0,3210.0,3140.0,3210.0,270800,1.0,,False,0.006154
3,20170104_1376,2017-01-04,1376,1510.0,1550.0,1510.0,1550.0,11300,1.0,,False,0.011053
4,20170104_1377,2017-01-04,1377,3270.0,3350.0,3270.0,3330.0,150800,1.0,,False,0.003026


Display information about main dataset

In [6]:
prices.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2332531 entries, 0 to 2332530
Data columns (total 12 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   RowId             2332531 non-null  object 
 1   Date              2332531 non-null  object 
 2   SecuritiesCode    2332531 non-null  int64  
 3   Open              2324923 non-null  float64
 4   High              2324923 non-null  float64
 5   Low               2324923 non-null  float64
 6   Close             2324923 non-null  float64
 7   Volume            2332531 non-null  int64  
 8   AdjustmentFactor  2332531 non-null  float64
 9   ExpectedDividend  18865 non-null    float64
 10  SupervisionFlag   2332531 non-null  bool   
 11  Target            2332293 non-null  float64
dtypes: bool(1), float64(7), int64(2), object(2)
memory usage: 198.0+ MB


# Preprocess

Manage data id: add variable describing how many days have passed from the beginning of the dataset 

In [7]:
prices['Date'] = pd.to_datetime(prices['Date'])
min_date = prices['Date'].min()
prices['date_rank'] = (prices['Date'] - min_date).dt.days

Choose features to use as independent variables in the model

In [8]:
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'date_rank', 'SecuritiesCode']

In [9]:
prices = prices.dropna(subset=features)

In [10]:
target = prices.pop('Target')
# scaler = StandardScaler()
# target = scaler.fit_transform(np.array(target).reshape(-1, 1)).ravel()
# target = pd.Series(target, index = prices.index)
target_mean = target.mean()

In [11]:
target.describe()

count    2.324923e+06
mean     4.262053e-04
std      2.339197e-02
min     -5.785414e-01
25%     -1.052632e-02
50%      0.000000e+00
75%      1.052632e-02
max      6.182380e-01
Name: Target, dtype: float64

In [12]:
train_f, valid_f = train_test_split(prices[features], test_size=0.2, shuffle=False)
train_idx = train_f.index
valid_idx = valid_f.index
lgb_train = lgb.Dataset(train_f, target[train_idx])
lgb_valid = lgb.Dataset(valid_f, target[valid_idx], reference=lgb_train)

In [13]:
train_f.head()

Unnamed: 0,Open,High,Low,Close,Volume,date_rank,SecuritiesCode
0,2734.0,2755.0,2730.0,2742.0,31400,0,1301
1,568.0,576.0,563.0,571.0,2798500,0,1332
2,3150.0,3210.0,3140.0,3210.0,270800,0,1333
3,1510.0,1550.0,1510.0,1550.0,11300,0,1376
4,3270.0,3350.0,3270.0,3330.0,150800,0,1377


# Train the <span style="color:DarkCyan">LGBM</span> Model

In [14]:
model = lgb.train(
    lgbm_params,
    lgb_train,
    valid_sets=[lgb_train, lgb_valid],
    valid_names=['Train', 'Valid'],
    num_boost_round=2000,
    early_stopping_rounds=100,
    verbose_eval=100,
)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1785
[LightGBM] [Info] Number of data points in the train set: 1859938, number of used features: 7
[LightGBM] [Info] Start training from score 0.000384
Training until validation scores don't improve for 100 rounds
[100]	Train's rmse: 0.0231328	Valid's rmse: 0.0219297
Early stopping, best iteration is:
[25]	Train's rmse: 0.0233947	Valid's rmse: 0.0219286


# Predict Test Data

In [15]:
test_prices = pd.read_csv(Path(file_path, 'example_test_files/stock_prices.csv'))
test_prices['date_rank'] = (pd.to_datetime(test_prices['Date']) - min_date).dt.days

In [16]:
preds =  model.predict(test_prices[features], num_iteration=model.best_iteration)
preds

array([0.0005377 , 0.00071663, 0.0005377 , ..., 0.00050657, 0.00049588,
       0.0005898 ])

In [17]:
pd.Series(preds).fillna(target_mean).rank(ascending = False,method = 'first').astype(int)

0       2068
1        197
2       2069
3        671
4       1360
        ... 
3995     528
3996     529
3997    2986
3998    3090
3999     972
Length: 4000, dtype: int64

# Submit

In [18]:
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices['date_rank'] = (pd.to_datetime(prices['Date']) - min_date).dt.days
    preds = model.predict(prices[features], num_iteration=model.best_iteration)
    preds = np.squeeze(preds)
    print(preds)
    sample_prediction["Prediction"] = preds
    sample_prediction = sample_prediction.sort_values(by = "Prediction", ascending=False)
    sample_prediction.Rank = np.arange(0,2000)
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"],axis=1)
    submission = sample_prediction[["Date","SecuritiesCode","Rank"]]
    env.predict(submission)

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
[0.0005377  0.00071663 0.0005377  ... 0.00050657 0.00049588 0.00056692]
[0.0005377  0.00071663 0.0005377  ... 0.00050657 0.00049588 0.0005898 ]
