<a href="https://www.kaggle.com/code/nataliasz/quick-lgbm-no-leak?scriptVersionId=96351189" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# JPX Tokyo Stock Exchange Prediction <span style="color:DarkCyan"> with LGBM</span>

Thank you for viewing my notebook, I hope you enjoy it 📊<br>
Don't hesitate to leave any feedback 😉

# Table of Contents
1. [Overview](#Overview)
1. [Load JPX data](#Load-JPX-data)
2. [Preprocess](#Preprocess)
3. [Train the LGBM Model](#Train-the-LGBM-Model)
4. [Predict Test Data](#Predict-Test-Data)
5. [Submit](#Submit)

# Overview

In this notebook, I will build a <span style="color:DarkCyan">Light Gradient Boosting Model</span> for the [JPX Tokyo Stock Exchange Prediction Competition](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction).

### Quick introduction to <span style="color:DarkCyan">LGBM</span>

[Light GBM](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc) is a gradient boosting model that uses decision tree algorithm.<br>
<span style="color:DarkCyan">Light GBM</span> grows tree vertically while other gradient boosting algorithms (e.g. XGBoost) grow trees horizontally.<br>
<span style="color:DarkCyan">LGBM</span> chooses the leaf with max delta loss to grow. Holding leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
import jpx_tokyo_market_prediction

pd.set_option('display.max_columns', 100)

In [2]:
lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # gbdt - traditional Gradient Boosting Decision Tree
    'objective': 'regression',  # L2 loss
    'metric': 'rmse',
    'learning_rate': 0.05,
    'lambda_l1': 0.5,  # L1 regularization
    'lambda_l2': 0.5,  # L2 regularization
    'num_leaves': 10,
    'feature_fraction': 0.5,  # LightGBM will select 50% of features before training each tree
    'bagging_fraction': 0.5,  # LightGBM will select 50% part of data without resampling
    'bagging_freq': 5,  #  perform bagging at every k iteration
    'min_child_samples': 10,
    'seed': 42
}

Load datasets

# Load JPX data

In [3]:
file_path = '/kaggle/input/jpx-tokyo-stock-exchange-prediction/'
prices = pd.read_csv(Path(file_path, 'train_files/stock_prices.csv'))
stock_list = pd.read_csv(Path(file_path, 'stock_list.csv'))

In [4]:
prices.head()

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073
1,20170104_1332,2017-01-04,1332,568.0,576.0,563.0,571.0,2798500,1.0,,False,0.012324
2,20170104_1333,2017-01-04,1333,3150.0,3210.0,3140.0,3210.0,270800,1.0,,False,0.006154
3,20170104_1376,2017-01-04,1376,1510.0,1550.0,1510.0,1550.0,11300,1.0,,False,0.011053
4,20170104_1377,2017-01-04,1377,3270.0,3350.0,3270.0,3330.0,150800,1.0,,False,0.003026


Display information about main dataset

In [5]:
prices.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2332531 entries, 0 to 2332530
Data columns (total 12 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   RowId             2332531 non-null  object 
 1   Date              2332531 non-null  object 
 2   SecuritiesCode    2332531 non-null  int64  
 3   Open              2324923 non-null  float64
 4   High              2324923 non-null  float64
 5   Low               2324923 non-null  float64
 6   Close             2324923 non-null  float64
 7   Volume            2332531 non-null  int64  
 8   AdjustmentFactor  2332531 non-null  float64
 9   ExpectedDividend  18865 non-null    float64
 10  SupervisionFlag   2332531 non-null  bool   
 11  Target            2332293 non-null  float64
dtypes: bool(1), float64(7), int64(2), object(2)
memory usage: 198.0+ MB


# Preprocess

Manage data id: add variable describing how many days have passed from the beginning of the dataset 

In [6]:
prices['Date'] = pd.to_datetime(prices['Date'])
min_date = prices['Date'].min()
prices['date_rank'] = (prices['Date'] - min_date).dt.days

Choose features to use as independent variables in the model

In [7]:
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'date_rank', 'SecuritiesCode']

In [8]:
prices = prices.dropna(subset=features)

In [9]:
target = prices.pop('Target')
scaler = StandardScaler()
target = scaler.fit_transform(np.array(target).reshape(-1, 1)).ravel()
target = pd.Series(target, index = prices.index)
target_mean = target.mean()

In [10]:
train_f, valid_f = train_test_split(prices[features], test_size=0.2)
train_idx = train_f.index
valid_idx = valid_f.index
lgb_train = lgb.Dataset(train_f, target[train_idx])
lgb_valid = lgb.Dataset(valid_f, target[valid_idx], reference=lgb_train)

In [11]:
train_f.head()

Unnamed: 0,Open,High,Low,Close,Volume,date_rank,SecuritiesCode
743128,1120.0,1120.0,1117.0,1117.0,2100,580,9028
253555,1344.0,1344.0,1314.0,1332.0,140600,197,5218
1474654,949.0,949.0,894.0,894.0,48800,1154,2715
2288696,206.0,208.0,194.0,198.0,4530600,1763,2315
691699,1380.0,1401.0,1365.0,1398.0,79100,540,9880


# Train the <span style="color:DarkCyan">LGBM</span> Model

In [12]:
model = lgb.train(
    lgbm_params,
    lgb_train,
    valid_sets=[lgb_train, lgb_valid],
    valid_names=['Train', 'Valid'],
    num_boost_round=2000,
    early_stopping_rounds=100,
    verbose_eval=100,
)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1785
[LightGBM] [Info] Number of data points in the train set: 1859938, number of used features: 7
[LightGBM] [Info] Start training from score 0.000030
Training until validation scores don't improve for 100 rounds
[100]	Train's rmse: 0.985736	Valid's rmse: 0.987824
[200]	Train's rmse: 0.982165	Valid's rmse: 0.984322
[300]	Train's rmse: 0.979792	Valid's rmse: 0.982049
[400]	Train's rmse: 0.978205	Valid's rmse: 0.980587
[500]	Train's rmse: 0.977021	Valid's rmse: 0.97955
[600]	Train's rmse: 0.976111	Valid's rmse: 0.978805
[700]	Train's rmse: 0.97536	Valid's rmse: 0.978255
[800]	Train's rmse: 0.974745	Valid's rmse: 0.977827
[900]	Train's rmse: 0.97426	Valid's rmse: 0.977534
[1000]	Train's rmse: 0.973821	Valid's rmse: 0.977265
[1100]	Train's rmse: 0.973363	Valid's rmse: 0.97701
[1200]	Train's rmse: 0.973015	Valid's rmse: 0.97688
[1300]	Train's rmse: 0.972617	Valid's rmse: 0.976669
[1400]	Train's rmse: 0.9

# Predict Test Data

In [13]:
test_prices = pd.read_csv(Path(file_path, 'example_test_files/stock_prices.csv'))
test_prices['date_rank'] = (pd.to_datetime(test_prices['Date']) - min_date).dt.days

In [14]:
preds =  model.predict(test_prices[features], num_iteration=model.best_iteration)
preds

array([0.5475503 , 0.47230382, 0.53618104, ..., 0.44426708, 0.44437173,
       0.43425699])

In [15]:
pd.Series(preds).fillna(target_mean).rank(ascending = False,method = 'first').astype(int)

0       1445
1       2466
2       1636
3        554
4        600
        ... 
3995    2741
3996    2716
3997    2734
3998    2731
3999    2842
Length: 4000, dtype: int64

# Submit

In [16]:
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices['date_rank'] = (pd.to_datetime(prices['Date']) - min_date).dt.days
    preds = model.predict(prices[features], num_iteration=model.best_iteration)
    preds = np.squeeze(preds)
    print(preds)
    sample_prediction["Prediction"] = preds
    sample_prediction = sample_prediction.sort_values(by = "Prediction", ascending=False)
    sample_prediction.Rank = np.arange(0,2000)
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"],axis=1)
    submission = sample_prediction[["Date","SecuritiesCode","Rank"]]
    env.predict(submission)

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
[0.5475503  0.47230382 0.53618104 ... 0.42796954 0.51768106 0.3847571 ]
[0.5457729  0.50494581 0.53612962 ... 0.44426708 0.44437173 0.43425699]
