# JPX Tokyo Stock Exchange Prediction <span style="color:DarkCyan"> with LGBM</span>

Thank you for viewing my notebook, I hope you enjoy it 📊<br>
Don't hesitate to leave any feedback 😉

# Table of Contents
1. [Overview](#Overview)
1. [Load JPX data](#Load-JPX-data)
2. [Preprocess](#Preprocess)
3. [Train the LGBM Model](#Train-the-LGBM-Model)
4. [Predict Test Data](#Predict-Test-Data)
5. [Submit](#Submit)

# Overview

In this notebook, I will build a <span style="color:DarkCyan">Light Gradient Boosting Model</span> for the [JPX Tokyo Stock Exchange Prediction Competition](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction).

### Quick introduction to <span style="color:DarkCyan">LGBM</span>

[Light GBM](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc) is a gradient boosting model that uses decision tree algorithm.<br>
<span style="color:DarkCyan">Light GBM</span> grows tree vertically while other gradient boosting algorithms (e.g. XGBoost) grow trees horizontally.<br>
<span style="color:DarkCyan">LGBM</span> chooses the leaf with max delta loss to grow. Holding leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
import jpx_tokyo_market_prediction

pd.set_option('display.max_columns', 100)

In [None]:
lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # gbdt - traditional Gradient Boosting Decision Tree
    'objective': 'regression',  # L2 loss
    'metric': 'rmse',
    'learning_rate': 0.05,
    'lambda_l1': 0.5,  # L1 regularization
    'lambda_l2': 0.5,  # L2 regularization
    'num_leaves': 10,
    'feature_fraction': 0.5,  # LightGBM will select 50% of features before training each tree
    'bagging_fraction': 0.5,  # LightGBM will select 50% part of data without resampling
    'bagging_freq': 5,  #  perform bagging at every k iteration
    'min_child_samples': 10,
    'seed': 42
}

Load datasets

# Load JPX data

In [None]:
file_path = '/kaggle/input/jpx-tokyo-stock-exchange-prediction/'
prices = pd.read_csv(Path(file_path, 'train_files/stock_prices.csv'))
stock_list = pd.read_csv(Path(file_path, 'stock_list.csv'))

In [None]:
prices.describe()

In [None]:
prices.head()

Display information about main dataset

In [None]:
prices.info(show_counts=True)

# Preprocess

Manage data id: add variable describing how many days have passed from the beginning of the dataset 

In [None]:
prices['Date'] = pd.to_datetime(prices['Date'])
min_date = prices['Date'].min()
prices['date_rank'] = (prices['Date'] - min_date).dt.days

Choose features to use as independent variables in the model

In [None]:
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'date_rank', 'SecuritiesCode']

In [None]:
prices = prices.dropna(subset=features)

In [None]:
target = prices.pop('Target')
# scaler = StandardScaler()
# target = scaler.fit_transform(np.array(target).reshape(-1, 1)).ravel()
# target = pd.Series(target, index = prices.index)
target_mean = target.mean()

In [None]:
target.describe()

In [None]:
train_f, valid_f = train_test_split(prices[features], test_size=0.2, shuffle=True)
train_idx = train_f.index
valid_idx = valid_f.index
lgb_train = lgb.Dataset(train_f, target[train_idx])
lgb_valid = lgb.Dataset(valid_f, target[valid_idx], reference=lgb_train)

In [None]:
train_f.head()

# Train the <span style="color:DarkCyan">LGBM</span> Model

In [None]:
model = lgb.train(
    lgbm_params,
    lgb_train,
    valid_sets=[lgb_train, lgb_valid],
    valid_names=['Train', 'Valid'],
    num_boost_round=3000,
    early_stopping_rounds=100,
    verbose_eval=100,
)

# Predict Test Data

In [None]:
test_prices = pd.read_csv(Path(file_path, 'example_test_files/stock_prices.csv'))
test_prices['date_rank'] = (pd.to_datetime(test_prices['Date']) - min_date).dt.days

In [None]:
preds =  model.predict(test_prices[features], num_iteration=model.best_iteration)
preds

In [None]:
pd.Series(preds).fillna(target_mean).rank(ascending = False,method = 'first').astype(int)

# Submit

In [None]:
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices['date_rank'] = (pd.to_datetime(prices['Date']) - min_date).dt.days
    preds = model.predict(prices[features], num_iteration=model.best_iteration)
    preds = np.squeeze(preds)
    print(preds)
    sample_prediction["Prediction"] = preds
    sample_prediction = sample_prediction.sort_values(by = "Prediction", ascending=False)
    sample_prediction.Rank = np.arange(0,2000)
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"],axis=1)
    submission = sample_prediction[["Date","SecuritiesCode","Rank"]]
    env.predict(submission)