# Tokyo Stock Exchange Prediction with CatBoost
In this notebook, I will build a Tokyo Stock Exchange Prediction Model using CatBoost. To make it easy to get start with, I only use stock prices and secondary stock prices data for training.

# Index Market Analysis

Success in any financial market requires one to identify solid investments. When a stock or derivative is undervalued, it makes sense to buy. If it's overvalued, perhaps it's time to sell. While these finance decisions were historically made manually by professionals, technology has ushered in new opportunities for retail investors. Data scientists, specifically, may be interested to explore quantitative trading, where decisions are executed programmatically based on predictions from trained models.

There are plenty of existing quantitative trading efforts used to analyze financial markets and formulate investment strategies. To create and execute such a strategy requires both historical and real-time data, which is difficult to obtain especially for retail investors. This competition will provide financial data for the Japanese market, allowing retail investors to analyze the market to the fullest extent.

Japan Exchange Group, Inc. (JPX) is a holding company operating one of the largest stock exchanges in the world, Tokyo Stock Exchange (TSE), and derivatives exchanges Osaka Exchange (OSE) and Tokyo Commodity Exchange (TOCOM). JPX is hosting this competition and is supported by AI technology company AlpacaJapan Co.,Ltd

This competition will compare your models against real future returns after the training phase is complete. The competition will involve building portfolios from the stocks eligible for predictions (around 2,000 stocks). Specifically, each participant ranks the stocks from highest to lowest expected returns and is evaluated on the difference in returns between the top and bottom 200 stocks. You'll have access to financial data from the Japanese market, such as stock information and historical stock prices to train and test your model.

All winning models will be made public so that other participants can learn from the outstanding models. Excellent models also may increase the interest in the market among retail investors, including those who want to practice quantitative trading. At the same time, you'll gain your own insights into programmatic investment methods and portfolio analysis―and you may even discover you have an affinity for the Japanese market.

# Evaluation

Submissions are evaluated on the Sharpe Ratio of the daily spread returns. You will need to rank each stock active on a given day. The returns for a single day treat the 200 highest (e.g. 0 to 199) ranked stocks as purchased and the lowest (e.g. 1999 to 1800) ranked 200 stocks as shorted. The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that. You can find a python implementation of the metric here.

You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow this template in Kaggle Notebooks:

# Code Tester

import jpx_tokyo_market_prediction env = jpx_tokyo_market_prediction.make_env() # initialize the environment iter_test = env.iter_test() # an iterator which loops over the test files for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test: sample_prediction_df['Target'] = np.arange(len(sample_prediction)) # make your predictions here env.predict(sample_prediction_df) # register your predictions You will get an error if you:

Use ranks that are below zero or greater than or equal to the number of stocks for a given date. Submit any duplicated ranks. Change the order of the rows.

# Timeline

This is a forecasting competition with an active training phase and a second period where models will be run against real market data. Training Timeline

April 4, 2022 - Start Date

June 28, 2022 - Entry deadline. You must accept the competition rules before this date in order to compete.

June 28, 2022 - Team Merger deadline. This is the last day participants may join or merge teams.

July 5, 2022 - Final submission deadline. All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

# Code Requirements

Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 9 hours run-time

GPU Notebook <= 9 hours run-time

Internet access disabled

Freely & publicly available external data is allowed, including pre-trained models

Submission file must be named submission.csv. The API will generate this submission file for you.

Please see the Code Competition FAQ for more information on how to submit. And review the code debugging doc if you are encountering submission errors

In [89]:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor

In [90]:
class Config:
    dataset_path = "../input/jpx-tokyo-stock-exchange-prediction/"

### Loading data

In [91]:
stock_list = pd.read_csv(f"{Config.dataset_path}stock_list.csv")
stock_list.head()

Unnamed: 0,SecuritiesCode,EffectiveDate,Name,Section/Products,NewMarketSegment,33SectorCode,33SectorName,17SectorCode,17SectorName,NewIndexSeriesSizeCode,NewIndexSeriesSize,TradeDate,Close,IssuedShares,MarketCapitalization,Universe0
0,1301,20211230,"KYOKUYO CO.,LTD.",First Section (Domestic),Prime Market,50,"Fishery, Agriculture and Forestry",1,FOODS,7,TOPIX Small 2,20211230.0,3080.0,10928280.0,33659110000.0,True
1,1305,20211230,Daiwa ETF-TOPIX,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2097.0,3634636000.0,7621831000000.0,False
2,1306,20211230,NEXT FUNDS TOPIX Exchange Traded Fund,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2073.5,7917718000.0,16417390000000.0,False
3,1308,20211230,Nikko Exchange Traded Index Fund TOPIX,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2053.0,3736943000.0,7671945000000.0,False
4,1309,20211230,NEXT FUNDS ChinaAMC SSE50 Index Exchange Trade...,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,44280.0,72632.0,3216145000.0,False


In [92]:
trades = pd.read_csv(f"{Config.dataset_path}train_files/trades.csv")
trades.tail()

Unnamed: 0,Date,StartDate,EndDate,Section,TotalSales,TotalPurchases,TotalTotal,TotalBalance,ProprietarySales,ProprietaryPurchases,...,CityBKsRegionalBKsEtcTotal,CityBKsRegionalBKsEtcBalance,TrustBanksSales,TrustBanksPurchases,TrustBanksTotal,TrustBanksBalance,OtherFinancialInstitutionsSales,OtherFinancialInstitutionsPurchases,OtherFinancialInstitutionsTotal,OtherFinancialInstitutionsBalance
1707,2021-12-01,,,,,,,,,,...,,,,,,,,,,
1708,2021-12-02,2021-11-22,2021-11-26,Growth Market (Mothers/JASDAQ),1143466000.0,1143923000.0,2287389000.0,456677.0,36639190.0,34960680.0,...,396230.0,-275608.0,6696755.0,6886122.0,13582877.0,189367.0,234653.0,298525.0,533178.0,63872.0
1709,2021-12-02,2021-11-22,2021-11-26,Prime Market (First Section),11383430000.0,11376210000.0,22759640000.0,-7214179.0,1499660000.0,1230944000.0,...,35957940.0,-17510292.0,254580089.0,261919512.0,516499601.0,7339423.0,11959898.0,16368287.0,28328185.0,4408389.0
1710,2021-12-02,2021-11-22,2021-11-26,Standard Market (Second Section),106996900.0,107503600.0,214500400.0,506702.0,2811025.0,3273163.0,...,42127.0,-42127.0,438928.0,243817.0,682745.0,-195111.0,60291.0,6985.0,67276.0,-53306.0
1711,2021-12-03,,,,,,,,,,...,,,,,,,,,,


In [93]:
stock_prices = pd.read_csv(f"{Config.dataset_path}train_files/stock_prices.csv")
stock_prices.head()

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073
1,20170104_1332,2017-01-04,1332,568.0,576.0,563.0,571.0,2798500,1.0,,False,0.012324
2,20170104_1333,2017-01-04,1333,3150.0,3210.0,3140.0,3210.0,270800,1.0,,False,0.006154
3,20170104_1376,2017-01-04,1376,1510.0,1550.0,1510.0,1550.0,11300,1.0,,False,0.011053
4,20170104_1377,2017-01-04,1377,3270.0,3350.0,3270.0,3330.0,150800,1.0,,False,0.003026


In [94]:
financials = pd.read_csv(f"{Config.dataset_path}train_files/financials.csv")
financials.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,DisclosureNumber,DateCode,Date,SecuritiesCode,DisclosedDate,DisclosedTime,DisclosedUnixTime,TypeOfDocument,CurrentPeriodEndDate,TypeOfCurrentPeriod,...,ForecastEarningsPerShare,ApplyingOfSpecificAccountingOfTheQuarterlyFinancialStatements,MaterialChangesInSubsidiaries,ChangesBasedOnRevisionsOfAccountingStandard,ChangesOtherThanOnesBasedOnRevisionsOfAccountingStandard,ChangesInAccountingEstimates,RetrospectiveRestatement,NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock,NumberOfTreasuryStockAtTheEndOfFiscalYear,AverageNumberOfShares
0,20161210000000.0,20170104_2753,2017-01-04,2753.0,2017-01-04,07:30:00,1483483000.0,3QFinancialStatements_Consolidated_JP,2016-12-31,3Q,...,319.76,,False,True,False,False,False,6848800.0,－,6848800.0
1,20170100000000.0,20170104_3353,2017-01-04,3353.0,2017-01-04,15:00:00,1483510000.0,3QFinancialStatements_Consolidated_JP,2016-11-30,3Q,...,485.36,,False,True,False,False,False,2035000.0,118917,1916083.0
2,20161230000000.0,20170104_4575,2017-01-04,4575.0,2017-01-04,12:00:00,1483499000.0,ForecastRevision,2016-12-31,2Q,...,-93.11,,,,,,,,,
3,20170100000000.0,20170105_2659,2017-01-05,2659.0,2017-01-05,15:00:00,1483596000.0,3QFinancialStatements_Consolidated_JP,2016-11-30,3Q,...,285.05,,False,True,False,False,False,31981654.0,18257,31963405.0
4,20170110000000.0,20170105_3050,2017-01-05,3050.0,2017-01-05,15:30:00,1483598000.0,ForecastRevision,2017-02-28,FY,...,,,,,,,,,,


In [None]:
options = pd.read_csv(f"{Config.dataset_path}train_files/options.csv")
options.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
secondary_stock_prices = pd.read_csv(f"{Config.dataset_path}train_files/secondary_stock_prices.csv")
secondary_stock_prices.head()

## Feature Engineering

In [None]:
def feature_engineering(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df["year"] = df.Date.dt.year
    df["month"] = df.Date.dt.month
    df["day"] = df.Date.dt.day
    df['dayofweek'] = df.Date.dt.dayofweek
    df['hour'] = df.Date.dt.hour
    df.pop("Date")
    df.pop("RowId")
    return df

In [None]:
stock_prices = feature_engineering(stock_prices)
stock_prices.head()

In [None]:
secondary_stock_prices = feature_engineering(secondary_stock_prices)
secondary_stock_prices.head()

In [None]:
first_target = stock_prices.pop("Target")
first_target.fillna(0, inplace=True)
second_target = secondary_stock_prices.pop("Target")
second_target.fillna(0, inplace=True)

## Train Validation Split
I will keep last 10% data as hold-out set.

In [None]:
validation_split = 0.1
split_index = int(len(secondary_stock_prices) * (1 - validation_split))

first_X_train = stock_prices.iloc[0:split_index]
first_X_val = stock_prices.iloc[split_index:]
first_y_train = first_target.iloc[0:split_index]
first_y_val = first_target.iloc[split_index:]

second_X_train = secondary_stock_prices.iloc[0:split_index]
second_X_val = secondary_stock_prices.iloc[split_index:]
second_y_train = second_target.iloc[0:split_index]
second_y_val = second_target.iloc[split_index:]

## Modeling

In [None]:
params = {
    'task_type' : 'GPU',
    'verbose' : 1000,
    "cat_features": ["SecuritiesCode"]
}
model1 = CatBoostRegressor(**params)
model1.fit(first_X_train, first_y_train, eval_set=(first_X_val, first_y_val))

In [None]:
model2 = CatBoostRegressor(**params)
model2.fit(second_X_train, second_y_train, eval_set=(second_X_val, second_y_val))

## Submission

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
counter = 0
# The API will deliver six dataframes in this specific order:
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    if counter == 0:
        print(prices.head())
        print(options.head())
        print(financials.head())
        print(trades.head())
        print(secondary_prices.head())
        print(sample_prediction.head())
    codes = list(sample_prediction["SecuritiesCode"])
    second_codes = secondary_prices["SecuritiesCode"].unique()
    prices = feature_engineering(prices)
    secondary_prices = feature_engineering(secondary_prices)
    y_pred = model1.predict(prices).reshape(-1)
    prediction_dict = dict([(str(code), target) for code, target in zip(codes, list(y_pred))])
    y_pred2 = model2.predict(secondary_prices)
    for i in range(len(secondary_prices)):
        code = str(secondary_prices.iloc[i]["SecuritiesCode"])
        if code in prediction_dict:
            prediction_dict[code] += y_pred2[i]
    ranks = np.argsort(-1 * np.array(list(prediction_dict.values())), axis=0)
    sample_prediction['Rank'] = ranks
    env.predict(sample_prediction)
    counter += 1