# JPX Tokyo Stock Exchange Prediction 📈

## A Simple Explanation + XGBoost Model...

Hello the purpose of this Notebook is to provide a easy to understand model to start in the competition and a good framework to improve.

**Objectives:**
Develop a Machine Learning model, using the 3 following architectures to underestand they performance in the Dataset.
1. Linear Regressor (Linear Model)
2. Gradient Boosted Trees (XGBoost)
3. Neuronal Network (Probably Sequence Model LSTM)

**Strategy**
1. Understand the Datasets
2. Build a baseline model to improve uppon
3. Implement the model architecture described in the objectives section


**Updates**

**04/08/2022**

1. Started Notebook, loading and exploring the data.
2. Develop multiple versions of the baseline submission.

**04/09/2022**
1. Trying to implement an XGBoost model. 
2. XG Boost completed, problems with features.

**04/10/2022**
1. Submitting the predictions from the XGBoost Model to Leaderboard.
2. Trying to understand what's the problem with feature importance.
**05/20/2022**
1. It's been almost a month so I need to improve this model.
2. Full validation and XGBoost completion


---


**Credits**

I took a lot of inspiration and code ideas from here...
Excellent explanation on how to calculate the target taken from here.

https://www.kaggle.com/code/chumajin/english-ver-easy-to-understand-the-competition

https://www.kaggle.com/code/paulorzp/mean-model-jpx/notebook?scriptVersionId=92406307

## Data Description
This dataset contains historic data for a variety of Japanese stocks and options. Your challenge is to predict the future returns of the stocks.

As historic stock prices are not confidential this will be a forecasting competition using the time series API. 
The data for the public leaderboard period is included as part of the competition dataset. Expect to see many people submitting perfect submissions for fun. Accordingly, 
the active phase public leaderboard for this competition is intended as a convenience for anyone who wants to test their code. 
The forecasting phase leaderboard will be determined using real market data gathered after the submission period closes.

# 1. Loading Libraries...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing the Competition API
import jpx_tokyo_market_prediction

---

# 2. Setting the Notebook Configuration

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 100
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.4f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3. Loading the Datasets

In [None]:
%%time
sample = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/example_test_files/sample_submission.csv')

stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")
supplemental_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv")

---

# 4. Exploring the Information Loaded

## 4.1 Sample Dataset...

In [None]:
%%time
sample.head()

In [None]:
%%time
sample.describe()

In [None]:
%%time
sample.nunique()

---

## 4.2 Stock Prices Dataset

In [None]:
%%time
stock_prices.head()

In [None]:
%%time
stock_prices.describe()

---

## 4.2 Stock Prices, Replicating the Target, Example

In [None]:
%%time
# Calculating the Target, Change of rate in the next day...
stock_1301 = stock_prices[stock_prices["SecuritiesCode"] == 1301].reset_index(drop = True)
stock_1301['Close_T1'] = stock_1301['Close'].shift(-1)
stock_1301['Close_T2'] = stock_1301['Close'].shift(-2)

In [None]:
%%time
stock_1301['Target_Calculated'] = (stock_1301['Close_T2'] - stock_1301['Close_T1']) / stock_1301['Close_T1']

In [None]:
%%time
stock_1301.head()

---

## 4.2 Stock Prices, Replicating the Ranking, Example

In [None]:
%%time
# Calculating the Ranking...
stock_2021_12_02 = stock_prices[stock_prices["Date"] == '2021-12-02'].reset_index(drop = True)

In [None]:
%%time
stock_2021_12_02["Rank"] = stock_2021_12_02["Target"].rank(ascending = False,method = "first") - 1 
stock_2021_12_02 = stock_2021_12_02.sort_values('Rank').reset_index(drop = True)

In [None]:
%%time
stock_2021_12_02.head()

In [None]:
%%time
# Calculating daily spread of the returns
# Consider only the Top 200...

stock_2021_12_02_Top200 = stock_2021_12_02.iloc[:200 ,:]

In [None]:
%%time
# Calculate the Top 200 Weights...

weights = np.linspace(start = 2, stop = 1, num = 200)
stock_2021_12_02_Top200['Weights'] = weights
stock_2021_12_02_Top200['Calc_weights'] = stock_2021_12_02_Top200['Target'] * stock_2021_12_02_Top200['Weights']
Sup = stock_2021_12_02_Top200['Calc_weights'].sum() / np.mean(weights)

In [None]:
%%time
print(Sup)

In [None]:
%%time
# Calculating daily spread of the returns
# Consider only the Bottom 200...

stock_2021_12_02_Bottom200 = stock_2021_12_02.iloc[-200: ,:]
stock_2021_12_02_Bottom200 = stock_2021_12_02_Bottom200.sort_values('Rank', ascending = False).reset_index(drop = True)

In [None]:
%%time
# Calculate the Top 200 Weights...
stock_2021_12_02_Bottom200['Weights'] = weights
stock_2021_12_02_Bottom200['Calc_weights'] = stock_2021_12_02_Bottom200['Target'] * stock_2021_12_02_Bottom200['Weights']
Sdown = stock_2021_12_02_Bottom200['Calc_weights'].sum() / np.mean(weights)

In [None]:
%%time
print(Sdown)

In [None]:
%%time
daily_spread_return = Sup - Sdown
print(daily_spread_return)

---

## 4.3 Stock Prices, Calculating the Sharpe Ratio Using Competition Host Function

In [None]:
%%time
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, toprank_weight_ratio: float = 2) -> float:
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): sharpe ratio
    """
    def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
        """
        Args:
            df (pd.DataFrame): predicted results
            portfolio_size (int): # of equities to buy/sell
            toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
        Returns:
            (float): spread return
        """
        assert df['Rank'].min() == 0
        assert df['Rank'].max() == len(df['Rank']) - 1
        weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
        purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
        short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
        return purchase - short

    buf = df.groupby('Date').apply(_calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio

In [None]:
%%time
stock_prices_example = stock_prices.loc[stock_prices['Date'] >= '2021-01-01'].reset_index(drop = True)
stock_prices_example['Rank'] = stock_prices_example.groupby('Date')['Target'].rank(ascending = False, method = 'first') - 1 
stock_prices_example['Rank'] =stock_prices_example['Rank'].astype("int")

In [None]:
%%time
stock_prices_example.head()

In [None]:
%%time
calc_spread_return_sharpe(stock_prices_example, 200, 2)

---

# Simple Baseline Model, Last Day Known...
Let's build the simplest possible model for a timeseries dataset...

## Review the Information Available, One More Time, Because Why Not

### Sample Dataset

In [None]:
%%time
# Review the sample dataset
sample.head()

In [None]:
%%time
print(sample['Date'].min())
print(sample['Date'].max())

In [None]:
%%time
# Review the sample dataset
sample.nunique()

In [None]:
%%time
# Review the amount of information in the dataset
sample.info()

---

### Stock Prices Dataset

In [None]:
%%time
# Review the stock price dataset
stock_prices.head()

In [None]:
%%time
print(stock_prices['Date'].min())
print(stock_prices['Date'].max())

In [None]:
%%time
# Review the stock price dataset
stock_prices.nunique()

In [None]:
%%time
# Review the stock price dataset
stock_prices.info()

---

### Suplemental Stock Prices Dataset

In [None]:
%%time
# Review the suplemental stock price dataset
supplemental_stock_prices.head()

In [None]:
%%time
print(supplemental_stock_prices['Date'].min())
print(supplemental_stock_prices['Date'].max())

In [None]:
%%time
# Review the suplemental stock price dataset
supplemental_stock_prices.nunique()

In [None]:
%%time
# Review the stock price dataset
supplemental_stock_prices.info()

---

## Training the Baseline Model (The Most Important Part)
The Submission file, Requieres to generate predictions for the following period of times...
* Min: 2021-12-06
* Max: 2022-02-28

Base on the competition information: 

The competition will be closed in **July 05, 2022** and the Model will be tested until **Oct 07, 2022**... </br>
This is aproximithly **94 days** of gap that the model should be able to predict proeprly. because at this point in time we have only data to **Dec 03, 2021** more data will be provided in the weeks to come, from what I have read.



**Strategy**

Based on the **Date** information we have available from the Stock Price dataset, **Dec 03, 2021** and the Max Date information from the Submisison file **Feb 28,2022** 
We will generating prediction for or up to **87 days** into the future during our analysis...

Probably a good start point to test the model performance.

In [None]:
%%time
print(sample[sample['SecuritiesCode'] == 1301]['Date'].min())
print(sample[sample['SecuritiesCode'] == 1301]['Date'].max())

## First Model, Using the Last know Value. Dec 03, 2021 
Well I Just submitted this Baseline and It's quite bad

**Public LB: -0.090**

In [None]:
%%time
sample.nunique()

In [None]:
%%time
stock_prices[stock_prices['Date'] == '2021-12-03'][['SecuritiesCode', 'Target']].nunique()

In [None]:
%%time
predictions = stock_prices[stock_prices['Date'] == '2021-12-03'][['SecuritiesCode', 'Target']]
predictions['Rank'] = predictions['Target'].rank(ascending = False,method = 'first').astype(int) - 1

In [None]:
%%time
predictions_dict = dict(zip(predictions['SecuritiesCode'],predictions['Rank']))

In [None]:
%%script false --no-raise-error

# Not active in this run
# Invoke the API to generate predictions.
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(predictions_dict)
    env.predict(sample_prediction)

---

## Second Model, Using the Last N Days Mean Stock Value.
Well this version of the Baseline It's even worse than just the previos day submission

**Public LB: -0.101**

In [None]:
%%time
NDAYS = 34 # Number of days we want to use...
# We Want to use 34 days of data, because there are 2000 stocks, aproximathly we need to select -2000 * 34 cells.

stock_prices_dates = stock_prices[stock_prices['Date'] >= stock_prices.Date.iat[-2000 * NDAYS]].reset_index(drop = True)
predictions = stock_prices_dates.groupby('SecuritiesCode')['Target'].mean().rank(ascending = False, method = 'first').astype(int) -1
predictions = predictions.reset_index(name = 'Rank')

In [None]:
predictions

In [None]:
%%time
# Creates a prediction dictionary to map the predictions.
predictions_dict = dict(zip(predictions['SecuritiesCode'],predictions['Rank']))

In [None]:
%%script false --no-raise-error

# Not active in this run
# Invoke the API to generate predictions.
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(predictions_dict)
    env.predict(sample_prediction)

---

## Third Model, Using the Last N Days Mean Stock Value (Suplemental Stock Prices) -- Data Leak.
This Model will utilize data future data, so there will be leaks in the model.
This is only a test to see if I can replicate some of the Notebooks circulating...

For example the API only requiered submissions for '2021-12-06' and '2021-12-07'
I will train the models using the suplemental stock prices this dataset has data from **2021-12-06** to **2022-02-28**

In [None]:
%%time
NDAYS = 34 # Number of days we want to use...
# We Want to use 34 days of data, because there are 2000 stocks, aproximathly we need to select -2000 * 34 cells.

suplemental_stock_prices_dates = supplemental_stock_prices[supplemental_stock_prices['Date'] >= supplemental_stock_prices.Date.iat[-2000 * NDAYS]].reset_index(drop = True)
predictions = suplemental_stock_prices_dates.groupby('SecuritiesCode')['Target'].mean().rank(ascending = False, method = 'first').astype(int) -1
predictions = predictions.reset_index(name = 'Rank')

In [None]:
predictions

In [None]:
%%time
# Creates a prediction dictionary to map the predictions.
predictions_dict = dict(zip(predictions['SecuritiesCode'],predictions['Rank']))

In [None]:
%%script false --no-raise-error


# Not active in this run
# Invoke the API to generate predictions.
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(predictions_dict)
    env.predict(sample_prediction)

---

## Fourth Model, XGBoost... (No Data Leak), Train Model Before 2021-12-03
Well I will try to use data before 2021-12-03 to not leak future information...

In [None]:
%%time
stock_prices.head()

In [None]:
%%time
stock_prices.describe()

In [None]:
stock_prices['Target'].describe()

In [None]:
%%time
#stock_prices['Target'] = np.log(stock_prices['Target'])

In [None]:
stock_prices['Target'].describe()

In [None]:
%%time
# Extrating Date Information.
def time_features(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df['Date'].dt.year   
    df['Month'] = df['Date'].dt.month
    df['Week_Day'] = df['Date'].dt.weekday
    df['Day_Of_Year'] = df['Date'].dt.dayofyear
    return df

In [None]:
%%time
stock_prices = time_features(stock_prices)

In [None]:
%%time
stock_prices.sample(10)

In [None]:
%%time
stock_prices.isnull().sum()

In [None]:
%%time
stock_prices['Open']  = stock_prices['Open'].fillna(stock_prices.groupby('SecuritiesCode')['Open'].transform('median'))
stock_prices['Low']   = stock_prices['Low'].fillna(stock_prices.groupby('SecuritiesCode')['Low'].transform('median'))
stock_prices['High']  = stock_prices['High'].fillna(stock_prices.groupby('SecuritiesCode')['High'].transform('median'))
stock_prices['Close'] = stock_prices['Close'].fillna(stock_prices.groupby('SecuritiesCode')['Close'].transform('median'))

In [None]:
%%time
stock_prices = stock_prices.dropna(subset=['Target'])

In [None]:
%%time
stock_prices.isnull().sum()

In [None]:
%%time
# Creating Lag Features.
def lag_features(df, feature = 'Close', lag_sequence = [1, 7, 15, 30], group_field = 'SecuritiesCode'):
    for lag in lag_sequence:
        df[feature + '_Lag' + str(lag)] = df.groupby(group_field)[feature].shift(lag)
    return df

In [None]:
%%time
stock_prices = lag_features(stock_prices, feature = 'Open', lag_sequence = [1, 2, 4, 7, 15], group_field = 'SecuritiesCode')
stock_prices = lag_features(stock_prices, feature = 'Close', lag_sequence = [1, 2, 4, 7, 15], group_field = 'SecuritiesCode')

In [None]:
%%time
#stock_prices = lag_features(stock_prices, feature = 'Target', lag_sequence = [90, 120, 150], group_field = 'SecuritiesCode')

In [None]:
%%time
stock_prices[stock_prices['SecuritiesCode'] == 1301].head(10)

In [None]:
%%time
lag_fields = [col for col in stock_prices.columns if 'Lag' in col]
stock_prices = stock_prices.dropna(subset = lag_fields)

In [None]:
%%time
from scipy.stats import kurtosis
def kurtosis_func(series):
    '''
    Describe something...
    '''
    return kurtosis(series)

def q01(series):
    return np.quantile(series, 0.01)

def q05(series):
    return np.quantile(series, 0.05)

def q95(series):
    return np.quantile(series, 0.95)

def q99(series):
    return np.quantile(series, 0.99)

def aggregated_features(df, aggregation_cols = ['SecuritiesCode'], prefix = ''):
    agg_strategy = {'Open' : ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median', 'skew', 'count', kurtosis_func, q01, q05, q95, q99],
                    'High' : ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median', 'skew', 'count', kurtosis_func, q01, q05, q95, q99],
                    'Low'  : ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median', 'skew', 'count', kurtosis_func, q01, q05, q95, q99],
                    'Close': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median', 'skew', 'count', kurtosis_func, q01, q05, q95, q99],
                   }
    group = df.groupby(aggregation_cols).aggregate(agg_strategy)
    group.columns = ['_'.join(col).strip() for col in group.columns]
    group.columns = [str(prefix) + str(col) for col in group.columns]
    group.reset_index(inplace = True)
    
    temp = (df.groupby(aggregation_cols).size().reset_index(name = str(prefix) + 'Size'))
    group = pd.merge(temp, group, how = 'left', on = aggregation_cols,)
    return group

In [None]:
def rolling_features(df, aggregation_cols = ['SecuritiesCode'], feature = 'Close', periods = 30):
    df[feature + 'Rolling_Mean'] = df.groupby(aggregation_cols)[feature].transform(lambda s: s.rolling(periods, min_periods=1).mean())
    return df

In [None]:
%%time 
stock_prices = rolling_features(stock_prices, aggregation_cols = ['SecuritiesCode'], feature = 'Close', periods = 30)

In [None]:
%%time 
stock_prices[stock_prices['SecuritiesCode'] == 1301].head(15)

In [None]:
%%time
# Separating the Data in Train and Validation.
cutoff_date = '2021-09-03' # This leaves 90 Days aproximathly to validate the model...

trn_data = stock_prices[stock_prices['Date'] < cutoff_date]
val_data = stock_prices[stock_prices['Date'] >= cutoff_date]

In [None]:
%%time
agg_trn_data = aggregated_features(trn_data, aggregation_cols = ['SecuritiesCode'])
agg_val_data = aggregated_features(val_data, aggregation_cols = ['SecuritiesCode'])

In [None]:
%%time
trn_data = trn_data.merge(agg_trn_data, how = 'left', on = 'SecuritiesCode')
val_data = val_data.merge(agg_val_data, how = 'left', on = 'SecuritiesCode')

In [None]:
%%time
ignore = ['RowId', 
          'Date', 
          'AdjustmentFactor', 
          'ExpectedDividend', 
          'SupervisionFlag', 
          'Target', 
         ]

prediction_target = 'Target'
features = [feat for feat in trn_data.columns if feat not in ignore]

In [None]:
%%time
# Display a list of all the futures available.
features

In [None]:
# Understanding some of the features...
import matplotlib.pyplot as plt

plt.scatter(x = trn_data['Day_Of_Year'], y = trn_data['Target'], alpha = 0.1)
plt.show()

In [None]:
%%time
from xgboost import XGBRegressor

In [None]:
%%time
params = {'n_estimators'    : 2048,
          'max_depth'       : 7,
          'learning_rate'   : 0.05,
          'subsample'       : 0.95,
          'colsample_bytree': 0.90,
          'reg_lambda'      : 1.50,
          'reg_alpha'       : 6.10,
          'gamma'           : 1.40,
          'random_state'    : 69,
          'objective'       : 'reg:squarederror',
          'tree_method'     : 'gpu_hist',
         }


params_open = {'tree_method'     : 'gpu_hist',}

In [None]:
%%time
X_train, X_valid = trn_data[features], val_data[features]
y_train, y_valid = trn_data[prediction_target], val_data[prediction_target]

In [None]:
%%time
X_train.head()

In [None]:
%%time
X_valid.head()

In [None]:
%%time
X_train.isnull().sum()

In [None]:
%%time

xgb = XGBRegressor(**params_open)
xgb.fit(X_train, 
        y_train, 
        eval_set = [(X_valid, y_valid)], 
        eval_metric = ['mae'], 
        early_stopping_rounds = 128, 
        verbose = 50
       )

In [None]:
trn_data[prediction_target].describe()

In [None]:
%%time
def plot_feature_importance(importance, names, model_type, max_features = 10):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df.head(max_features)

    #Define size of bar plot
    plt.figure(figsize=(8,6))
    
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
%%time
import seaborn as sns
import matplotlib.pyplot as plt
plot_feature_importance(xgb.feature_importances_, X_train.columns, 'XG BOOST ', max_features = 20)

In [None]:
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    ds = [prices, options, financials, trades, secondary_prices, sample_prediction]
    sample_prediction["Avg"] = sample_prediction["SecuritiesCode"].apply(get_avg)
    df = sample_prediction[["Date","SecuritiesCode","Avg"]]
    df["High"] = prices["High"]
    df["Open"] = prices["Open"]
    df["Close"] = prices["Close"]
    df["Low"] = prices["Low"]
    df["Volume"] = prices["Volume"]
    df.Date = pd.to_datetime(df.Date)
    sample_prediction["Volume"] = df["Volume"]
    df['Date'] = df['Date'].dt.strftime("%Y%m%d").astype(int)
    sample_prediction["Prediction"] = model_o.predict(df)
    sample_prediction["rate"] = sample_prediction["Prediction"]/(np.log(sample_prediction["Volume"]+1))
    sample_prediction = sample_prediction.sort_values(by = "rate", ascending=False)
    sample_prediction.Rank = np.arange(0,2000)
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"],axis=1)
    submission = sample_prediction[["Date","SecuritiesCode","Rank"]]
    env.predict(submission)