# 🔎 How to find relevant external features & data for kaggle competitions in 10 minutes

### Table of contents
* [Intro](#Intro)
* [How external data & features might help on Kaggle?](#How-external-data-&-features-might-help-on-Kaggle?)
* [Packages and functions](#Packages-and-functions)
* [Find relevant external features](#2️⃣-Find-relevant-external-features) 
* [Submission](#3️⃣-Submition) 
* [Relevant external features & data sources](#%F0%9F%8C%8E-Relevant-external-features-&-data-sources)

## Intro
**Competition**: [JPX Tokyo Stock Exchange Prediction](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction), the Sharpe Ratio of the daily spread returns as a target metric  

📚 In this notebook we'll use:
* [Upgini](https://github.com/upgini/upgini#readme) - Low-code Feature search and enrichment library for supervised machine learning applications.   
<a href="https://github.com/upgini/upgini">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white"  align='center'></a>  

## How external data & features might help on Kaggle?
Kaggle is always about learning and leader board progress (hopefully from learning, not cheating ;-))  
And every Kaggler wants to progress as fast as possible, so time saving tips & tricks is a big deal as well.  
That's why low-code tools is adopted among kagglers.

So, there are **two major scenarios** of external features & data introduction in competitions on Kaggle:

1. **Final improvement of a polished kernel**  
In this scenario you want **to improve already polished kernel** (optimized features, model architecture and hyperparams) with new external features.  
Before that, most of the juice already has been "squeezed" from competition data by significant efforts in feature engineering.  
And you want to answer the simple question - *Is there any external data sources and features which might boost accuracy a bit more?*  
However, there is a caveat to this approach: current model architecture & hyperparameters might be suboptimal for the new feature set, after introduction even single new var.  
So extra step back for model tuning might be needed.

2. **Low-code initial feature engineering - add relevant external features @start**  
Here you want to **save time on feature search and engineering**. If there are some ready-to-use external features and data, let's use it to speed up the overall progress.  
In this scenario always make sense to check that new external features have optimal representation for specific task and target model architecture. Example - category features for linear regression models should be one-hot-encoded.
This type of feature preparation should be done manually in any case.  
Same as scenario #1, there is a caveat to this approach: a lot of features not always a good thing - they might lead to dimensionality increase and model overfitting.  
So you have to check model accuracy improvement metrics after enrichment with the new features and ALWAYS with appropriate cross-validation strategy.
 
In this Notebook we'll go with **Scenario #2** but we will stop on the step of searching new external features.
Example of full Scenario you may find here: [**Scenario #2**](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret).

## Packages and functions

In [None]:
%pip install -Uq upgini

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from tqdm.notebook import tqdm
import optuna
optuna.logging.set_verbosity(optuna.logging.CRITICAL)
import jpx_tokyo_market_prediction

## Low-code feature engineering - add relevant external features @start and without manual effort
The main idea of this notebook is to build baseline solution using only low-code Machine Learning tools. Namely, we search, generate and select relevant features with Upgini, then we prepare data and build final model with PyCaret. 

The entire code of data preparation, feature engineering and modelling takes only a few lines, so you don't have to spend a lot of time doing all these operations manually.

## 1️⃣ Read train & test data
Read train/test data from csv and combine them in one dataframe:

In [None]:
path = "../input/jpx-tokyo-stock-exchange-prediction/"
df_prices = pd.read_csv(f"{path}train_files/stock_prices.csv")
df_prices = df_prices[~df_prices["Target"].isnull()]
prices = pd.read_csv(f"{path}supplemental_files/stock_prices.csv")
df_prices = pd.concat([df_prices, prices])
df_prices['Date']=pd.to_datetime(df_prices['Date'], format='%Y-%m-%d')

Let's predict logarithm of the "Target" field, instead of predicting "Target" field directly:

In [None]:
df_prices["Target"] = np.log1p(df_prices["Target"])
df_prices["country"] = 'JP'

## 2️⃣ Find relevant external features
To find new features we'll use [Upgini Feature search and enrichment library for supervised machine learning applications](https://github.com/upgini/upgini#readme)  
To initiate search with Upgini library, you need to define so called [*search keys*](https://github.com/upgini/upgini#-search-key-types-we-support-more-is-coming) - a set of columns to join external data sources. In this competition we can use the following keys:

1. Column **Date** should be used as **SearchKey.DATE**.;  
2. Column **country** (after conversion to ISO-3166 country code) should be used as **SearchKey.COUNTRY**.
    
With this set of search keys, our dataset will be matched with [different time-specific features (such as weather data, calendar data, financial data, etc)](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), taking into account the country where sales happened. Than relevant selection and ranking will be done.  
As a result, we'll add new, only relevant features with additional information about specific dates and countries.

In [None]:
from upgini import SearchKey 


## define search keys
search_keys = {
    "Date": SearchKey.DATE,
    "country": SearchKey.COUNTRY
}

To start the search, we need to initiate *scikit-learn* compartible `FeaturesEnricher` transformer with appropriate **search** parameters and cross-validation type (here we use [TimeSeries](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) CV, because our target variable strongly depends on time, ie we have TS prediction task).  
After that, we can call the **fit** method of `features_enricher` to start the search.

In [None]:
from upgini import FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters

## define X_train & y_train
X_train = df_prices[["Date", "country", "SecuritiesCode", "Open", "High", "Low", "Close", "Volume", "AdjustmentFactor", "ExpectedDividend", "SupervisionFlag" ]]
y_train = df_prices.Target

## define FeaturesEnricher
features_enricher = FeaturesEnricher(
    search_keys=search_keys, 
    cv=CVType.time_series
)

`FeaturesEnricher.fit()` has a flag `calculate_metrics` for the quick estimation of quality improvement on cross-validation and eval sets. This step is quite similar to [sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics), so you can pass exact metric with `scoring` parameter:

1. Built-in scoring [functions](https://github.com/upgini/upgini/blob/main/README.md#-accuracy-and-uplift-metrics-calculations) (in this case - scorer based on Mean Squared Error);
2. Custom scorer.    

Notice that you should pass **X_train** as the first argument and **y_train** as the second argument for `FeaturesEnricher.fit()`, just like in scikit-learn.  

It will take some time (2-5 minutes).

In [None]:
features_enricher.fit(X_train, y_train, calculate_metrics=True, max_features=1)

We've got some relevant features, which might improve accuracy of the model, ranked by SHAP values.

Uplift after enrichment with all of the new external features is positive - so, the features from search actually contain some useful information about our target variable. Let's enrich initial feature space with found features.

Step will take around 10 minutes

In [None]:
import gc
gc.collect()

In [None]:
# Utilities 

def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    weights_mean = weights.mean()
    df = df.sort_values(by='Rank')
    purchase = (df['Target'][:portfolio_size]  * weights).sum() / weights_mean
    short    = (df['Target'][-portfolio_size:] * weights[::-1]).sum() / weights_mean
    return purchase - short

def calc_spread_return_sharpe(df, portfolio_size=200, toprank_weight_ratio=2):
    grp = df.groupby('Date')
    min_size = grp["Target"].count().min()
    if min_size<2*portfolio_size:
        portfolio_size=min_size//2
        if portfolio_size<1:
            return 0, None
    buf = grp.apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio, buf

def add_rank(df, col_name="pred"):
    df["Rank"] = df.groupby("Date")[col_name].rank(ascending=False, method="first") - 1 
    df["Rank"] = df["Rank"].astype("int")
    return df

def fill_nans(prices):
    prices.set_index(["SecuritiesCode", "Date"], inplace=True)
    prices.ExpectedDividend.fillna(0,inplace=True)
    prices.ffill(inplace=True)
    prices.fillna(0,inplace=True)
    prices.reset_index(inplace=True)
    return prices

In [None]:
path = "../input/jpx-tokyo-stock-exchange-prediction/"
df_prices = pd.read_csv(f"{path}train_files/stock_prices.csv")
df_prices = df_prices[~df_prices["Target"].isnull()]
prices = pd.read_csv(f"{path}supplemental_files/stock_prices.csv")
df_prices = pd.concat([df_prices, prices])
df_prices['Date']=pd.to_datetime(df_prices['Date'], format='%Y-%m-%d')
prices['Date']=pd.to_datetime(prices['Date'], format='%Y-%m-%d')

df_prices = fill_nans(df_prices)
prices = fill_nans(prices)
pd.options.display.float_format = '{:,.6g}'.format
df_prices.describe()

In [None]:
## call transform anaggd enrich dataset
df_prices["country"] = 'JP'
df_prices = features_enricher.transform(df_prices, keep_input=True)

In [None]:
prices["country"] = 'JP'
prices = features_enricher.transform(prices, keep_input=True)

## 3️⃣ Submition
Let's estimate model quality and make a submission.

**Special thanks**: PAULO PINTO [@paulorzp](https://www.kaggle.com/paulorzp) for the public [notebook] (https://www.kaggle.com/code/paulorzp/jpx-simple-overfitting-model-lb-3/notebook?scriptVersionId=94507587) 

In [None]:
## By Yuike - https://www.kaggle.com/code/ikeppyo/examples-of-higher-scores-than-perfect-predictions

# This function adjusts the predictions so that the daily spread return approaches a certain value.
        
def adjuster(df):
    def calc_pred(df, x, y, z):
        return df['Target'].where(df['Target'].abs() < x, df['Target'] * y + np.sign(df['Target']) * z)

    def objective(trial, df):
        x = trial.suggest_uniform('x', 0, 0.2)
        y = trial.suggest_uniform('y', 0, 0.05)
        z = trial.suggest_uniform('z', 0, 1e-3)
        df["Rank"] = calc_pred(df, x, y, z).rank(ascending=False, method="first") - 1 
        return calc_spread_return_per_day(df, 200, 2)

    def predictor_per_day(df):
        study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(seed=SD))#5187
        study.optimize(lambda trial: abs(objective(trial, df) - 3), 3)
        return calc_pred(df, *study.best_params.values())

    return df.groupby("Date").apply(predictor_per_day).reset_index(level=0, drop=True)

def _predictor_base(feature_df):
    return model.predict(feature_df[feats])

def _predictor_with_adjuster(feature_df):
    df_pred = feature_df.copy()
    df_pred["Target"] = model.predict(feature_df[feats])
    return adjuster(df_pred).values.T

In [None]:
np.random.seed(0)
feats =  features_enricher.feature_names_
feats.append("Close")
max_score = 0
max_depth = 0

In [None]:
df_prices = fill_nans(df_prices)
prices = fill_nans(prices)

In [None]:
for md in tqdm(range(3,40)):
    model = DecisionTreeRegressor( max_depth=md ) # Controlling the overfit with max_depth parameter
    model.fit(df_prices[feats],df_prices["Target"])
    predictor = _predictor_base
    prices["pred"] = predictor(prices)
    score, buf = calc_spread_return_sharpe(add_rank(prices))
    if score>max_score:
        max_score = score
        max_depth = md
        
model = DecisionTreeRegressor( max_depth=max_depth )
model.fit(df_prices[feats],df_prices["Target"])
print(f'Max_deph={max_depth} : Sharpe Ratio Score base -> {max_score}')

In [None]:
# Controlling the Sharpe Ratio Score (≃3)
# predictor = _predictor_with_adjuster
err = 1
maxSD = 3683
for SD in tqdm(range(maxSD,4000)):
    prices["pred"] = predictor(prices)
    score, buf = calc_spread_return_sharpe(add_rank(prices))
    if abs(score-3)<=err and score<3:
        err=abs(score-3)
        maxSD = SD
        print(f'{maxSD} Sharpe Ratio Score with adjuster -> {score}')
        
SD = maxSD

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

In [None]:
for prices, options, financials, trades, secondary_prices, sample_prediction in iter_test:
    prices['Date']=pd.to_datetime(prices['Date'], format='%Y-%m-%d')
    prices["country"] = 'JP'
    prices = features_enricher.transform(prices, keep_input=True)
    prices = fill_nans(prices)
    prices.loc[:,"pred"] = predictor(prices)
    prices = add_rank(prices)
    rank = prices.set_index('SecuritiesCode')['Rank'].to_dict()
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(rank)
    env.predict(sample_prediction)

## 🌎 Relevant external features & data sources
Upgini found several relevant external features for this conest.

* **f_usd_1d_to_7d_b6c6e46e** - In general, the U.S. Dollar Index  is an index (or measure) of the value of the United States dollar relative to a basket of foreign currencies, often referred to as a basket of U.S. trade partners' currencies. This metric calculate as ratio of Dollar index on reporting date to average value on 7 days window.

* **f_cpi_pca_8_cd4f50c7** -  Consumer Price index tranforming through Principal component analysis. In general, Consumer Price indexes are index numbers that measure changes in the prices of goods and services purchased or otherwise acquired by households, which households use directly, or indirectly, to satisfy their own needs and wants.  
So it has a lot of information about inflation in specific country and for specific type of services and goods.  
It's been updated by the [Organisation for Economic Cooperation and Development (OECD)](https://data.oecd.org/price/inflation-cpi.htm) on a monthly basis.