# 🔎 How to find relevant external features & data for kaggle competitions in 10 minutes [1/3]
### Part #1 - Improve accuracy of Kaggle TOP1 leaderboard notebook in 10 minutes
##### [Part #2 Link](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret)
##### [Part #3 Link](https://www.kaggle.com/code/romaupgini/external-data-features-for-multivariate-ts)
______________________________
*updated 2022-05-27 [@roma-upgini](https://www.kaggle.com/romaupgini)*

**❓ Before reading the notebook, what will you learn from it?**

1. How external data & features might help on Kaggle: two scenarios
2. How to find relevant external features in less than 10 minutes and save time on feature engineering 
3. How to calculate metrics and uplifts from new external features
4. What external data sources might help you on Kaggle competitions

🗣 Share this notebook: [Shareable Link](https://www.kaggle.com/code/romaupgini/guide-how-to-find-relevant-external-features-1)
______________________________
### Table of contents
* [Intro](#Intro)

* [How external data & features might help on Kaggle?](#How-external-data-&-features-might-help-on-Kaggle?)

* [Packages and functions](#Packages-and-functions)

* [Final improvement of polished kernel](#Final-improvement-of-polished-kernel)

    - [1️⃣ Let's take existing TOP-1 winning solution with external data as a baseline](#1%EF%B8%8F%E2%83%A3-Let's-take-existing-TOP-1-winning-solution-with-external-data-as-a-baseline)
    - [2️⃣ Find relevant external features](#2️⃣-Find-relevant-external-features)
    
* [External data sources & features](#%F0%9F%8C%8E-External-data-sources-&-features)
* [References](#References)  

## Intro
**Competition**: [TPS January 2022](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022), SMAPE as a target metric  
**Special thanks**: [@ambrosm](https://www.kaggle.com/ambrosm) for the 1st place [notebook](https://www.kaggle.com/code/ambrosm/tpsjan22-10-advanced-linear-model-with-cci/notebook) and great disscussion on external data [here](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/discussion/302694)  
📚 In this notebook we'll use:
* [Upgini](https://github.com/upgini/upgini#readme) - Low-code Feature search and enrichment library for supervised machine learning applications.   
<a href="https://github.com/upgini/upgini">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white"  align='center'>
</a>  

**Baseline model** in this notebook is based on *@ambrosm* notebook (first place) with some minor changes:

* Feature engineering part was slightly changed, so we can prepare main features and external features separately;
* SimpleImputer was added to dataprep pipeline to deal with missing values while adding new external features;
* Constant scaling factor for the test predictions was removed.

## How external data & features might help on Kaggle?
Kaggle is always about learning and leader board progress (hopefully from learning, not cheating ;-))  
And every Kaggler wants to progress as fast as possible, so time saving tips & tricks is a big deal as well.  
That's why low-code tools is adopted among kagglers.

So, there are **two major scenarios** of external features & data introduction in competitions on Kaggle:

1. **Final improvement of a polished kernel**  
In this scenario you want **to improve already polished kernel** (optimized features, model architecture and hyperparams) with new external features.  
Before that, most of the juice already has been "squeezed" from competition data by significant efforts in feature engineering.  
And you want to answer the simple question - *Is there any external data sources and features which might boost accuracy a bit more?*  
However, there is a caveat to this approach: current model architecture & hyperparameters might be suboptimal for the new feature set, after introduction even single new var.  
So extra step back for model tuning might be needed.

2. **Low-code initial feature engineering - add relevant external features @start**  
Here you want to **save time on feature search and engineering**. If there are some ready-to-use external features and data, let's use it to speed up the overall progress.  
In this scenario always make sense to check that new external features have optimal representation for specific task and target model architecture. Example - category features for linear regression models should be one-hot-encoded.
This type of feature preparation should be done manually in any case.  
Same as scenario #1, there is a caveat to this approach: a lot of features not always a good thing - they might lead to dimensionality increase and model overfitting.  
So you have to check model accuracy improvement metrics after enrichment with the new features and ALWAYS with appropriate cross-validation strategy.
 
In this Guide we'll go with **Scenario #1**. Also you can check out guide for the [**Scenario #2**](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret).

## Packages and functions

In [None]:
%pip install -Uq upgini

import pandas as pd
import numpy as np
import pickle
import itertools
import gc
import math
import matplotlib.pyplot as plt
import dateutil.easter as easter
from datetime import datetime, date, timedelta
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold, GroupKFold, TimeSeriesSplit
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
import lightgbm as lgb
import scipy.stats
import os

def smape_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred) / (y_true + np.abs(y_pred))) * 200

def read_main_data(input_data_path):
    train_df = pd.read_csv(f'{input_data_path}/tabular-playground-series-jan-2022/train.csv')
    test_df = pd.read_csv(f'{input_data_path}/tabular-playground-series-jan-2022/test.csv')
    train_df["segment"], test_df["segment"] = "train", "test"
    kaggle_rama_sold = train_df[train_df.store == "KaggleRama"].num_sold.values
    kaggle_mart_sold = train_df[train_df.store == "KaggleMart"].num_sold.values
    kaggle_rama_ratio = np.mean(kaggle_rama_sold/kaggle_mart_sold)
    
    df = pd.concat([train_df, test_df]).reset_index(drop=True)
    df['date'] = pd.to_datetime(df.date)
    
    return df, kaggle_rama_ratio

def read_additional_data():
    gdp_df = pd.read_csv(
        f'{input_data_path}/gdp-20152019-finland-norway-and-sweden/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv'
    )
    gdp_df.set_index('year', inplace=True)

    cci_df = pd.read_csv(f'{input_data_path}/oecd-consumer-confidence-index/DP_LIVE_21012022073653464.csv')
    cci_df.set_index(['LOCATION', 'TIME'], inplace=True)
    
    return gdp_df, cci_df

In [None]:
def generate_main_features(df):
    new_df = df[["row_id", "date", "country", "segment", "num_sold"]].copy()
    
    ## one-hot encoding
    new_df['KaggleRama'] = df.store == 'KaggleRama'
    for country in ['Finland', 'Norway']:
        new_df[country] = df.country == country
    for product in ['Kaggle Mug', 'Kaggle Hat']:
        new_df[product] = df['product'] == product
           
    ## datetime features
    new_df['wd4'] = np.where(df.date.dt.weekday == 4, 1, 0)
    new_df['wd56'] = np.where(df.date.dt.weekday >= 5, 1, 0)
    
    dayofyear = df.date.dt.dayofyear
    for k in range(1, 3):
        sink = np.sin(dayofyear / 365 * 2 * math.pi * k)
        cosk = np.cos(dayofyear / 365 * 2 * math.pi * k)
        new_df[f'mug_sin{k}'] = sink * new_df['Kaggle Mug']
        new_df[f'mug_cos{k}'] = cosk * new_df['Kaggle Mug']
        new_df[f'hat_sin{k}'] = sink * new_df['Kaggle Hat']
        new_df[f'hat_cos{k}'] = cosk * new_df['Kaggle Hat']
    new_df.drop(columns=['mug_sin1'], inplace=True)
    new_df.drop(columns=['mug_sin2'], inplace=True)
        
    # special days
    new_df = pd.concat([
        new_df,
        pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) for d in range(24, 32)}),
        pd.DataFrame({
            f"n-dec{d}": (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Norway') 
            for d in range(25, 32)
        }),
        pd.DataFrame({
            f"f-jan{d}": (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland')
            for d in range(1, 15)
        }),
        pd.DataFrame({
            f"n-jan{d}": (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Norway')
            for d in range(1, 10)
        }),
        pd.DataFrame({
            f"s-jan{d}": (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Sweden')
            for d in range(1, 15)
        })
    ], axis=1)
    
    # May and June
    new_df = pd.concat([
        new_df,
        pd.DataFrame({
            f"may{d}": (df.date.dt.month == 5) & (df.date.dt.day == d) 
            for d in list(range(1, 10))
        }),
        pd.DataFrame({
            f"may{d}": (df.date.dt.month == 5) & (df.date.dt.day == d) & (df.country == 'Norway')
            for d in list(range(18, 26)) + [27]
        }),
        pd.DataFrame({
            f"june{d}": (df.date.dt.month == 6) & (df.date.dt.day == d) & (df.country == 'Sweden')
            for d in list(range(8, 15))
        })
    ], axis=1)
    
    # Last Wednesday of June
    wed_june_map = {
        2015: pd.Timestamp(('2015-06-24')),
        2016: pd.Timestamp(('2016-06-29')),
        2017: pd.Timestamp(('2017-06-28')),
        2018: pd.Timestamp(('2018-06-27')),
        2019: pd.Timestamp(('2019-06-26'))
    }
    wed_june_date = df.date.dt.year.map(wed_june_map)
    new_df = pd.concat([
        new_df,
        pd.DataFrame({
            f"wed_june{d}": (df.date - wed_june_date == np.timedelta64(d, "D")) & (df.country != 'Norway')
            for d in list(range(-4, 5))
        })
    ], axis=1)
    
    # First Sunday of November
    sun_nov_map = {
        2015: pd.Timestamp(('2015-11-1')),
        2016: pd.Timestamp(('2016-11-6')),
        2017: pd.Timestamp(('2017-11-5')),
        2018: pd.Timestamp(('2018-11-4')),
        2019: pd.Timestamp(('2019-11-3'))
    }
    sun_nov_date = df.date.dt.year.map(sun_nov_map)
    new_df = pd.concat([
        new_df,
        pd.DataFrame({
            f"sun_nov{d}": (df.date - sun_nov_date == np.timedelta64(d, "D")) & (df.country != 'Norway')
            for d in list(range(0, 9))
        })
    ], axis=1)
    
    # First half of December (Independence Day of Finland, 6th of December)
    new_df = pd.concat([
        new_df,
        pd.DataFrame({
            f"dec{d}": (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')
            for d in list(range(6, 15))
        }
    )], axis=1)

    # Easter
    easter_date = df.date.apply(lambda date: pd.Timestamp(easter.easter(date.year)))
    new_df = pd.concat([
        new_df,
        pd.DataFrame({
            f"easter{d}": (df.date - easter_date == np.timedelta64(d, "D"))
            for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(51, 58))
        }),
        pd.DataFrame({
            f"n_easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Norway')
            for d in list(range(-3, 8)) + list(range(50, 61))
        })
    ], axis=1)
    
    features_list = [
        f for f in new_df.columns 
        if f not in ["row_id", "date", "segment", "country", "num_sold", "KaggleRama"]
    ]
    return new_df, features_list

def get_gdp(row):
    country = 'GDP_' + row.country
    return gdp_df.loc[row.date.year, country]
        
def get_cci(row):
    country = row.country
    time = f"{row.date.year}-{row.date.month:02d}"
    if country == 'Norway': country = 'Finland'
    return cci_df.loc[country[:3].upper(), time].Value

def generate_extra_features(df, features_list):
    df['gdp'] = np.log(df.apply(get_gdp, axis=1))
    df['cci'] = df.apply(get_cci, axis=1)
    features_list_upd = features_list + ["gdp", "cci"]
    
    return df, features_list_upd

In [None]:
def fit_model(model, df, features_list, kaggle_rama_ratio):
    cols_to_scale = [
        'wd4', 'wd56', 'Finland', 'Norway', 'Kaggle Mug', 'Kaggle Hat',
        'mug_cos1', 'hat_sin1', 'hat_cos1', 'mug_cos2','hat_sin2', 'hat_cos2',
        'gdp'
    ]
    cols_to_scale = [f for f in cols_to_scale if f in features_list]
    stages = [('general', MinMaxScaler(), cols_to_scale)]
    if "cci" in features_list:
        stages.append(('cci', MinMaxScaler((0, 0.06)), ['cci']))
    column_tr = ColumnTransformer(stages, remainder=MinMaxScaler((0, 2.8)))
    dataprep_ppl = make_pipeline(column_tr, SimpleImputer(), StandardScaler(with_std=False))
    X_train = dataprep_ppl.fit_transform(df[features_list])
    y_train = df.num_sold.values.reshape(-1, 1).copy()
    y_train[df.KaggleRama.values > 0] = y_train[df.KaggleRama.values > 0] / kaggle_rama_ratio
    y_train = np.log(y_train).ravel()
    fitted_model = model.fit(X_train, y_train)
    model_coef = (
        pd.DataFrame({"name": features_list, "coef": np.abs(model.coef_)})
        .sort_values("coef", ascending=False)
        .reset_index(drop=True)
    )
        
    return dataprep_ppl, fitted_model, model_coef

def predict(dataprep_ppl, fitted_model, df, features_list, kaggle_rama_ratio):
    X_pred = dataprep_ppl.transform(df[features_list])
    y_pred = fitted_model.predict(X_pred)
    y_pred = np.exp(y_pred).reshape(-1, 1)
    y_pred[df.KaggleRama.values > 0] = y_pred[df.KaggleRama.values > 0] * kaggle_rama_ratio
    return y_pred

def cross_validate(model, df, features_list, kaggle_rama_ratio, cv=None):
    np.random.seed(0)
    scores_list = []
    df_train = df.query("segment == 'train'").reset_index(drop=True)
    model_coef = pd.DataFrame({"name": []})
    for fold, (train_idx, val_idx) in enumerate(cv.split(df_train)):
        df_tr, df_val = df_train.iloc[train_idx], df_train.iloc[val_idx]
        y_val = df_val.num_sold.values.reshape(-1, 1)
        
        dataprep_ppl, fitted_model, model_coef_ = fit_model(model, df_tr, features_list, kaggle_rama_ratio)
        model_coef = (
            model_coef
            .merge(
                model_coef_.rename(columns={"coef": f"coef_{fold}"}), 
                on="name", how="outer"
            )
        )
        y_val_pred = predict(dataprep_ppl, fitted_model, df_val, features_list, kaggle_rama_ratio)
        score = round(smape_loss(y_val, y_val_pred), 3)
        scores_list.append(score)
        
    return scores_list, model_coef

def make_submission(model, df, features_list, kaggle_rama_ratio, submission_path=""):
    df_train, df_test = df[df.segment == "train"].copy(), df[df.segment == "test"].copy()
    dataprep_ppl, fitted_model, _ = fit_model(model, df_train, features_list, kaggle_rama_ratio)
    df_sub = df_test[['row_id']].copy()
    df_sub["num_sold"] = predict(dataprep_ppl, fitted_model, df_test, features_list, kaggle_rama_ratio)
    df_sub["num_sold"] = np.round(df_sub["num_sold"])
    df_sub.to_csv(submission_path, index=False)

## Final improvement of polished kernel
## 1️⃣ Let's take existing TOP-1 winning solution with external data as a baseline

There is already an external data in this solution, so improvement shouldn't be an easy walk ;-):

1) *GDP statistics per year/country* ("gdp-20152019-finland-norway-and-sweden" dataset);  
2) *Consumer Confidence Index* per year/month/country (Value field of "oecd-consumer-confidence-index" dataset).

And we want to improve winning kernel by finding new/better external features.  

There is no changes in feature engineering from original [notebook](https://www.kaggle.com/code/ambrosm/tpsjan22-10-advanced-linear-model-with-cci/notebook) by [@ambrosm](https://www.kaggle.com/ambrosm).</br>
So let's calculate metrics for baseline solution.  
First, read train/test data from csv, combine them in one dataframe and generate features:


In [None]:
input_data_path = "/kaggle/input"
df, kaggle_rama_ratio = read_main_data(input_data_path)

# a lot of calendar based features
df, baseline_features = generate_main_features(df)
print(df.shape)
print("Number of features:", len(baseline_features))
df.segment.value_counts()

# features from GDP and CCI
gdp_df, cci_df = read_additional_data()
df, top_solution_features = generate_extra_features(df, baseline_features)
print(df.shape)
set(top_solution_features) - set(baseline_features)

Define model, cross-validation split and apply cross-validation to estimate model accuracy:

In [None]:
model = Ridge(alpha=0.2, tol=0.00001, max_iter=10000)
cv = KFold(n_splits=5)

top_solution_scores, model_coef = cross_validate(model, df, top_solution_features, kaggle_rama_ratio, cv=cv)
print("Top solution SMAPE by folds:", top_solution_scores)
print("Top solution avg SMAPE:", sum(top_solution_scores)/len(top_solution_scores))

Now, make submission file:

In [None]:
submission_path = 'submission_top_solution.csv'
make_submission(model, df, top_solution_features, kaggle_rama_ratio, submission_path)

Submission has score of **4.13** on public LB and **4.55** on private LB (**first place in the competition**). 
This is our baseline.  
Can we improve the first place solution even more? Let's find out!

## 2️⃣ Find relevant external features

To find new features we'll use [Upgini Feature search and enrichment library for supervised machine learning applications](https://github.com/upgini/upgini#readme)  
To initiate search with Upgini library, you need to define so called [*search keys*](https://github.com/upgini/upgini#-search-key-types-we-support-more-is-coming) - a set of columns to join external data sources. In this competition we can use the following keys:

1. Column **date** should be used as **SearchKey.DATE**.;  
2. Column **country** (after conversion to ISO-3166 country code) should be used as **SearchKey.COUNTRY**.
    
With this set of search keys, our dataset will be matched with [different time-specific features (such as weather data, calendar data, financial data, etc)](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), taking into account the country where sales happened. Than relevant selection and ranking will be done.  
As a result, we'll add new, only relevant features with additional information about specific dates and countries.

In [None]:
from upgini import SearchKey

# here we simply map each country to its ISO-3166 code
country_iso_map = {
    "Finland": "FI",
    "Norway": "NO",
    "Sweden": "SE"
}
df["country_iso"] = df.country.map(country_iso_map)

## define search keys
search_keys = {
    "date": SearchKey.DATE, 
    "country_iso": SearchKey.COUNTRY
}

To start the search, we need to initiate *scikit-learn* compartible `FeaturesEnricher` transformer with appropriate **search** parameters. After that, we can call the **fit** method of `features_enricher` to start the search.
> The ratio between KaggleRama and KaggleMart sales is a constant, so we'll use only KaggleMart sales for feature search and model training

In [None]:
%%time
from upgini import FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters

## define X_train / y_train, remove KaggleMart
condition = (df.segment == "train") & (df.KaggleRama == False)
X_train, y_train = df.loc[condition, list(search_keys.keys()) + top_solution_features], df.loc[condition, "num_sold"]

## define Features Enricher
features_enricher = FeaturesEnricher(
    search_keys = search_keys
)

`FeaturesEnricher.fit()` has a flag `calculate_metrics` for the quick estimation of quality improvement on cross-validation and eval sets. This step is quite similar to [sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics), so you can pass exact metric with `scoring` parameter:

1. Built-in scoring [functions](https://github.com/upgini/upgini/blob/main/README.md#-accuracy-and-uplift-metrics-calculations);
2. Custom scorer (in this case - scorer based on SMAPE loss).    

And we pass final Ridge model estimator with parameter `estimator`, for correct metric calculation, right in search results.  
Notice that you should pass **X_train** as the first argument and **y_train** as the second argument for `FeaturesEnricher.fit()`, just like in scikit-learn.  

*Step will take around 3.5 minutes*

In [None]:
%%time
from sklearn.metrics import make_scorer

## define SMAPE custom scoring function
scorer = make_scorer(smape_loss, greater_is_better=False)
scorer.__name__ = "SMAPE"

## launch fit
features_enricher.fit(X_train, y_train,
                      calculate_metrics = True,
                      scoring = scorer,
                      estimator = model)

We've got **60+ relevant features**, which might improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).  

Initial features from search dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

SMAPE uplift after enrichment with all of the new external features is *negative*  - as it doesn't make sense to use ALL of them for linear Ridge model.  
Let's enrich initial feature space with only **TOP-3** most important features.

>Generally it's a bad idea to put a lot of features with unknown structure (and possibly high pairwise correlation) into a linear model, like Ridge or Lasso without careful selection and pre-processing.

*Step will take around 2 minutes*

In [None]:
%%time

## call transform and enrich dataset with TOP-3 features only
df_enriched = features_enricher.transform(df, max_features=3, keep_input = True)

## put top-3 new external features names into selected features list
enricher_features = [
    f for f in features_enricher.get_features_info().feature_name.values
    if f not in list(search_keys.keys()) + top_solution_features
]
best_enricher_features = enricher_features[:3]

In [None]:
print("Top-3 of found features:")
best_enricher_features

## 3️⃣ Submit and calculate final leaderbord progress
Let's estimate model quality and make a submission file:

In [None]:
#same cross-validation split and model estimator as for baseline notebook in #1 Part
upgini_scores, model_coef = cross_validate(
    model, df_enriched, 
    top_solution_features + best_enricher_features, 
    kaggle_rama_ratio, cv=cv
)
print("Top solution SMAPE by folds:", top_solution_scores)
print("Upgini SMAPE by folds:", upgini_scores)
print("Top solution avg SMAPE:", sum(top_solution_scores)/len(top_solution_scores))
print("Upgini avg SMAPE:", sum(upgini_scores)/len(upgini_scores))

In [None]:
submission_path = 'submission.csv'
make_submission(
    model, df_enriched, 
    top_solution_features + best_enricher_features, 
    kaggle_rama_ratio, submission_path
)

This submission has score of **4.095** on public LB and **4.50** on private LB.  
Just to remider - baseline TOP-1 solution had **4.13** on public LB and **4.55** on private LB (*with 2 external data sources already*).   
**We've got a consistent improvement both on public and private parts of LB!**

## 🌎 Relevant external features & data sources
Leader board accuracy improved from enrichment with 3 new external features:

* **f_cci_1y_shift_0fa85f6f** - Consumer Confidence Index with 1 year lag. It's a Consumer Confidence Index value derivative for Finland and Sweden (CCI not available for Norway). CCI as feature already has been introduced in baseline notebook, but as a raw CCI index value with scaling on data prep step.

* **f_pcpiham_wt_531b4347** -  Consumer Price index for Health group of products & services. In general, Consumer Price indexes are index numbers that measure changes in the prices of goods and services purchased or otherwise acquired by households, which households use directly, or indirectly, to satisfy their own needs and wants.  
So it has a lot of information about inflation in specific country and for specific type of services and goods.  
It's been updated by the [Organisation for Economic Cooperation and Development (OECD)](https://data.oecd.org/price/inflation-cpi.htm) on a monthly basis.

* **f_cci_6m_shift_653a5999** - Consumer Confidence Index with 6 months lag.

## References
* [How to calculate the SMAPE score](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298201) by [@carlmcbrideellis](https://www.kaggle.com/carlmcbrideellis);
* [Approximating SMAPE](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298473) by [@ambrosm](https://www.kaggle.com/ambrosm).
