# 🔎 How to find relevant external features & data for kaggle competitions in 10 minutes [2/3]
### Part #2 - Zero feature engineering with low-code libraries: Upgini + PyCaret
##### [Part #1 Link](https://www.kaggle.com/code/romaupgini/guide-how-to-find-relevant-external-features-1)
##### [Part #3 Link](https://www.kaggle.com/code/romaupgini/external-data-features-for-multivariate-ts)
______________________________
*updated 26.05.22 [@roma-upgini](https://www.kaggle.com/romaupgini)*

**❓ Before reading the notebook, what will you learn from it?**

1. How external data & features might help on Kaggle: two scenarios
2. How to find relevant external features in less than 10 minutes and save time on feature engineering 
3. How to calculate metrics and uplifts from new external features
4. What external data sources might help you on Kaggle competitions

🗣 Share this notebook: [Shareable Link](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret)
_______________________________
### Table of contents
* [Intro](#Intro)

* [How external data & features might help on Kaggle?](#How-external-data-&-features-might-help-on-Kaggle?)

* [Packages and functions](#Packages-and-functions)

* [Low-code initial feature engineering - add relevant external features @start](#Low-code-initial-feature-engineering---add-relevant-external-features-@start-and-without-manual-effort)

    - [1️⃣ Read train & test data](#1%EF%B8%8F%E2%83%A3-Read-train-&-test-data)
    - [2️⃣ Find relevant external features](#2️⃣-Find-relevant-external-features)
    - [3️⃣ Build model using PyCaret](#3️⃣-Build-model-using-PyCaret)
    
    
* [External data sources & features](#%F0%9F%8C%8E-External-data-sources-&-features)   

## Intro
**Competition**: [TPS January 2022](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022), SMAPE as a target metric.

**Special thanks**: [@maxencefzr](https://www.kaggle.com/maxencefzr) for the [notebook](https://www.kaggle.com/code/maxencefzr/tps-jan22-catboost-using-pycaret/notebook) with PyCaret usage example.

📚 In this notebook we'll use: 

* [Upgini]( https://github.com/upgini/upgini) - Low-code Feature search and enrichment library for supervised machine learning applications;
* [PyCaret](https://pycaret.org/) - Low-code machine learning library in Python that automates machine learning workflows.  

<a href="https://github.com/upgini/upgini">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white"  align='center'>
</a>

## How external data & features might help on Kaggle?
Kaggle is always about learning and leader board progress (hopefully from learning, not cheating ;-))  
And every Kaggler wants to progress as fast as possible, so time saving tips & tricks is a big deal as well.  
That's why low-code tools is adopted among kagglers.

So, there are **two major scenarios** of external features & data introduction in competitions on Kaggle:

1. **Final improvement of a polished kernel**  
In this scenario you want **to improve already polished kernel** (optimized features, model architecture and hyperparams) with new external features.  
Before that, most of the juice already has been "squeezed" from competition data by significant efforts in feature engineering.  
And you want to answer the simple question - *Is there any external data sources and features which might boost accuracy a bit more?*  
However, there is a caveat to this approach: current model architecture & hyperparameters might be suboptimal for the new feature set, after introduction even single new var.  
So extra step back for model tuning might be needed.

2. **Low-code initial feature engineering - add relevant external features @start**  
Here you want to **save time on feature search and engineering**. If there are some ready-to-use external features and data, let's use it to speed up the overall progress.  
In this scenario always make sense to check that new external features have optimal representation for specific task and target model architecture. Example - category features for linear regression models should be one-hot-encoded.
This type of feature preparation should be done manually in any case.  
Same as scenario #1, there is a caveat to this approach: a lot of features not always a good thing - they might lead to dimensionality increase and model overfitting.  
So you have to check model accuracy improvement metrics after enrichment with the new features and ALWAYS with appropriate cross-validation strategy.
 
In this notebook, we'll go with **Scenario #2**. Also you can check out guide for the [**Scenario #1**](https://www.kaggle.com/code/romaupgini/guide-how-to-find-relevant-external-features-1).

## Packages and functions

In [None]:
%pip install -Uq upgini
%pip install pycaret

import logging
logging.getLogger("logs").setLevel("DEBUG")

import pandas as pd
import numpy as np
from pycaret.regression import *

def smape(actual, predicted):
    numerator = np.abs(predicted - actual)
    denominator = (np.abs(actual) + np.abs(predicted)) / 2
    
    return np.mean(numerator / denominator)*100

def read_main_data(input_data_path):
    train_df = pd.read_csv(f'{input_data_path}/tabular-playground-series-jan-2022/train.csv')
    test_df = pd.read_csv(f'{input_data_path}/tabular-playground-series-jan-2022/test.csv')
    train_df["segment"], test_df["segment"] = "train", "test"
    df = pd.concat([train_df, test_df]).reset_index(drop=True)
    df['date'] = pd.to_datetime(df.date)
    
    return df

## Low-code feature engineering - add relevant external features @start and without manual effort
The main idea of this notebook is to build baseline solution using only low-code Machine Learning tools. Namely, we search, generate and select relevant features with Upgini, then we prepare data and build final model with PyCaret. 

The entire code of data preparation, feature engineering and modelling takes only a few lines, so you don't have to spend a lot of time doing all these operations manually.

## 1️⃣ Read train & test data
Read train/test data from csv and combine them in one dataframe:

In [None]:
input_data_path = "/kaggle/input"
df = read_main_data(input_data_path)

print("Train + test dataframe size:", df.shape)
df.head()

Let's define list of baseline features (we'll simply use "country", "store" and "product" as is). As you can see, there is no feature engineering at all - all three columns are presented in the initial train and test datasets:

In [None]:
baseline_features = [
    "country", "store", "product"
]

Let's also predict logarithm of the "num_sold" field, instead of predicting "num_sold" field directly:

In [None]:
df["num_sold_log"] = np.log1p(df["num_sold"])

## 2️⃣ Find relevant external features
To find new features we'll use [Upgini Feature search and enrichment library for supervised machine learning applications](https://github.com/upgini/upgini#readme)  
To initiate search with Upgini library, you need to define so called [*search keys*](https://github.com/upgini/upgini#-search-key-types-we-support-more-is-coming) - a set of columns to join external data sources. In this competition we can use the following keys:

1. Column **date** should be used as **SearchKey.DATE**.;  
2. Column **country** (after conversion to ISO-3166 country code) should be used as **SearchKey.COUNTRY**.
    
With this set of search keys, our dataset will be matched with [different time-specific features (such as weather data, calendar data, financial data, etc)](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), taking into account the country where sales happened. Than relevant selection and ranking will be done.  
As a result, we'll add new, only relevant features with additional information about specific dates and countries.

In [None]:
from upgini import SearchKey 

# here we simply map each country to its ISO-3166 code
country_iso_map = {
    "Finland": "FI",
    "Norway": "NO",
    "Sweden": "SE"
}
df["country_iso"] = df.country.map(country_iso_map)
df.country_iso.value_counts()

## define search keys
search_keys = {
    "date": SearchKey.DATE, 
    "country_iso": SearchKey.COUNTRY
}

To start the search, we need to initiate *scikit-learn* compartible `FeaturesEnricher` transformer with appropriate **search** parameters and cross-validation type (here we use [TimeSeries](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) CV, because our target variable strongly depends on time, ie we have TS prediction task).  
After that, we can call the **fit** method of `features_enricher` to start the search.

In [None]:
from upgini import FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters

## define X_train & y_train
X_train = df.loc[df.segment == "train", ["date", "country_iso"] + baseline_features]
y_train = df.loc[df.segment == "train", "num_sold_log"]

## define FeaturesEnricher
features_enricher = FeaturesEnricher(
    search_keys=search_keys, 
    cv=CVType.time_series
)

`FeaturesEnricher.fit()` has a flag `calculate_metrics` for the quick estimation of quality improvement on cross-validation and eval sets. This step is quite similar to [sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics), so you can pass exact metric with `scoring` parameter:

1. Built-in scoring [functions](https://github.com/upgini/upgini/blob/main/README.md#-accuracy-and-uplift-metrics-calculations) (in this case - scorer based on Mean Squared Error);
2. Custom scorer.    

Notice that you should pass **X_train** as the first argument and **y_train** as the second argument for `FeaturesEnricher.fit()`, just like in scikit-learn.  

*Step will take around 3.5 minutes*

In [None]:
%%time

features_enricher.fit(X_train, y_train, calculate_metrics = True)

We've got **70+ relevant features**, which might improve accuracy of the model, ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).  

Uplift after enrichment with all of the new external features is *positive* - so, the features from search actually contain some useful information about our target variable. Let's enrich initial feature space with found features.

*Step will take around 2 minutes*

In [None]:
## call transform and enrich dataset
df_enriched = features_enricher.transform(df, keep_input=True)

## define list of found features
enricher_features = [
    f for f in features_enricher.get_features_info().feature_name.values
    if f not in ["date", "country_iso"] + baseline_features
]

## 3️⃣ Build model using PyCaret

Now it's time to build a model using PyCaret. At first, we need to configurate **setup** for model training. 

Notice that we use TimeSeries cross-validation with 5 folds, just like we did during feature search:

In [None]:
_ = setup(
    data = df_enriched[df_enriched.segment == "train"],
    target = "num_sold_log",
    numeric_features = enricher_features,
    categorical_features = baseline_features,
    ignore_features=["date", "country_iso", "segment", "num_sold"],
    fold_strategy = 'timeseries',
    fold = 5,
    data_split_shuffle = False, 
    silent = True
)

Let's fit model based on CatBoost regressor with **create_model** function. Notice that we don't need to apply additional data preparation (for example, encoding of categorical features) - PyCaret does it automatically:

In [None]:
model = create_model('catboost', random_state=0)
model

That's it, our model is ready! Let's plot feature importances for the selected features:

In [None]:
plot_model(model, 'feature')

We can also interpret results of the model using SHAP summary plot:

In [None]:
interpret_model(model)

Finally, let's make predictions for the test part of the dataset and submit them:

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv')
y_pred = predict_model(model, data=df_enriched[df_enriched.segment == "test"])['Label']
submission["num_sold"] = np.round(np.expm1(y_pred.values))
submission.to_csv('submission.csv', index=False)

As a result, we build quite a good solution for the competition **without manual data preparation, feature engineering and modelling** thanks to the low-code instruments! 

## 🌎 Relevant external features & data sources

Here is the description for the **TOP-2** most important features from Upgini enrichment:

* **f_NID_NGDP_f8ff00f6** - Total investment in percents of country's GDP. 

* **f_pcpiham_wt_531b4347** -  Consumer Price index for Health group of products & services. In general, Consumer Price indexes are index numbers that measure changes in the prices of goods and services purchased or otherwise acquired by households, which households use directly, or indirectly, to satisfy their own needs and wants.  
So it has a lot of information about inflation in specific country and for specific type of services and goods.  
It's been updated by the [Organisation for Economic Cooperation and Development (OECD)](https://data.oecd.org/price/inflation-cpi.htm) on a monthly basis.