# Welcome and have fun learning



#### Linear regression excels at extrapolating trends, but can't learn interactions. XGBoost excels at learning interactions, but can't extrapolate trends. We'll learn how to create "hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weakness of the other. 

- Feature engineering and Linear model based on excellent: https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model @ambrosm
- Hybrid model from Time series course: https://www.kaggle.com/learn/time-series
- Holidays dataset: https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298990

Objective of this notebook used to be a ~simple~ and robust time series regression for future use.

<blockquote style="margin-right:auto; margin-left:auto; padding: 1em; margin:24px;">
    <strong>Fork This Notebook!</strong><br>
Create your own editable copy of this notebook by clicking on the <strong>Copy and Edit</strong> button in the top right corner.
</blockquote>

**Notes:**

## Imports and Configuration ##

In [None]:
!pip install scikit-learn -U
# Intel® Extension for Scikit-learn installation:
!pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
# !pip install git+https://github.com/scikit-learn-contrib/py-earth@v0.2dev

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

from scipy import stats
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
import seaborn as sns


import ipywidgets as widgets

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import lightgbm as lgb

from datetime import date
import holidays
import calendar
import dateutil.easter as easter

from collections import defaultdict
le = defaultdict(LabelEncoder)

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(12, 8))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import gc
import os
import math
import random

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Fine tuning
Most of the global variables are not used.

In [None]:
# -----------------------------------------------------------------
# Some parameters to config 
PRODUCTION = True # True: For submission run. False: Fast trial run

# Hyperparameters
FOLDS = 15 if PRODUCTION else 5   # Only 5 or 10.
REPEAT = 5 if PRODUCTION else 1
SEED_START = 0

# NN hyperparameters
EPOCHS = 500        # Does not matter with Early stopping. Deep network should not take too much epochs to learn
HIDDEN_LAYERS = (200, 100)

RANDOM_STATE = 42
VERBOSE = 0

# Admin
ID = "row_id"            # Id id x X index
INPUT = "../input/tabular-playground-series-jan-2022"
GPU = False          # True: use GPU.
FEATURE_ENGINEERING = True

PSEUDO_LABEL = False # PSEUDO are not ground true and will not help long term, only used for final push
BLEND = False        # Blend previous run
PSEUDO_DIR = "../input/tpsjan22-10-advanced-linear-model-with-cci/submission_linear_model_rounded.csv"
PSEUDO_DIR2 = "../input/tpsjan22-10-advanced-linear-model-with-cci/submission_linear_model_rounded.csv"

N_ESTIMATORS = 700 if PSEUDO_LABEL else 240
LOSS_CORRECTION = 1

# time series data common new feature  
DATE = "date"
YEAR = "year"
QUARTER = "quarter"
MONTH = "month"
WEEK = "week"
DAY = "day"
DAYOFYEAR = "dayofyear"
WEEKOFYEAR = "weekofyear"
DAYOFMONTH = "dayofMonth"
DAYOFWEEK = "dayofweek"
WEEKDAY = "weekday"


In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything(RANDOM_STATE)

# Loss function SMAPE
​​i=1​∑​N​​w​i​​​​100​i=1​∑​N​​​(∣t​i​​∣+∣a​i​​∣)/2​​w​i​​∣a​i​​−t​i​​∣​​​​

In [None]:
# https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/36414
def smape_loss(y_true, y_pred):
    """
    SMAPE Loss
    Parameters
    ----------
    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Ground truth (correct) target values.
    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated target values.
    Returns
    -------
    loss : float or ndarray of floats
        If multioutput is 'raw_values', then mean absolute error is returned
        for each output separately.
        If multioutput is 'uniform_average' or an ndarray of weights, then the
        weighted average of all output errors is returned.
        SMAPE output is non-negative floating point. The best value is 0.0.

    """
    assert(y_true.shape == y_pred.shape)
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)


In [None]:
# https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/282735
def better_than_median(inputs, axis):
    """Compute the mean of the predictions if there are no outliers,
    or the median if there are outliers.

    Parameter: inputs = ndarray of shape (n_samples, n_folds)"""
    spread = inputs.max(axis=axis) - inputs.min(axis=axis) 
    spread_lim = 0.45
    print(f"Inliers:  {(spread < spread_lim).sum():7} -> compute mean")
    print(f"Outliers: {(spread >= spread_lim).sum():7} -> compute median")
    print(f"Total:    {len(inputs):7}")
    return np.where(spread < spread_lim,
                    np.mean(inputs, axis=axis),
                    np.median(inputs, axis=axis))

In [None]:
from math import ceil, floor, sqrt
# from https://www.kaggle.com/fergusfindley/ensembling-and-rounding-techniques-comparison
def geometric_round(arr):
    result_array = arr
    result_array = np.where(result_array < np.sqrt(np.floor(arr)*np.ceil(arr)), np.floor(arr), result_array)
    result_array = np.where(result_array >= np.sqrt(np.floor(arr)*np.ceil(arr)), np.ceil(arr), result_array)

    return result_array

In [None]:
def plot_periodogram(ts, detrend='linear', ax=None):
    from scipy.signal import periodogram
    fs = pd.Timedelta("1Y") / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots()
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels(
        [
            "Annual (1)",
            "Semiannual (2)",
            "Quarterly (4)",
            "Bimonthly (6)",
            "Monthly (12)",
            "Biweekly (26)",
            "Weekly (52)",
            "Semiweekly (104)",
        ],
        rotation=30,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Periodogram")
    return ax

## Data Preprocessing ##

Before we can do any feature engineering, we need to *preprocess* the data to get it in a form suitable for analysis. We'll need to:
- **Load** the data from CSV files
- **Clean** the data to fix any errors or inconsistencies
- **Encode** the statistical data type (numeric, categorical)
- **Impute** any missing values

We'll wrap all these steps up in a function, which will make easy for you to get a fresh dataframe whenever you need. After reading the CSV file, we'll apply three preprocessing steps, `clean`, `encode`, and `impute`, and then create the data splits: one (`df_train`) for training the model, and one (`df_test`) for making the predictions that you'll submit to the competition for scoring on the leaderboard.

### Handle Missing Values ###

Handling missing values now will make the feature engineering go more smoothly. We'll impute `0` for missing numeric values and `"None"` for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating "missing value" indicators: `1` whenever a value was imputed and `0` otherwise.

In [None]:
def impute(df):
    for name in df.select_dtypes("number"):
        df[name] = df[name].fillna(0)
    for name in df.select_dtypes("category"):
        df[name] = df[name].fillna("None")
    return df

# Data/Feature Engineering

## Periodic spline features
We can try an alternative encoding of the periodic time-related features using spline transformations with a large enough number of splines, and as a result a larger number of expanded features compared to the sine/cosine transformation:

In [None]:
from sklearn.preprocessing import SplineTransformer


def periodic_spline_transformer(period, n_splines=None, degree=3):
    if n_splines is None:
        n_splines = period
    n_knots = n_splines + 1  # periodic and include_bias is True
    return SplineTransformer(
        degree=degree,
        n_knots=n_knots,
        knots=np.linspace(0, period, n_knots).reshape(n_knots, 1),
        extrapolation="periodic",
        include_bias=True,
    )

In [None]:
year_df = pd.DataFrame(
    np.linspace(0, 365, 1000).reshape(-1, 1),
    columns=[DAYOFYEAR],
)
splines = periodic_spline_transformer(365, n_splines=12, degree=2).fit_transform(year_df)
splines_df = pd.DataFrame(
    splines,
    columns=[f"spline_{i}" for i in range(splines.shape[1])],
)
pd.concat([year_df, splines_df], axis="columns").plot(x=DAYOFYEAR, cmap=plt.cm.tab20b)
_ = plt.title(f"Periodic spline-based encoding for the {DAYOFYEAR} feature")

In [None]:
# https://www.kaggle.com/samuelcortinhas/tps-jan-22-quick-eda-hybrid-model/notebook
def unofficial_holiday(df):
    countries = {'Finland': 1, 'Norway': 2, 'Sweden': 3}
    stores = {'KaggleMart': 1, 'KaggleRama': 2}
    products = {'Kaggle Mug': 1,'Kaggle Hat': 2, 'Kaggle Sticker': 3}
    
    # load holiday info.
    hol_path = '../input/public-and-unofficial-holidays-nor-fin-swe-201519/holidays.csv'
    holiday = pd.read_csv(hol_path)
    
    fin_holiday = holiday.loc[holiday.country == 'Finland']
    swe_holiday = holiday.loc[holiday.country == 'Sweden']
    nor_holiday = holiday.loc[holiday.country == 'Norway']
    df['fin holiday'] = df.date.isin(fin_holiday.date).astype(int)
    df['swe holiday'] = df.date.isin(swe_holiday.date).astype(int)
    df['nor holiday'] = df.date.isin(nor_holiday.date).astype(int)
    df['holiday'] = np.zeros(df.shape[0]).astype(int)
    df.loc[df.country == 'Finland', 'holiday'] = df.loc[df.country == 'Finland', 'fin holiday']
    df.loc[df.country == 'Sweden', 'holiday'] = df.loc[df.country == 'Sweden', 'swe holiday']
    df.loc[df.country == 'Norway', 'holiday'] = df.loc[df.country == 'Norway', 'nor holiday']
    df.drop(['fin holiday', 'swe holiday', 'nor holiday'], axis=1, inplace=True)
    return df

In [None]:
# BUID calendar columns
MONTH_COLUMNS = []
WEEKOFYEAR_COLUMNS = []
DAYOFYEAR_COLUMNS = []
WEEKDAY_COLUMNS = []

for x in [MONTH,WEEKOFYEAR,DAYOFYEAR,WEEKDAY]:
    for y in [f'mug_{x}', f'hat_{x}', f'stick_{x}']:
        if x == MONTH:
            MONTH_COLUMNS.append(y)
        if x == WEEKOFYEAR:
            WEEKOFYEAR_COLUMNS.append(y)
        if x == DAYOFYEAR:
            DAYOFYEAR_COLUMNS.append(y)
        if x == WEEKDAY:
            WEEKDAY_COLUMNS.append(y)

In [None]:
def fourier_features(index, freq, order):
    time = np.arange(len(index), dtype=np.float32)
    k = 2 * np.pi * (1 / freq) * time
    features = {}
    for i in range(1, order + 1):
        features.update({
            f"sin_{freq}_{i}": np.sin(i * k),
            f"cos_{freq}_{i}": np.cos(i * k),
        })
    return pd.DataFrame(features, index=index)

def get_basic_ts_features(df):
    
    gdp_df = pd.read_csv('../input/gdp-per-capita-finland-norway-sweden-201519/GDP_per_capita_2015_to_2019_Finland_Norway_Sweden.csv')
    gdp_df.set_index('year', inplace=True)
#     gdp_exponent = 1.2121103201489674 # see https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model for an explanation
    def get_gdp(row):
        country = row.country
        return gdp_df.loc[row.date.year, country] #**gdp_exponent

    # Apply GDP log
    df['gdp'] = np.log1p(df.apply(get_gdp, axis=1))
    
#     # Split GDP by country (for linear model)
#     df['fin_gdp']=np.where(df['country'] == 'Finland', df['gdp'], 0)
#     df['nor_gdp']=np.where(df['country'] == 'Norway', df['gdp'], 0)
#     df['swe_gdp']=np.where(df['country'] == 'Sweden', df['gdp'], 0)
    
#     # Drop column
#     df=df.drop(['gdp'],axis=1)
    
    # one-hot encoding should be used. linear model should not learn this as numeric value
#     df[YEAR] = df[DATE].dt.year
#     df[MONTH] = df[DATE].dt.month
#     df[WEEKOFYEAR] = df[DATE].dt.isocalendar().week
#     df[DAYOFYEAR] = df[DATE].dt.dayofyear
#     df[WEEKDAY] = df[DATE].dt.weekday
#     df[DAY] = df[DATE].dt.day # day in month
#     df[DAYOFMONTH] = df[DATE].dt.days_in_month
#     df[DAYOFWEEK] = df[DATE].dt.dayofweek
#     df[MONTH] = df[DATE].dt.month # Min SMAPE: 4.005319478790032
#     df[QUARTER] = df.date.dt.quarter

#     df['wd0'] = df[DATE].dt.weekday == 0 # + Monday
#     df['wd1'] = df[DATE].dt.weekday == 1 # Tuesday
#     df['wd2'] = df[DATE].dt.weekday == 2
#     df['wd3'] = df[DATE].dt.weekday == 3
    df['wd4'] = df[DATE].dt.weekday == 4 # + Friday
    df['wd56'] = df[DATE].dt.weekday >= 5 # + Weekend

#     df[f'mug_wd4'] = np.where(df['product'] == 'Kaggle Mug', df[f'wd4'], False)
#     df[f'mug_wd56'] = np.where(df['product'] == 'Kaggle Mug', df[f'wd56'], False)
#     df[f'hat_wd4'] = np.where(df['product'] == 'Kaggle Hat', df[f'wd4'], False)
#     df[f'hat_wd56'] = np.where(df['product'] == 'Kaggle Hat', df[f'wd56'], False)
#     df[f'stick_wd4'] = np.where(df['product'] == 'Kaggle Sticker', df[f'wd4'], False)
#     df[f'stick_wd56'] = np.where(df['product'] == 'Kaggle Sticker', df[f'wd56'], False)
#     df = df.drop(columns=[f'wd4', f'wd56'])
    # 4 seasons
#     df['season'] = ((df[DATE].dt.month % 12 + 3) // 3).map({1:'DJF', 2: 'MAM', 3:'JJA', 4:'SON'})

    return df

def feature_splines(df):
    # one-hot encoding should be used. linear model should not learn this as numeric value
#     df[MONTH] = df[DATE].dt.month
#     df[WEEKOFYEAR] = df[DATE].dt.isocalendar().week
#     df[WEEKDAY] = df[DATE].dt.weekday
#     df[DAYOFYEAR] = df[DATE].dt.dayofyear
    
    dayofyear_splines = periodic_spline_transformer(365, n_splines=9, degree=2).fit_transform(df[DATE].dt.dayofyear.values.reshape(-1, 1))
    splines_df = pd.DataFrame(
        dayofyear_splines,
        columns=[f"spline_{i}" for i in range(dayofyear_splines.shape[1])],
    )
    for i in range(dayofyear_splines.shape[1]):
        df[f'mug_{DAYOFYEAR}{i}'] = np.where(df['product'] == 'Kaggle Mug', splines_df[f"spline_{i}"], 0.)
        df[f'hat_{DAYOFYEAR}{i}'] = np.where(df['product'] == 'Kaggle Hat', splines_df[f"spline_{i}"], 0.)
#         df[f'stick_{DAYOFYEAR}{i}'] = np.where(df['product'] == 'Kaggle Sticker', splines_df[f"spline_{i}"], 0.)
#         df[f'fin_{DAYOFYEAR}{i}'] = np.where(df['country'] == 'Finland', splines_df[f"spline_{i}"], 0.)
#         df[f'nor_{DAYOFYEAR}{i}'] = np.where(df['country'] == 'Norway', splines_df[f"spline_{i}"], 0.)
#         df[f'swe_{DAYOFYEAR}{i}'] = np.where(df['country'] == 'Sweden', splines_df[f"spline_{i}"], 0.)

#     weekofyear_splines = periodic_spline_transformer(52, n_splines=2, degree=2).fit_transform(df[DATE].dt.isocalendar().week.values.astype(np.float64).reshape(-1,1))
#     splines_df = pd.DataFrame(
#         weekofyear_splines,
#         columns=[f"spline_{i}" for i in range(weekofyear_splines.shape[1])],
#     )
#     for i in range(weekofyear_splines.shape[1]):
#         df[f'weekofyear_{WEEKOFYEAR}{i}'] = splines_df[f"spline_{i}"]
#         df[f'hat_{WEEKOFYEAR}{i}'] = np.where(df['product'] == 'Kaggle Hat', splines_df[f"spline_{i}"], 0)
#         df[f'stick_{WEEKOFYEAR}{i}'] = np.where(df['product'] == 'Kaggle Sticker', splines_df[f"spline_{i}"], 0)
#     df[f'mug_{MONTH}'] = np.where(df['product'] == 'Kaggle Mug', df[MONTH], 0)
#     df[f'mug_{WEEKOFYEAR}'] = np.where(df['product'] == 'Kaggle Mug', df[WEEKOFYEAR], 0)
#     df[f'mug_{DAYOFYEAR}'] = np.where(df['product'] == 'Kaggle Mug', df[DAYOFYEAR], 0)
#     df[f'mug_{WEEKDAY}'] = np.where(df['product'] == 'Kaggle Mug', df[WEEKDAY], 0)
#     df[f'hat_{MONTH}'] = np.where(df['product'] == 'Kaggle Hat', df[MONTH], 0)
#     df[f'hat_{WEEKOFYEAR}'] = np.where(df['product'] == 'Kaggle Hat', df[WEEKOFYEAR], 0)
#     df[f'hat_{DAYOFYEAR}'] = np.where(df['product'] == 'Kaggle Hat', df[DAYOFYEAR], 0)
#     df[f'hat_{WEEKDAY}'] = np.where(df['product'] == 'Kaggle Hat', df[WEEKDAY], 0)
#     df[f'stick_{MONTH}'] = np.where(df['product'] == 'Kaggle Sticker', df[MONTH], 0)
#     df[f'stick_{WEEKOFYEAR}'] = np.where(df['product'] == 'Kaggle Sticker', df[WEEKOFYEAR], 0)
#     df[f'stick_{DAYOFYEAR}'] = np.where(df['product'] == 'Kaggle Sticker', df[DAYOFYEAR], 0)
#     df[f'stick_{WEEKDAY}'] = np.where(df['product'] == 'Kaggle Sticker', df[WEEKDAY], 0)

#     df = df.drop(columns=[DAYOFYEAR]) #MONTH, WEEKOFYEAR, WEEKDAY

    return df

def feature_periodic(df):
    # 21 days cyclic for lunar
    # 21 4.244872419046287 31 4.23870 37 4.2359085545955875 47 4.24590382934362 39 4.236812122257115 
    # 35 4.2358561209794665 33 4.237682217183017 36 4.230652791910613 3 4.241000488616227 4.23833321067532
    #[7, 14, 21, 28, 30, 31, 91] range(1, 32, 4) range(1,3,1)[1,2,4]
    # Long term periodic
    dayofyear = df.date.dt.dayofyear
    j=-36
    for k in [2]:
        df = pd.concat([df,
                        pd.DataFrame({
                            f"sin{k}": np.sin((dayofyear+j) / 365 * 1 * math.pi * k),
                            f"cos{k}": np.cos((dayofyear+j) / 365 * 1 * math.pi * k),
                                     })], axis=1)
        # Products
        df[f'mug_sin{k}'] = np.where(df['product'] == 'Kaggle Mug', df[f'sin{k}'], 0)
        df[f'mug_cos{k}'] = np.where(df['product'] == 'Kaggle Mug', df[f'cos{k}'], 0)
        df[f'hat_sin{k}'] = np.where(df['product'] == 'Kaggle Hat', df[f'sin{k}'], 0)
        df[f'hat_cos{k}'] = np.where(df['product'] == 'Kaggle Hat', df[f'cos{k}'], 0)
#         df[f'stick_sin{k}'] = np.where(df['product'] == 'Kaggle Sticker', df[f'sin{k}'], 0)
#         df[f'stick_cos{k}'] = np.where(df['product'] == 'Kaggle Sticker', df[f'cos{k}'], 0)
        df = df.drop(columns=[f'sin{k}', f'cos{k}'])

    # Short term Periodic
    weekday = df.date.dt.weekday
    df[f'weekly_sin'] = np.sin((1 / 7) * 2 * math.pi*(weekday+1)) #+
    df[f'weekly_cos'] = np.cos((1 / 7) * 2 * math.pi*(weekday+1)) #+
    df[f'semiweekly_sin'] = np.sin((1 / 7) * 4 * math.pi*(dayofyear-1.5)) #+ ⁅sin(1/7 𝜋⋅4(𝑥−2))⁆
    df[f'semiweekly_cos'] = np.cos((1 / 7) * 4 * math.pi*(dayofyear-1.5)) #+ ⁅cos(1/7 𝜋⋅4𝑥)⁆
    
    df[f'fin_weekly_sin'] = np.where(df['country'] == 'Finland', df[f'weekly_sin'], 0)
    df[f'fin_weekly_cos'] = np.where(df['country'] == 'Finland', df[f'weekly_cos'], 0)
    df[f'nor_weekly_sin'] = np.where(df['country'] == 'Norway', df[f'weekly_sin'], 0)
    df[f'nor_weekly_cos'] = np.where(df['country'] == 'Norway', df[f'weekly_cos'], 0)
    df[f'swe_weekly_sin'] = np.where(df['country'] == 'Sweden', df[f'weekly_sin'], 0)
    df[f'swe_weekly_cos'] = np.where(df['country'] == 'Sweden', df[f'weekly_cos'], 0)
    
    df[f'mug_weekly_sin'] = np.where(df['product'] == 'Kaggle Mug', df[f'weekly_sin'], 0)
    df[f'mug_weekly_cos'] = np.where(df['product'] == 'Kaggle Mug', df[f'weekly_cos'], 0)
    df[f'hat_weekly_sin'] = np.where(df['product'] == 'Kaggle Hat', df[f'weekly_sin'], 0)
    df[f'hat_weekly_cos'] = np.where(df['product'] == 'Kaggle Hat', df[f'weekly_cos'], 0)
    df[f'stick_weekly_sin'] = np.where(df['product'] == 'Kaggle Sticker', df[f'weekly_sin'], 0)
    df[f'stick_weekly_cos'] = np.where(df['product'] == 'Kaggle Sticker', df[f'weekly_cos'], 0)
    
    df[f'mug_semiweekly_sin'] = np.where(df['product'] == 'Kaggle Mug', df[f'semiweekly_sin'], 0)
    df[f'mug_semiweekly_cos'] = np.where(df['product'] == 'Kaggle Mug', df[f'semiweekly_cos'], 0)
    df[f'hat_semiweekly_sin'] = np.where(df['product'] == 'Kaggle Hat', df[f'semiweekly_sin'], 0)
    df[f'hat_semiweekly_cos'] = np.where(df['product'] == 'Kaggle Hat', df[f'semiweekly_cos'], 0)
#     df[f'stick_semiweekly_sin'] = np.where(df['product'] == 'Kaggle Sticker', df[f'semiweekly_sin'], 0)
#     df[f'stick_semiweekly_cos'] = np.where(df['product'] == 'Kaggle Sticker', df[f'semiweekly_cos'], 0)
    
    df = df.drop(columns=['weekly_sin', 'weekly_cos', 'semiweekly_sin', 'semiweekly_cos'])
    
#     df[f'semiannual_sin'] = np.sin(dayofyear / 182.5 * 2 * math.pi)
#     df[f'semiannual_cos'] = np.cos(dayofyear / 182.5 * 2 * math.pi)
    
    return df

def feature_holiday(df):
# Dec Jan
    # End of year
    df = pd.concat([df,
                        pd.DataFrame({f"f-dec{d}":
                                      (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')
                                      for d in range(24, 32)}),
                        pd.DataFrame({f"n-dec{d}":
                                      (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Norway')
                                      for d in range(24, 32)}),
                        pd.DataFrame({f"s-dec{d}":
                                      (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Sweden')
                                      for d in range(24, 32)}),
                        pd.DataFrame({f"f-jan{d}":
                                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland')
                                      for d in range(1, 14)}),
                        pd.DataFrame({f"n-jan{d}":
                                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Norway')
                                      for d in range(1, 10)}),
                        pd.DataFrame({f"s-jan{d}":
                                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Sweden')
                                      for d in range(1, 15)})
                       ], axis=1)
        
    # May
    df = pd.concat([df,
                        pd.DataFrame({f"may{d}":
                                      (df.date.dt.month == 5) & (df.date.dt.day == d) 
                                      for d in list(range(1, 10))}),
                        pd.DataFrame({f"may{d}":
                                      (df.date.dt.month == 5) & (df.date.dt.day == d) & 
                                      (df.country == 'Norway')
                                      for d in list(range(18, 28))})
                        ], axis=1)
    
    # June and July 8, 14
    df = pd.concat([df,
                        pd.DataFrame({f"june{d}":
                                      (df.date.dt.month == 6) & (df.date.dt.day == d) & 
                                      (df.country == 'Sweden')
                                      for d in list(range(8, 14))}),
                       ], axis=1)
    # Last Wednesday of June
    wed_june_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-24')),
                                         2016: pd.Timestamp(('2016-06-29')),
                                         2017: pd.Timestamp(('2017-06-28')),
                                         2018: pd.Timestamp(('2018-06-27')),
                                         2019: pd.Timestamp(('2019-06-26'))})
    df = pd.concat([df, pd.DataFrame({f"wed_june{d}": 
                                      (df.date - wed_june_date == np.timedelta64(d, "D")) & 
                                      (df.country != 'Norway')
                                      for d in list(range(-4, 6))})], axis=1)

    # First Sunday of November
    sun_nov_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-11-1')),
                                         2016: pd.Timestamp(('2016-11-6')),
                                         2017: pd.Timestamp(('2017-11-5')),
                                         2018: pd.Timestamp(('2018-11-4')),
                                         2019: pd.Timestamp(('2019-11-3'))})
    df = pd.concat([df, pd.DataFrame({f"sun_nov{d}":
                                      (df.date - sun_nov_date == np.timedelta64(d, "D")) & (df.country == 'Norway')
                                      for d in list(range(0, 9))})], axis=1)
    # First half of December (Independence Day of Finland, 6th of December)
    df = pd.concat([df, pd.DataFrame({f"dec{d}":
                                      (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')
                                      for d in list(range(6, 14))})], axis=1)
    # Easter April
    easter_date = df.date.apply(lambda date: pd.Timestamp(easter.easter(date.year)))
    df = pd.concat([df, pd.DataFrame({f"easter{d}":
                                      (df.date - easter_date == np.timedelta64(d, "D"))
                                      for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})], axis=1)
    return df

In [None]:
# import ephem

    # one-hot encoding should be used. linear model should not learn this as numeric value
#     df[YEAR] = df[DATE].dt.year
#     df[MONTH] = df[DATE].dt.month
#     df[WEEK] = df[DATE].dt.week
#     df[DAY] = df[DATE].dt.day
#     df[DAYOFYEAR] = df[DATE].dt.dayofyear
#     df[WEEKOFYEAR] = df[DATE].dt.isocalendar().week
#     df[DAYOFMONTH] = df[DATE].dt.days_in_month
#     df[DAYOFWEEK] = df[DATE].dt.dayofweek
#     df[WEEKDAY] = df[DATE].dt.weekday
#     df['wd1'] = df[DATE].dt.weekday == 1
#     df['wd2'] = df[DATE].dt.weekday == 2
#     df['wd3'] = df[DATE].dt.weekday == 3
#     df['wd4'] = df[DATE].dt.weekday == 4
#     df.loc[(df.date.dt.year != 2016) & (df.date.dt.month >=3), DAYOFYEAR] += 1 # fix for leap years
    # 4 seasons
#     df['season'] = ((df[DATE].dt.month % 12 + 3) // 3).map({1:'DJF', 2: 'MAM', 3:'JJA', 4:'SON'})
#     df[MONTH] = df[MONTH].apply(lambda x: calendar.month_abbr[x])
        # Countries
#         df[f'finland_sin{k}'] = np.where(df['country'] == 'Finland', df[f'sin{k}'], 0) # new: 4.015424129340626 old: 4.030858784243854
#         df[f'finland_cos{k}'] = np.where(df['country'] == 'Finland', df[f'cos{k}'], 0)
#         df[f'norway_sin{k}'] = np.where(df['country'] == 'Norway', df[f'sin{k}'], 0)
#         df[f'norway_cos{k}'] = np.where(df['country'] == 'Norway', df[f'cos{k}'], 0)
#         df[f'sweden_sin{k}'] = np.where(df['country'] == 'Sweden', df[f'sin{k}'], 0)
#         df[f'sweden_cos{k}'] = np.where(df['country'] == 'Sweden', df[f'cos{k}'], 0)
#         df[f'mart_sin{k}'] = np.where(df['store'] == 'KaggleMart', df[f'sin{k}'], 0)
#         df[f'mart_cos{k}'] = np.where(df['store'] == 'KaggleMart', df[f'cos{k}'], 0)
#         df[f'rama_sin{k}'] = np.where(df['store'] == 'KaggleRama', df[f'sin{k}'], 0)
#         df[f'rama_cos{k}'] = np.where(df['store'] == 'KaggleRama', df[f'cos{k}'], 0)
#         df[f'finland_sin{k}'] = np.where(df['country'] == 'Finland', df[f'sin{k}'], 0)
#         df[f'finland_cos{k}'] = np.where(df['country'] == 'Finland', df[f'cos{k}'], 0)
#         df[f'norway_sin{k}'] = np.where(df['country'] == 'Norway', df[f'sin{k}'], 0)
#         df[f'norway_cos{k}'] = np.where(df['country'] == 'Norway', df[f'cos{k}'], 0)
#         df[f'store_sin{k}'] = np.where(df['store'] == 'KaggleMart', df[f'sin{k}'], 0)
#         df[f'store_cos{k}'] = np.where(df['store'] == 'KaggleMart', df[f'cos{k}'], 0)
#     df[f'semiweekly_sin'] = np.sin((1 / 7) * 4 * math.pi*(dayofyear-2)) #+ ⁅sin(1/7 𝜋⋅4(𝑥−2))⁆
#     df[f'semiweekly_cos'] = np.cos((1 / 2) * 2 * math.pi*dayofyear) #+ ⁅cos(1/2 𝜋⋅2𝑥)⁆
#     df[f'lunar_sin'] = np.sin((1 / 21) * 2 * math.pi*dayofyear)
#     df[f'lunar_cos'] = np.cos((1 / 21) * 2 * math.pi*dayofyear)
#     df[f'season_sin'] = np.sin(dayofyear / 91.5 * 2 * math.pi)
#     df[f'season_cos'] = np.cos(dayofyear / 91.5 * 2 * math.pi)
#     df[f'lunar_phase'] = df.date.apply(lambda x: ephem.Moon(str(x)).moon_phase)
#     df[f'semiannual_sin'] = np.sin(dayofyear / 182.5 * 2 * math.pi)
#     df[f'semiannual_cos'] = np.cos(dayofyear / 182.5 * 2 * math.pi)
#     df[f'sin31'] = np.sin((dayofyear) / 365 * 2 * math.pi*31) #+
#     df[f'cos31'] = np.cos(dayofyear / 365 * 2 * math.pi*31)

    # Individual holidays
#     df = pd.concat([df, pd.DataFrame({f'fin{ptr[1]}':
#                                       (df.date == pd.Timestamp(ptr[0])) & (df.country == 'Finland')
#                                       for ptr in holidays.Finland(years = [2015,2016,2017,2018,2019]).items()})], axis=1)
#     df = pd.concat([df, pd.DataFrame({f'nor{ptr[1]}':
#                                       (df.date == pd.Timestamp(ptr[0])) & (df.country == 'Norway')
#                                       for ptr in holidays.Norway(years = [2015,2016,2017,2018,2019]).items()})], axis=1)
#     df = pd.concat([df, pd.DataFrame({f'swe{ptr[1]}':
#                                       (df.date == pd.Timestamp(ptr[0])) & (df.country == 'Sweden')
#                                       for ptr in holidays.Sweden(years = [2015,2016,2017,2018,2019]).items()})], axis=1)
    
    #Swedish Rock Concert
    #Jun 3, 2015 – Jun 6, 2015
    #Jun 8, 2016 – Jun 11, 2016
    #Jun 7, 2017 – Jun 10, 2017
    #Jun 6, 2018 – Jun 10, 2018
    #Jun 5, 2019 – Jun 8, 2019
#     swed_rock_fest  = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-6')),
#                                          2016: pd.Timestamp(('2016-06-11')),
#                                          2017: pd.Timestamp(('2017-06-10')),
#                                          2018: pd.Timestamp(('2018-06-10')),
#                                          2019: pd.Timestamp(('2019-06-8'))})
#     df = pd.concat([df, pd.DataFrame({f"swed_rock_fest{d}":
#                                       (df.date - swed_rock_fest == np.timedelta64(d, "D")) & (df.country == 'Sweden')
#                                       for d in list(range(-3, 3))})], axis=1)

In [None]:
def feature_engineer(df):
    df = get_basic_ts_features(df)
#     df = feature_splines(df)
    df = feature_periodic(df)
    df = feature_holiday(df)
    df = unofficial_holiday(df)
    return df.copy()

In [None]:
from pathlib import Path


def load_data():
    # Read data
    data_dir = Path(INPUT)
    df_train = pd.read_csv(data_dir / "train.csv", parse_dates=[DATE],
                    usecols=['date', 'country', 'store', 'product', 'num_sold'],
                    dtype={
                        'country': 'category',
                        'store': 'category',
                        'product': 'category',
                        'num_sold': 'float64',
                    },
                    infer_datetime_format=True,)
    df_test = pd.read_csv(data_dir / "test.csv", index_col=ID, parse_dates=[DATE])
    column_y = df_train.columns.difference(
        df_test.columns)[0]  # column_y target_col label_col
    df_train[DATE] = pd.to_datetime(df_train[DATE])
    df_test[DATE] = pd.to_datetime(df_test[DATE])
    return df_train, df_test, column_y


In [None]:
def process_data(df_train, df_test):
    # Preprocessing
    if FEATURE_ENGINEERING:
        df_train = feature_engineer(df_train)
        df_test = feature_engineer(df_test)

    return df_train, df_test

# Load Data #

And now we can call the data loader and get the processed data splits:

In [None]:
%%time
train_df, test_df, column_y = load_data()

## Pseudolabeling

In [None]:
df_pseudolabels = pd.read_csv(PSEUDO_DIR, index_col=ID)
df_pseudolabels[DATE] = pd.to_datetime(test_df[DATE])
df_pseudolabels.to_csv("pseudo_labels_v0.csv", index=True)
# if PSEUDO_LABEL:
    # df_pseudolabels = df_pseudolabels.set_index([DATE]).sort_index()
test_df[column_y] = df_pseudolabels[column_y].astype(np.float64)
train_df = pd.concat([train_df, test_df], axis=0)

In [None]:
%%time
train_df, test_df = process_data(train_df, test_df)

In [None]:
train_data = train_df.copy()
test_data = test_df.copy()

In [None]:
X = train_data.set_index([DATE]).sort_index()
X_test = test_data.set_index([DATE]).sort_index()

In [None]:
# Check NA
missing_val = X.isnull().sum()
print(missing_val[missing_val > 0])
missing_val = X_test.isnull().sum()
print(missing_val[missing_val > 0])

In [None]:
train_data = train_data.set_index(['date', 'country', 'store', 'product']).sort_index()

### Visualizing fourier features
Can it replicate the chaos below?

In [None]:
fig_dims = (30,6)
train_subset = train_data.loc['2015-01-1':'2015-12-27']
# ax = train_subset.stick_semiweekly_cos.plot(title='Period', figsize=fig_dims)
# ax = train_subset.hat_sin2.plot(title='Period', figsize=fig_dims) #lunar_cos weekly_cos sin2 lunar_sin season_sin semiweekly_sin semiannual_sin weekly_sin
# ax = train_subset.mug_cos2.plot(title='Period', figsize=fig_dims)
# ax = train_subset.sin2.plot(title='Period', figsize=fig_dims)
# _ = ax.set(ylabel="Wave")

In [None]:
if PRODUCTION:
    kaggle_sales_2015 = (
        train_data
        .groupby(['country', 'store', 'product', 'date'])
        .mean()
        .unstack(['country', 'store', 'product'])
        .loc['2015']
    )
    kaggle_sales_2016 = (
        train_data
        .groupby(['country', 'store', 'product', 'date'])
        .mean()
        .unstack(['country', 'store', 'product'])
        .loc['2016']
    )
    kaggle_sales_2017 = (
        train_data
        .groupby(['country', 'store', 'product', 'date'])
        .mean()
        .unstack(['country', 'store', 'product'])
        .loc['2017']
    )
    kaggle_sales_2018 = (
        train_data
        .groupby(['country', 'store', 'product', 'date'])
        .mean()
        .unstack(['country', 'store', 'product'])
        .loc['2018']
    )
    frames = [kaggle_sales_2015, kaggle_sales_2016, kaggle_sales_2017, kaggle_sales_2018]
    kaggle_sales = pd.concat(frames)

    fig_dims = (20,12)
    ax = kaggle_sales.num_sold.plot(title='Sales Trends', figsize=fig_dims)
    _ = ax.set(ylabel="Numbers sold")

Some sample of the dataset.

In [None]:
# X.plot(y='weekofyear_weekofyear0', cmap=plt.cm.tab20b)

In [None]:
# if PRODUCTION: 
train_data.groupby(column_y).apply(lambda s: s.sample(min(len(s), 5)))

## Clean up

In [None]:
del test_df
del train_data
del test_data
gc.collect()

# What is Seasonality? #

We say that a time series exhibits **seasonality** whenever there is a regular, periodic change in the mean of the series. Seasonal changes generally follow the clock and calendar -- repetitions over a day, a week, or a year are common. Seasonality is often driven by the cycles of the natural world over days and years or by conventions of social behavior surrounding dates and times.
### Choosing Fourier features with the Periodogram

How many Fourier pairs should we actually include in our feature set? We can answer this question with the periodogram. The **periodogram** tells you the strength of the frequencies in a time series. Specifically, the value on the y-axis of the graph is `(a ** 2 + b ** 2) / 2`, where `a` and `b` are the coefficients of the sine and cosine at that frequency (as in the *Fourier Components* plot above).

<figure style="padding: 1em;">
<img src="https://i.imgur.com/PK6WEe3.png" width=600, alt="">
<figcaption style="textalign: center; font-style: italic"><center>Periodogram for the <em>Wiki Trigonometry</em> series.</center></figcaption>
</figure>

From left to right, the periodogram drops off after *Quarterly*, four times a year. That was why we chose four Fourier pairs to model the annual season. The *Weekly* frequency we ignore since it's better modeled with indicators.

### Computing Fourier features (optional)

Knowing how Fourier features are computed isn't essential to using them, but if seeing the details would clarify things, the cell hidden cell below illustrates how a set of Fourier features could be derived from the index of a time series. (We'll use a library function from `statsmodels` for our applications, however.)

Now let's look at the periodogram:

In [None]:
if PRODUCTION:
    plot_periodogram(X[column_y]);

The periodogram agrees with the seasonal plots above: a strong semiweekly season and a weaker annual season. The weekly season we'll model with indicators and the annual season with Fourier features. From right to left, the periodogram falls off between Bimonthly (6) and Monthly (12), so let's use 10 Fourier pairs.

We'll create our seasonal features using DeterministicProcess, the same utility we used in Lesson 2 to create trend features. To use two seasonal periods (weekly and annual), we'll need to instantiate one of them as an "additional term":

# Components and Residuals #

So that we can design effective hybrids, we need a better understanding of how time series are constructed. We've studied up to now three patterns of dependence: trend, seasons, and cycles. Many time series can be closely described by an additive model of just these three components plus some essentially unpredictable, entirely random *error*:

```
series = trend + seasons + cycles + error
```

Each of the terms in this model we would then call a **component** of the time series.

The **residuals** of a model are the difference between the target the model was trained on and the predictions the model makes -- the difference between the actual curve and the fitted curve, in other words. Plot the residuals against a feature, and you get the "left over" part of the target, or what the model failed to learn about the target from that feature.

In [None]:
# annotations: https://stackoverflow.com/a/49238256/5769929
def seasonal_plot(X, y, period, freq, ax=None):
    if ax is None:
        _, ax = plt.subplots()
    palette = sns.color_palette("husl", n_colors=X[period].nunique(),)
    ax = sns.lineplot(
        x=freq,
        y=y,
        hue=period,
        data=X,
        ci=False,
        ax=ax,
        palette=palette,
        legend=False,
    )
    ax.set_title(f"Seasonal Plot ({period}/{freq})")
    for line, name in zip(ax.lines, X[period].unique()):
        y_ = line.get_ydata()[-1]
        ax.annotate(
            name,
            xy=(1, y_),
            xytext=(6, 0),
            color=line.get_color(),
            xycoords=ax.get_yaxis_transform(),
            textcoords="offset points",
            size=14,
            va="center",
        )
    return ax

### Functions

In [None]:
import matplotlib.dates as mdates
from matplotlib.dates import MONTHLY, WEEKLY, DAILY

# Plot all num_sold_true and num_sold_pred (five years) for one country-store-product combination
def plot_five_years_combination(engineer, country='Norway', store='KaggleMart', product='Kaggle Hat', period_start='2015-01-01', period_end='2019-12-31'):
    locator = mdates.AutoDateLocator(minticks=12)
    locator.maxticks[WEEKLY] = 24
    locator.maxticks[DAILY] = 24
    dtFmt = mdates.ConciseDateFormatter(locator)
    
    demo_df = pd.DataFrame({'row_id': 0,
                            'date': pd.date_range(period_start, period_end, freq='D'),
                            'country': country,
                            'store': store,
                            'product': product})
    demo_df.set_index('date', inplace=True, drop=False)
    demo_df = engineer(demo_df)
    demo_df[column_y] = model.predict(demo_df[features])
    if PSEUDO_LABEL:
        demo_df[column_y] *= LOSS_CORRECTION
    train_subset = X[(X.country == country) & (X.store == store) & (X['product'] == product)].copy()
    train_subset = train_subset.loc[period_start:period_end]
    fig, ax = plt.subplots(figsize=(32, 8))
    plt.plot(demo_df[DATE], demo_df.num_sold, label='prediction', alpha=0.5, color='blue')
    plt.plot(train_subset.index, train_subset.num_sold, label='true', alpha=0.3, color='red', linestyle='--')
    plt.scatter(train_subset.index, train_subset.num_sold, label='true', alpha=0.3, color='red', s=2)
    plt.grid(True)
    plt.grid(which='major',axis ='y', linestyle=':', linewidth='0.5', color='black')
    plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
    ax.xaxis.set_major_formatter(dtFmt) # apply the format to the desired axis
    ax.xaxis.set_major_locator(locator)
    
    plt.legend()
    plt.title(f'{country} {store} {product} Predictions and true from {period_start} to {period_end}')
    plt.tight_layout()
    plt.show()
    return demo_df['num_sold']

In [None]:
def plot_true_vs_prediction(df_true, df_hat):
    plt.figure(figsize=(20, 13))
    plt.scatter(np.arange(len(df_hat)), np.log1p(df_hat), label='prediction', alpha=0.5, color='blue', s=3) #np.arange(len(df_hat))
    plt.scatter(np.arange(len(df_true)), np.log1p(df_true), label='Pseudo/true', alpha=0.5, color='red', s=7) #np.arange(len(df_true))
    plt.legend()
    plt.title(f'Predictions VS Pseudo-label {column_y} (LOG)') #{df_true.index[0]} - {df_true.index[-1]}
    plt.show()

In [None]:
def plot_residuals(y_residuals):
    plt.figure(figsize=(13, 3))
    plt.scatter(np.arange(len(y_residuals)), y_residuals, label='residuals', alpha=0.1, color='blue', s=5)
    plt.legend()
    plt.title(f'Linear Model residuals {column_y} (LOG)') #{df_true.index[0]} - {df_true.index[-1]}
    plt.tight_layout()
    plt.show()

In [None]:
def plot_oof(y_true, y_predict):
    # Plot y_true vs. y_pred
    plt.figure(figsize=(5, 5))
    plt.scatter(y_true, y_predict, s=3, color='r', alpha=0.5)
#     plt.scatter(np.log1p(y_true), np.log1p(y_predict), s=1, color='g', alpha=0.3)
    plt.plot([plt.xlim()[0], plt.xlim()[1]], [plt.xlim()[0], plt.xlim()[1]], '--', color='k')
    plt.gca().set_aspect('equal')
    plt.xlabel('y_true')
    plt.ylabel('y_pred')
    plt.title('OOF Predictions')
    plt.show()

In [None]:
def find_min_SMAPE(y_true, y_predict):
    loss_correction = 1
    scores = []
    # float step
    for WEIGHT in np.arange(0.988, 1.02, 0.0001):
        y_hat = y_predict.copy()
        y_hat *= WEIGHT
        scores.append(np.array([WEIGHT, np.mean(smape_loss(y_true, y_hat))]))
        
    scores = np.vstack(scores)
    min_SMAPE = np.min(scores[:,1])
    print(f'min SMAPE {min_SMAPE:.5f}')
    for x in scores:
        if x[1] == min_SMAPE:
            loss_correction = x[0]
            print(f'loss_correction: {x[0]:.5f}')
            
    plt.figure(figsize=(5, 3))
    plt.plot(scores[:,0],scores[:,1])
    plt.scatter([loss_correction], [min_SMAPE], color='g')
    plt.ylabel(f'SMAPE')
    plt.xlabel(f'loss_correction: {loss_correction:.5f}')
    plt.legend()
    plt.title(f'min SMAPE:{min_SMAPE:.5f} scaling')
    plt.show()
    
    return loss_correction

In [None]:
def evaluate_SMAPE(y_va, y_va_pred):
    loss_correction = 1
    # Evaluation: Execution time and SMAPE
    smape_before_correction = np.mean(smape_loss(y_va, y_va_pred))
    smape = np.mean(smape_loss(y_va, y_va_pred))
    loss_correction = find_min_SMAPE(y_va, y_va_pred)
    y_va_pred *= loss_correction
    print(f"SMAPE (before correction: {smape_before_correction:.5f})")
    print(f'Min SMAPE: {np.mean(smape_loss(y_va, y_va_pred))}')
    return loss_correction

In [None]:
def evaluate(model, X, y, cv):
    cv_results = cross_validate(
        model,
        X,
        y,
        cv=cv,
        scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
    )
    mae = -cv_results["test_neg_mean_absolute_error"]
    rmse = -cv_results["test_neg_root_mean_squared_error"]
    print(
        f"Mean Absolute Error:     {mae.mean():.3f} +/- {mae.std():.3f}\n"
        f"Root Mean Squared Error: {rmse.mean():.3f} +/- {rmse.std():.3f}"
    )

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, ColumnTransformer, TransformedTargetRegressor
from sklearn.model_selection import cross_validate, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn import set_config
set_config(display='diagram') 

# Model 1 (trend)
from pyearth import Earth # https://contrib.scikit-learn.org/py-earth/content.html#api
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge, HuberRegressor, RidgeCV, TheilSenRegressor, SGDRegressor
from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import Nystroem

# Model 2
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, StackingRegressor, VotingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

# Hybrid Models
Linear regression excels at extrapolating trends, but can't learn interactions. XGBoost excels at learning interactions, but can't extrapolate trends. We'll learn how to create "hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weakness of the other.

In [None]:
# You'll add fit and predict methods to this minimal class
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class BoostedHybrid(BaseEstimator, RegressorMixin):
    def __init__(self, model_1, model_2):
        self.model_1 = model_1
        self.model_2 = model_2
        self.y_columns = None  # store column names from fit method
    def fit(self, X, y):
        """A reference implementation of a fitting function.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        y : array-like, shape (n_samples,) or (n_samples, n_outputs)
            The target values (class labels in classification, real numbers in
            regression).
        Returns
        -------
        self : object
            Returns self.
        """
        X, y = check_X_y(X, y, accept_sparse=True)
        # Train model_1
        self.model_1.fit(X, y)

        # Make predictions
        y_fit = self.model_1.predict(X)
        # Compute residuals
        y_resid = y - y_fit

        # Train model_2 on residuals , eval_set=[(X_1_valid, y_valid_resid)]
        self.model_2.fit(X, y_resid)
        # Model2 prediction
        y_fit2 = self.model_2.predict(X)
        # Compute noise
        y_resid2 = y_resid - y_fit2
        
        # Save data for question checking
        self.y = y
        self.y_fit = y_fit
        self.y_resid = y_resid
        self.y_fit2 = y_fit2
        self.y_resid2 = y_resid2

        self.is_fitted_ = True
        return self
    
    def predict(self, X):
        """ A reference implementation of a predicting function.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        Returns
        -------
        y : ndarray, shape (n_samples,)
            Returns an array of ones.
        """
        X = check_array(X, accept_sparse=True)
        check_is_fitted(self, 'is_fitted_')
        # Predict with model_1
        y_predict = self.model_1.predict(X)
        # Add model_2 predictions to model_1 predictions
        y_predict += self.model_2.predict(X)

        return y_predict

In [None]:
from statsmodels.graphics.gofplots import qqplot
def model_fit_eval(hybrid_model, X_train, y_train, X_valid, y_valid, X_test, df_test, loss_correction):
#     test_pred_list = []
    # Boosted Hybrid
    hybrid_model.fit(X_train, y_train) #, X_valid, y_valid
    
    loss_correction = 1
    ###### Preprocess the validation data
    y_va = y_valid.copy()
    # Inference for validation
    y_va_pred = hybrid_model.predict(X_valid)
    print(f'***********Validation Data*****************')
    loss_correction = evaluate_SMAPE(y_va, y_va_pred)
    
    ###### Validate against 2019 PSEU #######
    loss_correction = 1
    ###### Preprocess the test data
    y_test = df_test[column_y].values.reshape(-1, 1)
    # Inference test 2019 for validation
    y_test_prediction = hybrid_model.predict(X_test[features])
    # Evaluation: SMAPE
    print(f'***********Test Data*****************')
    loss_correction = evaluate_SMAPE(y_test, y_test_prediction.reshape(-1, 1))
    ### Mean test prediction ###
#     test_pred_list.append(y_test_prediction)
    
    ###### Validation dataset
    ###### Visualize and evaluate
    print(f'***********Validation Data*****************')
    plot_oof(y_va, y_va_pred)
    plot_true_vs_prediction(y_va, y_va_pred)
    print(f'***********Test Data*****************')
    plot_oof(y_test, y_test_prediction)
    plot_true_vs_prediction(y_test, y_test_prediction)
    # Model_1 residual
    plot_residuals(hybrid_model.regressor_[1].estimators_[0].y_resid)
    qqplot(np.expm1(hybrid_model.regressor_[1].estimators_[0].y_resid))
    # Model_2 residual. Best case is normal gaussian noise
    plot_residuals(hybrid_model.regressor_[1].estimators_[0].y_resid2)
    qqplot(np.expm1(hybrid_model.regressor_[1].estimators_[0].y_resid2))
    
    return hybrid_model, y_test_prediction, loss_correction

In [None]:
# ts_cv = TimeSeriesSplit(n_splits=FOLDS)
# alphas = np.logspace(-6, 6, 25)
# naive_linear_pipeline = make_pipeline(
#     ColumnTransformer(
#         transformers=[
#             ("categorical", one_hot_encoder, make_column_selector(dtype_include=object)),
#             ("numeric", StandardScaler(), make_column_selector(dtype_include=np.number)),
#         ],
#         remainder='passthrough',
#     ),
#     RidgeCV(),
# )
# evaluate(naive_linear_pipeline, X_2, np.log1p(y), cv=ts_cv)

## Stacker
- models_1(List): Any linear model
- models_2(List): Tree based model. Fitting model_1 residual.

In [None]:
%%time
gc.collect()
# pseudohubererror squarederror

def build_estimator_stack(estimator_stack, seed=RANDOM_STATE):
    if GPU:
        param_xgb = {
                    'objective' : 'reg:squarederror',
                    'tree_method' : 'gpu_hist',
                    'learning_rate': 0.12,
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
                    'random_state': seed
                 }
        param_cat = {
                    'loss_function' : 'RMSE', # SMAPE RMSE Huber
                    'eval_metric': 'RMSE',
                    'task_type' : 'GPU',            
                    'learning_rate': 0.13,
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
                    'random_state': seed,
                    'verbose': VERBOSE
                 }
        param_lgb = {
                    'objective' : 'regression',
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
                    'device' : 'gpu',
                    'random_state': seed
                 }
    else: #CPU
        param_xgb = {
                    'objective' : 'reg:squarederror',
                    'tree_method' : 'hist',
                    'learning_rate': 0.12,
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
                    'random_state': seed
                 }
        param_cat = {
                    'loss_function' : 'RMSE',
                    'eval_metric': 'RMSE',
                    'learning_rate': 0.13,
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
#                     'iterations': 700,
#                     'od_type' : 'Iter',
#                     'od_wait' : 20,
                    'random_state': seed,
                    'verbose': VERBOSE
                 }
        param_lgb = {
                    'objective' : 'regression',
                    'max_depth': 4,
                    'n_estimators': N_ESTIMATORS,
                    'random_state': seed
                 }
    if PRODUCTION:
        # Linear estimator. Try different combinations of the algorithms above KNeighborsRegressor fit_intercept=False
        models_1 = [
                    Ridge(fit_intercept=False, random_state=seed),
                    ElasticNet(fit_intercept=False, random_state=seed),
                    HuberRegressor(fit_intercept=False, epsilon=1.20, max_iter=1300),
                   ]
        # Residue estimator
        models_2 = [
                    XGBRegressor(**param_xgb),
                    lgb.LGBMRegressor(**param_lgb),
                    CatBoostRegressor(**param_cat),
                   ]
    else: # Trial run
        # Linear estimator. Try different combinations of the algorithms above KNeighborsRegressor. Remove bias fit_intercept=False
        models_1 = [
#                     Ridge(fit_intercept=False, random_state=seed),
#                     ElasticNet(fit_intercept=False, random_state=seed),
#                     LinearRegression(fit_intercept=True),
#                     SGDRegressor(fit_intercept=False, random_state=seed),
#                     Lasso(fit_intercept=False, random_state=seed),
#                     TheilSenRegressor(fit_intercept=False, random_state=seed),
                    HuberRegressor(fit_intercept=False, epsilon=1.20, max_iter=1300),
#                     Earth(verbose=VERBOSE),
#                     MLPRegressor(   hidden_layer_sizes=HIDDEN_LAYERS,
#                                     learning_rate_init=0.01,
#                                     learning_rate='adaptive',
#                                     early_stopping=True,
#                                     max_iter=EPOCHS,
#                                     random_state=seed,
#                                     ),

                   ]
        # Residue estimator
        models_2 = [
                    CatBoostRegressor(**param_cat),
#                     lgb.LGBMRegressor(**param_lgb),
#                     XGBRegressor(**param_xgb),
                   ]

    for model_1 in models_1:
        for model_2 in models_2:
            model1_name = type(model_1).__name__
            model2_name = type(model_2).__name__
            hybrid_model = BoostedHybrid(
                    model_1 = model_1,
                    model_2 = model_2
                            )
            print(f'******************Stacking {model1_name:>16} with {model2_name:<18}*************************')
            estimator_stack.append((f'model_{model1_name}_{model2_name}', hybrid_model))
    return estimator_stack

## Pipeline MasterClass

In [None]:
# tscv = TimeSeriesSplit(n_splits=FOLDS) # cv=tscv , n_jobs=-1

def build_stacking_regressor(estimator_stack, seed=RANDOM_STATE):
    # X pipeline
    stacking_regressor = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("categorical", OneHotEncoder(handle_unknown="ignore", sparse=False), make_column_selector(dtype_include=object)),
#                 ("cyclic_year", periodic_spline_transformer(365, n_splines=9, degree=2), [DAYOFYEAR]),
#                 ("cyclic_weekofyear", periodic_spline_transformer(52, n_splines=7, degree=2), [WEEKOFYEAR]),
#                 ("cyclic_month", periodic_spline_transformer(12, n_splines=6, degree=3), [MONTH]),
#                 ("cyclic_weekday", periodic_spline_transformer(7, n_splines=3, degree=2), [WEEKDAY]),
#                 ("numeric", MinMaxScaler(), make_column_selector(dtype_include=np.number)),
            ],
            remainder=MinMaxScaler(), #'passthrough', #
        ),
#         Nystroem(kernel="poly", degree=2, n_components=300, random_state=seed),
        StackingRegressor(estimators=estimator_stack, final_estimator=RidgeCV(), cv=FOLDS, n_jobs=-1, verbose=VERBOSE),
    )
    # X y pipeline with y log transform
    model = TransformedTargetRegressor(
        regressor=stacking_regressor, func=np.log1p, inverse_func=np.expm1, check_inverse=False
    )
    return model

## Data splitting X_2 X_test y

2018 as test dataset if not production

In [None]:
def get_Xy(X):
    # Target series
    y = X.loc[:, column_y]
    X_2 = X.drop(column_y, axis=1)

    features = X_2.columns

    if PSEUDO_LABEL:
        TRAIN_START_DATE = "2015-01-01"
        TRAIN_END_DATE = "2019-12-31"
        VALID_START_DATE = "2015-01-01"
        VALID_END_DATE = "2018-12-31"
    else:
        if PRODUCTION:
            TRAIN_START_DATE = "2015-01-01"
            TRAIN_END_DATE = "2018-12-31"
            VALID_START_DATE = "2015-01-01"
            VALID_END_DATE = "2018-12-31"
        else: # 2018 Validation
            TRAIN_START_DATE = "2015-01-01"
            TRAIN_END_DATE = "2017-12-31"
            VALID_START_DATE = "2018-01-01"
            VALID_END_DATE = "2018-12-31"

    y_train, y_valid = y[TRAIN_START_DATE:TRAIN_END_DATE], y[VALID_START_DATE:VALID_END_DATE]
    X2_train, X2_valid = X_2.loc[TRAIN_START_DATE:TRAIN_END_DATE], X_2.loc[VALID_START_DATE:VALID_END_DATE]
    return y, y_train, y_valid, X2_train, X2_valid, features

# Training

### Product channel test

- Min SMAPE: 3.9165618565465845 
- month week day- linear 4.0052295549526225 
- month week day+ linear 3.9153803500063686 
- WEEKDAY+ 3.91795141135476
- GDP exp: Min SMAPE: 3.9835975216489867

In [None]:
%%time
PRODUCT=False
# PRODUCT='Kaggle Sticker'
# X = X.loc[X['product'] == PRODUCT]
# X_test = X_test.loc[X_test['product'] == PRODUCT]
# train_df.loc[train_df['product'] == 'Kaggle Mug']

# y_test to use
if PRODUCT:
    df_2019 = train_df.loc[(train_df['product'] == PRODUCT) & (train_df[DATE] >= pd.to_datetime(date(2019, 1, 1)))]
else:
    df_2019 = df_pseudolabels

test_prediction_list=[]
for seed in range(SEED_START, (SEED_START+REPEAT), 1):
    estimator_stack = []
    y, y_train, y_valid, X2_train, X2_valid, features = get_Xy(X)
    estimator_stack = build_estimator_stack(estimator_stack=estimator_stack, seed=seed)
    stacking_regressor = build_stacking_regressor(estimator_stack, seed=seed)
    print(f'****************** Run using seed: {seed} ******************')
    model, y_test_prediction, LOSS_CORRECTION = model_fit_eval(stacking_regressor, X2_train, y_train, X2_valid, y_valid, X_test, df_2019, LOSS_CORRECTION)
    test_prediction_list.append(y_test_prediction)
model

### Debug

In [None]:
for ptr in holidays.Norway(years = [2018], observed=True).items():
    print(ptr)

In [None]:
# Debug
for country in np.unique(train_df['country']):
    for product in np.unique(train_df['product']):
        for store in np.unique(train_df['store']):
            y_fit = plot_five_years_combination(feature_engineer, country=country, product=product, store=store,period_start='2018-2-01', period_end='2018-2-28')
            break
        break

# Inference year 2019 test data

# Inference validation

In [None]:
for country in np.unique(train_df['country']):
    for product in np.unique(train_df['product']):
        for store in np.unique(train_df['store']):
            y_fit = plot_five_years_combination(feature_engineer, country=country, product=product, store=store)
            break

# Submission
Once you're satisfied with everything, it's time to create your final predictions! This cell will:

- use the best trained model to make predictions from the test set
- save the predictions to a CSV file


In [None]:
len(test_prediction_list)

In [None]:
sub = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv')

## Mean vs Median

In [None]:
if BLEND:
    test_prediction_list.append(df_pseudolabels[column_y].values) #blender 1
    df_pseudolabels1 = pd.read_csv(PSEUDO_DIR2, index_col=ID)
    test_prediction_list.append(df_pseudolabels1[column_y].values) #blender 2
test_prediction_list_median = np.median(test_prediction_list, axis=0) # median is better https://www.kaggle.com/saraswatitiwari/tabular-playground-series-22
test_prediction_list_mean = np.mean(test_prediction_list, axis=0) # median is better https://www.kaggle.com/saraswatitiwari/tabular-playground-series-22
###### Validate against 2019 PSEU #######
loss_correction = 1
###### Preprocess the test data
y_test = df_2019[column_y].values.reshape(-1, 1)
print(f'*********** Median *****************')
median_correction = evaluate_SMAPE(y_test, test_prediction_list_median.reshape(-1, 1))
print(f'*********** Mean *****************')
mean_correction = evaluate_SMAPE(y_test, test_prediction_list_mean.reshape(-1, 1))

test_prediction = (test_prediction_list_mean*mean_correction) if (np.abs(1.-mean_correction) <= np.abs(1.-median_correction)) else (test_prediction_list_median*median_correction)


In [None]:

if len(test_prediction) > 0:
    # Create the submission file
    submission = pd.DataFrame(data=np.zeros((sub.shape[0],2)),index = sub.index.tolist(),columns=[ID,column_y])
    submission[ID] = sub[ID]
    submission[column_y] = test_prediction
    submission.to_csv('pseudo_labels_v1.csv', index=False)
    # round
    submission[column_y] = geometric_round(submission[column_y]).astype(int) #https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/299162
    submission.to_csv('submission.csv', index=False)

    # Plot the distribution of the test predictions
    plt.figure(figsize=(16,3))
    plt.hist(train_df[column_y], bins=np.linspace(0, 3000, 201),
             density=True, label='Training')
    plt.hist(submission[column_y], bins=np.linspace(0, 3000, 201),
             density=True, rwidth=0.5, label='Test predictions')
    plt.xlabel(column_y)
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

In [None]:
display(submission.head(30))
display(submission.tail(30))

In [None]:
submission[column_y].describe()

In [None]:
df_pseudolabels[column_y].describe()

Variance
Variance = σ2=Σ(xi−μ)**2/n

To submit these predictions to the competition, follow these steps:

1. Begin by clicking on the blue **Save Version** button in the top right corner of the window.  This will generate a pop-up window.
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the blue **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

# Next Steps #

If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

Be sure to check out [other users' notebooks](https://www.kaggle.com/c/tabular-playground-series-jan-2022/code) in this competition. You'll find lots of great ideas for new features and as well as other ways to discover more things about the dataset or make better predictions. There's also the [discussion forum](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion), where you can share ideas with other Kagglers.

Have fun Kaggling!

In [None]:
for x in X.columns:
    print(x)