# Introduction

### Algorithm Description

TODO: Write it.

### General Preparations

In [1]:
from typing import List, Tuple

import numpy as np
import pandas as pd

from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

from dsawl.ooffg.estimators import OutOfFoldFeaturesRegressor

In [2]:
np.random.seed(361)

# Synthetic Dataset Generation

In [3]:
def generate_data(
        slopes: List[float],
        group_size: int,
        noise_stddev: float
        ) -> pd.DataFrame:
    """
    Generate `len(slopes)` * `group_size` examples
    with dependency y = slope * x + noise.
    """
    dfs = []
    for i, slope in enumerate(slopes):
        curr_df = pd.DataFrame(columns=['x', 'category', 'y'])
        curr_df['x'] = range(group_size)
        curr_df['category'] = i
        curr_df['y'] = curr_df['x'].apply(
            lambda x: slope * x + np.random.normal(scale=noise_stddev)
        )
        dfs.append(curr_df)
    df = pd.concat(dfs)
    return df

Let us create a situation where in-fold generation of target-based features leads to overfitting. To do so, make a lot of small categories and set noise variance to a high value. Such settings result in leakage of noise from target into in-fold generated mean. Thus, regressor learns how to use this leakage, which is useless on hold-out sets.

In [4]:
slopes = [2, 1, 3, 4, -1, -2, 3, 2, 1, 5, -2, -3, -5, 8, 1, -7, 0, 2, 0]
group_size = 5
noise_stddev = 10

In [5]:
train_df = generate_data(slopes, group_size, noise_stddev)
train_df.head()

Unnamed: 0,x,category,y
0,0,0,3.314436
1,1,0,-4.654332
2,2,0,1.642263
3,3,0,15.854888
4,4,0,21.873901


Generate test set from the same distribution.

In [6]:
test_df = generate_data(slopes, group_size, noise_stddev)
test_df.head()

Unnamed: 0,x,category,y
0,0,0,-0.682203
1,1,0,9.03367
2,2,0,8.911399
3,3,0,2.345571
4,4,0,3.408391


# Benchmark: Training with In-Fold Generated Mean

In [7]:
encoding_df = train_df \
    .groupby('category', as_index=False) \
    .agg({'y': np.mean}) \
    .rename(columns={'y': 'infold_mean'})

In [8]:
def get_target_and_features(df: pd.DataFrame) -> Tuple[pd.DataFrame]:
    merged_df = df.merge(encoding_df, on='category')
    X = merged_df[['x', 'infold_mean']]
    y = merged_df['y']
    return X, y

In [9]:
X_train, y_train = get_target_and_features(train_df)
X_test, y_test = get_target_and_features(test_df)

In [10]:
rgr = LinearRegression()

In [11]:
rgr.fit(X_train, y_train)
y_hat_train = rgr.predict(X_train)
r2_score(y_train, y_hat_train)

0.42987610029007461

In [12]:
y_hat = rgr.predict(X_test)
r2_score(y_test, y_hat)

0.061377490670754042

Obviously, overfitting is detected.

# Out-of-Fold Features Estimator

In [13]:
# As of now, `dsawl.ooffg` does not support `pandas`, so add `.values` after dataframes.
X_train, y_train = train_df[['x', 'category']].values, train_df['y'].values
X_test, y_test = test_df[['x', 'category']].values, test_df['y'].values

In [14]:
ooffr = OutOfFoldFeaturesRegressor(
    LinearRegression(),  # No arguments should be passed to constructor here.
    dict(),  # If neeeded, pass arguments here as a dictionary.
    n_splits=3,  # Define how to make folds for features generation.
    shuffle=True,
    random_state=361)

Below is a wrong way to measure train error. Regressor uses in-fold generated features for predictions, not the features that are used to training.

In [15]:
ooffr.fit(X_train, y_train, source_positions=[1])
y_hat_train = ooffr.predict(X_train)
r2_score(y_train, y_hat_train)

0.32858825879553122

Now, let us look at the right way to measure performance on train set. In `OutOfFoldFeaturesRegressor` and `OutOfFoldFeaturesClassifier` `fit_predict` method is not just a combination of `fit` and `predict` methods.

In [16]:
y_hat_train = ooffr.fit_predict(X_train, y_train, source_positions=[1])
r2_score(y_train, y_hat_train)

0.13534718308949856

In [17]:
ooffr.fit(X_train, y_train, source_positions=[1])
y_hat = ooffr.predict(X_test)
r2_score(y_test, y_hat)

0.13394140594498127

Thus, there is no blatant overfitting, because train set score and test set score are close to each other. Also it is appropriate to highlight that test set score is significantly higher than that of the benchmark regressor trained with in-fold generated mean.