# Postprocessing for the February TPS

Most of this code has been copied from @[maxencefzr](https://www.kaggle.com/maxencefzr)'s [notebook](https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees). I've added the postprocessing and the code for dealing with duplicate training samples. 

Release notes:
- V3: Using scipy.optimize to optimize the postprocessing (didn't improve the lb score)
- V4: Dealing with duplicate training data and sample weights

In [None]:
#%%capture

# Intel® Extension for Scikit-learn installation:
!pip install scikit-learn-intelex

import os
import warnings

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import mode
from tqdm import tqdm
from pathlib import Path

from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

# Mute warnings
warnings.filterwarnings("ignore")

In [None]:
data_dir = Path('../input/tabular-playground-series-feb-2022')

df_train = pd.read_csv(data_dir / 'train.csv', index_col='row_id')
df_test  = pd.read_csv(data_dir / 'test.csv', index_col='row_id')

TARGET = df_train.columns.difference(df_test.columns)[0]
features = df_train.columns[df_train.columns != TARGET]

# Deduplicating the training data

Among the 200000 training samples, there are 76007 duplicates. These duplicates are an issue for two reasons:
1. They make training times unnecessarily long (ok, if you have enough patience, this could be a non-issue).
2. They inflate the cv scores, if not handled correctly.

We must not simply drop the duplicates because this would change the probability distribution. After all, if one particular measurement outcome has been measured 18 times, it should have higher weight than an outcome which has been measured only once. Fortunately, the `fit()` method of most scikit-learn estimators has an optional parameter `sample_weight` for this purpose.

In the following, we convert the training dataframe to a new dataframe without the duplicated rows. To compensate for dropping the duplicates, we add a column `sample_weight` to the dataframe.

In [None]:
# Count the duplicates in the training data
df_train.duplicated().sum()

In [None]:
# Create a new dataframe without duplicates, but with an additional sample_weight column
vc = df_train.value_counts()
dedup_train = pd.DataFrame([list(tup) for tup in vc.index.values], columns=df_train.columns)
dedup_train['sample_weight'] = vc.values
dedup_train

Let's do a quick check for correctness. The first row of `dedup_train` has a sample_weight of 18. If everything is correct, the original dataframe should have 18 rows with the same data:

In [None]:
(df_train[features].values == dedup_train[features].iloc[0].values.reshape(1, -1)).all(axis=1).sum()

# Training, cross-validation & inference

After deduplicating the training data, we apply two small changes to the training loop:
1. When calling `fit()`, we add the sample weights of the training data.
2. When calling `accuracy_score()`, we add the sample weights of the validation data.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encoding categorical features
le = LabelEncoder()

X = dedup_train[features]
y = pd.DataFrame(le.fit_transform(dedup_train[TARGET]), columns=[TARGET])
sample_weight = dedup_train['sample_weight']

In [None]:
#%%time

N_SPLITS = 10
folds = StratifiedKFold(n_splits=N_SPLITS, shuffle=True)
y_pred_list, y_proba_list, scores = [], [], []

for fold, (train_id, valid_id) in enumerate(tqdm(folds.split(X, y), total=N_SPLITS)):
    print('####### Fold: ', fold)
    
    # Splitting
    X_train, y_train, sample_weight_train = X.iloc[train_id], y.iloc[train_id], sample_weight.iloc[train_id]
    X_valid, y_valid, sample_weight_valid = X.iloc[valid_id], y.iloc[valid_id], sample_weight.iloc[valid_id]
    
    # Model
    model = ExtraTreesClassifier(
        n_estimators=300,
        n_jobs=-1,
        verbose=0,
        random_state=1
    )

    # Training
    model.fit(X_train, y_train, sample_weight_train)
        
    # Validation
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred, sample_weight=sample_weight_valid)
    print(f'Accuracy score: {valid_score:5f}\n')
    scores.append(valid_score)
    
    # Prediction for submission
    y_pred_list.append(model.predict(df_test))
    y_proba_list.append(model.predict_proba(df_test))
    
score = np.array(scores).mean()
print(f'Mean accuracy score: {score:6f}')

# Ensembling

We are happy about the high cv score and ensemble the ten predictions by majority vote:

In [None]:
# Majority vote
y_pred = mode(y_pred_list).mode[0]
y_pred = le.inverse_transform(y_pred)

# The surprise

Let's compare the distribution of classes in training and in our predictions. Something went wrong:

In [None]:
target_distrib = pd.DataFrame({
    'count': df_train.target.value_counts(),
    'share': df_train[TARGET].value_counts() / df_train.shape[0] * 100
})

target_distrib['pred_count'] = pd.Series(y_pred, index=df_test.index).value_counts()
target_distrib['pred_share'] = target_distrib['pred_count'] / len(df_test) * 100
target_distrib.sort_index()

What went wrong? In the training data, all classes have equal frequencies of 10 %. In our predictions, *E. coli* is underpredicted with a frequency of only 8.7 %. Two explanations are possible:
1. In the test data, *E. coli* really has a frequency of only 8.7 %. And *E. fergusonii* really has a frequency of 10.8 %.
2. Because the bacteria have mutated and changed their DNA, our classifier no longer classifies them correctly.

I think the correct explanation is 2, because the [EDA has already shown that the bacteria mutate between training and test](https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense).

Fortunately, we can account for the mutations with a little postprocessing.

# Postprocessing

Our classifier predicts not only classes, but also probabilities. These probabilities have already been collected in `y_proba_list`. We now tune these probabilities by manually adding a small bias to the probabilities of `Enterococcus hirae` and `E. coli`.

From these tuned probabilities, we can determine new predictions by applying `np.argmax(axis=1)`, and we see that the class frequencies now are much better.

In [None]:
y_proba = sum(y_proba_list) / len(y_proba_list)
y_proba += np.array([0, 0, 0.01, 0.03, 0, 0, 0, 0, 0, 0])
y_pred_tuned = le.inverse_transform(np.argmax(y_proba, axis=1))
pd.Series(y_pred_tuned, index=df_test.index).value_counts().sort_index() / len(df_test) * 100

In [None]:
submission = pd.read_csv(data_dir / 'sample_submission.csv')
submission[TARGET] = y_pred_tuned
submission.to_csv('submission.csv', index=False)
submission

# Final remark

Understanding a model's weaknesses is part of data science. The present ExtraTreesClassifier has the weakness that it does not take the train-test drift into account.

But please note that the postprocessing in this notebook is not data science. It is a workaround to compensate for the model's weakness. The real data science remains to be done: Create a model for the train-test drift which doesn't need postprocessing workarounds.