In [None]:
%%capture

!pip install scikit-learn-intelex

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import gc

import warnings
warnings.filterwarnings("ignore")

## Loading the data

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv", index_col=0)
test_df = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv", index_col=0)

print(f"Nb samples in train: {train_df.shape[0]}\nNb features in train: {train_df.shape[1]}\nNb samples in test: {test_df.shape[0]}\nNb features in test: {test_df.shape[1]}\n")

Let's take a quick look at the data:

In [None]:
train_df.head()

In [None]:
print(f"Total number of duplicated rows: {train_df.duplicated().sum()} out of {train_df.shape[0]}")
train_df = train_df.drop_duplicates()
print(f"Total number of rows after removal: {train_df.shape[0]}")

In [None]:
ax = train_df.target.value_counts(normalize=True).plot(kind='barh')
ax.set_xlabel('% of the total')
plt.show()

We can see that there are a similar amount of samples for each type of bacteria, which makes an almost balanced dataset (one thing less to worry about). What about NaN's?

In [None]:
# To check if any NaN´s in columns
(train_df.isna().sum()).any()

## Very basic EDA

Before starting with any kind of EDA, it is a good practice to split the data into train and test datasets, so we don't peek too much into our test set and avoiding any possible overfitting at the beginning (I call it 'test', although in the competition the data that will be submitted is called test as well). Since the data is balanced, we don't need to use a stratified splitting.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

X = train_df.drop(columns='target')
y = pd.DataFrame(le.fit_transform(train_df.target), columns=['target'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022)

In [None]:
# Run to print all the features
#for col in X:
#    print(col)

In [None]:
corr_mat = X_train.corr()
corr_val = corr_mat.values[np.tril(corr_mat, -1).astype(bool)]

In [None]:
f, axs = plt.subplots(1, 2, figsize=(18,7))
sns.heatmap(corr_mat, ax=axs[0])
sns.histplot(data=corr_val, kde=True, stat='probability', ax=axs[1])

The heatmap for the correlation map shows kind of a checkerboard pattern that must be related to relationships between neighbours (i.e., A1T1G4C4 is correlated to its neighbours, A1T1G3C5 and A1T1G5C3). The distribution of the correlation coefficients (fig. right) seems to have a longer tail on the positive values, up to 0.8, while the negative correlation coefficients don't reach 0.6. Interestingly, these inverse correlations appear on the heatmap as well, especially between distant areas. This could indicate that the more differents the two strings there is a chance of low correlation or of inverse correlation. It can be interesting to look at the specific relationships that hold these patterns. 


I'm wondering... what if we plot the Hamming distance between strings?

In [None]:
def hamming(a, b):
    return len([i for i in filter(lambda x: x[0] != x[1], zip(a, b))])

features = [col for col in X]

hamm_d = np.zeros((len(features), len(features)))
for i, ft1 in enumerate(features):
    for j, ft2 in enumerate(features):
        hamm_d[i, j] = hamming(ft1,ft2)

hamm_d = pd.DataFrame(hamm_d, index=features, columns=features)

f = plt.figure(figsize=(9, 7))
sns.heatmap(hamm_d, square=True)
plt.show()

As shown in the figure, the patches of high correlation in the previous heatmap kind of match the strings with a lower Hamming distance. What about the inverse correlation? It might be interesting to explore that path...

## Model: ExtraTreesClassifier

### Cross-Validation

In [None]:
y_train

The first time I tried to predict on this dataset I used a 5-CV and XGBoost getting pretty good results in CV and test (my test dataset). I saw other kernels using Extra Trees (i.e. this one https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees) and achieving better results, so I decided to go in that direction too. 

In [None]:
etc_params = {
    'n_estimators': 300,
    'n_jobs': -1,
    'bootstrap': False,
    'verbose': 0,
    'random_state': 2022
}
y_probs = []
y_preds = []
perf = []
cv = KFold(n_splits=10, shuffle=True, random_state=2022)
for fold, (train_idx, valid_idx) in enumerate(cv.split(X_train, y_train)):    
    X_train_cv = X_train.iloc[train_idx] 
    y_train_cv = y_train.iloc[train_idx]  
    X_valid = X_train.iloc[valid_idx]
    y_valid = y_train.iloc[valid_idx]
    
    # train
    clf = ExtraTreesClassifier(**etc_params)    
    clf.fit(X_train_cv, y_train_cv)
    
    # predict
    y_pred_cv = clf.predict(X_valid)
    acc = accuracy_score(y_valid,  y_pred_cv)
    perf.append(acc)
    y_preds.append(y_pred_cv)
    y_probs.append(clf.predict_proba(test_df))

    
    print(f"CV - {fold+1}: {acc:.4f}")
    
print(f"Average across folds: {np.mean(perf)}")

### Test set

In [None]:
# Credit: https://www.kaggle.com/grfiv4/plot-a-confusion-matrix

def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(10, 8))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=90)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.2f}; misclass={:0.2f}'.format(accuracy, misclass))
    plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

plot_confusion_matrix(cm           = cm, 
                      normalize    = False,
                      target_names = le.classes_,
                      title        = "Confusion Matrix")

In [None]:
# As remekkinas says, I also think this is overfitting the data
y_prob = sum(y_probs) / len(y_probs)
# The explanations for these numbers are in AMBROSM's code
y_prob += np.array([0, 0, 0.01, 0.03, 0, 0, 0, 0, 0, 0])
y_pred_tuned = le.inverse_transform(np.argmax(y_prob, axis=1))
pd.Series(y_pred_tuned, index=test_df.index).value_counts().sort_index() / len(test_df) * 100

## Submission

In [None]:
sub = pd.read_csv("../input/tabular-playground-series-feb-2022/sample_submission.csv", index_col=0)
sub['target'] = y_pred_tuned
print(sub.head(10))
sub.to_csv('submission.csv')

To make further progress I think I should get more insight into the kind of data we are dealing with. In my opinion it shows some particular relationships between features that should be addressed somehow. I'll try to follow that direction.

Thanks for reading and happy kaggling! :)

## Appendix: Why Extra Trees perform better?

The Extra Trees classifier, or Extremely Randomized is a meta estimator that fits a random number of trees to a random subsample of the dataset averaging the result to apply the logic "the wisdom of the crowd". XGBoost, on the other hand, is also an ensemble of weak learners, in this case also decision trees. However, in the case of the XGBoost, each tree (or each weak learner) is trained to "correct" the errors made by previous trees in the ensemble. In the ExtraTrees model, that part of the optimization is made randomly (the algorithm is not trying to find the optimal splitting point). The obvious result of this is that each tree has more freedom to deviate from the dominant structure underlying the data, and as you might imagine this is good in our case of highly correlated features. In a nutshell, in the bias-variance tradeoff, with Extra Trees we are choosing to lose flexibility in order to make a more robust model, i.e., less prone to overfitting.  


Sources:

https://www.kaggle.com/questions-and-answers/196968

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

https://towardsdatascience.com/an-intuitive-explanation-of-random-forest-and-extra-trees-classifiers-8507ac21d54b

PS: I've run into this site recently and they have an AMAZING visual explanation of the bias-variance trade-off in case you are interested:
https://mlu-explain.github.io/bias-variance/