The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

Submissions are evaluated using multi-class logarithmic loss.

In [1]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete


In [2]:
train_data = pd.read_csv("../input/tabular-playground-series-may-2021/train.csv", index_col="id")
test_data = pd.read_csv("../input/tabular-playground-series-may-2021/test.csv", index_col="id")

# train_data.head()
# test_data.head()

Ok, so the dataset is clean. There is no missing data. 

In [3]:
y = train_data["target"]
X = train_data.drop(columns="target")

In [5]:
classes = list(y.unique())
print(f"There are {len(classes)} classes:\n {classes}")

There are 4 classes:
 ['Class_2', 'Class_1', 'Class_4', 'Class_3']


# Tuning hyper-parameters

For tuning hyper-parameters, a search consists of:
- an estimator (regressor or classifier)
- a parameter space
- a method for searching or sampling candidates
- a cross-validation scheme
- a score function


## Estimators

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    ExtraTreesClassifier,
    AdaBoostClassifier
)

n_estimators = 30

clf_0 = DecisionTreeClassifier(max_depth=None)
clf_1 = RandomForestClassifier(n_estimators=n_estimators, n_jobs=-1)
clf_2 = ExtraTreesClassifier(n_estimators=n_estimators)
clf_3 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators)

models = [clf_0, clf_1, clf_2, clf_3]

## Parameter space

In [7]:
# checking the parameters for all models
for clf in models:
    print(clf.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 30, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
{'bootstrap': False, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2

In [8]:
# Let's focus on the second model: RandomForest
param_grid = [
    {'n_estimators': [10, 30, 50, 100]},
]

## Cross-validation scheme

In [23]:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(
    n_splits=10, n_repeats=1, random_state=0
)

## Score function

In [24]:
from sklearn.metrics import log_loss

In [29]:
clf_0.fit(X,y)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf_0.fit(X_train, y_train)
proba_preds = clf_0.predict_proba(X_test)

log_loss(y_test, proba_preds)

20.697044092773496

In [42]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
y_true = onehot.fit_transform(pd.DataFrame(y_test)).toarray()
log_loss(y_true, proba_preds)
proba_preds

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

In [45]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    clf_0, X, y,
    cv=cv,
    scoring="neg_log_loss",
    verbose=3
)
# print(f"Using cross-validation = {scores}")

score_mean = scores.mean()
score_std = scores.std()

print(f"Score mean = {score_mean} +- {score_std/(scores.size)**0.5}")
print(f"Score variance = {score_std}")


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END .............................. score: (test=-20.254) total time=   2.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.4s remaining:    0.0s


[CV] END .............................. score: (test=-20.571) total time=   2.3s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.8s remaining:    0.0s


[CV] END .............................. score: (test=-20.388) total time=   2.3s
[CV] END .............................. score: (test=-20.678) total time=   2.2s
[CV] END .............................. score: (test=-21.034) total time=   2.2s
[CV] END .............................. score: (test=-20.810) total time=   2.3s
[CV] END .............................. score: (test=-20.637) total time=   2.2s
[CV] END .............................. score: (test=-20.568) total time=   2.3s
[CV] END .............................. score: (test=-20.692) total time=   2.2s
[CV] END .............................. score: (test=-20.447) total time=   2.3s
Score mean = -20.60793404967463 +- 0.06618312025456134
Score variance = 0.20928940266123675


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   23.7s finished


## Searching method

### Exhaustive Grid Search

In [50]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'max_depth': [3, 4, 5]},
]

search = GridSearchCV(
    estimator=clf_0, param_grid=param_grid,
    scoring='neg_log_loss', cv=cv,
    verbose=4
)
search.fit(X, y)

Fitting 10 folds for each of 3 candidates, totalling 30 fits
[CV 1/10] END .....................max_depth=3;, score=-1.115 total time=   0.3s
[CV 2/10] END .....................max_depth=3;, score=-1.114 total time=   0.3s
[CV 3/10] END .....................max_depth=3;, score=-1.116 total time=   0.3s
[CV 4/10] END .....................max_depth=3;, score=-1.113 total time=   0.3s
[CV 5/10] END .....................max_depth=3;, score=-1.114 total time=   0.3s
[CV 6/10] END .....................max_depth=3;, score=-1.116 total time=   0.3s
[CV 7/10] END .....................max_depth=3;, score=-1.114 total time=   0.3s
[CV 8/10] END .....................max_depth=3;, score=-1.116 total time=   0.3s
[CV 9/10] END .....................max_depth=3;, score=-1.114 total time=   0.3s
[CV 10/10] END ....................max_depth=3;, score=-1.115 total time=   0.3s
[CV 1/10] END .....................max_depth=4;, score=-1.114 total time=   0.4s
[CV 2/10] END .....................max_depth=4;,

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=0),
             estimator=DecisionTreeClassifier(),
             param_grid=[{'max_depth': [3, 4, 5]}], scoring='neg_log_loss',
             verbose=4)

In [51]:
results_df = pd.DataFrame(search.cv_results_)
results_df = results_df.sort_values(by=['rank_test_score'])
results_df = (
    results_df
    .set_index(results_df["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('n_estimators')
)
results_df[
    ['params', 'rank_test_score', 'mean_test_score', 'std_test_score']
]

Unnamed: 0_level_0,params,rank_test_score,mean_test_score,std_test_score
n_estimators,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,{'max_depth': 3},1,-1.114662,0.000963
4,{'max_depth': 4},2,-1.114889,0.002035
5,{'max_depth': 5},3,-1.120137,0.003274


In [52]:
clf_final = DecisionTreeClassifier(max_depth=5)
clf_final.fit(X, y)
preds = clf_final.predict_proba(test_data)

In [64]:
output = pd.DataFrame(data=preds, columns=clf_final.classes_)
output.insert(0, 'id', test_data.index)
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
