How to Tackle a ML Competition

Competition Mechanics (Syllabus)

Reference :

Starting Approx. on 1st week of August 2020


Lecture 1 : Feature Processing [Simo]

Videos :

notebook|FeatureProcessing.ipynb

Intro : feature, physics, and scaling —> In physics we usually have some hints on how to "preprocess" features, i.e. we have some law...

Overview of ML Approaches (Recap)

  1. Linear Models
  2. Tree-Based Models
  3. k Nearest Neighbors (kNN)
  4. Neural Networks
  5. No Free Lunch Theorem

Feature Preprocessing

Lecture: Numeric Features, Categorical and Ordinal Features, Datetime and Coordinates, Handling Missing Values

Additional Material and Links:

Feature Extraction from Images / Text


  • Text Preprocessing: lowercase, lemmatization, stemming, stopwords
  • Bag of Words (sklearn CountVectorizer), PostProcessing: TfIdf (Term frequencies Inverse document frequencies) , N-Grams
  • Embeddings (Word2Vec, Glove, FastText, Doc2Vec) : vectorizing text King-Queen Man-Woman —> Pretrained Models


  • CNNs for feature extraction, Images —> Vectors; Descriptors (output from different layers)
  • Finetuning & Training from Scratch
  • Augmentation (Cropping, Rotating etc..)

Additional Material & Links:

Bag of words (Text)

Word2vec (Text)

NLP Libraries

Pretrained models (Images)

Finetuning (Images)

Lecture 2 : Exploratory Data Analysis [Jacopoison]

Exploratory Data Analysis (EDA)

It allows you to better understand the data: Do EDA first, don't immediately dig into modeling.

Main tools:

  • Visualizations: Patterns lead to questions
  • Exploiting Leaks

A few steps in EDA:

  • Get domain knowledge
  • Check if data is intuitive, possible misinterpretations
  • Understand how the data was generated (class imbalance, crucial for proper validation scheme)

Exploring Anonymized Data

Sensitive data are usually hashed to avoid privacy related issues (setting up a notebook)

  1. We can try to decode (de-anonymize) the data, in a legal way (of course 🙂) by guessing the true meaning of the features or at least guess the true type (categorical etc..) —> individual features

  2. Or else, we can try to figure out feature relations

  3. Helpful Function: df.dtypes(),, x.value_counts(), x.isnull()

  • Histograms (plt.hist(x)), they aggregate data tho —> can be confusing
  • Index vs Values Plot
  • Feature statistics with pandas —> df.describe()
  • Features Mean vs Feature (by sorting this plot you can find relations —> in a data augmentation framework it might be useful)
  1. Plots in order to explore feature relations → Visualizations are our art tools in the art of EDA
  • plt.scatter(x1, x2)
  • pd.scatter_matrix(df)
  • df.corr(), plt.matshow()

Dataset Cleaning (and other things to check)

The organizers could give us a fraction of objects they have or a fraction of features. And that is why we can have some issues with the data

  • E.g. a feature that takes the same value in both training and test set (maybe it's a fraction of the whole amount of data that organizers have) —> since it is constant it's not useful for our purposes —> remove it
  • Duplicated features , sometimes two columns are completely identical —> remove one (traintest.T.drop_duplicates())
  • Duplicated categorical features, features are identical but rows have different names (let's rename levels of features and find duplicated)
for f in categorical_features:
	traintest[f] = traintest[f].factorize()


! We need to do label encoding correctly !

  • Same with rows —> but one more question: why do have duplicated rows? A bug in the generation process?!
  • Check if dataset is shuffled ! (very useful) if not there might be data leakage. We can plot a feature vs row index and additionally smooth values with rolling average techniques
  • One more time —> Visualize every possible thing —> Cool viz lead to cool features 😉

Additional Material & Links:

Visualization tools

Others (Advanced)

Springleaf competition EDA


NumerAi Competition EDA

Notebooksss too

  • Hardcore EDA —> sorted correlation matrix

You've probably noticed that it's much about Reverse Engineering & Creativity

Lecture 3: Validation & Overfitting Revisited[Ari]

Notebook:


Validation and Overfitting

  • Train, Validation & Test (Public + Private) Sets
  • Underfitting vs Overfitting Recap
  • Overfitting in general ≠ Overfitting in Competitions

Validation Strategies

  • Holdout: ngroups = 1 (sklearn.model_selection.ShuffleSplit)
  • K-Fold: ngroups = k (sklearn.model_selection.KFold) and difference between K-Fold and K times Holdout
  • Leave-one-out: ngroups = len(train) (sklearn.model_selection.LeaveOneOut), useful if we have too little data
  • Stratification : a random split can sometimes fail, we need a way to ensure similar target distribution over different folds

The main rule you should know — never use data you train on to measure the quality of your model. The trick is to split all your data into training and validation parts.

Below you will find several ways to validate a model.

  • **Holdout scheme:
    1. Split train data into two parts:  partA and partB
    2. Fit the model on partA, predict for partB
    3. Use predictions for partB  for estimating model quality. Find such hyper-parameters that quality on partB  is maximized.
  • **K-Fold scheme:
    1. Split train data into K folds.
    2. Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement.
  • **LOO (Leave-One-Out) scheme:
    1. Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
    2. In the end you will get LOO predictions for every sample in the trainset and can calculate loss.

Notice, that these validation schemes are supposed to be used to estimate the quality of the model. When you've found the right hyper-parameters and want to get test predictions don't forget to retrain your model using all training data.

Data Splitting Strategies

Setup validation to mimic train / test split. E.g. time series, we need to rely on the time trend instead of randomly picking up values —> Time Based Splits

Different splitting strategies can differ significantly

  1. In generated features
  2. In a way the model will rely on that features
  3. In some kind of target leak

Splitting Data into Train and Validation

  • Random, Rowwise ; Most common, we assume that rows are independent from each other
  • Timewise, we generally have everything before some date in the train-set and everything after in the test-set (e.g. Moving window validation)
  • By ID, id can be a unique identifier of something
  • Combined, combining some of the above mentioned

Logic of feature generation depends on the data splitting strategy.

Problems Occurring During Validation

Validation problems (usually caused by inconsistency of data):

  • Validation Stage (e.g. if we are predicting sales we should take a look at holidays, so there's a reason to expect some particular behavior)

  • Submission Stage (e.g. LeaderBoard (LB) score is consistently higher/lower that validation score, LB score is not correlated with validation score at all) —> Leaderboard Probing (calculate mean for train data and try to probe the test data distribution by submitting .. 🤯)

    What if we have imbalanced distributions? We should ensure the same distribution in test and validation (again, by LB probing)

LB Shuffle: it happens when positions on Public and Private LB are drastically different, main reasons:

  • Randomness (main main reason)
  • Little amount of data
  • Different Public & Private distributions

Additional Material & Links:

Lecture 4: Data Leakages (or how to cheat)

Not sure to do it, it really depends on the competition, if it is Kaggle-like we should take a look at leaks

Basic Data Leaks

  1. Leaks in time series: Split should be done on time
    • In real life we don't have information from the future
    • Check train, public and private splits. If one of them is not on time you've found a data leak
    • Even when split by time, features may contain information about future: User history in CTR tasks; Weather
  2. Unexpected information:
    • Metadata
    • Information on IDs
    • Row Order

Leaderboard Probing

  • Categories tightly connected with "id" are vulnerable to LB probing
  • Adapting global mean via LB probing
  • Some competition with data leakages: Truly Native; Expedia; Flavours of Physics
  • Pairwise tasks, data leakage in item frequencies

Case Study: Expedia Kaggle Competition

Additional Material & Links:

Lecture 4 : Metrics Optimization [Pio]


  • Why there are so many
  • Why should we care about them in competitions
  • Loss vs Metric
  • Review the most important metrics
  • Optimization techniques for the metrics

Metrics are an essential part of any competition, they are used to evaluate our submissions. Why do we have different metrics for each competition? There are different ways to measure the quality of an algorithm

  • E.g. How to formalize effectiveness for an online shop ? It can be the number of times the website was visited or the number of times something was ordered using this website

Chosen metric determines optimal decision boundary


If your model is scored with some metric, you get best results by optimizing exactly that metric

With LB probing we can check if the train and test sets have some incongruences with respect distributions, we gotta be careful wrt metrics optimization if there's some imbalance

Regression Metrics

Add Notebook

  • MSE, RMSE, R-Squared
  • MAE
  • MSLE

MSE: Mean Square Error


RMSE: Root Mean Square Error




MAE: Mean Absolute Error


MSPE : Mean Square Percentage Error


MAPE: Mean Absolute Percentage Error


(R)MSLE: Root Mean Square Logarithmic Error


Classification Metrics

Accuracy : How frequently our class prediction is correct


  • Best constant : predict the most frequent class —> dummy example : Dataset of 10 cats and 90 dogs —> Always predicts dog and we get 90% accuracy, the baseline accuracy could be very high even if the result is not correct

Log Loss : Logarithmic Loss


AUC ROC : Area Under Curve

Cohen's Kappa (& Weighted Kappa)



Approaches for Metrics Optimization

Loss and Metric

  • Target Metric : is what we want to optimize (e.g Accuracy)
  • Optimization loss : is what our model optimizes


Approaches for target metric optimization

  • Just run the right model (lol!) : MSE, Logloss
  • Preprocess train and optimize another metric : MSPE, MAPE, RMSLE
  • Optimize another metric, postprocess predictions: Accuracy, Kappa
  • Write Custom Loss Function (if you can)
  • Optimize another metric, use early stopping

Probability Calibration

  • Platt Scaling : Just fit Logistic Regression to your prediction
  • Isotonic Regression
  • Stacking : Just fit XGBoost or neural net to your predictions

Additional Material & Links :

DAIgnosis: Exploring the Space of Metrics




Lecture 5 : Mean Encodings[?]

We can follow this one —>

The general idea of this technique is to add new variables based on some feature. In simplest cases, we encode each level of a categorical variable with the corresponding target mean

Why does it work ?

  • Label encoding gives random order.. No correlation with target
  • Mean encoding helps to separate zeros from ones

It turns out that this sorting quality of mean encoding is quite helpful. Remember, what is the most popular and effective way to solve machine learning problem? Is grading using trees (XGBoost). One of the few downsides is an inability to handle high cardinality categorical variables. Trees have limited depth, with mean encoding, we can compensate it!



  • Cross Validation inside training data (CV loop)

    Usually decent results with 4-5 folds

  • Smoothing

  • Adding random noise

  • Sorting and calculating expanding mean

Extensions and generalizations

  • Regression and multiclass
  • Many-to-many relations
  • Time Series
  • Interactions and numerical features

Lecture 6: Hyperparameter Tuning & Advanced Features[?]

How do we tune hyperparameters?

  • Select the most influential parameters

    There are tons of params and we can't tune all of them

  • Understand, how exactly they influence the training

  • Tune them

    Manually (change and examine)

    Automatically (hyperopt, grid search etc..) —> some libraries : Hyperopt; Scikit-optimize; Spearmint; GPyOpt; RoBo; SMAC3

    We need to define a function that specifies all the params and a search space, the range for the paramas where we want to look for the solution.

  • Different values for params can lead to 3 behaviors:

    1. Underfitting (bad)
    2. Good Fit and Generalization (good)
    3. Overfitting (bad)

Color-Coding Legend

Red Parameter :

  • Increasing it impedes fitting, it reduces overfitting

Green parameter:

  • Increasing it leads to a better fit on train set, increase it if model underfits, decrease it if it overfits

Tree-based Models

  • GBDT (XGBoost & LightGBM)



  • Random Forest / Extra Trees

    Notebook on how to find sufficient n_estimators

    N_estimators (the higher the better)

    max_depth (it can be unlimited)



    Others: criterion (gini etc) , random_state, n_jobs

Neural Nets

  • Number of neurons per layer

  • Number of layers

  • Optimizers

    SGD + Momentum

    • Better generalization

    Adam/Adagrad ...

    • Adaptive methods lead to more overfitting
  • Batch Size

  • Learning Rate, there's a connection between batch size and learning rate (proporzionalità diretta nelle dimensioni delle due)

  • Regularization

    • L2 / L1 for weights
    • Dropout /Dropconnect
    • Static Dropconnect

Linear Models

  • SVC/SVR (sklearn)
  • Logistic Regression + regularizers (sklearn)
  • SGDClassifier /Regressor (sklearn)
  • FTRL, Follow The Regularized Leader (Vowpal Wabbit) —> For the data sets that do not fit in the memory, we can use Vowpal Wabbit. It implements learning of linear models in online fashion. It only reads data row by row directly from the hard drive and never loads the whole data set in the memory. Thus, allowing to learn on a very huge data sets.

Regularization Parameter (C, alpha, lambda ...)

Practical Guide

Lecture 7: Advanced Features [Simo]

Statistics and distance based features

  • Statistics on initial features

  • Neighbors (kNN, Bray - Curtis metric etc)

  • Matrix Factorizations for Feature Extraction (NMF, SVD, PCA)

    Pay attention to apply the same transformation to all your data (concatenate train & test and apply PCA or whatever)

  • Feature Interactions

    Sums, Diffs, Multiplications, Divisions


  • t-SNE , UMAP (Manifold Learning Methods)

    Interpretation of hyperparameters (e.g. Perplexity)

Additional Material & Links:

Matrix Factorization:



Lecture 8: Ensembling [Ari]

Prendiamo da qui —>

An ensemble method combines the predictions of many individual classifiers by majority voting.

Ensemble of low-correlating classifiers with slightly greater than 50% accuracy will outperform each of the classifiers individually.

Condorcet's jury theorem:

  • If each member of the jury (of size N) makes an independent judgement and the probability p of the correct decision by each juror is more than 0.5, then the probability of the correct decision PN by the majority m tends to one. On the other hand, if p<0.5 for each juror, then the probability tends to zero.


  • where m as a minimal number of jurors that would make a majority.

  • But real votes are not independent, and do not have uniform probabilities.

Uncorrelated submissions clearly do better when ensembled than correlated submissions.

Majority votes make most sense when the evaluation metric requires hard predictions.

Choose bagging for base models with high variance.

Choose boosting for base models with high bias.

Use averaging, voting or rank averaging on manually-selected well-performing ensembles.



  • Averaging is taking the mean of individual model predictions.
  • Averaging predictions often reduces variance (as bagging does).
  • It’s a fairly trivial technique that results in easy, sizeable performance improvements.
  • Averaging exactly the same linear regressions won't give any penalty.
  • An often heard shorthand for this on Kaggle is "bagging submissions".

Weighted averaging

  • Use weighted averaging to give a better model more weight in a vote.

  • A very small number of parameters rarely lead to overfitting.

  • It is faster to implement and to run.

  • It does not make sense to explore weights individually (α+β≠1) for:


    • AUC: For any α, β, dividing the predictions by α+β will not change AUC.
    • Accuracy (implemented with argmax): Similarly to AUC, argmax position will not change.

Conditional averaging

  • Use conditional averaging to cancel out erroneous ranges of individual estimators.
  • Can be automatically learned by boosting trees and stacking.


  • Bagging (bootstrap aggregating) considers homogeneous models, learns them independently from each other in parallel, and combines them following some kind of deterministic averaging process.
  • Bagging combines strong learners together in order to "smooth out" their predictions and reduce variance.
  • Bootstrapping allows to fit models that are roughly independent.

  • The procedure is as follows:
    • Create N random sub-samples (with replacement) for the dataset of size N.
    • Fit a base model on each sample.
    • Average predictions from all models.
  • Can be used with any type of method as a base model.
  • Bagging is effective on small datasets.
  • Out-of-bag estimate is the mean estimate of the base algorithms on 37% of inputs that are left out of a particular bootstrap sample.
    • Helps avoid the need for an independent validation dataset.
  • Parameters to consider:
    • Random seed
    • Row sampling or bootstrapping
    • Column sampling or bootstrapping
    • Size of sample (use a much smaller sample size on a larger dataset)
    • Shuffling
    • Number of bags
    • Parallelism
  • See Tree-Based Models


  • Bootstrapping is random sampling with replacement.
  • With sampling with replacement, each sample unit has an equal probability of being selected.
    • Samples become approximatively independent and identically distributed (i.i.d).
    • It is a convenient way to treat a sample like a population.
  • This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.
  • It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators.
  • For example:
    • Select a random element from the original sample of size N and do this B times.
    • Calculate the mean of each sub-sample.
    • Obtain a 95% confidence interval around the mean estimate for the original sample.
  • Two important assumptions:
    • N should be large enough to capture most of the complexity of the underlying distribution (representativity).
    • N should be large enough compared to B so that samples are not too much correlated (independence).
  • An average bootstrap sample contains 63.2% of the original observations and omits 36.8%.


  • Boosting considers homogeneous models, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy.
  • This technique is called boosting because we expect an ensemble to work much better than a single estimator.
  • Sequential methods are no longer fitted independently from each others and can't be performed in parallel.
  • Each new model in the ensemble focuses its efforts on the most difficult observations to fit up to now.
  • Boosting combines weak learners together in order to create a strong learner with lower bias.
    • A weak learner is defined as one whose performance is at least slightly better than random chance.
    • These learners are also in general less computationally expensive to fit.

Adaptive boosting

  • At each iteration, adaptive boosting changes the sample distribution by modifying the weights of instances.

    • It increases the weights of the wrongly predicted instances.
    • The weak learner thus focuses more on the difficult instances.
  • The procedure is as follows:

    • Fit a weak learner ht with the current observations weights.


    • Estimate the learner's performance and compute its weight αt (contribution to the ensemble).


    • Update the strong learner by adding the new weak learner multiplied by its weight.

    • Compute new observations weights that expresse which observations to focus on.


  • See Tree-Based Models

Gradient boosting

Gradient boosting doesn’t modify the sample distribution:

  • At each iteration, the weak learner trains on the remaining errors (so-called pseudo-residuals) of the strong learner.

Gradient boosting doesn’t weight weak learnes according to their performance:

  • The contribution of the weak learner (so-called multiplier) to the strong one is computed using gradient descent.
  • The computed contribution is the one minimizing the overall error of the strong learner.

Allows optimization of an arbitrary differentiable loss function.

The procedure is as follows:

  • Compute pseudo-residuals that indicate, for each observation, in which direction we would like to move.

  • Fit a weak learner ht to the pseudo-residuals (negative gradient of the loss)

  • Add the predictions of ht multiplied by the step size α (learning rate) to the predictions of ensemble


  • See Tree-Based Models


  • Stacking considers heterogeneous models, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions.
  • Stacking on a small holdout set is blending.
  • Stacking with linear regression is sometimes the most effective way of stacking.
  • Non-linear stacking gives surprising gains as it finds useful interactions between the original and the meta-model features.
  • Feature-weighted linear stacking stacks engineered meta-features together with model predictions.
  • At the end of the day you don’t know which base models will be helpful.
  • Stacking allows you to use classifiers for regression problems and vice versa.
  • Base models should be as diverse as possible:
    • 2-3 GBMs (one with low depth, one with medium and one with high)
    • 2-3 NNs (one deeper, one shallower)
    • 1-2 ExtraTrees/RFs (again as diverse as possible)
    • 1-2 linear models such as logistic/ridge regression
    • 1 kNN model
    • 1 factorization machine
  • Use different features for different models.
  • Use feature engineering:
    • Pairwise distances between meta features
    • Row-wise statistics (like mean)
    • Standard feature selection techniques
  • Meta models can be shallow:
    • GBMs with small depth (2-3)
    • Linear models with high regularization
    • ExtraTrees
    • Shallow NNs (1 hidden layer)
    • kNN with BrayCurtis distance
    • A simple weighted average (find weights with bruteforce)
  • Use automated stacking for complex cases to optimize:
    • CV-scores
    • Standard deviation of the CV-scores (a smaller deviation is a safer choice)
    • Complexity/memory usage and running times
    • Correlation (uncorrelated model predictions are preferred).
  • Greedy forward model selection:
    • Start with a base ensemble of 3 or so good models.
    • Add a model when it increases the train set score the most.

Multi-level stacking