# My Machine Learning Models Just SUCKED. Here Is How I Fixed Them
## 7 things I learned to consistently achieve over 90 point-performance
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@wannabephotographer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Walid Hamadeh</a>
        on 
        <a href='https://unsplash.com/s/photos/suck?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

### Setup

In [1]:
import warnings

import numpy as np

warnings.filterwarnings("ignore")

I remember taking my first ML course on Kaggle. I was introduced to Decision Trees and performed my first "serious" regression task on the over-used Ames Housing dataset. I was so happy! I even went as far as thinking that Machine Learning was not so hard after all... What a noob!

Turns out, Decision Trees was like the "Hello World" in ML and I had only dipped my pinkie toe into the world of beautiful math and data. Since then, I have learned and improved a lot (or I think I have). 

Now, I am not just blindly training my favorite models based on the target. I have moved away from writing template code and started taking data preprocessing more seriously. Because of these changes and many others, my models started achieving robust results, often upwards of 85 point-performance, even with large datasets.

So, in this article, I will lay out 7 most important things I learned to consistently push my models to squeeze every bit of performance increase. ## TODO

### 1. Find out the hyperparameters that control overfitting/underfitting

Not surprisingly, the first thing I learned about the ML world is the issue of overfitting and how to combat it. Generating a robust model that generalizes well is a step-by-step process and the initial stage starts right at the model initialization. 

After choosing your baseline model(s), search for its parameters that influences its objective function the most. Often, these hyperparameters are the ones that directly affect model's learning and most importantly, how it generalizes.

The best way to do this is to read the documentation of the model thoroughly. After reading enough documentation, you will find out that there are certain keywords that immediately suggest the parameter is related to controlling overfitting. 

For example, tree-based and ensemble models use the term "prune" to control the depth of each tree. RandomForests have `n_estimators`, `max_features` that affect the build of each tree. Sklearn user guide also says `max_depth` and `min_samples_split` are important.

For linear models, most common keywords include *regularization*, *penalty*, etc. LogisticRegression and Linear SVMs have the `C` - the inverse of regularization strength or `alpha` and `gamma` hyperparameters that exist in all SVMs. The common penalty types are called 'L1' and 'L2', and [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)/[Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) algorithms natively support them.

For establishing base performance, you can simply use the default values or values that are suggested on the documentation. Often, these values are not optimal and should be tuned with hyperparameter optimizers at the last stage of your workflow.

### 2. Divide the data into 3 sets, not 2

Unless you are still using toy datasets, real-world data often comes in large magnitudes. In such cases, you can afford to divide the data into 3 sets (1 training, 2 validation) to generate the best possible results. 

If you are not using cross-validation, the problem with 1 training and only 1 test set is again the issue of overfitting. All the work you will be doing will be dependent on that single pair of train/test sets split at some random seed. After the model learns from the train set, you can tune its hyperparameters until it spits out the best possible score for your test set.

The maximum score in this scenario does not necessarily mean that your model now generalizes well on unseen data. This particular score only tells how well your models does on that small set of samples randomly chosen. This is overfitting all over again, just in disguise.

An easy fix would be using another hold-out set. Model learns from the training, you optimize it on the test and finally, you check its *real* performance on unseen data using the third - validation set. Here is a helper function to do this:

In [2]:
from sklearn.model_selection import train_test_split


def train_test_valid_split(X, y, train_size=0.7):
    """
    A helper function to divide the full data into 3 sets.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size)
    X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5)

    return X_train, X_test, X_valid, y_train, y_test, y_valid

### 3. Use cross-validation extensively

You should use the last tip when you don't have the luxury of 10+ cores on your machine and your dataset size is massive. 

If you *do* have the luxury though, use cross-validation extensively. Understand different types of cross-validation and find the ones that suit your unique case. For example, Sklearn offers [12 unique cross-validation splitters](https://scikit-learn.org/stable/modules/cross_validation.html), each with a common purpose - to combat overfitting my generating multiple predictions for each sample using several models. By using cross-validation, you gain the benefit of:

1. Not waste any data - after all folds are done, all data have been used as training and test sets. 
2. Remove the possibility that the model performed too optimistically by getting accidentally trained on too favorable train/test sets.
3. Report the inherent uncertainty that comes with doing ML - by getting several scores, you can compute the average score to see general performance and see the standard deviation to get an idea of how much the results can vary. 

### 4. Move beyond simple imputation techniques

When I first started out, I didn't really care about missing values. I mostly played with toy datasets and simple mean/mode imputation techniques were more than enough (who can blame me?). 

After working with larger datasets, my approach changed. First, instead of blindly applying imputation techniques, I started asking why the data is missing in the first place. More broadly, I explored the types of missingness. Generally, there are three:

1. Missing Completely at Random (MCAR)
2. Missing at Random (MAR)
3. Missing Not at Random (MNAR)

These 3 types are named rather similarly but have subtle differences. Finding out which category the missing data falls into can narrow down the techniques you can use to impute it.

Beyond the simple techniques such as mean/median imputation, there are 2 model-based approaches. In Sklearn, these are:

1. KNN imputation (`sklearn.impute.KNNImputer`)
2. Iterative Imputation (`sklearn.impute.IterativeImputer`)

Both have implementations in R and I was surprised to learn that R ecosystem for imputing missing values is more mature. Discussing the methods here would deviate us from the original purpose of the article, so I can refer you to my separate post:

https://towardsdatascience.com/going-beyond-the-simpleimputer-for-missing-data-imputation-dd8ba168d505?source=your_stories_page-------------------------------------

There is also a question of which technique is better and how effective it is. It is possible that you can use several techniques and evaluate an estimator too see each one's effect on predictions but that would not scale well for large datasets. 

My favorite is to plot a feature's distribution before and after the imputation was done:
![](images/1.png)

Comparison of KNN imputation with different values of K. As you can see, with K=2, orange line comes closer to the original (blue) distribution.

The closer the after-imputation distribution is to the original, the better the technique. Of course this approach has its downsides, especially when the proportion of missing values is large.

### 5. Perform feature selection or dimensionality reduction

Having more data does is not always better. It certainly is the case when the data have unnecessary number of predictor variables (features).

Having too many features that do not contribute much to the predictive power of estimators leads to overfitting, more computation cost and increased model complexity. These features tend to have low variance or be highly correlated within each other.

You can use either feature selection or dimensionality reduction with PCA to remove the redundant variables from the dataset.

**Feature selection** techniques are sometimes preferred when you have a deep understanding of each feature. Before even applying complex algorithms, you can discard a few using your domain knowledge or by just exploring how and why each variable was collected. Then, you can use other techniques such as model-based feature selection. For example, Sklearn offers `SelectFromModel` or Recursive Feature Elimination (RFE) wrapper algorithms to automatically find the set of most important predictor variables. To achieve higher performance, passing ensemble algorithms works pretty well here too.

**Dimensionality reduction** with PCA is one of the most powerful techniques to reduce the number of features. It takes a high-dimensional data and projects it to lower dimensions (fewer features) by preserving as much of the original variance as possible. Sklearn's implement of PCA (`sklearn.decomposition.PCA`) tends to do pretty well on Kaggle competitions. You can directly specify the number of features you want to keep, or what is commonly done - pass a percentage between 0 and 1 to tell the amount of variance you want to preserve. PCA automatically finds the minimum number of features that can account for the passed variance.

A disadvantage of PCA is that there is a lot of math involved and you will be sacrificing explainability because after PCA, you won't be able to interpret the features. The best case scenario to use PCA would be when the features are anonymized.