# My Machine Learning Models Just SUCKED. Here Is How I Fixed Them
## Here are 7 things I learned to consistently achieve over 90 point-performance
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@wannabephotographer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Walid Hamadeh</a>
        on 
        <a href='https://unsplash.com/s/photos/suck?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

I remember taking my first ML course on Kaggle. I was introduced to Decision Trees and performed my first "serious" regression task on the over-used Ames Housing dataset. I was so happy! I even went as far as thinking that Machine Learning was not so hard after all... What a noob!

Turns out, Decision Trees was like the "Hello World" in ML and I had only dipped my pinkie toe into the world of beautiful math and data. Since then, I have learned and improved a lot (or I think I have). 

Now, I am not just blindly training my favorite models based on the target. I have moved away from writing template code and started taking data preprocessing more seriously. Because of these changes and many others, my models started achieving robust results, often upwards of 85 point-performance, even with large datasets.

So, in this article, I will lay out 7 most important things I learned to consistently push my models to squeeze every bit of performance increase. ## TODO

### 1. Find out the hyperparameters that control overfitting/underfitting

Not surprisingly, the first thing I learned about the ML world is the issue of overfitting and how to combat it. Generating a robust model that generalizes well is a step-by-step process and the initial stage starts right at the model initialization. 

After choosing your baseline model(s), search for its parameters that influences its objective function the most. Often, these hyperparameters are the ones that directly affect model's learning and most importantly, how it generalizes.

The best way to do this is to read the documentation of the model thoroughly. After reading enough documentation, you will find out that there are certain keywords that immediately suggest the parameter is related to controlling overfitting. 

For example, tree-based and ensemble models use the term "prune" to control the depth of each tree. RandomForests have `n_estimators`, `max_features` that affect the build of each tree. Sklearn user guide also says `max_depth` and `min_samples_split` are important.

For linear models, most common keywords include *regularization*, *penalty*, etc. LogisticRegression and Linear SVMs have the `C` - the inverse of regularization strength or `alpha` and `gamma` hyperparameters that exist in all SVMs. The common penalty types are called 'L1' and 'L2', and [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)/[Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) algorithms natively support them.

For establishing base performance, you can simply use the default values or values that are suggested on the documentation. Often, these values are not optimal and should be tuned with hyperparameter optimizers at the last stage of your workflow.

### 2. Divide the data into 3 sets, not 2

Unless you are still using toy datasets, real-world data often comes in large magnitudes. In such cases, you can afford to divide the data into 3 sets (1 training, 2 validation) to generate the best possible results. 

If you are not using cross-validation, the problem with 1 training and only 1 test set is again the issue of overfitting. All the work you will be doing will be dependent on that single pair of train/test sets split at some random seed. After the model learns from the train set, you can tune its hyperparameters until it spits out the best possible score for your test set.

The maximum score in this scenario does not necessarily mean that your model now generalizes well on unseen data. This particular score only tells how well your models does on that small set of samples randomly chosen. This is overfitting all over again, just in disguise.

An easy fix would be using another hold-out set. Model learns from the training, you optimize it on the test and finally, you check its *real* performance on unseen data using the third - validation set. Here is a helper function to do this:

In [1]:
from sklearn.model_selection import train_test_split


def train_test_valid_split(X, y, train_size=0.7):
    """
    A helper function to divide the full data into 3 sets.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size)
    X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5)

    return X_train, X_test, X_valid, y_train, y_test, y_valid