# 5 Sklearn Mistakes That Silently Tell You Are a Rookie
## No error messages - that's what makes them subtle...
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@santabarbara77?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Varvara Grabova</a>
        on 
        <a href='https://unsplash.com/s/photos/mistake?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

### Setup

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

rcParams["figure.figsize"] = [12, 9]
rcParams["figure.autolayout"] = True
rcParams["xtick.labelsize"] = 15
rcParams["ytick.labelsize"] = 15
rcParams["legend.fontsize"] = "small"

warnings.filterwarnings("ignore")

### Introduction

### Using `fit` or `fit_transform` everywhere

Let's start with the most serious mistake - a mistake that is related to *data leakage*. Data leakage is subtle and can be destructive to model performance. It occurs when information that would not be available at prediction time is used during the model training. Data leakage causes models to give very optimistic results, even in cross-validation but perform terribly when testing on actual novel data. 

Data leakage is common during data preprocessing, particularly if the training and test sets are not separated. Many Sklearn preprocessing transformers such as imputers, normalizers, standardization functions and log transformers tap into the underlying distribution of the data during the fit time. 

For example, `StandardScaler` normalizes the data by subtracting the mean from each sample and dividing by the standard deviation. Calling the `fit()` function on the full data (X) allows the transformer to learn the mean and standard deviation of the whole distribution of each feature. After transformation, if this data is then split into train and test sets, the train set would be contaminated because `StandardScaler` leaked important information from the actual distribution. 

Even though this might not be apparent to us, Sklearn algorithms are powerful enough to notice this and take advantage during testing. In other words, the train data would be too perfect for the model because it has useful information of the test set and the test would not be novel enough to test the model's performance on actual unseen data.

The easiest solution is to never call `fit` on the full data. Before doing any preprocessing, always split the data into train and test sets. Even after the split, you should never call `fit` or `fit_transform` on the test set because you will end up at the same problem. 

Since both train and test sets should receive the same preprocessing steps, a golden rule is to use `fit_transform` on the train data - this ensures that the transformer learns only from the train set and transforms it simultaneously. Then, call the `transform` method on the test set to transform it based on the information learned only from the training data.

A more robust solution would be using Sklearn's built-in pipelines. Pipeline classes are specifically built to guard algorithms from data leakage. Using pipelines ensures that only the training data is used during `fit` and the test data is used only for calculations. You can learn about them in detail in my separate article:

https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d?source=your_stories_page-------------------------------------

### Judging Model Performance Only By Test Scores

You got a test score over 0.85 - should you be celebrating? Big, fat NO!

Even though high test scores generally mean robust performance, there are important caveats to interpreting test results. First and most importantly, regardless of the value, test scores should only be judged based on the score you get from training.

The only time you should be happy with your model is when the training score is higher than the test score and both are high enough to satisfy the expectations of your unique case. However, this does not imply that the higher the difference between train and test scores, the better. 

For example, 0.85 training score and 0.8 test score suggests a robust model that is neither overfit nor underfit. But, if the training score is over 0.9 and the test score is 0.8, your model is overfit - instead of generalizing during training, the model memorized some of the training data resulting in a much lower test score than training. You will often see such cases with tree-based and ensemble models. Algorithms such as Random Forests tend to achieve very high training scores if their tree depth is not controlled which leads to overfitting. You can read [this discussion](https://stats.stackexchange.com/questions/156694/how-can-training-and-testing-error-comparisons-be-indicative-of-overfitting?noredirect=1&lq=1) on StackExchange to learn more about this difference between train and test scores.

There is also the case where the test score is higher than train. If the test score is higher than the test score even in the slightest, feel alarmed because you made a blunder! The major cause of such scenarios is data leakage and we discussed an example of that in the last section. 

Sometimes, it is also possible to get a good training score and extremely low testing score. When the difference between train and test score is unusually large, the problem will often be associated with the test set rather than overfitting. One reason this might happen is using different preprocessing steps for the train and test sets, or simply forgetting to apply preprocessing to the test set.

In summary, always examine the gap between train and test scores closely. Doing so will tell you whether you should apply regularization to overcome overfitting, look for possible mistakes you made during preprocessing or the best case scenario, prepare the model for final evaluation and deployment.

### Generating Incorrect Train/Test Sets in Classification

### Judging Model Performance Without Cross-validation

### Using `LabelEcoder` to Encode the X array

### Using Default Scorers to Evaluate the Performance of a Model

### Using Variance Thresholding Without Normalization