# Data Leakage
Find and fix this problem that ruins your model in subtle ways.

# Data Leakage Summary

## Introduction

- In this tutorial, we learn what data leakage is and how to prevent it.
- If we don't know how to prevent it, data leakage will often sneak into our models and ruin them in subtle and dangerous ways.
- It's one of the most important concepts for practicing data scientists.

## What Is Data Leakage?

- Data leakage happens when our training data contains information about the target variable that **won’t be available at prediction time**.
- This causes the model to perform well during training (and even on validation), but poorly in real-world predictions.
- In other words, leakage makes a model **look accurate** during development, but **fail** when deployed.

## Types of Leakage

### 1. Target Leakage

- Occurs when we include predictors that use information **unavailable at prediction time**.
- To detect target leakage, we must think about the **chronological order** in which data becomes available.
- Just because a feature boosts model performance doesn't mean it's valid — it might be leaking future information.

### 2. Train-Test Contamination

- Train-test contamination happens when we fail to keep training and validation data properly separated.
- Validation is supposed to evaluate how well our model generalizes to unseen data.
- If validation data influences any part of training or preprocessing, our validation results become misleading.
#### Example of Contamination

- Suppose we impute missing values (fit the imputer) **before** running `train_test_split()`.
- The imputer then uses **all the data**, including the future validation set, to compute statistics (e.g., mean).
- The model will show good validation scores — but that performance won’t hold up in production.

#### Why It's Dangerous?

- Contamination gives us false confidence in our model.
- It can be even more subtle and dangerous during complex feature engineering.
- The model appears to work well on data it already "knows", but generalizes poorly to new data.

#### How to Prevent It?

- Always exclude validation data from **any kind of fitting**, including preprocessing steps.
- Use **Scikit-learn pipelines** to keep preprocessing tied to training folds.
- When using **cross-validation**, it's essential to do all preprocessing **inside the pipeline**.

## Conclusion

Data leakage can be multi-million dollar mistake in many data science applications. Careful separation of training and validation data can prevent train-test contamination, and pipelines can help implement this separation. Likewise, a combination of caution, common sense, and data exploration can help identify target leakage.

----------------------------------------------------------------------------------

# Step 1: The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:

- The current month (January, February, etc)
- Advertising expenditures in the previous month
- Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
- The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out.

## ❓ Does the _leather used_ feature constitute data leakage?

### ✅ Yes, this is **data leakage**.

### 🔍 Why?

- The goal is to predict **how many shoelaces are needed each month**.
- The feature **"leather used"** refers to something that happens **during or after** production.
- If you're predicting **before** the month begins (e.g., for planning), you **won’t yet know** how much leather will be used.
- So including this feature gives the model **access to future information**, which it shouldn't have at prediction time.
- This is a textbook example of **target leakage**.

### 🧠 It depends on the timing:
- ❌ **If predictions are made before production begins** → _leather used_ is **not known** → this is **leakage**.
- ✅ **If predictions are made after the month is over**, as a report → no leakage, but the model has **no practical use** for planning.

## ✅ Recommendation

- Remove the _leather used_ feature from the model.
- Optionally, build a **separate model** to predict leather usage using early-month features.
- Then use that predicted value as an input to the shoelace model — this forms a proper pipeline without data leakage.

# Step 2: Return of the Shoelaces

You have a new idea. You could use the **amount of leather Nike ordered** (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

## ❓ Is there still a data leakage problem?

### 🟡 It depends.

### 🔍 What does it depend on?

- ⏱️ **Timing of the prediction**:  
  - If the **leather order** is made **before** the month starts, and the prediction is also made **before** the month begins, then:
    - ✅ **No leakage**: you're only using data available at prediction time.
  
  - If the leather order is made **after** the month starts (e.g., a just-in-time order system), and your prediction is meant to happen **before** the month, then:
    - ❌ **Yes, leakage**: the model is again using future information.

- 📦 **Why use leather ordered instead of leather used?**
  - "Leather ordered" is likely correlated with production volume (and therefore shoelaces), but it's a **prior decision**, not an outcome.
  - So it can act as a **proxy** for future demand, **as long as it’s available when predictions are made**.

## ✅ Recommendation

- Confirm the **ordering schedule** and **when predictions are made**.
- If the leather is ordered **before** the month begins, and you’re predicting for that month, it is **safe to use** and can **improve model accuracy**.

----------------------------------------------------------------------------------
# Step 3: Getting Rich With Cryptocurrencies?

## ✅ Correct Answer:

There is **no source of leakage** here. These features should be available at the moment you want to make a prediction, and they're unlikely to be changed in the training data after the prediction target is determined.

However, the way your friend describes **accuracy** could be **misleading** if you aren't careful.

If the price moves gradually, today's price will naturally be an accurate predictor of tomorrow's price. But that doesn't mean the model tells you whether it's a **good time to invest**.

For example, if the current price is $100 and the predicted price for tomorrow is also $100, this may seem accurate — but it doesn't tell you if the price is going **up** or **down**, which is crucial for investment decisions.

> ✅ A better prediction target would be the **change in price over the next day**.

If you can consistently predict whether the price is about to go **up or down** (and by how much), then you may have a winning investment opportunity.



---
# Step 4: Preventing Infections

## ❓Problem
We want to predict whether a patient will get an infection after a rare surgery. One proposed feature is the average infection rate of the **surgeon** who performed the surgery, calculated across all their patients in the dataset.



## ⚠️ Target Leakage

Yes, this approach **can introduce target leakage**.

If you calculate a surgeon's average infection rate **using the full dataset**, you are using information from the **target variable** (whether other patients had infections) to construct a feature. This means that the infection status of the **patient you're trying to predict** is indirectly influencing the feature used to make that prediction.

> For example, if a patient gets an infection, that will slightly raise the surgeon's average infection rate, which is then used to predict that same patient's infection — a circular logic.



## ⚠️ Train-Test Contamination

Yes, this also creates **train-test contamination**.

If the surgeon infection rate is computed using **data from both the training and test sets**, then information from the test set has "leaked" into the training process. This leads to overly optimistic validation results, since the model has indirectly seen the test labels.



## ✅ How to Fix It

To safely use the surgeon's infection rate:
- Compute the rate **only using the training data**.
- For the test/validation set, use the precomputed rates from the training data.
- If a surgeon appears in the test set but not in the training set, handle them with a default or median rate.

> ✅ Even better: compute the infection rate using **cross-validation** within the training set so that no data point is used in computing its own surgeon's infection rate.


---
# Step 5: Housing Prices

## ❓Problem
You are building a model to predict the price of a house at the time its description is added to a website. You have the following features:

1. Size of the house (in square meters)  
2. Average sales price of homes in the same neighborhood  
3. Latitude and longitude of the house  
4. Whether the house has a basement  

The model is trained on historical data and used for future predictions.



## ⚠️ Which Feature Has Leakage?

**Feature 2: Average sales price of homes in the same neighborhood**  
✅ This feature is the **most likely to cause target leakage**.

### Why?
If this average includes the **sale price of the house you're trying to predict**, or any houses sold **after** the one you're trying to predict, then you're using future or target information in your input.  
- This will cause the model to look very accurate during training.
- But at prediction time, the house hasn’t been sold yet — so the average won't include it.
- If the neighborhood has few houses, this average might almost *exactly match* the target value, making the model unrealistically accurate during training.



## ✅ Analysis of Each Feature

### 1. Size of the house (in square meters)
- ✅ Safe: Available at prediction time and unlikely to change.
  
### 2. Average neighborhood sales price
- ⚠️ Risk of **target leakage**, depending on how it's computed (see above).

### 3. Latitude and longitude
- ✅ Safe: Constant features known at prediction time.

### 4. Has basement (Yes/No)
- ✅ Safe: Known before the house is sold, not a derived or future-based value.



## 💡 Best Practice
If you want to use average neighborhood price:
- Only compute it **using sales that happened *before* the prediction time**.
- Or use other proxies, like long-term average prices in the area from external sources (e.g. last year's average).

