## 5.4 Target statistics

Methods based on target statistics create a single feature that approximates the expected value of the response given the category that each observation belongs to. The disadvantage is that such methods cause target leakage.

**This approach is typically more useful for tree-based methods**.

The `category_encoders` package makes it easy for us to experiment with different methods. The target encoder is the original version of this approach. The other approaches mitigate target leakage. 

So what we are doing here is very interesting, but takes some time to explain. The problem when using one-hot-encoding is that the number of predictors can increase very quickly, especially if you want to convert all the categorical predictors. But first, recall that in one-hot-encoding we simply replace

```
Crawfor -> (1, 0, 0, ..., 0, 0)
ClearCr -> (0, 1, 0, ..., 0, 0)
Gilbert -> (0, 0, 1, ..., 0, 0)
...
Landmrk -> (0, 0, 0, ..., 0, 1)
GrnHill -> (0, 0, 0, ..., 0, 0)
```


In Target Statistics, the idea is that we want to replace these strings with 1 number

```
Crawfor -> P_1
ClearCr -> P_2
Gilbert -> P_3
...
Landmrk -> P_{n-1}
GrnHill -> P_n
```

Instead, what if we simply replaced the values with the average house price? So

```
Crawfor -> Average house price for Crawfor
ClearCr -> Average house price for ClearCr
Gilbert -> Average house price for Gilbert
...
Landmrk -> Average house price for Landmrk
GrnHill -> Average house price for GrnHill
```

This is (with one more trick) exactly what `LeaveOneOutEncoder` does. The other `category_encoders` basically do more advanced versions of this trick.

Notice that the problem here is called target leakage - because information about the y-variable is being used as the predictors.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('AmesHousing.csv')
data["logSalePrice"] = np.log(data["SalePrice"])

In [4]:
data.groupby("Neighborhood")["logSalePrice"].mean().sort_values()

Neighborhood
MeadowV    11.449986
IDOTRR     11.474050
BrDale     11.560916
OldTown    11.672587
BrkSide    11.690847
Edwards    11.726375
SWISU      11.786066
Sawyer     11.810391
Landmrk    11.827736
NPkVill    11.852267
Blueste    11.856254
NAmes      11.863716
Mitchel    11.966394
SawyerW    12.086417
NWAmes     12.127035
Gilbert    12.145524
Greens     12.167202
Blmngtn    12.178908
CollgCr    12.181596
Crawfor    12.195976
ClearCr    12.217974
Somerst    12.315522
Timber     12.378339
Veenker    12.388841
GrnHill    12.526341
StoneBr    12.621673
NridgHt    12.639025
NoRidge    12.673408
Name: logSalePrice, dtype: float64