numeric-yeojohnson.qmd

---
pagetitle: "Feature Engineering A-Z | Yeo-Johnson"
---

# Yeo-Johnson {#sec-numeric-yeojohnson}

::: {style="visibility: hidden; height: 0px;"}
## Yeo-Johnson
:::

You have likely heard a lot of talk about having normally distributed predictors. This isn't that common of an assumption, and having a fairly non-skewed symmetric predictor is often enough. Linear Discriminant Analysis assumes Gaussian data, and that is about it (TODO add a reference here). Still, it is worthwhile to have more symmetric predictors, and this is where the Yeo-Johnson transformation comes into play.

This method is very similar to the Box-Cox method in @sec-numeric-boxcox, except it doesn't have the restriction that the variable $x$ needs to be positive. 

It works by using maximum likelihood estimation to estimate a transformation parameter $\lambda$ in the following equation that would optimize the normality of $x^*$

$$
x^* = \left\{
    \begin{array}{ll}
      \dfrac{(x + 1) ^ \lambda - 1}{\lambda}              & \lambda \neq 0, x \geq 0 \\
      \log(x + 1)                                         & \lambda =    0, x \geq 0 \\
      - \dfrac{(-x + 1) ^ {2 - \lambda} - 1}{2 - \lambda} & \lambda \neq 2, x <    0 \\
      - \log(-x + 1)                                      & \lambda =    2, x <    0
    \end{array}
  \right.
$$
It is worth noting again, that what we are optimizing over is the value of $\lambda$. This is also a case of a trained preprocessing method when used on the predictors. We need to estimate the parameter $\lambda$ on the training data set, then use the estimated value to apply the transformation to the training and test data set to avoid data leakage.

If the values of $x$ are strictly positive, then the Yeo-Johnson transformation is the same as the Box-Cox transformation of $x + 1$, if the values of $x$ are strictly negative then the transformation is the Box-Cox transformation of $-x + 1$ with the power $2 - \lambda$. The interpretation of $\lambda$ isn't as easy as for the Box-Cox method.

Let us see some examples of Yeo-Johnson at work. Below is three different simulated distribution, before and after they have been transformed by Yeo-Johnson.

```{r}
#| echo: false
#| message: false
library(recipes)
library(ggplot2)
library(tidyr)
library(patchwork)

visualize_yeojohnson <- function(x) {
  recipe(~Original, data = tibble(Original = x)) |>
    step_mutate(Transformed = Original) |>
    step_YeoJohnson(Transformed) |>
    prep() |>
    bake(new_data = NULL) |>
    pivot_longer(everything()) |>
    ggplot(aes(value)) +
    geom_histogram(bins = 50) +
    facet_wrap(~name, scales = "free") +
    theme_minimal() +
    labs(x = NULL, y = NULL)
}
```

```{r}
#| echo: false
#| message: false
#| fig-cap: "Before and After Yeo-Johnson"
#| fig-alt: "6 histograms of distribution, in 2 columns. The left column shows unaltered distributions. The right column shows the distribution of the Yeo-Johnson transformation of the left column. The right column is mostly normally distributed."
set.seed(1234)
visualize_yeojohnson(rchisq(10000, 1)/2 + rnorm(10000, sd = 0.2)) /
visualize_yeojohnson(c(rnorm(6000, 0), rnorm(3000, 2), rnorm(1000, 4))) /
visualize_yeojohnson(5 - abs(rt(10000, 205)) + rnorm(10000, sd = 0.1))
```

We have the original distributions that have some left or right skewness. And the transformed columns look better, in the sense that they are less skewed and they are fairly symmetric around the center. Are they perfectly normal? No! but these transformations might be beneficial. We also notice how these methods work, even when there are negative values.

The Yeo-Johnson method isn't magic and will only give you something more normally distributed if the distribution can be made more normally distributed by applying the above formula would give you some more normally distributed values.

```{r}
#| echo: false
#| message: false
#| fig-cap: "Before and After Box-Cox"
#| fig-alt: "6 histograms of distribution, in 2 columns. The left column shows unaltered distributions. The right column shows the distribution of the Box-Cox transformation of the left column. The right column has not benefitted from the Box-Cox transformation"
set.seed(1234)
visualize_yeojohnson(runif(10000)) /
visualize_yeojohnson(c(rnorm(7000, 10), rnorm(13000, 14))) /
visualize_yeojohnson(1 + rbeta(10000, 0.5, 0.5) + rnorm(10000, sd = 0.05))
```

The first distribution here is uniformly random. The resulting transformation ends up more skewed, even if only a little bit, than the original distribution because this method is not intended for this type of data. We are seeing similar results with the bi-modal distributions.

## Pros and Cons

### Pros

- More flexible than individually chosen power transformations such as logarithms and square roots
- Can handle negative values

### Cons

- Isn't a universal fix

## R Examples

We will be using the `ames` data set for these examples.

```{r}
library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Lot_Area, Wood_Deck_SF, Sale_Price)
```

{recipes} provides a step to perform Yeo-Johnson transformations, which out of the box uses $e$ as the base with an offset of 0.

```{r}
yeojohnson_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
  step_YeoJohnson(Lot_Area) |>
  prep()

yeojohnson_rec |>
  bake(new_data = NULL)
```

We can also pull out the value of the estimated $\lambda$ by using the `tidy()` method on the recipe step.

```{r}
yeojohnson_rec |>
  tidy(1)
```

## Python Examples

```{python}
#| echo: false
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```

We are using the `ames` data set for examples. {feature_engine} provided the `YeoJohnsonTransformer()` that we can use.

```{python}
from feazdata import ames
from sklearn.compose import ColumnTransformer
from feature_engine.transformation import YeoJohnsonTransformer

ct = ColumnTransformer(
    [('yeojohnson', YeoJohnsonTransformer(), ['Lot_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ct.transform(ames)
```