/
numeric-logarithms.qmd
199 lines (141 loc) · 8.27 KB
/
numeric-logarithms.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
pagetitle: "Feature Engineering A-Z | Logarithms"
---
# Logarithms {#sec-numeric-logarithms}
::: {style="visibility: hidden; height: 0px;"}
## Logarithms
:::
Logarithms come as a tool to deal with highly skewed data. Consider the histogram below, it depicts the lot area for all the houses in the `ames` data set. It is highly skewed with the majority of areas being less than 10,000 with a handful greater than 50,000 with one over 200,000.
::: {.callout-caution}
# TODO
Find a better data set for this example
:::
```{r}
#| echo: false
#| message: false
#| fig-cap: "Histogram of lot areas"
#| fig-alt: "Highly right skewed histogram of lot areas"
library(tidymodels)
data("ames")
ggplot(ames, aes(Lot_Area)) +
geom_histogram(bins = 100) +
theme_minimal()
```
This data could cause some problems if we tried to use this variable in its unfinished state. We need to think about how numeric variables are being used in these models. Take a linear regression as an example. It fits under the assumption that there is a linear relationship between the response and the predictors. In other words, the model assumes that a 1000 increase in lot area is going to increase the sale price of the house by the same amount if it occurred to a 5000 lot area house or a 50000 lot area house. This might be a valid scenario for this housing market. Another possibility is that we are seeing diminishing returns, and each increase in lot area is going to cause a smaller increase in the predicted sale price.
The main identity that is interesting when using logarithms for preprocessing is the multiplicative identity
$$
\log_{base}(xy) = \log_{base}(x) + \log_{base}(y)
$$
This identity allows us to consider the distances in a specific non-linear way. On the linear scale, we have that the distance between 1 and 2 is the same as 20 to 21 and 10,000 to 10,001. Such a relation is not always what we observe. Consider the relationship between excitement (response) and the number of audience members (predictor) at a concert. Adding a singular audience member is not going to have the same effect for all audience sizes. If we consider $y = 1.5$ then we can look at what happens each time the audience size is increased by 50%. For an initial audience size of 100, the following happens
$$
\log_{10}(150) = \log_{10}(100 \cdot 1.5) = \log_{10}(100) + \log_{10}(1.5) \approx 2 + 0.176
$$
and for an initial audience size of 10,000, we get
$$
\log_{10}(15,000) = \log_{10}(10,000 \cdot 1.5) = \log_{10}(10,000) + \log_{10}(1.5) \approx 4 + 0.176
$$
And we see that by using a logarithmic transformation we can translate a multiplicative increase into an additive increase.
The logarithmic transformation is perfect for such a scenario. If we take a `log2()` transformation on our data, then we get a new interpretation of our model. Now each doubling (because we used log**2**) of the lot area is going to result in the same predicted increase in sale price. We have essentially turned out predictors to work on doublings rather than additions. Below is the same chart as before, but on a logarithmic scale using base 2.
```{r}
#| echo: false
#| message: false
#| fig-cap: "Histogram of logarithmic lot areas"
#| fig-alt: "histogram of logarithmic lot areas. This time not skewed"
library(tidymodels)
data("ames")
ggplot(ames, aes(log2(Lot_Area))) +
geom_histogram(bins = 100) +
theme_minimal()
```
it is important to note that we are not trying to make the variable normally distributed. What we are trying to accomplish is to remove the skewed nature of the variable. Likewise, this method should not be used as a variance reduction tool as that task is handled by doing normalization which we start exploring more in @sec-numeric-scaling-issues.
The full equation we use is
$$
y = log_{base}(x + offset)
$$
Where $x$ is the input, $y$ is the result. The base in and of itself makes a difference, as the choice of base only matters in a multiplicative way. Common choices for bases are $e$, 2, and 10. I find that using a base of 2 yields nice interpretive properties as an increase of 1 in $x$ results in a doubling of $y$.
One thing we haven't touched on yet is that logarithms are only defined for positive input. The offset is introduced to deal with this problem. The offset will default to 0 in most software you will encounter. If you know that your variables contain values from -100 to 10000, you can set `offset = 101` to make sure we never take the logarithm of a negative number. A common scenario is non-negative, which contains 0, setting `offset = 0.5` helps in that case. The choice of offset you choose does matter a little bit. The offset should generally be smaller than the smallest non-negative value you have, otherwise, you will influence the proportions too much. Below we use an offset of 0.1, and the relationship we get out of this is that distance from 0 to 1, is roughly the same as multiplying by 10.
```{r}
#| echo: false
offset <- 0.1
log(c(0, 1, 10, 100, 1000) + offset, base = 2) |> round(3)
```
On the other hand, if we set the offset to 0.001, we get that the difference from 0 to 1 is the same as the distance from 1 to 1000, or the same as multiplying by 1000.
```{r}
#| echo: false
offset <- 0.001
log(c(0, 1, 10, 100, 1000) + offset, base = 2) |> round(3)
```
The use of logarithms will have different effects on different types of models. Linear models will see a change in transformed variables since we can better model relationships between the response and dependent variable if the transformation leads to a more linear relationship. Tree-based models should in theory not be affected by the use of logarithms, however, some implementations use binning to reduce the number of splitting points for consideration. You will see a change if these bins are based on fixed-width intervals. This is expected to have minimal effect on the final model.
## Pros and Cons
### Pros
- A non-trained operation, can easily be applied to training and testing data sets alike
### Cons
- Needs offset to deal with negative data
- Is not a universal fix. While it can make skewed distributions less skewed. It has the opposite effect on a distribution that isn't skewed. See the effect below on 10,000 uniformly distributed values
```{r}
#| echo: false
#| layout-ncol: 2
set.seed(1234)
runif(10000) |> hist(breaks = 20)
set.seed(1234)
runif(10000) |> log10() |> hist(breaks = 20)
```
## R Examples
We will be using the `ames` data set for these examples.
```{r}
library(recipes)
library(modeldata)
ames |>
select(Lot_Area, Wood_Deck_SF, Sale_Price)
```
{recipes} provides a step to perform logarithms, which out of the box uses $e$ as the base with an offset of 0.
```{r}
log_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
step_log(Lot_Area)
log_rec |>
prep() |>
bake(new_data = NULL)
```
The base can be changed by setting the `base` argument.
```{r}
log_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
step_log(Lot_Area, base = 2)
log_rec |>
prep() |>
bake(new_data = NULL)
```
If we have non-positive values, which we do in the `Wood_Deck_SF` variable because it has quite a lot of zeroes, we get `-Inf` which isn't going to work.
```{r}
log_rec <- recipe(Sale_Price ~ Wood_Deck_SF, data = ames) |>
step_log(Wood_Deck_SF)
log_rec |>
prep() |>
bake(new_data = NULL)
```
Setting the `offset` argument helps us to deal with that problem.
```{r}
log_rec <- recipe(Sale_Price ~ Wood_Deck_SF, data = ames) |>
step_log(Wood_Deck_SF, offset = 0.5)
log_rec |>
prep() |>
bake(new_data = NULL)
```
## Python Examples
```{python}
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples. {feature_engine} provided the `LogCpTransformer()` which is just what we need in this instance. This transformer lets us set an offset `C` that is calculated before taking the logarithm to avoid taking the log of a non-positive number. If we knew that there weren't any non-positive numbers then we could use the `LogTransformer` instead.
```{python}
from feazdata import ames
from sklearn.compose import ColumnTransformer
from feature_engine.transformation import LogCpTransformer
ct = ColumnTransformer(
[('log', LogCpTransformer(C=1), ['Wood_Deck_SF'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames)
```