# Saving Degrees of Freedom with Transformations

Here's an example of using transformations to (1) match the context of the problem and (2) get better results with fewer degrees of freedom.

We'll use the `trees` dataset that's built into R. The "Girth" column is actually the diameter, and it's the only column measured in inches rather than feet. I'm going to make a new predictor based on Girth that's more useful for later models.

In [1]:
trees$Radius <- trees$Girth/24
head(trees) # "Girth" is actually diameter, according to help file

Unnamed: 0_level_0,Girth,Height,Volume,Radius
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
1,8.3,70,10.3,0.3458333
2,8.6,65,10.3,0.3583333
3,8.8,63,10.2,0.3666667
4,10.5,72,16.4,0.4375
5,10.7,81,18.8,0.4458333
6,10.8,83,19.7,0.45


A naive model might be a basic multiple linear regression.

In [2]:
multiple_lm <- lm(Volume ~ Radius + Height, data = trees)
summary(multiple_lm)


Call:
lm(formula = Volume ~ Radius + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Radius      112.9959     6.3424  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948,	Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16


We could make this fit better by blindly adding polynomial terms and doing a transformation:

In [3]:
transformed_and_poly <- lm(log(Volume) ~ poly(Girth, 2) + poly(Height, 2), data = trees)
summary(transformed_and_poly)


Call:
lm(formula = log(Volume) ~ poly(Girth, 2) + poly(Height, 2), 
    data = trees)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16094 -0.04023 -0.00295  0.05474  0.13434 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.27273    0.01492 219.360  < 2e-16 ***
poly(Girth, 2)1   2.51150    0.09732  25.808  < 2e-16 ***
poly(Girth, 2)2  -0.26046    0.09206  -2.829  0.00887 ** 
poly(Height, 2)1  0.54845    0.09746   5.628 6.47e-06 ***
poly(Height, 2)2 -0.05518    0.09191  -0.600  0.55349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08307 on 26 degrees of freedom
Multiple R-squared:  0.9784,	Adjusted R-squared:  0.9751 
F-statistic: 294.5 on 4 and 26 DF,  p-value: < 2.2e-16


- It is interesting that the squared term for height is *not* significant. Why is this so interesting to me? Look at the next equation in this lesson...

A slightly better model might be one of the form:
$$
V = \pi r^2h
$$
which assumes that trees are perfect cylinders. This can be accomplished by modelling:
\begin{align*}
\log(V) &= \beta_0 + \beta_1\log(r) + \beta_2\log(h) + \epsilon\\
\implies V &= \exp(\beta_0)r^\beta_1h^\beta_2\exp(\epsilon)
\end{align*}
and expecting that $\exp(\beta_0)$ is close to $\pi$, $\beta_1 = 2$, and $\beta_2 = 1$. (It makes me very happy that we're taking the "log" when talking about lumber.)

In this formulation, note that the errors are *multiplicative*. We're assuming a baseline model, and then the observed values are 1 times that baseline if there's no error, and some other multiple of the baseline model if there is error. This isn't a problem *per se*, but it changes how we might interpret prediction error!

In [4]:
volume_logs <- lm(log(Volume) ~ log(Girth) + log(Height), data = trees)
summary(volume_logs)


Call:
lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.168561 -0.048488  0.002431  0.063637  0.129223 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.63162    0.79979  -8.292 5.06e-09 ***
log(Girth)   1.98265    0.07501  26.432  < 2e-16 ***
log(Height)  1.11712    0.20444   5.464 7.81e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08139 on 28 degrees of freedom
Multiple R-squared:  0.9777,	Adjusted R-squared:  0.9761 
F-statistic: 613.2 on 2 and 28 DF,  p-value: < 2.2e-16


- We get something close to our hopes!
    - Except for $\beta_0$, which we'll talk about later.
- With this model, we could do a hypothesis test for $\beta_1 = 2$ and $\beta_2 = 1$. 
    - If these are reasonable values, loggers could confidently calculate the volume of a tree assuming that it's a cylinder!

We could also assume these values from the start, and include a "naive" volume.

In [5]:
trees$naive_volume <- pi * trees$Radius^2 * trees$Height

We could then model this according to
\begin{align*}
V & = \beta_0N + \epsilon
\end{align*}
where $N$ is our "naive" volumne. We might have the expectation that $\beta_0 = 1$ if the naive volume is correct and we are getting 100% of the tree as usable lumber.

In [6]:
# Assuming trees are cylinders
diff_from_cylinder <- lm(Volume ~ -1 + naive_volume, data = trees)
summary(diff_from_cylinder)


Call:
lm(formula = Volume ~ -1 + naive_volume, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6696 -1.0832 -0.3341  1.6045  4.2944 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
naive_volume 0.386513   0.004991   77.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.455 on 30 degrees of freedom
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic:  5996 on 1 and 30 DF,  p-value: < 2.2e-16


This tells us that the estimated usable lumber from a given tree is about 40\% of what we would expect if the tree were a perfect cylinder.

Importantly for this lesson, we have an $R^2$ of 0.9949 on a single degree of freedom! $R^2$ is not the greatest measure, but it's informative in this case:

In [7]:
data.frame(
    model = c("multiple_lm", "transformed_and_poly", "volume_logs", "diff_from_cylinder"),
    R2 = c(summary(multiple_lm)$adj.r.squared, summary(transformed_and_poly)$adj.r.squared, summary(volume_logs)$adj.r.squared, summary(diff_from_cylinder)$adj.r.squared),
    df = c(multiple_lm$rank - 1, transformed_and_poly$rank - 1, volume_logs$rank - 1, diff_from_cylinder$rank)
)

model,R2,df
<chr>,<dbl>,<dbl>
multiple_lm,0.9442322,2
transformed_and_poly,0.9750853,4
volume_logs,0.976084,2
diff_from_cylinder,0.994856,1


By choosing our transformations carefully, we have a model that is both *better* and *simpler*! The coefficient estimate also relates to a physical quantity that is useful to us - the percent of usable wood we can get from a tree! Statistics is amazing! (I want the following on a t-shirt: "If you don't think stats is lit af then you ain't woke, fam!")