# Adjusted $R^{2}$ example

Taken from Example 12.4 in [1].

Combinations of three different materials were used to try and increase the chance of sperm survival in animal semen. The percent surival rate of sperm in these animals is found below:

In [2]:
data ex12_04;
    input y x1 x2 x3;
    label y = "y (% survival)" x1 = "x1 (weight %)" x2 = "x2 (weight %)" x3 = "x3 (weight %)";
    datalines;
25.5 1.74 5.3 10.8
31.2 6.32 5.42 9.4
25.9 6.22 8.41 7.2
38.4 10.52 4.63 8.5
18.4 1.19 11.6 9.4
26.7 1.22 5.85 9.9
26.4 4.1 6.62 8
25.9 6.32 8.72 9.1
32 4.08 4.42 8.7
25.2 4.15 7.6 9.2
39.7 10.15 4.83 9.4
35.7 1.72 3.12 7.6
26.5 1.7 5.3 8.2
;
run;

proc print data = ex12_04 label;
run;

Obs,y (% survival),x1 (weight %),x2 (weight %),x3 (weight %)
1,25.5,1.74,5.3,10.8
2,31.2,6.32,5.42,9.4
3,25.9,6.22,8.41,7.2
4,38.4,10.52,4.63,8.5
5,18.4,1.19,11.6,9.4
6,26.7,1.22,5.85,9.9
7,26.4,4.1,6.62,8.0
8,25.9,6.32,8.72,9.1
9,32.0,4.08,4.42,8.7
10,25.2,4.15,7.6,9.2


Let's perform a regression using all of the feature variables in the data:

In [4]:
proc reg data = ex12_04 plots = none;
    model y = x1 x2 x3;
run;

0,1
Number of Observations Read,13
Number of Observations Used,13

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,399.45437,133.15146,30.98,<.0001
Error,9,38.6764,4.29738,,
Corrected Total,12,438.13077,,,

0,1,2,3
Root MSE,2.07301,R-Square,0.9117
Dependent Mean,29.03846,Adj R-Sq,0.8823
Coeff Var,7.13885,,

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,Label,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|
Intercept,Intercept,1,39.15735,5.88706,6.65,<.0001
x1,x1 (weight %),1,1.0161,0.1909,5.32,0.0005
x2,x2 (weight %),1,-1.86165,0.26733,-6.96,<.0001
x3,x3 (weight %),1,-0.34326,0.61705,-0.56,0.5916


Using the estimated coefficients, we deem the mean response to be:

\begin{equation}
\widehat{y} = 39.157 + 1.106 x_1 - 1.862 x_2 - 0.343 x_3
\end{equation}

The summary statistics tell us that the $R^2 = 0.9117$, but its $R^2_{adj} = 0.8823$. Remember that $R^2_{adj}$ is a modification of $R^2$ that penalizes the use of each additional feature variable past the first one.

Let's try removing $x_3$ and see what happens:

In [5]:
proc reg data = ex12_04 plots = none;
    model y = x1 x2;
run;

0,1
Number of Observations Read,13
Number of Observations Used,13

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,2,398.1245,199.06225,49.76,<.0001
Error,10,40.00627,4.00063,,
Corrected Total,12,438.13077,,,

0,1,2,3
Root MSE,2.00016,R-Square,0.9087
Dependent Mean,29.03846,Adj R-Sq,0.8904
Coeff Var,6.88796,,

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,Label,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|
Intercept,Intercept,1,36.09468,2.0116,17.94,<.0001
x1,x1 (weight %),1,1.03051,0.18248,5.65,0.0002
x2,x2 (weight %),1,-1.86964,0.25756,-7.26,<.0001


Now, the fitted line is:

\begin{equation}
\widehat{y} = 36.095 + 1.031 x_1 - 1.870 x_2
\end{equation}

Now, the $R^2$ value is $0.9087$, while $R^2_{adj} = 0.8904$. $R^2_{adj}$ has increased a little bit with the removal of the $x_3$ variable.

Looking at the F-statistics from the "Analysis of Variance" table, the original full model had an F-value of $30.98$, while the new reduced model has an F-value of $49.76$. This demonstrates that the new reduced model accounts for a more significant amount of the variation in the response than the original full model.

We would have never gotten to the reduced model by comparing $R^2$ alone. Using $R^2$, we would have rejected the reduced model outright because the full model had a larger value. But, using a combination of $R^2_{adj}$ and the F-value, we have deduced that the reduced model is far better at explaining the relationship between $y$ and the feature variables.

## Summary

* examine $R^2$ in conjunction with $R^2_{adj}$ and the F-statistic to get a better understanding of the overall model
* evaluating models based on $R^2$ alone can lead to overfitting and faulty conclusions
* $R^2_{adj}$ prevents overfitting by penalizing each additional feature variable added to the model

## Citations

[2] R. E. Walpole, R. H. Myers, S. L. Myers, K. Ye, in Probability & Statistics for Engineers & Scientists, 9th ed. Boston, USA: Pearson Education, Inc., 2012, ch. 12, sec. 3-6, pp. 449-464.