# Parameter selection example

The `sashelp.fish` dataset contains measurements of 159 fish that were caught in Lake Laengelmaevesi in Finland. They are grouped by species.

The dataset contains the following measurements:
* weight ($weight$)
* length from the nose of the fish to the base of the tail ($length_1$)
* length from the nose of the fish to the notch of the tail ($length_2$)
* length from the nose of the fish to the end of the tail ($length_3$)
* height as a percentage of $length_3$ ($height$)
* width as a percentage of $length_3$ ($width$)

We want to determine the relationship between a fish's $weight$ and the rest of its measurements. Our goal is to try and eliminate some of the variables using parameter selection in order to get the most significant subset. Our subsets will be generated using the following requests under the `selection = ` option of the `model` statement in `PROC REG`:
* forward
* backward
* stepwise
* adjrsq
* cp

The `maxr`, `minr`, and `rsquare` requests will not be covered here. `maxr` and `minr` are similar to the `forward`/`backward`/`stepwise` requests, except that it $R^{2}$ improvements lists all possible models. `rsquare` also lists all possible models and is similar to `adjrsq`.

For this example, we will look at the perch species:

In [15]:
proc print data = sashelp.Fish (where = (species = "Perch"));
run;

Obs,Species,Weight,Length1,Length2,Length3,Height,Width
73,Perch,5.9,7.5,8.4,8.8,2.112,1.408
74,Perch,32.0,12.5,13.7,14.7,3.528,1.9992
75,Perch,40.0,13.8,15.0,16.0,3.824,2.432
76,Perch,51.5,15.0,16.2,17.2,4.5924,2.6316
77,Perch,70.0,15.7,17.4,18.5,4.588,2.9415
78,Perch,100.0,16.2,18.0,19.2,5.2224,3.3216
79,Perch,78.0,16.8,18.7,19.4,5.1992,3.1234
80,Perch,80.0,17.2,19.0,20.2,5.6358,3.0502
81,Perch,85.0,17.8,19.6,20.8,5.1376,3.0368
82,Perch,85.0,18.2,20.0,21.0,5.082,2.772


Let's adjust the $height$ and $width$ values to be actual measurements and not percentages:

In [16]:
data perch;
    set sashelp.fish;
    where species = "Perch";
    
    height2 = length3 * height/100;
    width2 = length3 * width/100;
    
    drop height width;
    rename height2 = height width2 = width;
run;

Let's first use forward selection:

In [28]:
proc reg data = perch plots = none;
    model weight = length1 length2 length3 height width / selection = forward;
run;

0,1
Number of Observations Read,56
Number of Observations Used,56

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,1,6512049,6512049.0,2623.38,<.0001
Error,54,134045,2482.31396,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,-143.37271,12.23262,340997,137.37,<.0001
height,202.90376,3.9615,6512049,2623.38,<.0001

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,2,6548835,3274417.0,1784.34,<.0001
Error,53,97260,1835.08909,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,64.29055,47.55977,3353.29701,1.83,0.1822
Length1,-16.11087,3.59841,36785.0,20.05,<.0001
height,282.79776,18.16673,444688.0,242.32,<.0001

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,6565442,2188481.0,1411.01,<.0001
Error,52,80652,1551.00605,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,75.91282,43.86782,4644.63546,2.99,0.0895
Length1,-16.96483,3.31846,40536.0,26.14,<.0001
height,194.65197,31.69493,58499.0,37.72,<.0001
width,152.48592,46.59996,16607.0,10.71,0.0019

Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection,Summary of Forward Selection
Step,Variable Entered,Number Vars In,Partial R-Square,Model R-Square,C(p),F Value,Pr > F
1,height,1,0.9798,0.9798,31.1344,2623.38,<.0001
2,Length1,2,0.0055,0.9854,10.3203,20.05,<.0001
3,width,3,0.0025,0.9879,2.0204,10.71,0.0019


The forward method stops once it can no longer include variables that are statistically significant in the presence of the other variables. In this case, it has determined that the maximum amount of variables in which the model can still be significant is 3, with the feature vaiables being:
* height
* length1
* width

Notice that we started with the $height$ variable and continuously added the others, with each addition reducing the overall F-value of the model. The "Summary of Forward Selection" table gives us a lot of information about which variables contribute a great deal to the model.

The "Partial R-Square" value tells us how much variation in the predicted response is attributed to each feature variable. The $height$ variable contributes $\frac{0.9798}{0.9879} \times 100\% = 99.18\%$ of the variability, while $length_1$ contributes  $\frac{0.0055}{0.9879} \times 100\% = 0.55\%$ and $width$ contributes $\frac{0.0055}{0.9879} \times 100\% = 0.25\%$.

The F-value reinforces this notion by performing a comparison of the partial $R^2$ value against the variance in the error distribution. The p-value derived from the F-value demonstrates how significant the variable is to the model. All variables are found to be highly significant.

Notice that the $width$ variable contributes very little at all. Let's see if that feature is eliminated using backwards selection:

In [29]:
proc reg data = perch plots = none;
    model weight = length1 length2 length3 height width / selection = backward;
run;

0,1
Number of Observations Read,56
Number of Observations Used,56

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,5,6565475,1313095.0,814.38,<.0001
Error,50,80619,1612.38786,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,79.52364,51.48413,3846.93787,2.39,0.1287
Length1,-13.44318,26.6831,409.26119,0.25,0.6166
Length2,-2.4321,40.8942,5.70306,0.0,0.9528
Length3,-0.92325,27.80247,1.77805,0.0,0.9736
height,194.88889,32.38529,58391.0,36.21,<.0001
width,152.66674,47.5998,16586.0,10.29,0.0023

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,4,6565473,1641368.0,1038.31,<.0001
Error,51,80621,1580.80728,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,79.21905,50.16191,3942.66686,2.49,0.1205
Length1,-13.3217,26.17104,409.59654,0.26,0.6129
Length2,-3.50172,24.94816,31.14322,0.02,0.8889
height,194.83358,32.02413,58513.0,37.01,<.0001
width,152.57373,47.04968,16624.0,10.52,0.0021

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,6565442,2188481.0,1411.01,<.0001
Error,52,80652,1551.00605,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,75.91282,43.86782,4644.63546,2.99,0.0895
Length1,-16.96483,3.31846,40536.0,26.14,<.0001
height,194.65197,31.69493,58499.0,37.72,<.0001
width,152.48592,46.59996,16607.0,10.71,0.0019

Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination,Summary of Backward Elimination
Step,Variable Removed,Number Vars In,Partial R-Square,Model R-Square,C(p),F Value,Pr > F
1,Length3,4,0.0,0.9879,4.0011,0.0,0.9736
2,Length2,3,0.0,0.9879,2.0204,0.02,0.8889


The backwards selection algorithm eliminates $length_3$ and $length_2$, leaving us with the same variables from before. This is due to the fact that the least statistically significant variable is eliminated in the presence of the other variables. This elimination is repeated until all variables are significant.

In the first two iterations $length_1$ is found to be insignificant, but is more significant than either $length_2$ or $length_3$. Once those two are eliminated, $length_1$ is significant in the presence of $height$ and $width$ and is allowed to stay. This makes intuitive sense because $length_1$, $length_2$, and $length_3$ carry very similar information, making only one of them necessary.

Now, let's look at stepwise selection:

In [30]:
proc reg data = perch plots = none;
    model weight = length1 length2 length3 height width / selection = stepwise;
run;

0,1
Number of Observations Read,56
Number of Observations Used,56

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,1,6512049,6512049.0,2623.38,<.0001
Error,54,134045,2482.31396,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,-143.37271,12.23262,340997,137.37,<.0001
height,202.90376,3.9615,6512049,2623.38,<.0001

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,2,6548835,3274417.0,1784.34,<.0001
Error,53,97260,1835.08909,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,64.29055,47.55977,3353.29701,1.83,0.1822
Length1,-16.11087,3.59841,36785.0,20.05,<.0001
height,282.79776,18.16673,444688.0,242.32,<.0001

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,6565442,2188481.0,1411.01,<.0001
Error,52,80652,1551.00605,,
Corrected Total,55,6646094,,,

Variable,Parameter Estimate,Standard Error,Type II SS,F Value,Pr > F
Intercept,75.91282,43.86782,4644.63546,2.99,0.0895
Length1,-16.96483,3.31846,40536.0,26.14,<.0001
height,194.65197,31.69493,58499.0,37.72,<.0001
width,152.48592,46.59996,16607.0,10.71,0.0019

Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection,Summary of Stepwise Selection
Step,Variable Entered,Variable Removed,Number Vars In,Partial R-Square,Model R-Square,C(p),F Value,Pr > F
1,height,,1,0.9798,0.9798,31.1344,2623.38,<.0001
2,Length1,,2,0.0055,0.9854,10.3203,20.05,<.0001
3,width,,3,0.0025,0.9879,2.0204,10.71,0.0019


Once again, this produces similar results to the forward and backward elimination steps. Since stepwise elimination starts the same way as its forward counterpart, we should expect similar results. This time, however, stepwise elimination deletes any variables deemed insigificant in the presence of a newly added variable.

To prevent model overfitting, let's look at the $R^2_{adj}$ selection:

In [31]:
proc reg data = perch plots = none;
    model weight = length1 length2 length3 height width / selection = adjrsq;
run;

0,1
Number of Observations Read,56
Number of Observations Used,56

Number in Model,Adjusted R-Square,R-Square,Variables in Model
3,0.9872,0.9879,Length1 height width
3,0.9871,0.9878,Length2 height width
3,0.987,0.9877,Length3 height width
4,0.9869,0.9879,Length1 Length2 height width
4,0.9869,0.9879,Length1 Length3 height width
4,0.9869,0.9878,Length2 Length3 height width
5,0.9867,0.9879,Length1 Length2 Length3 height width
2,0.9848,0.9854,Length1 height
2,0.9847,0.9853,Length2 height
2,0.9846,0.9852,Length3 height


Here, it is seen that $length_1$, $height$ and $width$ give the highest $R^2_{adj}$ value. Switching $length_1$ with $length_2$ or $length_3$ gives similar $R^2_{adj}$ values, which isn't surprising given that they convey almost the exact same information.

Finally, let's use Mallows' $C_p$ selection:

In [33]:
proc reg data = perch plots = none;
    model weight = length1 length2 length3 height width / selection = cp;
run;

0,1
Number of Observations Read,56
Number of Observations Used,56

Number in Model,C(p),R-Square,Variables in Model
3,2.0204,0.9879,Length1 height width
3,2.2551,0.9878,Length2 height width
3,2.6109,0.9877,Length3 height width
4,4.0011,0.9879,Length1 Length2 height width
4,4.0035,0.9879,Length1 Length3 height width
4,4.2538,0.9878,Length2 Length3 height width
5,6.0,0.9879,Length1 Length2 Length3 height width
2,10.3203,0.9854,Length1 height
2,10.5753,0.9853,Length2 height
2,11.0975,0.9852,Length3 height


The Mallows' $C_p$ statistic prevents overfitting by penalizing the addition of new feature variables. Its formula is this:

\begin{equation}
C_p = \frac{SSE}{s^2} - N + 2P
\end{equation}

where the $SSE$ is found for a particular subset $P$ of feature variables. As more feature variables are addded the $C_p$ value increases. Therefore, the best model would be one that minimizes $C_p$. In this case, the following variables generate a model with the lowest $C_p$:
* $length_1$
* $height$
* $width$

which are the same three that are chosen throughout this example.

The model, as chosen by the selection methods, can be written as:
\begin{equation}
\text{weight} = 75.91 - 16.96 length_1 + 194.65 height + 152.49 width
\end{equation}

It is interesting to see that the $length_1$ of a fish is inversely proportional to the weight.

As a side note, I believe that $length_1$ is chosen over the other two lengths because the tail does not add much to the weight of a fish. The length of the body of the fish explains the weight more than the length of the body with the tail. Therefore, $length_1$ gives the minimal amount of information required to explain the weight of the fish.

## Summary

* model selection methods help sort through many feature variables to isolate ones which will significantly benefit a model