# Module Four Discussion: F-Test for Comparing Nested Models

**Important: You will not be doing a problem set in this module so as to allow you more time to work on Project One. Instead, you will have more in-depth discussion questions based on the scripts for the nested models F-test.**

In this notebook, you have been given a set of steps that will show you how to compare different regression models for a data set using the F-test for nested models. It is very important to run the steps in order. Some steps depend on the outputs of earlier steps. Once you have run all the steps, be sure to complete the Module Four discussion.

Reminder: If you have not already reviewed the questions for the Module Four discussion, be sure to do so now. That will give you an idea of the questions you will need to answer with the outputs of this script.

### Step 1: Loading the Data Set
You are an analyst working for a car maker. You have access to a set of data that can be used to study the fuel economy of a car. Car makers are interested in studying factors that are associated with better fuel economy. This data set includes several important variables that are associated with fuel economy. You will use this data set to create models to predict fuel economy.

This block of R code will load the data set from **mtcars.csv** file. You will then create a subset of the data with only the three variables that are needed in the next step. Here are the variables that will be retained:

| <div style="text-align: left"> Variable </div>  |   <div style="text-align: left"> What does it represent? </div> |
| -- | --  |
| <div style="text-align: left"> mpg </div> | <div style="text-align: left"> Miles/(US) gallon </div> |
| <div style="text-align: left"> drat </div> | <div style="text-align: left"> Rear axle ratio </div> |
| <div style="text-align: left"> wt </div> | <div style="text-align: left"> Weight (1,000 lbs) </div> |

Reference: R data sets. (1974). <i>Motor trend car road tests</i> [Data file]. Retrieved from https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars

Click the code section below and hit the **Run** button above.

In [1]:
# Loading mtcars data set from a mtcars.csv file
mtcars <- read.csv(file='mtcars.csv', header=TRUE, sep=",")

vars <- c('mpg','drat','wt')
mtcars_subset <- mtcars[vars]

# Print the first six rows
print("head")
head(mtcars_subset, 6)

[1] "head"


mpg,drat,wt
<dbl>,<dbl>,<dbl>
21.0,3.9,2.62
21.0,3.9,2.875
22.8,3.85,2.32
21.4,3.08,3.215
18.7,3.15,3.44
18.1,2.76,3.46


## Step 2: Compare Interaction Model with No Interaction Model
In step 3 of Module Two Jupyter Notebook, you created an interaction model for fuel economy using two quantitative variables: weight of the car and rear axle ratio. This model included the interaction term between weight and rear axle ratio. You will now compare this model with a model that does not contain the interaction term. These models can be thought of as nested models. Two models are nested if one of them contains all terms of the second model in addition to having at least one additional term. The model with the additional term is called the complete or full model. The other model is called the reduced or restricted model. 

The comparison between the complete and reduced model will be done using the F-test for nested models. For this particular model comparison, the F-test will evaluate whether the interaction term contributes in predicting the fuel economy. If the test is significant, then the interaction model should be used. If the test is insignificant, then the reduced model should be used. 
<br><br>

The general form of the complete and reduced model is:
<br>

\begin{equation*}
\large E(y) = {\beta}_0\ +\ {\beta}_1\ {x}_1\ +\ {\beta}_2\ {x}_2\ +\ {\beta}_3\ {x}_1\ {x}_2\ \ \ \ \ \ \text{complete model} 
\end{equation*}

\begin{equation*}
\large E(y) = {\beta}_0\ +\ {\beta}_1\ {x}_1\ +\ {\beta}_2\ {x}_2\ \ \ \ \ \ \ \text{reduced model} 
\end{equation*}

<br><br>

The prediction regression equation for the complete and reduced model is:
<br>

\begin{equation*}
\large \hat{y} = \hat{{\beta}_0}\ +\ \hat{{\beta}_1}\ {x}_1\ +\ \hat{{\beta}_2}\ {x}_2\ +\ \hat{{\beta}_3}\ {x}_1\ {x}_2\ \ \ \ \ \ \text{complete model} 
\end{equation*}

\begin{equation*}
\large \hat{y} = \hat{{\beta}_0}\ +\ \hat{{\beta}_1}\ {x}_1\ +\ \hat{{\beta}_2}\ {x}_2\ \ \ \ \ \ \ \text{reduced model} 
\end{equation*}

<br>
\begin{equation*}
\text{where } \hat{y} \text{ is the predicted fuel efficiency,}\ {x}_1\ \text{is weight of the car, and}\ {x}_2\ \text{is rear axle ratio} 
\end{equation*}
 
<br><br>
The null hypothesis for this test is that the beta estimate for the interaction term is zero, meaning that the interaction term is not needed and the reduced model is sufficient. The alternative hypothesis is that the beta estimate for the interaction term is non-zero, meaning that the interaction term is needed and the complete model is necessary.   

The **anova** function in R will run this test for you and will output the F-test statistic and the corresponding P-value. 

Click the block of code below and hit the **Run** button above.  

In [2]:
# Create the complete model
fit_complete <- lm(mpg ~ wt + drat + wt:drat, data=mtcars_subset)

# Create the reduced model
fit_reduced <- lm(mpg ~ wt + drat, data=mtcars_subset)

# Perform the F-test
anova(fit_complete, fit_reduced)

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
28,225.6176,,,,
29,269.2413,-1.0,-43.62369,5.413865,0.02743572


#### Hypothesis Test
Null Hypothesis: Beta3 = 0. (The coefficient for interaction term is 0. The reduced model is sufficient)
<br>

Alternative Hypothesis: Beta3 is non-zero. (The coefficient for the interaction term is non-zero. The complete model should be used.)
<br>

Let us use a level of significance of 5%. If the P-value is less than 0.05, we will reject the null hypothesis. Otherwise, we will not reject the null hypothesis.
<br>

Suppose the F-value = 5.4139 and the P-value = 0.0274.
<br>

Since the P-value 0.0274 is less than the level of significance 0.05, we will reject the null hypothesis and conclude that the interaction term should be used in predicting the fuel economy. Therefore, the complete model should be used.

## Step 3: Loading the Economic Data Set

You are an analyst working for the government and you have access to a set of historical data that can be used to study wage growth of the labor force. Governments are interested in studying wage growth patterns based on their economic agenda. This data set has other economic factors that are related to wage growth. You will use this data set to create models for wage growth.

This block of R code will load the **economic** data set from a CSV file. You will then create a subset of the data with only the three variables that are needed in the next step. Here are the variables that should be retained: 

| <div style="text-align: left"> Variable </div>  |   <div style="text-align: left"> What does it represent? </div> |
| -- | --  |
| <div style="text-align: left"> wage_growth </div> | <div style="text-align: left"> Wage growth rate </div> |
| <div style="text-align: left"> unemployment </div> | <div style="text-align: left"> Unemployment rate</div> |
| <div style="text-align: left"> gdp </div> | <div style="text-align: left"> GDP growth rate </div> |

Note: This is a simulated data set based on a real-world problem.

Click the code section below and hit the **Run** button above.

In [3]:
# Load the economic data set and subset to only include wage_growth, unemployment, and gdp variables
economic <- read.csv(file='economic.csv', header=TRUE, sep=",")
vars <- c('wage_growth','unemployment','gdp')
economic_subset <- economic[vars]

# Print the first six rows
print("head")
head(economic_subset, 6)

[1] "head"


wage_growth,unemployment,gdp
<dbl>,<dbl>,<dbl>
7.3,3.56,6.27
9.05,2.42,9.44
10.08,1.23,18.29
10.98,1.18,19.96
8.54,2.54,8.43
9.75,2.22,17.85


## Step 4: Compare Complete Second Order Model with Interaction Model
In step 6 of the Module Three Jupyter Notebook, you created a complete second order model for wage growth using two quantitative variables: unemployment rate and GDP growth. This model included the interaction term and squared terms for unemployment rate and GDP growth. You will now compare this model (full or complete model) with a model that does not contain the squared terms (restricted or reduced model). This will be done using a statistical test to evaluate whether the squared terms contribute information for predicting wage growth. In other words, you are testing whether the squared terms should be included in predicting wage growth (in which case a complete model is necessary) or not (in which case the reduced model is sufficient). 
<br><br>

The general form of the complete and reduced model is:
<br>

\begin{equation*}
\large E(y) = {\beta}_0\ +\ {\beta}_1\ {x}_1\ +\ {\beta}_2\ {x}_2\ +\ {\beta}_3\ {x}_1\ {x}_2\ +\ {\beta}_4\ {x}_1^2\ +\ {\beta}_5\ {x}_2^2 \ \ \ \ \text{complete model} 
\end{equation*}

\begin{equation*}
\large E(y) = {\beta}_0\ +\ {\beta}_1\ {x}_1\ +\ {\beta}_2\ {x}_2\ +\ {\beta}_3\ {x}_1\ {x}_2\ \ \ \ \ \ \ \text{reduced model} 
\end{equation*}

<br><br>


The prediction regression equation for the complete and reduced model is:
<br>

\begin{equation*}
\large \hat{y} = \hat{{\beta}_0}\ +\ \hat{{\beta}_1}\ {x}_1\ +\ \hat{{\beta}_2}\ {x}_2\ +\ \hat{{\beta}_3}\ {x}_1\ {x}_2\ +\ \hat{{\beta}_4}\ {x}_1^2\ +\ \hat{{\beta}_5}\ {x}_2^2 \ \ \ \ \text{complete model} 
\end{equation*}

\begin{equation*}
\large \hat{y} = \hat{{\beta}_0}\ +\ \hat{{\beta}_1}\ {x}_1\ +\ \hat{{\beta}_2}\ {x}_2\ +\ \hat{{\beta}_3}\ {x}_1\ {x}_2\ \ \ \ \ \ \ \text{reduced model} 
\end{equation*}



<br>
\begin{equation*}
\text{where } \hat{y} \text{ is the predicted wage growth,}\ {x}_1\ \text{is unemployment, and}\ {x}_2\ \text{is GDP} 
\end{equation*}
 
<br>
The F-test for nested models will do this comparison. This hypothesis test compares a complete model to a reduced model. The null hypothesis for this test is that the beta estimates for squared terms are zero, meaning that the squared terms are not needed and the reduced model is sufficient. The alternative hypothesis is that at least one of the beta estimates for squared terms is non-zero, meaning that the squared terms are needed and the complete model is necessary.   

The **anova** function in R will run this test for you and will output the F-test statistic and the corresponding P-value. 

Click the block of code below and hit the **Run** button above.

In [4]:
# Create the complete model
fit_complete <- lm(wage_growth ~ unemployment + gdp + unemployment:gdp + I(unemployment^2) + I(gdp^2), data=economic_subset)

# Create the reduced model
fit_reduced <- lm(wage_growth ~ unemployment + gdp + unemployment:gdp, data=economic_subset)

# Perform the F-test
anova(fit_complete, fit_reduced)

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
93,31.36076,,,,
95,42.12107,-2.0,-10.76031,15.95479,1.103452e-06


## End of Module Four Jupyter Notebook
Attach the HTML output as a part of your Module Four discussion. The HTML output can be downloaded by clicking **File**, then **Download as**, then **HTML**. Be sure to answer all of the questions in the discussion prompt.