# Selection of Regression Model

Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model?

The need for model selection often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. 

Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. 

For example, they can try different terms that explain interactions between variables and curvature in the data.

The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.

    Too few: Underspecified models tend to be biased.
    Too many: Overspecified models tend to be less precise.
    Just right: Models with the correct terms are not biased and are the most precise.

To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.

### Statistical Methods for Model Specification

You can use statistical assessments during the model specification process. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection.

### Adjusted R-squared and Predicted R-squared: 

Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.

![R Squared ](rsqrbase.png)
![R Squared ](rsqr.png)

Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.

![Adjacent R Squared ](adjrsqr.png)


Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.

In short closer the value of R-squared or r score to 1 better the model.

#### P-values for the independent variables:

In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.

#### Stepwise regression and Best subsets regression: 

These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.


### Pros and Cons of Regression Models

![Pros and Cons of Regression Models](Regression Model Pros and Cons.png)





    Note: In our evaluation of regression model we are going to use the R-squared method.
