# Qualitative Predictors (Dummy)

## Predictors with Only Two Levels
If a qualitative predictor (also known as a **factor**) only has two **levels**, then we can create a **dummy variable** that takes on two possible numerical values.

<img src="images/10.png" width="500">
and use this variable as a predictor in the regression equation. This results in the model

<img src="images/11.png" width="500">

Now β0 can be interpreted as the average credit card balance among males, β0 + β1 as the average credit card balance among females, and β1 is the average difference in credit card balance between females and males.

Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable

<img src="images/12.png" width="500">
and use this variable in the regression equation. This results in the model

<img src="images/13.png" width="500">

Now β0 can be interpreted as the overall average credit card balance (ignoring the gender effect), and β1 is the amount that females are above the average and males are below the average.

**The final predictions for the credit balances of males and females will be identical regardless of the coding scheme used. The only difference is in the way that the coefficient are interpreted.**

## Qualitative Predictors with More than Two Levels

When a qualitative predictor has more than two levels, we can create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. The first could be

<img src="./images/10.png" width="500">
and the second could be
<img src="./images/11.png" width="500">

Then both of these variables can be used in the regression equation, in order to obtain the model

<img src="./images/12.png" width="500">

**Baseline**

- There will always be **one fewer** dummy variable than the number of levels. The level with no dummy variable—African American in this example—is known as the baseline.

<img src="images/14.png" width="500">

The p-values associated with the coefficient estimates for the two dummy variables are very large, suggesting no statistical evidence of a real difference in credit card balance between the ethnicities

> However, the coefficients and their p-values do depend on the choice of dummy
variable coding

Rather than rely on the individual coefficients, we can use an **F-test** to test H0 : β1 = β2 = 0; the results of F-test does not depend on the how we code the dummy variable.

If the p-value for the F-test is small, it indicates that we cannot reject the null hypothesis that there is no relationship between balance and ethnicity. 

# Extensions of the Linear Model

Two of the most important assumptions state that the relationship between the predictors and response are **additive** and **linear**. 
- **Additive**: the effect of changes in a predictor $X_j$ on the response $Y$ is independent of the values of the other predictors
- **Linear**: the change in the response $Y$ due to a one-unit change in $X_j$ is constant, regardless of the value of $X_j$

**Here are some common classical approaches for extending the linear model.**

## Removing the Additive Assumption by Considering Synergy Effect
Consider the standard linear regression model with two variables,
\begin{align}
Y = β_0 + β_1X_1 + β_2X_2 + \epsilon
\end{align}

One way of extending this model to allow for interaction effects is to include a third predictor, called an
**interaction term**:

\begin{align}
Y = β_0 + β_1X_1 + β_2X_2 +  β_3X_1X_2 + \epsilon 
\end{align}

**How does inclusion of this interaction term relax the additive assumption?**

The model above could be written as:
\begin{align}
Y &= β_0 + (β_1+β_3X_2)X_1 + β_2X_2 + \epsilon  \\
&= β_0 + \tilde{β}_1X_1 + β_2X_2 + \epsilon
\end{align}

Since $\tilde{β}_1$ changes with $X_2$, the effect of $X_1$ on $Y$ is no longer constant: adjusting $X_2$ will change the impact of $X_1$ on $Y$.

<img src="images/15.png" width="900">

Like the above case:
- Sometimes the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not. 
- According to **hierarchical principle**, if we include an interaction in a model, we should also include the **main effects**, even if the p-values associated with their coefficients are not significant. (If the interaction between X1 and X2 seems important, we should include both X1 and X2 in the model even if their coefficient estimates have large p-values)

**Concept of interactions applies on qualitative variables**
In the absence of an interaction term, the model takes the form:
<img src="images/16.png" width="800">

> In this case, the model have 2 parallel lines, one for students and one for non-students, which means that the average effect on balance of a one-unit increase in income does not depend on whether or not the individual is a student. This represents a potentially serious limiatation of the model, since in fact a change in income may have a very different effect on the balance of a student versus a non-student.

Adding an interaction variable of income and the dummy variable student, model now becomes:
<img src="images/17.png" width="850">

Now those regression lines have different intercepts, as well as different slopes. This allows for the possibility that changes in income may affect the credit card balances of students and non-students differently.
<img src="images/18.png" width="800">

## Non-linear Relationships

One simple way of extenting the linear model to accommodate non-linear relationships is known as **polynomial regression**, from which we can include polynomial functions of predictors in the regression model.

<img src="images/19.png" width="500">

We can see the relationship between mpg and horsepower is in fact non-linear: the data suggest a curved relationship. A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors in the model. For example, the points in Figure 3.8 seem to have a quadratic shape, suggesting that the model below may provide a better fit:

<img src="images/20.png" width="500">

# Potential Problems of Linear Regrssion

## Non-linearity of the Data
**Assumption**: The linear regression model assumes that there is a straight-line relationship
between the predictors and the response.

**How to soleve:**

**Residual plots** are a useful graphical tool for identifying non-linearity
- Given a simple linear regression model, we can plot the residuals, $e_i = y_i-\hat{y_i}$ versus the predictor $x_i$. In the case of multiple regression model, since there are multiple predictors, we instead plot the residuals versus the predicted values $\hat{y_i}$.

<img src="images/21.png" width="600">

- Ideally, the residual plot will show no discernible pattern.
- If the residual plot indicates non-linear associations in the data, then a simple approach is to use **non-linear transformations** of the predictors, such as $\log{X},\sqrt{X}, X^2$, in the regression model. 
> Like the left panel of Figure 3.9, the residual exhibit a clear U-shape, which provides a strong indication of non-linearity in the data. 
In contrast, the right-hand panel displays the resicual plot of the model containing a quadratic term, and there appears to be little pattern in the residual, suggesting that the quadratic term improves the fit to the data.

## Correlation of Error Terms 

**Assumption**: The error terms, $\epsilon_1,\epsilon_2,...,\epsilon_n$ of linear regression are uncorrelated.
-  If the errors are uncorrelated, then the fact that $\epsilon_i$ is positive provides little or no information about the sign of $\epsilon_i+1$.
-  If the error terms are correlated, we may have an unwarranted sense of confidence in our model
 - **estimated standard errors for the regression coefficients** will underestimate the true standard error
 - **confidence and prediction intervals** will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter
 - **p-values associated with the model** will be lower than they should be
 - **Lead to erroneously conclude that a parameter is statistically significant**

\begin{align}
\operatorname{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \sum_{j=1}^n \operatorname{Cov}(X_i, X_j) = \sum_{i=1}^n \operatorname{Var}(X_i) + 2\sum_{1\le i<j\le n}\operatorname{Cov}(X_i,X_j)
\end{align}


**Why might correlations among the error terms occur?**
- Such correlations frequently occur in the context of **time series** data
- In many cases, observations that are obtained at adjacent time points will have **positively correlated errors**. 

**How to solve:**
- We can **plot the residuals** from our model **as a function of time** to identify this correlation. 
- Ideally, the residual plot will show no discernible pattern.
- If the error terms are positively correlated, then we might see **tracking** in the residuals—that is, adjacent residuals may have similar values.

<img src="images/22.png" width="600">

- In the top panel, there's no evidence of a time-related trend in the residuals.
- In the bottom panel, there's a clear pattern in the residuals - adjacent residuals tend to take on similar values.
- The center panel illustrates a more moderate case in which the residuals had a correlation of 0.5. The pattern is less clear.

**Good experimental design is crucial to mitigate the risk of such correlations**

## Non-constant Variance of Error Terms

**Assumption**: the error terms have a constant variance, $Var(\epsilon_i) = σ^2$.
- The standard errors,
confidence intervals, and hypothesis tests associated with the linear model
rely upon this assumption.

But it's often the case that the variances of the error terms are non-constant. 
- For instance, the variances of the error terms may increase with the value of the response. 
- One can identify non-constant variances in the errors, or **heteroscedasticity**,from the presence of a funnel shape in residual plot.

**How to solve**: 

Transform the response Y using a concave function such as $\log{Y}$ or $\sqrt{Y}$ . Such
a transformation results in a greater amount of shrinkage of the larger responses,
leading to a reduction in **heteroscedasticity**. The residuals now appear to have constant variance, though there's some evidence of slight non-linear relationship in the data.

<img src="images/23.png" width="700">

## Outliers 

An **Outlier**: is a point for which $y_i$ is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.

**Problems of Outlier**:
- Effect on the least squares fit line itself,
- Effect on interpretation of the model fit
 - For instance, in this example, the RSE is 1.09 when the outlier is included in the regression, but it is only 0.77 when the outlier is removed.

**How to solve:**

**Residual Plots** can be used to identify outliers. In practice, it can be difficult to decide how large a residual needs to be before we consider the point to be an outlier. To address this problem, we can plot **studentized residuals**, computed by dividing each residual $e_i$ by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.

<img src="images/24.png" width="600">

## High Leverage Points 

**High Leverage**: Observations with high leverage have an unusual value for ${x_i}$. For example, observations 41 in the left-hand panel of Figure 3.13 has high (an unusual value for ${x_i}$), in that the predictor value for this observation is large relative to the other observations. 

Comparing the left-hand panels of Figure 3.13, we observe that removing the high leverage observation has a much more substantial impact on the least squares line than removing the outlier. In fact, high leverage observations tend to have a sizable impact on the estimated regression line. For this reason, it is important to identify high leverage observations.

<img src="images/25.png" width="600">

**How to solve:**

In multiple linear regression with many predictions, it's possible to have an observation that is well within the range of each individual predictor's values, but that is unusual in terms of the full set of predictions. An example is shown in the center panel. The red observation is well outside of the blue ellipse, but neither its value for X1 nor its value for X2 is unusual. So if we examine just X1 or just X2, we will fail to notice this high leverage point.

In order to quantify an observation's leverage, we compute the **leverage statistic**. A large value of this statistic indicates an observation with high leverage. For simple linear regression,

\begin{align}
h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{i^{'}=1}^n (x_{i^{'}}-\bar{x})^2}
\end{align}

- $h_i$ increases with the distance of $x_i$ from $\bar{x}$.
- $h_i$ is always between 1/n and 1, and the **average leverage** for all the observations is always equal to $(p+1)/n$ (p is the number of parameters (regression coefficients including the intercept)).
- **High leverage**: a leverage statistic that greatly exceeds $(p+1)/n$, high leverage.



<img src="images/26.png" width="600">

> The right-hand panel of Figure 3.13 provides a plot of the studentized residuals versus $h_i$ for the data in the left-hand panel of Figure 3.13. Observation 41 stands out as having a very high leverage statistic as well as a high studentized residual. In other words, it is an outlier as well as a high leverage observation.

## Collinearity 

Collinearity: situation in which two or more predictor variables are closely related to one another.

**Problems of Collinearity**

- Difficult to separate out the individual effects of collinear variables on the response. In other words, since those predictors tend to increase or decrease together, it can be hard to determine how each one separately is associated with the response.
- Results in a great deal of uncertainty in the coefficient estimates. 
- Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for $\hat{β_j}$ (coefficients) to grow
 - Recall that the t-statistic for each predictor is calculated by dividing $\hat{β_j}$ by its standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the presence of collinearity, we may fail to reject H0 : βj = 0. This means that the **power** of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity.

<img src="images/27.png" width="600">

**Detection of Collinearity**

- **Correlation matrix** of the predictors.
 - An element of this matrix that is large in absolute value indicates a pair of highly correlated variables.
 - **Multicollinearity**: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this multicollinearity.
 
- **Variance Inflation Factor (VIF)**
 - The VIF is the ratio of the variance of $\hat{β_j}$ when fitting the full model divided by the variance of $\hat{β_j}$ if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice, there's a small amount of collinearity among the predictors.
 - A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. 
 
 The VIF for each variable can be computed using:
\begin{align}
VIF(\hat{β_j})=\frac{1}{1-R^2_{X_j|X_{-j}}}
\end{align}

where $R^2_{X_j|X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all of the other predictors. If $R^2_{X_j|X_{-j}}$ is close to one, then collinearity is present, and so the VIF will be large.

**Solution to Collinearity**

- Drop one of the problematic variables from the regression.
- Combine the collinear variables together into a single predictor
 - E.g.: take the average of standardized versions of limit and rating in order to create a new variable that measures credit worthiness.