# Multiple Linear Regression: Intuition

In this notebook, we discuss the intuition behind Multple Linear Regression, before diving into application.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## 1. Basic Intuition

By this point we have already learned the intuition behind simple linear regression.  Given a dataset with years of experience and salary, we can build a machine learning model that can predict an individual's salary given their years of experience.  This scenario can be modeled with the following formula:

$$
y = b_0 + b_1x_1
$$

In such a case, there is one dependent variable — salary, and one independent variable — years of experience.  Now let us consider a new example; suppose we want to predict a US company's profits given their Research & Development Spend (R&D Spend), their administrative costs, their marketing spend, and which state they are based in.  Similar to the first example, we still have only one dependent variable — profit, but now we have 4 independent variables:
- R&D Spend
- Administrative Costs
- Marketing Spend
- State

We model this type of scenario with the following formula:

$$
y = b_0 + b_1x_1 + b_2x_2 + ... b_nx_n
$$

Where $n$ is the number of dependent variables.  Let us now consider an example:

<table>
  <tr>
    <th>Profit</th>
    <th>R&D Spend</th>
    <th>Administrative Costs</th>
    <th>Marketing</th>
    <th>State</th>
  </tr>
  <tr>
      <td>192,261.83</td>
      <td>165,349.20</td>
      <td>136,897.80</td>
      <td>471,784.10</td>
      <td>New York</td>
  </tr>
  <tr>
      <td>191,792.06</td>
      <td>162,597.70</td>
      <td>151,377.59</td>
      <td>443,898.53</td>
      <td>California</td>
  </tr>
  <tr>
      <td>191,050.39</td>
      <td>153,441.51</td>
      <td>101,145.55</td>
      <td>407,934.54</td>
      <td>California</td>
  </tr>
  <tr>
      <td>182,901.99</td>
      <td>144,372.41</td>
      <td>118,671.85</td>
      <td>383,199.62</td>
      <td>New York</td>
  </tr>
  <tr>
      <td>166,187.94</td>
      <td>142,107.34</td>
      <td>91,391.77</td>
      <td>366,168.42</td>
      <td>California</td>
  </tr>
</table>

As our example uses four independent variables, our formula would look as follows:

$$
y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ?
$$

Note the question mark at the end.  Our last independent variable is the company's state, which is a categorical variable; however, linear regression is an algorithm based on numerical data.  We must therefore create a dummy variable using one-hot encoding:

<table>
  <tr>
    <th>Profit</th>
    <th>R&D Spend</th>
    <th>Administrative Costs</th>
    <th>Marketing</th>
    <th>New York</th>
    <th>California</th>
  </tr>
  <tr>
      <td>192,261.83</td>
      <td>165,349.20</td>
      <td>136,897.80</td>
      <td>471,784.10</td>
      <td>1</td>
      <td>0</td>
  </tr>
  <tr>
      <td>191,792.06</td>
      <td>162,597.70</td>
      <td>151,377.59</td>
      <td>443,898.53</td>
      <td>0</td>
      <td>1</td>
  </tr>
  <tr>
      <td>191,050.39</td>
      <td>153,441.51</td>
      <td>101,145.55</td>
      <td>407,934.54</td>
      <td>0</td>
      <td>1</td>
  </tr>
  <tr>
      <td>182,901.99</td>
      <td>144,372.41</td>
      <td>118,671.85</td>
      <td>383,199.62</td>
      <td>1</td>
      <td>0</td>
  </tr>
  <tr>
      <td>166,187.94</td>
      <td>142,107.34</td>
      <td>91,391.77</td>
      <td>366,168.42</td>
      <td>0</td>
      <td>1</td>
  </tr>
</table>

Our multiple linear regression formula therefore takes the following form:

$$
y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4D_1
$$

## 2. Multicollinearity And The Dummy Variable Trap

Above, $D_1$ is the dummy variable "Is New York?"  Theoretically, $D_2$ would be the dummy varible "Is California?"  However, we should not use this variable.  More generally, we should always omit one dummy variable for each categorical variable.  To start, because we only have two states, if a company's state is not New York, then it stands to reason that it would be California.  Recall that $b_0$ is the constant, which is determined using all variables.  Therefore, the state of being California or not is already factored into $b_0$.  If the state is New York, then we add in $b_4$ (as $b_4D_1 = b_4\times1 = b_4$); otherwise, we leave the formula as is.

More importantly though is the phenomenon of multicollinearity; where one or more independent variables are correlated.  In this case, $D_2 = 1 - D_1$.  A multiple linear regression model cannot distinguish between the impact of $D_1$ and $D_2$, and will not work properly; this is known as the Dummy Variable Trap.

The real problem is that you cannot have both the constant, and all dummy variables at the same time, as follows:

$$
y = b_0 + ... + b_4D_1 + b_5D_2
$$

As stated, more generally, we always omit one dummy variable. If there were a third state, we would include $b_4D_1$, and $b_5D_2$, but we would exclude $b_6D_3$.  Furthermore, if we have multiple categorical variables, we apply the same logic to each, omitting the last dummy variable from each category.

## 3. Hypothesis Testing, Statistical Significance, And The $P$-Value

To begin our understanding of statistical significance, consider a coin toss.  There are two hypotheses; one hypothesis, the "null hypothesis," is that this is a fair coin.  Another hypothesis, the "alternative hypothesis" is that this is not a fair coin.  The hypotheses are therefore as follows:

$H_0$: This is a fair coin.<br>
$H_1$: This is not a fair coin.

We want to determine which hypothesis holds true.  In hypothesis testing, we assume $H_0$ to be true, and then determine whether the statistics support rejecting the null hypothesis in favor of the alternative hypothesis.  Suppose we flip the coin for the first time then, and it lands on tails; the probability of this happening is 50% (since we assume this to be a fair coin).  Disregarding statistics, how do we feel about this?  Is this a fair coin?  One would reasonably assume so.

Suppose we flip the coin again, and it lands on tails a second time; the probability of this happening is $0.5\times0.5=0.25$.  Again, how do we feel about this?  Reasonably, getting two tails in a row is no big deal.  Suppose we flip the coin a third time, and get tails a third time.  The probability of this happening is $0.5\times0.5\times0.5=0.125$.  While one could continue to assume this is a fair coin, it's beginning to get suspicious.  If we continue this pattern, the probability of getting tails 4 times in a row is $0.5\times0.5\times0.5\times0.5=0.0625$.  The probability of getting tails 5 times in a row is $0.5\times0.5\times0.5\times0.5\times0.5=0.03125$.

These probabilities are known as $P$-values.  A $P$-value is simply the probability of an event happening, given that the null hypothesis is true.  Therefore, if the null hypothesis is true—that this is a fair coin—then the probability of getting tails four times in a row is 6.25%, and the probability of getting yet another tails would be 3.125%.  If the alternative hypothesis however is the true one, then the $P$-vales, or the probabilities of continuously getting tails, would be different.  For exampe, if this were a weighted coin, and there were an 80% chance of the coin landing on tails, then getting 5 tails in a row would have a probability of $0.8\times0.8\times0.8\times0.8\times0.8=0.32768.$

Statistical significance is defined as $\alpha=0.05$, or the probability of an event being 5%.  It's at this point that the event is so unlikely, that we can confidently reject the null hypothesis and accept the alternative hypothesis.  In other words, if the probabiltiy of getting tails 5 times in a row is only 3.125%, and yet it still happens, then statistically, we can confidentally reject the hypothesis that this is a fair coin, and argue that the coin is not fair, or that the alternative hypothesis is true.  Note that there is still a 5% chance that the null hypothesis is true, but that since this is so low, we simply reject the hypothesis.  Also note that depending on your circumstances, you can define whatever percentage  you consider to be statistical significance.  You may consider 10%, or $\alpha=0.1$ to be low enough to reject the null hypothesis, or you may require that $\alpha=0.0025$.

## 4. Building A Model

Returning to our example, recall that we have several independent variables:

$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4D_1$

In reality, we could potentially have many more than this:

$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + b_6x_6 + b_7x_7 + b_8x_8 + b_9x_9 +b_{10} x_{10}$ ...

In practice, we must decide which variables to keep, and which to discard.  While it may seem like the more data the better, there are actually valid reasons to discard unnecessary data:
1. <em>"Garbage in, garbage out"</em>- If the input data is "garbage," that is, unhelpful, then it will only distract the model during the training phase, resulting in a bad model.
2. We may have to explain the effect of each independent variable to some audience, including the math therebehind.  It is much easier to explain 4 variables than to explain 10.  As an extension from the above, it is especially hard to explain an independent variable that actually has no affect on the dependent variable.

Knowing that we must determine which independent variables to use, we now examine 5 methods of building models:
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

Methods 2-4 are also known as stepwise regression, but the term "stepwise regression" often refers to bidirectional elimination.

### 4.1 All-In Method

The "all-in" method, as its name implies, consists of using all variables.  You may do this if you have prior knowledge of the problem, and only use meaningful variables to start with.  You may also use this if you are preparing for backward elimination.

### 4.2 Backward Elimination Method

The steps to perform backward elimination are as follows:

1. Determine a significance level to keep variables in a model.  For example, you may decide that a variable must have $\alpha=0.05$ in order to keep it.
2. Fit the model with all possible predictors.  In other words, go "all-in."
3. Identify the predictor with the highest $P$-value.  If $P>\alpha$ (the variable is not statistically significant), proceed to step 4, otherwise, your model is ready.
4. Remove the predictor, and <b>refit the model without this variable</b>.  For emphasis, we must note here that it is critical to refit the model, and not simply remove the variable and call it done.

After step 4, we go back to step 3, and examine the $P$-values, repeating this pattern until we are confident that all variables left are statistically significant.

### 4.3 Forward Selection Method

While forward selection sounds like the opposite of backward elimination, it is actually much more complex that just reversing the backward elimination model.  The steps are as follows:

1. Determine the significance level to "enter" the model.  For example, you may decide a predictor must have $\alpha=0.05$ in order to bring it into use.
2. Fit all simple regression models; that is, fit a single-variable linear regression model with all possible variables.  Then select the model whose predictor has the lowest $P$-value.
3. Keep this variable, then fit all possible models with one extra predictor.
4. Identify the new predictor with the lowest $P$-value. If $P<\alpha$ (the variable is statistically significant), then repeat step 3, otherwise, ignore the variable, and your model is ready.

### 4.4 Bidirectional Elimination Method

As the name would suggest, the bidirectional elimination method combines the backward elimination method and the forward selection method.  The steps are as follows:

1. Determine a signifance level to "enter" the model, and to "stay" in the model.  For example, you may determine that to enter the model, $\alpha_1=0.05$, and to stay in the model, $\alpha_2=0.05$.
2. Fit all simple regression models, and select the model whose predictor has the lowest $P$-value.
3. If we have more than one predictor, then we reperform backward elimination <b>entirely</b>.  We must emphasize here "<b>entirely</b>."  That is, we do not simply remove one variable, but rather as many as backward elimination calls for.
4. Repeat step 2, adding a new predictor to the variable, followed by repeating step 3, reperforming backward elimination, until no new variables may enter, and none can exit.  Here, your model is ready.

This is an iterative method.  Emphasizing the point made in step 3 of reperforming backward elimination method <b>entirely</b>, it is possible to work our way up to several variables, only to add one more, then remove a few, then add one, then remove a few.

### 4.5 Score Comparison / All Possible Models

This is the most thorough model, but aso the most resource-consuming.  The steps are as follows:

1. Select a criterion for goodness of fit (for example, Akaite criterion, $r$-squared, etc)
2. Build all possible models (there will be a total of $2^{n}-1$ combinations, where $n$ is the number of independent variables).
3. Select the model with the best criterion.  Here, your model is ready.

While this may seem simple, let's consider step 2.  If we have a dataset with as a few as 10 independent variables, then we must compare $2^{10}=1,023$ models.

# Parking Lot Notes

Linear Regression models have assumptions.  Before building a machine learning model, you must check whether these assumptions are true.  The assumptions are as follows:

1. Linearity
2. Homoscedasticity
3. Multivariate Normality
4. Independence Of Errors
5. Lack Of Multicollinearity

To-do:
- Work out and show the math behind the Dummy Variable Trap, why this phenomenon exists.