# Multiple Linear Regression: Intuition

In this notebook, we discuss the intuition behind Multple Linear Regression, before diving into application.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

By this point we have already learned the intuition behind simple linear regression.  Given a dataset with an individual's years of experience and their salary, we can build a machine learning model that can then predict any future salary given any years of experience.  This scenario can be modeled with the formula:

$$
y = b_0 + b_1x_1
$$

In such a case, there is one dependent variable — salary, and one independent variable — years of experience.  Now let us consider a new example; suppose we want to predict a US company's profits given their Research & Development Spend (R&D Spend), their administrative costs, their marketing spend, and which state they are based in.  Similar to the first example, we still have only one dependent variable — profit, but now we have 4 independent variables:
- R&D Spend
- Administative Costs
- Marketing Spend
- Cost

We model this type of scenario with the following formula:

$$
y = b_0 + b_1x_1 + b_2x_2 + ... b_nx_n
$$

Where $n$ is the number of dependent variables.  Let us now consider an example:

<table>
  <tr>
    <th>Profit</th>
    <th>R&D Spend</th>
    <th>Administrative Costs</th>
    <th>Marketing</th>
    <th>State</th>
  </tr>
  <tr>
      <td>192,261.83</td>
      <td>165,349.20</td>
      <td>136,897.80</td>
      <td>471,784.10</td>
      <td>New York</td>
  </tr>
  <tr>
      <td>191,792.06</td>
      <td>162,597.70</td>
      <td>151,377.59</td>
      <td>443,898.53</td>
      <td>California</td>
  </tr>
  <tr>
      <td>191,050.39</td>
      <td>153,441.51</td>
      <td>101,145.55</td>
      <td>407,934.54</td>
      <td>California</td>
  </tr>
  <tr>
      <td>182,901.99</td>
      <td>144,372.41</td>
      <td>118,671.85</td>
      <td>383,199.62</td>
      <td>New York</td>
  </tr>
  <tr>
      <td>166,187.94</td>
      <td>142,107.34</td>
      <td>91,391.77</td>
      <td>366,168.42</td>
      <td>California</td>
  </tr>
</table>

As our example uses four independent variables, our formula would look as follows:

$$
y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ?
$$

Note the question mark at the end.  Our last independent variable is the company's state, which is a categorical variable; however, linear regression is an algorithm based on numerical data.  We must therefore create a dummy variable using one-hot encoding:

<table>
  <tr>
    <th>Profit</th>
    <th>R&D Spend</th>
    <th>Administrative Costs</th>
    <th>Marketing</th>
    <th>New York</th>
    <th>California</th>
  </tr>
  <tr>
      <td>192,261.83</td>
      <td>165,349.20</td>
      <td>136,897.80</td>
      <td>471,784.10</td>
      <td>1</td>
      <td>0</td>
  </tr>
  <tr>
      <td>191,792.06</td>
      <td>162,597.70</td>
      <td>151,377.59</td>
      <td>443,898.53</td>
      <td>0</td>
      <td>1</td>
  </tr>
  <tr>
      <td>191,050.39</td>
      <td>153,441.51</td>
      <td>101,145.55</td>
      <td>407,934.54</td>
      <td>0</td>
      <td>1</td>
  </tr>
  <tr>
      <td>182,901.99</td>
      <td>144,372.41</td>
      <td>118,671.85</td>
      <td>383,199.62</td>
      <td>1</td>
      <td>0</td>
  </tr>
  <tr>
      <td>166,187.94</td>
      <td>142,107.34</td>
      <td>91,391.77</td>
      <td>366,168.42</td>
      <td>0</td>
      <td>1</td>
  </tr>
</table>

Our multiple linear regression formula therefore takes the following form:

$$
y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4D_1
$$

## Multicollinearity And The Dummy Variable Trap

Above, $D_1$ is the dummy variable "Is New York?"  Theoretically, $D_2$ would be the dummy varible "Is California?"  However, we should not use this variable.  More generally, we should always omit one dummy variable.  To start, because we only have two states, if a company's state is not New York, then it stands to reason that it would be California.  Recall that $b_0$ is the constant, which is determined using all variables.  Therefore, the state of being California or not is already factored into $b_0$.  If the state is New York, then we add in $b_4$ (as $b_4D_1 = b_4\times1 = b_4$); otherwise, we leave the formula as is.

More importantly though is the phenomenon of multicollinearity; where one or more independent variables are correlated.  In this case, $D_2 = 1 - D_1$.  A multiple linear regression model cannot distinguish between the impact of $D_1$ and $D_2$, and will not work properly; this is known as the Dummy Variable Trap.

The real problem is that you cannot have both the constant, and all dummy variables at the same time, as follows:

$$
y = b_0 + ... + b_4D_1 + b_5D_2
$$

As stated, more generally, we always omit one dummy variable. If there were a third state, we would include $b_4D_1$, and $b_5D_2$, but we would exclude $b_6D_3$.  Furthermore, if we have multiple categorical variables, we apply the same logic to each, omitting the last dummy variable from each category.

## Hypothesis Testing, Statistical Significance, And The P-Value

To begin our understanding of statistical significance, consider a coin toss.  There are two hypotheses; one hypothesis, the "null hypothesis," is that this is a fair coin.  Another hypothesis, the "alternative hypothesis" is that this is not a fair coin.  The hypotheses are therefore as follows:

$H_0$: This is a fair coin.<br>
$H_1$: This is not a fair coin.

We want to understand which hypothesis is the correct one.  In hypothesis testing, we assume $H_0$ to be true, and then determine whether the statistics support rejecting the null hypothesis in favor of the alternatie hypothesis.  Suppose we flip the coin for the first time then, and it lands on tails; the probability of this happening is 50%.  Disregarding statistics, how do we feel about this?  Is this a fair coin?  One would reasonably assume so.

Suppose we flip the coin again, and it lands on tails a second time; the probability of this happening is $0.5\times0.5=0.25$.  Again, how do we feel about this?  Reasonably, getting two tails in a row is no big deal.  Suppose we flip the coin a third time, and get tails a third time.  The probability of this happening is $0.5\times0.5\times0.5=0.125$.  While one could continue to assume this is a fair coin, it's beginning to get suspicious.  If we continue this pattern, the probability of getting tails 4 times in a row is $0.5*0.5*0.5*0.5=0.0625$.  The probability of getting tails 5 times in a row is $0.5*0.5*0.5*0.5*0.5=0.03125$.

These probabilities are known as $P$-values.  A $P$-value is simply the probability of an event happening, given that the null hypothesis is true.  Therefore, if the null hypothesis is true—that this is a fair coin—then the probability of getting tails four times in a row is 6.25%, and the probability of getting yet another tails would be 3.125%.  If the alternative hypothesis however is the true one, then the $P$-vales, or the probabilities of continuously getting tails, would be different.  For exampe, if this were a weighted coin, and there were an 80% chance of the coin landing on tails, then getting 5 tails in a row would have a probability of $0.8*0.8*0.8*0.8*0.8=0.32768.$

Statistical significance is defined as $\alpha=0.05$, or the probability of an event being 5%.  It's at this point that the event is so unlikely, that we can confidently reject the null hypothesis and accept the alternative hypothesis.  In other words, if the probabiltiy of getting tails 5 times in a row is only 3.125%, and yet it still happens, then statistically, we can confidentally reject the hypothesis that this is a fair coin, and argue that the coin is not fair, or that the alternative hypothesis is true.  Note that there is still a 5% chance that the null hypothesis is true, but that since this is so low, we simply reject the hypothesis.  Also note that depending on your circumstances, you can define what percentage  you consider to be statistical significance.  You may consider 10%, or $\alpha=0.1$ to be low enough to reject the null hypothesis, or you may require that $\alpha=0.00025$.

# Parking Lot Notes

Linear Regression models have assumptions.  Before building a machine learning model, you must check whether these assumptions are true.  The assumptions are as follows:

1. Linearity
2. Homoscedasticity
3. Multivariate Normality
4. Independence Of Errors
5. Lack Of Multicollinearity

To-do:
- Work out and show the math behind the Dummy Variable Trap, why this phenomenon exists.