<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Multiple Linear Regression

_Authors: TMTC_ | _RUH_
---

## Review


### Overview of Supervised Learning
---

![Supervised learning diagram](./images/supervised_learning.png)

As we discussed yesterday:
_Simple linear regression (SLR) has one independent variable._

_Multivariable (MLR) has potentially infinite independent variables._

#### In either case, we are looking to minimize the difference between our predictions, $\hat{y}$, and the true value, $y$

![Estimating coefficients](./images/estimating_coefficients.png)

In the diagram above:

- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the vertical distances between the observed values and the least squares line.


### Form of Linear Regression

Simple LR uses one feature and a constant to represent a relationship with another feature.
### $y = \alpha + \beta X +\epsilon_i $ 

but you might know it as
### $ y = mx + b$

And we can extend the simple linear regression into the multiple linear regression (more on this later):
### $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon_i$

- $y$ is the response (the target/outcome/dependent variable)
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature/independet variable)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature/independent)
- $\epsilon_i$ is the constant error

The $\beta$ values are called the **model coefficients**:

- These values are estimated (or "learned") during the model fitting process using the **least squares criterion**.
- Specifically, we are finding the line (mathematically) which minimizes the **sum of squared residuals** (or "sum of squared errors").
- And once we've learned these coefficients, we can use the model to predict the response and draw inferences about the relationships between variables and the outcome.

![Estimating coefficients](./images/estimating_coefficients.png)

In the diagram above:

- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the vertical distances between the observed values and the least squares line.

###  SLR in `sklearn`
** Walk through the process for fitting a model in `sklearn` **

## The `sklearn` process

When we model data using `sklearn`, we're going to follow (more or less) the same process every time.

1. Select and instantiate the algorithm we want to use (i.e., `from [library] import [model]`)
2. Create a feature matrix (called 'X' for convenience) with your independent variable(s)
2. Create an outcome/targe/response matrix (usually denoted 'y') with your dependent variable
3. Ensure that X and y are the same length (.shape is your friend here)
4. Instantiate the estimator
5. Fit the estimator (i.e., train your model)
6. Use the estimator to make predictions
7. Evaluate the predictions using the appropriate metric

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [25]:
url = './datasets/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime')

### Data Dictionary

| Variable| Description |
|---------|----------------|
|datetime| hourly date + timestamp  |
|season|  1 = spring, 2 = summer, 3 = fall, 4 = winter |
|holiday| whether the day is considered a holiday|
|workingday| whether the day is neither a weekend nor holiday|
|weather| 1: Clear, 2: Mist, 3: Light Snow 4: Heavy Rain|
|temp| temperature in Celsius|
|atemp| "feels like" temperature in Celsius|
|humidity| relative humidity|
|windspeed| wind speed|
|casual| number of non-registered user rentals initiated|
|registered| number of registered user rentals initiated|
|count| number of total rentals|

In [26]:
# do some eda
bikes.rename(columns={'count':'total'}, inplace=True)

In [27]:
# step 1: Select and instantiate the algorithm we want to use


In [71]:
# step 2: Create a feature matrix


In [29]:
# check correlation 

In [72]:
# check linear relationship


In [32]:
# step 3: Create a feature matrix


In [33]:
# step 4: Ensure that X and y are the same length

In [73]:
# check X type , shape


In [74]:
# check y type,


In [36]:
# step 5: Instantiate the estimator
lr= 

In [75]:
# step 6: Fit the estimator (train your model)


In [38]:
# step 7: Use the estimator to make predictions


In [39]:
# step 8: Evaluate the predictions using the appropriate metric

In [76]:
# check intercept


In [77]:
# check coef


# Multiple Linear Regression in `sklearn`
---
We've built a simple linear regression, one using only one feature, and it's pretty useful. But is it likely that only one feature is going be accurate?

<a id="adding-more-features-to-the-model"></a>
### Adding More Features to the Model

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables. 

- For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc. 

- For a linear regression, we want these variables to be largely independent of one another, but all of them should help explain the y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

#### Explore more features.

In [None]:
# step 2: Create a feature matrix

In [79]:
# check correlation again


In [80]:
# Create feature column variables


In [81]:
# check linear relationship


In [57]:
# step 3: Create a feature matrix


In [58]:
# step 4: Ensure that X and y are the same length

In [82]:
# check X type , shape


In [83]:
# check y type,


In [61]:
# step 5: Instantiate the estimator


In [84]:
# step 6: Fit the estimator (train your model)


In [64]:
# step 7: Use the estimator to make predictions


In [65]:
# step 8: Evaluate the predictions using the appropriate metric

In [85]:
# check intercept


In [86]:
# check coef (Pair the feature names with the coefficients.)


Interpreting the coefficients:

- Holding all other features fixed, a 1-unit increase in temperature is associated with a rental increase of 2.64 bikes.
- Holding all other features fixed, a 1-unit increase in temperature (feels like) is associated with a rental increase of 4.87 bikes.
- Holding all other features fixed, a 1-unit increase in humidity is associated with a rental decrease of 3.16 bikes.
- Holding all other features fixed, a 1-unit increase in season is associated with a rental increase of 22.3 bikes.
- Holding all other features fixed, a 1-unit increase in weather is associated with a rental increase of 7.38 bikes.


Does anything look incorrect and does not reflect reality?

<a id="what-is-multicollinearity"></a>
## What Is Multicollinearity?
---

Multicollinearity happens when two or more features are highly correlated with each other. The problem is that due to the high correlation, it's hard to disambiguate which feature has what kind of effect on the outcome. In other words, the features mask each other. 

There is a second related issue called variance inflation where including correlated features increases the variability of our model and p-values by widening the standard errors. This can be measured with the variance inflation factor, which we will not cover here.

#### With the bikeshare data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

<a id='assumptions'></a>

## Assumptions of MLR

---

Like SLR, there are assumptions associated with MLR. Luckily, they're quite similar to the SLR assumptions.

1. **Linearity:** $Y$ must have an approximately linear relationship with each independent $X_i$.
2. **Independence:** Errors (residuals) $\varepsilon_i$ and $\varepsilon_j$ must be independent of one another for any $i \ne j$.
3. **Normality:** The errors (residuals) follow a Normal distribution with mean 0.
4. **Equality of Variances**: The errors (residuals) should have a roughly consistent pattern, regardless of the value of the $X_i$ predictors. (There should be no discernable relationship between the $X$ predictors and the residuals.)
5. **Independence of Predictors**: The independent variables $X_i$ and $X_j$ must be independent of one another for any $i \ne j$.

The mnemonic LINEI is a useful way to remember these five assumptions.