# Multivariate Linear Regression

## Linear regression of multiple independent variables

- Models the relationship between a multiple input independent variables (explanatory, feature variables) and an output dependent continuous variable (target, response variable) using a linear model.  The model remains linear in that the output is a linear combination of the input variables.  

> $y = c_{0} + c_{1}x_{1} + c_{2}x_{2} + c_{3}x_{3} + … + c_{n}x_{n}$

- The algorithm attempts to find the “best” choices of values for the parameters, which in a linear regression model are the coefficients, $c_{i}$, in order to make the formula as “accurate” as possible.  

- The algorithm determines *'best'* by minimizing the error (Sum of Squared Error). SSE:   $\sum_{i=1}^{n}(\hat{y}-y_{i})^2$

- Once estimated, the parameters  (intercept and coefficients) allow the value of the dependent variable to be obtained from the values of the independent variables. 

### Curse of Dimensionality
- As the dimensionality of the feature space increases, i.e. the number of independent variables, the number of configurations can grow exponentially, and thus the number of configurations covered by an observation decreases.  This is visualized below, showing fewer observations per region as dimensionality increases.

![curse_of_dimensionality.png](curse_of_dimensionality.png)

image source:  https://www.kdnuggets.com/2015/03/deep-learning-curse-dimensionality-autoencoders.html/2


### Multivariate Linear Regression 

![multivariate.png](multivariate.png)

image source:  http://nbviewer.ipython.org/urls/s3.amazonaws.com/datarobotblog/notebooks/multiple_regression_in_python.ipynb


## Demonstrate in Excel

- Workbook: Regression_Examples


- Data Worksheet: Multi_Data


- Output:

    - Multi_Output_1

    - Multi_Output_2
    
    - Multi_Output_3
    
    - Multi_Output_4
    
    - Multi_Output_5
    

- Meta-parameters:

    - Select the range for your predictor/target/dependent variable, your $y$.

    - Select the range for your explanatory/feature/independent variables, your $x_{i}$.  *In excel, all independent variables must be in consecutive columns, followed by the dependent variable.*
    
    - Check Labels & Confidence Level: 95%
    
    - Output options: New Worksheet Ply: Name new worksheet (Multi_Output_n)
    
    - Check Residuals

### Analysis 1

#### Exam1, Exam2 & Exam3 with 14 Observations

#### Data

> ![Analysis1.png](Analysis1.png)

#### Results

> ![analysis1_output.png](analysis1_output.png)

#### Summary

1. Only 12 Degrees of Freedom...too many variables for the limited sample size. Curse of dimensionality!

2. P-values are too high

3. Try removing a variable

### Analysis 2

#### Exam1 & Exam2 with 14 Observations

#### Data

> ![Analysis2.png](Analysis2.png)

#### Results

> ![analysis2_output.png](analysis2_output.png)

#### Summary

1. p-values for intercept and exam 1 look better now
2. p-value for exam 2 still not significant
3. Let's assume we can generate more observations either natively or through cross validation methods such as bootstrapping....


### Analysis 3

#### Exam1 & Exam2 with 104 Observations

#### Data

> ![Analysis3.png](Analysis3.png)

#### Results

> ![analysis3_output.png](analysis3_output.png)

#### Summary

- p-values for intercept and exam1 significant
- p-values for exam2 are not significant
- try adding exam3 back in to the mix...


### Analysis 4
#### Exam1, Exam2 & Exam3 with 104 Observations

#### Data

> ![Analysis4.png](Analysis4.png)

#### Results

> ![analysis4_output.png](analysis4_output.png)

#### Summary

- p-values for intercept, exam1 and exam3 are significant
- p-value for exam2 is not significant
- remove exam 2 from the model and run analysis...

### Analysis 5

#### Exam1 & Exam3 with 104 Observations

#### Data

> ![Analysis5.png](Analysis5.png)

#### Results

- All results are significant

> ![analysis5_output.png](analysis5_output.png)

#### Residuals

- SSE Total = 12,278

#### Confidence Interval


> ![analysis5_ci.png](analysis5_ci.png)

#### Coefficient of determination

- $R^2 = .97$  
- 97% of the variance in final grades can be explained by exam 1 and exam 3. 

#### Regression Function

- $finalgrade = 11.39 + .58exam1 + .29exam2$

## Multivariate exercises
### Using telco_churn database, model total charges as a linear function of tenure and monthly charges

1. Using the telco_churn database, extract a table where each observation(row) is a customer *with a month-to-month contract* and your columns are the following variables (and only these variables) customer id, tenure, monthly charges, and total charges. 


2. Export the table to a csv


3. Import the csv into the excel file regression_exercises.xlsx


4. Perform a regression analysis in excel. Check Labels, Confidence Level: 95%, and *Residuals* option.  Place the summary output onto a new worksheet labeled "Multvariate_Exercise".  After that, go to the new worksheet and insert 2 new columns to the left.  


5. Translate the resulting R^2 value in a sentence that relates to your specific analysis/data. (in cell:  Multvariate_Exercise!B2)
    

6. Translate the intercept coefficient into a sentence that explains how it relates to your analysis/data. (in cell: Multvariate_Exercise!B3) 
    

3. Translate the second and third coefficients into a sentence that explains how it relates to your analysis/data.  (in cell: Multvariate_Exercise!B4)
    

4. Write the linear function in the form of $y = c_{0} + c_{1}x_{1} + c_{2}x_{2}$ using the parameters that were computed from the regression analysis and the variable names for y and x specific to your data. (in cell: Multvariate_Exercise!B5)
    

5. Check if your results are reliable (statistically significant) by analyzing the test statistics for the regression function and coefficients.  Summarize your findings and why or why not in 1 or more sentences (in cell: Multvariate_Exercise!B6)
    

6. Manually compute the sum of squared error of the residuals using only the definition of SSE and the basic operations that make up the SSE definition. Use column E to compute, placing your final summation in cell: Multvariate_Exercise!B7)
    

7. Write an if statement to compare if your manually computed SSE matched the regression output SSE.  (in cell: (in cell: Multvariate_Exercise!B8)
    

8. Manually compute the mean squared error of the residuals using only the definition of MSE and the basic operations that make up the MSE definition. Use column F to compute, placing your final summation in cell: Multvariate_Exercise!B9)
    

9. Write an if statement to compare if your manually computed MSE matched the regression output MSE. (in cell: Multvariate_Exercise!B10)


10. What is your confidence interval for the intercept?  Answer in a sentence that would describe it to a non-technical co-worker or target audience using the data you are analyzing and the meaning of a confidence interval.  (in cell: Multvariate_Exercise!B11)


11. Analyzing the residual plot, is there more work to do to capture more information related to the variance?  Why or why not? If so, what variable would you add next?  (in cell: Multvariate_Exercise!B12:B13)