# Univariate Linear Regression

- Single feature and a continuous target

- $y = mx + b$, m = slope, b = y-intercept

- The algorithm finds the best values for the parameters $m$ and $b$

- The formula with those parameters is then our `line of best fit`.

- "Best" is determined by minimizing the vertical offsets, i.e. the error between the actual values and the estimated or predicted values. 

- This can be measured using one of many functions, including:  

    - Mean Squared Error, $MSE = \frac{1}{n}\sum_{i=1}^{n} (\hat{y}-y_{i})^2$

    - Sum of Squared Errors, $SSE = \sum_{i=1}^{n}(\hat{y}-y_{i})^2$
    
    - Mean Absolute Error, $MAE = \sum_{i=1}^{n}\frac{|y_{i}-x_{i}|}{n}$
    
    - Explained Sum of Squares, $ESS = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2$
    
    - Residual Sum of Squares, $RSS = \sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2 = \sum(\epsilon^2)$
    
    - Total Sum of Squares, $TSS = \sum_{i=1}^{n}(y_{i}-\bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2 + \sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2$ 


| Best Fit Line | Vertical Offsets |
| ------------- | -----------------|
| ![univariate_bestfitline.png](univariate_bestfitline.png) | ![simple_regression.png](simple_regression.png) |

## Train the Regression Algorithm

1. On the Data tab, in the Analysis group, click Data Analysis.

![univariate_excel_dataanalysis.png](univariate_excel_dataanalysis.png)


2. Select Regression and click OK.

![univariate_excel_regression.png](univariate_excel_regression.png)


3. Set your meta-parameters:

    - Select the range for your target, $y$
    
    - Select the range for your feature, $x$
    
    - Check Labels & Confidence Level: 95%
    
    - Output options: New Worksheet Ply: Name new worksheet (Uni_Example_Output)
    
    - Check Residuals, Residual Plots, Line Fit Plots
    
    - Check Normal Probability Plots
    
    - Click OK

![excel_regression_hyperparams.png](excel_regression_hyperparams.png)

## Evaluate Results

![Uni_Example_Data.png](Uni_Example_Data.png)

### How well does your model fit the data?

#### Coefficient of Determination, $R^2$

R-squared tells you how well your model fits the data by measuring the strength of the relationship between your model and the dependent variable. However, it is not a formal test for the relationship. The F-test of overall significance is the hypothesis test for this relationship. If the overall F-test is significant, you can conclude that R-squared does not equal zero, and the correlation between the model and dependent variable is statistically significant.

$R^{2}$ is the ratio of the explained sum of squares to the total sum of squares.   
$R^{2} x 100$ = Percent of variance in y (target) explained by x (feature)  

![r_square.png](r_square.png)


#### The F-test of overall significance 

The F-test indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables. 
F-tests can evaluate multiple model terms simultaneously, which allows them to compare the fits of different linear models. In contrast, t-tests can evaluate just one term at a time.
In statistical output, you can find the overall F-test in the ANOVA table.

The F-test for overall significance has the following two hypotheses:

- **The null hypothesis** states that the model with no independent variables fits the data as well as your model. (Significance F > 0.05)   

- **The alternative hypothesis** says that your model fits the data better than the intercept-only model. (Significance F <= 0.05)  

`F` in the ANOVA table is the F test statistic, while `Significance F` is the p-value for the F-test. 

##### Understanding the results

For the model with no independent variables, the intercept-only model, all of the model’s predictions equal the mean of the dependent variable. Consequently, if the overall F-test is statistically significant, your model’s predictions are an improvement over using the mean.

- If less than 0.05, you're OK => conclude that your regression model fits the data better than the model with no independent variables, meaning the independent variables in your model improve the fit.   
- If greater than 0.05, it's probably better to stop using this set of features. 

If none of your independent variables are statistically significant, you can expect the overall F-test to also not be statistically significant.   
Occasionally, however, the tests can produce conflicting results. This disagreement can occur because the F-test of overall significance assesses all of the coefficients jointly whereas the t-test for each coefficient examines them individually. For example, the overall F-test can find that the coefficients are significant jointly while the t-tests can fail to find significance individually.   
How can this happen? The F-test sums the predictive power of all independent variables and determines that it is unlikely that all of the coefficients equal zero. However, it’s possible that each variable isn’t predictive enough on its own to be statistically significant. In other words, your sample provides sufficient evidence to conclude that your model is significant, but not enough to conclude that any individual variable is significant.

![sign_F.png](sign_F.png)


#### Residuals

Identifying if the regression model is statistically significant is a critical step. However, you must also check your residual plots to determine whether the results are trustworthy!

- The residual, $\epsilon$, a.k.a. the vertical offset, is the difference between the actual y (target) and the predicted y ($\hat{y}$, target)

- $\epsilon = |\hat{y}-y|$

![Uni_Resid2.png](Uni_Resid2.png)

### How important is each independent variable in predicting y?

#### The T-test of independent variable significance

- **The null hypothesis** states that the model without this variable fits the data as well as your model. (Significance  > 0.05)   

- **The alternative hypothesis** says that your model fits the data better with that independent variable than the model without that variable (Significance F <= 0.05)  

##### Understanding the results

Any independent variable with a p-value of <= 0.05 contributes to a better model than without it.  

- Remove any independent variables where the p-value for T statistic is > 0.05. Re-run. If significance F value increases significantly, then add back the variable with the lowest p-value of those removed. Check again. Repeat. 

![p_val.png](p_val.png)


### Parameters and confidence

#### coefficient 1: y-Intercept 

Often labeled as $b$ or $b_{0}$ and referred to as the y-intercept, it is where the regression line crosses the y-axis, or the value of y when x = 0. 

#### coefficient 2: slope 

Often labeled as $m$ or $b_{1}$ and referred to as the slope, it is the amount we expect $y$ to increase by when $X$ increases by 1 (or decrease if $b_{1} < 0$) 

- $w_{1} = \frac{\Delta y}{\Delta x}$

![Uni_Coeff.png](Uni_Coeff.png)

#### Confidence Interval

In this problem $X$ is exam 1 and $y$ is final grade, so $b_{1}$ is our estimate of the points that the final grade increases for every 1 point increase in exam 1. We are 95% confidence that that value for $b_{1}$ is between .663 and .841.  That is, if we were to collect new data generated from the same distribution then in 19 out of every 20 experiments we'd get $b_{1}$ in this interval.  

- 95% CI => 95% of the time the indicated parameter will fall in that range.    
- A narrow interval means more confidence in the value presented.    
- A wide interval indicates less confidence in the value presented.  

![Uni_CI.png](Uni_CI.png)

#### Percentiles & Values

![uni_percentile.png](uni_percentile.png)

### Sum of Squares

#### Explained (Regression) Sum of Squares  

- tells you how much of the variation in the target your model explained. 

- $ESS = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2$

![Explained_SS.png](Explained_SS.png)


#### Residual Sum of Squares

- tells you how much of the target's variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.  

- $RSS = \sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2 = \sum(\epsilon^2)$

- The smaller the residual sum of squares, the better your model fits your data

- The greater the residual sum of squares, the poorer your model fits your data. 

- A value of zero means your model is a perfect fit. 

![Residual_SS.png](Residual_SS.png)


#### Total Sum of Squares 

- Total Sum of Squares = Explained Sum of Squares + Residual Sum of Squares

- $\sum_{i=1}^{n}(y_{i}-\bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2 + \sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2$ 

![Total_SS.png](Total_SS.png)

### The Regression Function

#### The predicted values, $\hat{y}$

- $\hat{y} = w_{0} + w_{1}x$


#### The original values

- $y = w_{0} + w_{1}x + \epsilon$

- $\epsilon$ = Residual


![Uni_Resid2.png](Uni_Resid2.png)

![univariate_excel_plots.png](univariate_excel_plots.png)

## Exercises

### telco_churn: Predict total charges using tenure

Using telco_churn database, model total charges as a linear function of tenure

*Your excel workbook will be submitted via google classroom. Answers to specific questions will be submitted through a form posted with the assignment*

1. Acquire your data using SQL from the telco_churn database   

    - Extract a table where each observation(row) is a customer *with a month-to-month contract* and your columns are customer id, tenure, and total charges.  
    - Export the table to a csv  
    - Import the csv into a new excel workbook regression_exercises.xlsx


2. Perform a univariate regression analysis. The answers to the following questions will be submitted via the google form posted in the classroom. 

    - How well does your model fit the data?
    - Translate the resulting R^2 value in a sentence that relates to your specific analysis/data.
    - Translate the intercept coefficient into a sentence that explains how it relates to your analysis/data.
    - Is your independent variable important in predicting the customer's total charges?
    - Translate the second coefficient into a sentence that explains how it relates to your analysis/data.      
    - Translate the confidence interval into a meaningful statement about the 2nd coefficient that a non-statistically informed employee would understand. 
    - Write the linear function in the form of $y = mx + b$ using the parameters that were computed from the regression analysis. 
    - What is set of charges that lie above the 90th percentile? (write in interval notation)


3. Manually compute the following evaluation metrics:

    - SSE, sum of squared error, of the residuals using only the definition of SSE and the basic operations that make up the SSE definition.  Then write a function to validate your manually computed SSE by comparing it with the regression output SSE.  
    - MSE, mean squared error, of the residuals using only the definition of MSE and the basic operations that make up the MSE definition. Then write a function to validate your manually computed MSE by comparing it with the regression output MSE. 


4. Plot the residuals
    - Analyze the residual plot and answser the following question in the google form posted with this assignment in the google classroom: Is there more work to do to capture more information related to the variance?  Why or why not? If so, what feature would you add next?
