These scripts are based on the lecture notes from STAT 501 - Regression Methods. This is a course from Pennsylvania State University ( Penn State ).
This notes describes the basic concepts about a simple linear regression. How to measure the strength of linear association between the features and How to interpret the coefficient of determination ( Rsquare )
- The Best Line would be the line the presents the prediction errors as small as possible in some overall sense
- Least Squares Criterios: Minimize the sum of the squared prediction errors.
- Error ( Residual ): y - yhat
The Simple linear regression model fits the four conditions below:
- The mean of the response, at each value of the predictor, is a Linear function of the x
- The errors are Independent
- The errors, at each value of the predictor, are Normallly Distribuited
- The errors, at each value of the predictor, have Equal Variance
Measure the strenght of the relationship
- SSR ( Reg Sum of Squares ): Quantifies how far the estimated sloped regression line, yhat, is from the horizontal, ybar (no relationship line).
-
SSE ( Error Sum of Squares ): Quantifies how much the data points, yi, vary around the estimated regression line, yhat. .
-
SSTO ( Total Sum of Squares ): Quantifies how much the data points, yi, vary around their mean, ybar. .
-
SSTO = SSR + SSE
Percentage of the variation in y is reduced by taking into account predictor x. Percentage of the variation in y is 'explained by' the variation in predictor x.
- r-squared is a number between 0 and 1.
- If r-squared = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all the variation in y!.
- If r-squared = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!.
Measure the sign of the relationship
-
The r-squared quantifies the strength of a linear relationship. If r-squared = 0, tells us that if there is a relationship between x and y, it's not linear.
-
A large r-squared value should not be interpreted as meaning that the estimated regression line fits the data well. Its large values does suggest that taking into account the predictor is better than not doing so. It just doesn't tell us that we could still do better.
-
The r-squared can be greatly affected by just one data point (or a few data points).
-
Correlation (or association) does not imply causation.
-
Ecological correlations, correlations that are based on rates or averages, tend to overstate the strength of an association.
-
A statistically significant r-squared does not imply that the slope Beta1 is meaninfully different from 0.
-
A large r-squared value does not necessarily mean that an useful prediction of the response can be made. It's still possible to get prediction intervals or confidence intervals that are too wide to be useful.
This lesson presents two alternatives methods for testing if a linear association exists between the predictor x and the response y in a simple linear regression model. versus (https://latex.codecogs.com/gif.latex?H_%7BA%7D%3A%20%5Cbeta_%7B1%7D%20%5Cneq%200)
- The t-test for the slope
- Analysis of variance (ANOVA) F-test