### What is a Regression?
Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Also called simple regression or ordinary least squares (OLS), linear regression is the most common form of this technique. Linear regression establishes the linear relationship between two variables based on a line of best fit. Linear regression is thus graphically depicted using a straight line with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable when the value of the other is zero. Non-linear regression models also exist, but are far more complex.

#### Understanding Regression
Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and multiple linear regression, although there are non-linear regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to predict the outcome (while holding all others constant).

Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also help predict sales for a company based on weather, previous sales, GDP growth, or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering the costs of capital.

#### Calculating Regression
Linear regression models often use a least-squares approach to determine the line of best fit. The least-squares technique is determined by minimizing the sum of squares created by a mathematical function. A square is, in turn, determined by squaring the distance between a data point and the regression line or mean value of the data set.

Once this process has been completed (usually done today with software), a regression model is constructed. The general form of each type of regression model is:

Simple linear regression:

![image.png](attachment:image.png)

### Types of Sum of Squares
#### Residual Sum of Squares:
The RSS allows you to determine the amount of error left between a regression function and the data set after the model has been run. You can interpret a smaller RSS figure as a regression function that is well-fit to the data while the opposite is true of a larger RSS figure.

The sum of squares error (SSE) or residual sum of squares (RSS, where residual means remaining or unexplained) is the difference between the observed and predicted values.

Here is the formula for calculating the residual sum of squares:

![image.png](attachment:image.png)

#### Regression Sum of Squares
The regression sum of squares is used to denote the relationship between the modeled data and a regression model. A regression model establishes whether there is a relationship between one or multiple variables. Having a low regression sum of squares indicates a better fit with the data. A higher regression sum of squares, though, means the model and the data aren't a good fit together.

The sum of squares due to regression (SSR) or explained sum of squares (ESS) is the sum of the differences between the predicted value and the mean of the dependent variable. In other words, it describes how well our line fits the data.

Here is the formula for calculating the regression sum of squares:

![image.png](attachment:image.png)

#### Sum of Squares Total
The sum of squares total (SST) or the total sum of squares (TSS) is the sum of squared differences between the observed dependent variables and the overall mean. Think of it as the dispersion of the observed variables around the mean—similar to the variance in descriptive statistics. But SST measures the total variability of a dataset, commonly used in regression analysis and ANOVA.

![image.png](attachment:image.png)

Where:

yi – observed dependent variable

y – mean of the dependent variable

#### If SSR equals SST, our regression model perfectly captures all the observed variability, but that’s rarely the case.

#### What Is the Relationship between SSR, SSE, and SST?
Mathematically, SST = SSR + SSE.

The rationale is the following:

The total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability, known as error.

### What is R-Squared?
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).

R-squared can take any values between 0 to 1.

The most common interpretation of r-squared is how well the regression model explains observed data. For example, an r-squared of 60% reveals that 60% of the variability observed in the target variable is explained by the regression model. Generally, a higher r-squared indicates more variability is explained by the model.

Whereas correlation explains the strength of the relationship between an independent and a dependent variable, R-squared explains the extent to which the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.

#### Formula for R-Squared

![image.png](attachment:image.png)

Or in simple terms,

R2 = 1 - (SSR / SST)

#### R-Squared vs. Adjusted R-Squared
R-squared only works as intended in a simple linear regression model with one explanatory variable. With a multiple regression made up of several independent variables, the R-squared must be adjusted.

The adjusted R-squared compares the descriptive power of regression models that include diverse numbers of predictors. Every predictor added to a model increases R-squared and never decreases it. Thus, a model with more terms may seem to have a better fit just for the fact that it has more terms, while the adjusted R-squared compensates for the addition of variables; it only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance.

In an overfitting condition, an incorrectly high value of R-squared is obtained, even when the model actually has a decreased ability to predict. This is not the case with the adjusted R-squared.

### Assumptions of Linear Regression

 - Linear Model: 
According to this assumption, the relationship between the independent and dependent variables should be linear. The reason behind this relationship is that if the relationship will be non-linear which is certainly is the case in the real-world data then the predictions made by our linear regression model will not be accurate and will vary from the actual observations a lot.

 - No Multicollinearlity in the data:
If the predictor variables are correlated among themselves, then the data is said to have a multicollinearity problem. But why is this a problem? The answer to this question is that high collinearity means that the two variables vary very similarly and contain the same kind of information. This will leads to redundancy in the dataset. Due to redundancy, only the complexity of the model increase, and no new information or pattern is learned by the model. We generally try to avoid highly correlated features even while using complex models.

We can identify highly correlated features using scatter plots or heatmap.

 - Homoscedasticity of Residuals or Equal Variances:
Homoscedasticity is the term that states that the spread residuals which we are getting from the linear regression model should be homogeneous or equal spaces. If the spread of the residuals is heterogeneous then the model is called to be an unsatisfactory model.

One can easily get an idea of the homoscedasticity of the residuals by plotting a scatter plot of the residual data.

 - No Autocorrelation:
The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.

If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients.

- No Endogeneity:
In statistics, endogeneity refers to the correlation between the independent variable and unexplained variation (or “error”) in the dependent variable. In a regression analysis, for instance, endogeneity occurs when there is a relationship between the predictor variable and the error term. Endogeneity may lead to bias in the results of statistical tests. This is a crucial issue in statistics because endogeneity may undermine the validity of inferences and lead to incorrect conclusions.