## Simple Linear Regression

Simple linear regression is when one independent variable is used to estimate
a dependent variable. For example, predicting Co2 emission using
the EngineSize variable. With linear regression, you can model the relationship of these variables. A good model can be used to predict what the approximate emission of each car is. Let's see how to do it with Python.

### Creating train and test dataset

Train/Test Split involves splitting the dataset into training and testing sets that are mutually exclusive. After which, you train with the training set and test with the testing set. This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the model. Therefore, it gives us a better understanding of how well our model generalizes on new data.

This means that we know the outcome of each data point in the testing dataset, making it great to test with. Since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.

### Simple Regression Model

Linear regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the 'residual sum of squares' between the actual value y in the dataset, and the predicted value y using linear approximation.

### Model Evaluation

We compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.
There are different model evaluation metrics:

*   Mean Absolute Error: it is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it's just average error.

*   Mean Squared Error (MSE): MSE is the mean of the squared error. It's more popular than Mean Absolute Error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.

*   Root Mean Squared Error (RMSE): it measures the difference between predicted values from a model and the actual observed values. It is obtained by taking the square root of the average of the squared residuals.

*   R-squared is not an error, but rather a popular metric to measure the performance of a regression model. It represents how close the data points are to the fitted regression line. The higher the R-squared value, the better the model fits the data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).

The choice of metric completely depends on the type of model, your data type, and domain
of knowledge.

## Multiple Linear Regression

Multiple linear regression is when multiple independent variables are used to estimate a dependent variable. For example, predicting CO2 emission using EngineSize and the number of Cylinders in the car's engine.

#### **Applications**

Basically, there are two applications for multiple linear regression:  

1.   First, it can be used when we would like to identify the strength of the effect that the independent variables have on a dependent variable. For example, does revision time, test anxiety, lecture attendance, and gender, have any effect on exam performance of students?

2.   Second, it can be used to predict the impact of changes. That is, to understand how the dependent variable changes when we change the independent variables. For example, if we were reviewing a person's health data, a multiple linear regression can tell you how much that person's blood pressure goes up (or down) for every unit increase (or decrease) in a patient's body mass index (BMI), holding other factors constant.

## Multiple Regression Model


In reality, there are multiple variables that impact the CO2emission. When more than one independent variable is present, multiple linear regression is used. An example of multiple linear regression is predicting co2emission using the features FUELCONSUMPTION_COMB, EngineSize and Cylinders of cars. The good thing here is that multiple linear regression model is the extension of the simple linear regression model.

As mentioned before, Coefficient and Intercept are the parameters of the fitted line. Given that it is a multiple linear regression model with 3 parameters and that the parameters are the intercept and coefficients of the hyperplane, sklearn can estimate them from our data. Scikit-learn uses plain Ordinary Least Squares method to solve this problem.

**Ordinary Least Squares (OLS)**

OLS is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the target dependent variable and those predicted by the linear function. In other words, it tries to minimizes the sum of squared errors (SSE) or mean squared error (MSE) between the target variable ($y$) and our predicted output ($\hat{y}$) over all samples in the dataset.

OLS can find the best parameters using of the following methods:

*   Solving the model parameters analytically using closed-form equations
*   Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, etc.)

### Prediction

**Explained variance regression score**:
Let $\hat{y}$ be the estimated target output, y the corresponding (correct) target output, and Var be the Variance (the square of the standard deviation). Then the explained variance is estimated as follows:

$\texttt{explainedVariance}(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}$  
The best possible score is 1.0, the lower values are worse.