## How To Implement Linear Regression for Machine Learning?
The focus of supervised learning revolves around the input and output variables using an algorithm to predict the outcome. If a new input variable comes into the picture. The linear regression algorithm in machine learning is a supervised learning technique to approximate the mapping function to get the best predictions. In this article, we will learn about linear regression for machine learning. The following topics are discussed in this blog.

- What is Regression?
- ypes of Regression- = What is Linear Regression?
- Linear Regression Terminologies
- Advantages And Disadvantages Of Linear Regression
- Linear Regression Use Cases
- Use case – Linear Regression Implementation
## What is Regression?
The main goal of regression is the construction of an efficient model to predict the dependent attributes from a bunch of attribute variables. A regression problem is when the output variable is either real or a continuous value i.e salary, weight, area, etc.

We can also define regression as a statistical means that is used in applications like housing, investing, etc. It is used to predict the relationship between a dependent variable and a bunch of independent variables. Let us take a look at various types of regression techniques.



## Types Of Regression
The following are types of regression.

- Simple Linear Regression
- Polynomial Regression
- Support Vector Regression
- Decision Tree Regression
- Random Forest Regression
## Simple Linear Regression
One of the most interesting and common regression technique is simple linear regression. In this, we predict the outcome of a dependent variable based on the independent variables, the relationship between the variables is linear. Hence, the word linear regression.



## Polynomial Regression
In this regression technique, we transform the original features into polynomial features of a given degree and then perform regression on it.
## Support Vector Regression
For support vector machine regression or SVR, we identify a hyperplane with maximum margin such that the maximum number of data points are within those margins. It is quite similar to the support vector machine classification algorithm.

## Decision Tree Regression
A decision tree can be used for both regression and classification. In the case of regression, we use the ID3 algorithm(Iterative Dichotomiser 3) to identify the splitting node by reducing the standard deviation.

## Random Forest Regression
In random forest regression, we ensemble the predictions of several decision tree regressions. Now that we know about different types of regression let us take a look at simple linear regression in detail.

 

## What is Linear Regression?
Simple linear regression is a regression technique in which the independent variable has a linear relationship with the dependent variable. The straight line in the diagram is the best fit line. The main goal of the simple linear regression is to consider the given data points and plot the best fit line to fit the model in the best way possible.



Before moving on to how the linear regression algorithm works, let us take a look at a few important terminologies in simple linear regression.

## Linear Regression Terminologies
The following terminologies are important to be familiar with before moving on to the linear regression algorithm.



### Cost Function
The best fit line can be based on the linear equation given below.

In [None]:
y=b0 +b1x +e

- The dependent variable that is to be predicted is denoted by Y.
- A line that touches the y-axis is denoted by the intercept b0.
- b1 is the slope of the line, x represents the independent variables that determine the prediction of Y.
- The error in the resultant prediction is denoted by e.

The cost function provides the best possible values for b0 and b1 to make the best fit line for the data points. We do it by converting this problem into a minimization problem to get the best values for b0 and b1. The error is minimized in this problem between the actual value and the predicted value.



## cost function( J )
### J = (1/n)sum{(y_pred - y_real)^2}

We choose the function above to minimize the error. We square the error difference and sum the error over all data points, the division between the total number of data points. Then, the produced value provides the averaged square error over all data points.

It is also known as MSE(Mean Squared Error), and we change the values of b0 and b1 so that the MSE value is settled at the minimum.

## Gradient Descent
The next important terminology to understand linear regression is gradient descent. It is a method of updating b0 and b1 values to reduce the MSE. The idea behind this is to keep iterating the b0 and b1 values until we reduce the MSE to the minimum.

To update b0 and b1, we take gradients from the cost function. To find these gradients, we take partial derivatives with respect to b0 and b1. These partial derivatives are the gradients and are used to update the values of b0 and b1.

<b>gradient descent</b> - linear regression in machine learning - smaller learning rate takes closer to the minimum, but it takes more time and in case of a larger learning rate. The time taken is sooner but there is a chance to overshoot the minimum value. Now that we are through with the terminologies in linear regression.



## A case study in Python:
For this case study first, you will use the Statsmodel library for Python. It is a very popular library which provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. For the data, you will use the famous Boston House dataset. The mighty scikit-learn comes with this dataset, so you don't need to download it separately.

Let's start the case study by importing the statsmodels library and your dataset:

In [7]:
import numpy as np
import pandas as pd

In [8]:
df=pd.read_csv('datasets_1379_2485_housing.csv')

In [9]:
df.head()

Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
0,6.575,4.98,15.3,504000.0
1,6.421,9.14,17.8,453600.0
2,7.185,4.03,17.8,728700.0
3,6.998,2.94,18.7,701400.0
4,7.147,5.33,18.7,760200.0


In [10]:
df.shape

(489, 4)

In [11]:
df.isnull().sum()

RM         0
LSTAT      0
PTRATIO    0
MEDV       0
dtype: int64

Now, before applying linear regression, you will have to prepare the data and segregate the features and the label of the dataset. MEDV (median home value) is the label in this case. You can access the features of the dataset using feature_names attribute.

A bit of pandas knowledge will come in handy here. This cheat sheet is a must-see if you are looking for ways to refresh basic pandas concepts.

In [12]:
X=df.iloc[:,0:-1]

In [13]:
y=df.iloc[:,-1]

In [14]:
X.shape

(489, 3)

In [15]:
y.shape

(489,)

At this point, you need to consider a few important things about linear regression before applying it to the data. You could have studied this earlier in this tutorial, but studying these factors at this particular point of time will help you get the real feel.

<b>Linear Assumption:</b> Linear regression is best employed to capture the relationship between the input variables and the outputs. In order to do so, linear regression assumes this relationship to be linear (which might not be the case all the time). But you can always transform your data so that a linear relationship is maintained. For example, if your data has an exponential relationship, you can apply log-transform to make the relationship linear.

<b>Collinearity between the features:</b> Collinearity is a measure to calculate the importance of a feature of a dataset mathematically. When you have a dataset in which the features are very correlated to each other, linear regression fails to approximate the relationship appropriately and tends to overfit. So, it is efficient to detect the highly correlated features and to drop them before you apply linear regression. If you want to know more about this, feel free to check this excellent Kaggle kernel.
Let's do some hands-on now. To keep things simple you will just take RM — the average number of rooms feature for now. Note that Statsmodels does not add a constant term (recall the factor θ0) by default. Let’s see it first without the constant term in your regression model:

In [17]:
from sklearn.model_selection import train_test_split

In [22]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 70% training and 30% test

In [23]:
from sklearn.linear_model import LinearRegression

In [24]:
LR=LinearRegression()

In [25]:
LR.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [27]:
y_pred=LR.predict(X_test)

In [28]:
from sklearn.metrics import r2_score

In [29]:
r2_score(y_test,y_pred)

0.7042069943455352

In [31]:
df.corr()['MEDV']

RM         0.697209
LSTAT     -0.760670
PTRATIO   -0.519034
MEDV       1.000000
Name: MEDV, dtype: float64