<a href="https://colab.research.google.com/github/MachineLearningWithHuman/cloud/blob/master/Linear_Regression_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression
This Notebook contains Example of linear regression implemented in scikit-learn framework the folder contains other framework implementation of the algorethem.
i.e: tensorflow, pytorch, R,Matlab. 
 

<div align="center">
<a href="https://github.com/MachineLearningWithHuman/cloud/blob/master/Employer/machine%20learning%20basics/linear%20regression/Linear_Regression_.ipynb" role="button"><img class="notebook-badge-image" src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
<a href="https://colab.research.google.com/github/MachineLearningWithHuman/cloud/blob/master/Linear_Regression_.ipynb"><img class="notebook-badge-image" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</div>

# <b>Assumptions in Linear Regression</b>


*   **Linear relationship.**It is also important to check for outliers since linear regression is sensitive to outlier effects.  The linearity assumption can best be tested with scatter plots.
![alt text](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2F2.bp.blogspot.com%2F-TcRUfR96Flw%2FUBhSO2LK9CI%2FAAAAAAAAAgY%2FI40YLjWavIs%2Fs1600%2Flinear%2Bvs%2Bnonlinear.jpg&f=1&nofb=1)
*   **Multivariate normality.**This assumption can best be checked with a histogram or a Q-Q-Plot.  Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test.  When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.
![alt text](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.gya82KlV8aRluXMS8OCOOAAAAA%26pid%3DApi&f=1)
*   **No or little multicollinearity.**Multicollinearity occurs when the independent variables are too highly correlated with each other.
![alt text](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fi.stack.imgur.com%2FJfCrn.jpg&f=1&nofb=1)
Multicollinearity may be tested with three central criteria:

1.    **Correlation matrix** – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.
2.   **Tolerance –**the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis.  Tolerance is defined as **T = 1 – R²** for these first step regression analysis.  With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
3.  **Variance Inflation Factor (VIF) –** the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.
*   No auto-correlation
*   Homoscedasticity (static variance)








# Overview

Our goal is to learn a linear model $\hat{y}$ that models $y$ given $X$. 

$\hat{y} = XW + b$
* $\hat{y}$ = predictions | $\in \mathbb{R}^{NX1}$ ($N$ is the number of samples)
* $X$ = inputs | $\in \mathbb{R}^{NXD}$ ($D$ is the number of features)
* $W$ = weights | $\in \mathbb{R}^{DX1}$ 
* $b$ = bias | $\in \mathbb{R}^{1}$ 

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Data

In [0]:
from sklearn import datasets

data = datasets.load_boston()

In [0]:
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

#statsmodel for regression

In [6]:
import statsmodels.api as sm

X = df["RM"]
y = target["MEDV"]

#model
model = sm.OLS(y,X).fit()

#prediction
predictions = model.predict(X)

#summary
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared (uncentered):,0.901
Model:,OLS,Adj. R-squared (uncentered):,0.901
Method:,Least Squares,F-statistic:,4615.0
Date:,"Tue, 02 Jun 2020",Prob (F-statistic):,3.7399999999999996e-256
Time:,02:39:37,Log-Likelihood:,-1747.1
No. Observations:,506,AIC:,3496.0
Df Residuals:,505,BIC:,3500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
RM,3.6534,0.054,67.930,0.000,3.548,3.759

0,1,2,3
Omnibus:,83.295,Durbin-Watson:,0.493
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152.507
Skew:,0.955,Prob(JB):,7.649999999999999e-34
Kurtosis:,4.894,Cond. No.,1.0


# <b>Points To be Noted here</b>


1.   OLS stands for Ordinary Least Squares and the method **Least Squares** means that we’re trying to fit a regression line that would minimize the square of distance from the regression line.
2.   The coefficient of 3.6534 means that as the RM variable increases by 1, the predicted value of MDEV increases by 3.6534.
3.   R-squared (uncentered):	0.901 the percentage of variance our model explains
4.  this is typically y=m*x



#Adding Constant

In [7]:
import statsmodels.api as sm # import statsmodels 

X = df["RM"] ## X usually means our input variables (or independent variables)
y = target["MEDV"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.484
Model:,OLS,Adj. R-squared:,0.483
Method:,Least Squares,F-statistic:,471.8
Date:,"Tue, 02 Jun 2020",Prob (F-statistic):,2.49e-74
Time:,02:45:57,Log-Likelihood:,-1673.1
No. Observations:,506,AIC:,3350.0
Df Residuals:,504,BIC:,3359.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-34.6706,2.650,-13.084,0.000,-39.877,-29.465
RM,9.1021,0.419,21.722,0.000,8.279,9.925

0,1,2,3
Omnibus:,102.585,Durbin-Watson:,0.684
Prob(Omnibus):,0.0,Jarque-Bera (JB):,612.449
Skew:,0.726,Prob(JB):,1.02e-133
Kurtosis:,8.19,Cond. No.,58.4


#Multivariant

In [10]:
X = df[["RM", "LSTAT"]]
X=sm.add_constant(X)
y = target["MEDV"]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.639
Model:,OLS,Adj. R-squared:,0.637
Method:,Least Squares,F-statistic:,444.3
Date:,"Tue, 02 Jun 2020",Prob (F-statistic):,7.0099999999999995e-112
Time:,02:48:26,Log-Likelihood:,-1582.8
No. Observations:,506,AIC:,3172.0
Df Residuals:,503,BIC:,3184.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.3583,3.173,-0.428,0.669,-7.592,4.875
RM,5.0948,0.444,11.463,0.000,4.222,5.968
LSTAT,-0.6424,0.044,-14.689,0.000,-0.728,-0.556

0,1,2,3
Omnibus:,145.712,Durbin-Watson:,0.834
Prob(Omnibus):,0.0,Jarque-Bera (JB):,457.69
Skew:,1.343,Prob(JB):,4.1100000000000003e-100
Kurtosis:,6.807,Cond. No.,202.0


#<b> SKLearn FrameWork

In [0]:
from sklearn import linear_model

In [0]:
X=df
y=target["MEDV"]

In [0]:
#model
lm = linear_model.LinearRegression()
model=lm.fit(X,y)

In [16]:
predictions=model.predict(X)
print(predictions[0:5])

[30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]


### R-Squared

In [17]:
lm.score(X,y)

0.7406426641094095

###Coefficient and intercepts

In [18]:
lm.coef_

array([-1.08011358e-01,  4.64204584e-02,  2.05586264e-02,  2.68673382e+00,
       -1.77666112e+01,  3.80986521e+00,  6.92224640e-04, -1.47556685e+00,
        3.06049479e-01, -1.23345939e-02, -9.52747232e-01,  9.31168327e-03,
       -5.24758378e-01])

In [19]:
lm.intercept_

36.459488385090125

In [20]:
lm.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}

**Congratulations For learning Linear Regression in SKlearn and statsmodel Like share and subscribe us on Youtube**
[youtube](https://)[](https://www.youtube.com/channel/UCiWd572-4LeH0IqJ5A7LavA/)