<a href="https://colab.research.google.com/github/Suraj5188/Maschine_Learning/blob/main/4_Linear_Regression_Assumption_Checks_test_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Linear Regression Assumption Checks test data**

One of the most essential steps to take before applying linear regression and depending solely on accuracy scores is to check for these assumptions. Table of Content

1. Linearity
2. Mean of Residuals
3. Check for Homoscedasticity
4. Check for Normality of error terms/residuals
5. No autocorrelation of residuals
6. No perfect multicollinearity
7. Other Models for comparison

In [16]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')
import os

from sklearn import metrics

In [2]:
dataset=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML_Dataset/advertising.csv')

In [3]:
dataset.columns

Index(['TV', 'Radio', 'Newspaper', 'Sales'], dtype='object')

In [4]:
dataset.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


In [5]:
dataset.tail()

Unnamed: 0,TV,Radio,Newspaper,Sales
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,18.4


In [6]:
dataset.shape

(200, 4)

In [7]:
dataset.size

800

In [8]:
fig = px.scatter_matrix(dataset)
fig.show()

#**Assumptions for Linear Regression**

#**1. Linearity**

Linear regression needs the relationship between the independent and dependent variables to be linear. Let's use a pair plot to check the relation of independent variables with the Sales variable

In [9]:
# visualize the relationship between the features and the response using scatterplots

fig = px.scatter(dataset,x=['TV','Radio','Newspaper'], y='Sales',title='')
fig.show()

By looking at the plots we can see that with the Sales variable the none of the independent variables form an accurately linear shape but TV and Radio do still better than Newspaper which seems to hardly have any specific shape. So it shows that a linear regression fitting might not be the best model for it. A linear model might not be able to efficiently explain the data in terms of variability, prediction accuracy etc.

A tip is to remember to always see the plots from where the dependent variable is on the y axis. Though it wouldn't vary the shape much but that's how linear regression's intuition is, to put the dependent variable as y and independents as x(s).

Now rest of the assumptions require us to perform the regression before we can even check for them. So let's perform regression on it. Fitting the linear model

In [10]:
# Get dependent and independent variables - X and y
X = dataset.drop(["Sales"],axis=1)
y = dataset.Sales

In [11]:
# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0,test_size=0.25)

In [12]:
# Fitting the linear regression model
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X_test,y_test)
y_pred = regr.predict(X_test)

In [13]:
y_pred

array([10.31827548,  8.39815899,  9.70788126, 25.19277789, 15.05579895,
        8.28507992,  9.83242706, 18.84680832,  9.36915015, 17.91670771,
       23.62294923, 10.60032527, 13.62037571, 17.12107613, 11.32824572,
       13.11789157, 21.42367926,  8.08250486, 13.47331188, 18.93574063,
       25.27185795, 12.62524623, 16.66553013, 13.81823488,  7.73518442,
       14.6699957 , 14.87884053, 20.38300961, 17.51730695,  8.43038144,
       12.34752172, 20.69378544, 21.96953105, 21.85907755,  6.88699361,
        6.68097167,  9.14788149, 16.09233864, 12.59460074,  6.98256997,
       10.44097103,  8.42672343, 15.31968087, 18.33382788, 19.25043865,
       13.01303985,  5.03416011,  9.91886925, 15.46913312, 11.0931301 ])

#**Mean of Residuals**

In [18]:
print('MAE:',metrics.mean_absolute_error(y_pred,y_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_pred,y_test)))
print('R-Squared',metrics.r2_score(y_pred,y_test))

MAE: 1.3506217053360121
RMSE: 1.9491868555333813
R-Squared 0.8596932444002311
