## This notebook explains the assumptions of linear regression in detail. One of the most essential steps to take before applying linear regression and depending solely on accuracy scores is to check for these assumptions.

Table of Content
<br><a href="#linearity">1. Linearity</a>
<br><a href="#mean">2. Mean of Residuals</a>
<br><a href="#homo">3. Check for Homoscedasticity</a>
<br><a href="#normal">4. Check for Normality of error terms/residuals</a>
<br><a href="#auto">5. No autocorrelation of residuals</a>
<br><a href="#multico">6. No perfect multicollinearity</a>
<br><a href="#other">7. Other Models for comparison</a>


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
import warnings
warnings.filterwarnings('ignore')
import os
import matplotlib.pyplot as plt

In [2]:
ad_data = pd.read_csv('../input/Advertising.csv',index_col='Unnamed: 0')

In [4]:
ad_data.tail()

In [5]:
ad_data.info()

In [6]:
ad_data.isna().sum()

In [7]:
ad_data.describe()

In [8]:
ad_data.boxplot()

In [9]:
p = sns.pairplot(ad_data)

In [10]:
ad_data.shape

#  Assumptions for Linear Regression

## <a id="linearity">1. Linearity</a>


 ### Linear regression needs the relationship between the independent and dependent variables to be linear.  Let's use a pair plot to check the relation of independent variables with the Sales variable

In [11]:
# visualize the relationship between the features and the response using scatterplots
p = sns.pairplot(ad_data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7)

### By looking at the plots we can see that with the Sales variable the none of the independent variables form an accurately linear shape but TV and Radio do still better than Newspaper which seems to hardly have any specific shape. So it shows that a linear regression fitting might not be the best model for it. A linear model might not be able to *efficiently* explain the data in terms of variability, prediction accuracy etc. 

A tip is to remember to always see the plots from where the dependent variable is on the y axis. Though it wouldn't vary the shape much but that's how linear regression's intuition is, to put the dependent variable as y and independents as x(s).

### Now rest of the assumptions require us to perform the regression before we can even check for them. So let's perform regression on it.

### Fitting the linear model

In [12]:
x = ad_data.drop(["Sales"],axis=1)
y = ad_data.Sales

In [14]:
y.head()

In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(x)

In [18]:
X

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0,test_size=0.25)

In [22]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X_train,y_train) # MOdel is built
y_pred = regr.predict(X_train)

In [41]:
regr.predict(np.array([1250, 100, 100]).reshape(1, -1))

In [25]:
r2 = r2_score(y_true=y_train,y_pred=y_pred)
r2

In [26]:
print("R squared: {}".format(r2))

In [27]:
adj_r2 = 1 - ((1-r2)*(200-1)/(200-3-1))
adj_r2

In [28]:
y_pred_test = regr.predict(X_test)
r2_test = r2_score(y_true=y_test,y_pred=y_pred_test)
r2_test

In [None]:
mape = sum(abs((1/200)*((y_test-y_pred_test)/y_test)))
mape*100

## <a id="mean">2. Mean of Residuals</a>

### Residuals as we know are the differences between the true value and the predicted value. One of the assumptions of linear regression is that the mean of the residuals should be zero. So let's find out.


In [29]:
residuals = y_train.values-y_pred
mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))

### Very close to zero so all good here.

## <a id="homo">3. Check for Homoscedasticity</a>

### Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.

### Detecting heteroscedasticity! 
Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.

In [30]:
p = sns.scatterplot(y_pred,residuals)
plt.xlabel('y_pred/predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,26)
p = sns.lineplot([0,26],[0,0],color='blue')
p = plt.title('Residuals vs fitted values plot for homoscedasticity check')

## <a id="normal">4. Check for Normality of error terms/residuals</a>

In [31]:
p = sns.distplot(residuals,kde=True)
p = plt.title('Normality of error terms/residuals')

### The residual terms are pretty much normally distributed for the number of test points we took. Remember the central limit theorem which says that as the sample size increases the distribution tends to be normal. A skew is also visible from the plot. It's very difficult to get perfect curves, distributions in real life data.

## <a id="auto">5. No autocorrelation of residuals</a>

### When the residuals are autocorrelated, it means that the current value is dependent of the previous (historic) values and that there is a definite unexplained pattern in the Y variable that shows up in the error terms. Though it is more evident in time series data.

#### In plain terms autocorrelation takes place when there's a pattern in the rows of the data. This is usual in time series data as there is a pattern of time for eg. Week of the day effect which is a very famous pattern seen in stock markets where people tend to buy stocks more towards the beginning of weekends and tend to sell more on Mondays. There's been great study about this phenomenon and it is still a matter of research as to what actual factors cause this trend.

### There should not be autocorrelation in the data so the error terms should not form any pattern.

In [32]:
plt.figure(figsize=(10,5))
p = sns.lineplot(y_pred,residuals,marker='o',color='blue')
plt.xlabel('y_pred/predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,26)
p = sns.lineplot([0,26],[0,0],color='red')
p = plt.title('Residuals vs fitted values plot for autocorrelation check')

## <a id="multico">6. No perfect multicollinearity</a>

### In regression, multicollinearity refers to the extent to which independent variables are correlated. 

In [33]:
plt.figure(figsize=(20,20))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(ad_data.corr(), annot=True,cmap='RdYlGn',square=True)  # seaborn has very simple solution for heatmap

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# So most of the major assumptions of Linear Regression are successfully through. Great! Since this was one of the simplest data sets it demonstrated the steps well. These steps can be applied on other problems to be able to make better decisions about which model to use. I hope this acts as a decent template of sort to be applied to data.

# <a id="other"> 7. Some other model evaluations for fun</a>

In [None]:
from sklearn.tree import DecisionTreeRegressor

dec_tree = DecisionTreeRegressor(random_state=0)
dec_tree.fit(X_train,y_train)
dec_tree_y_pred = dec_tree.predict(X_train)

print("R squared: {}".format(r2_score(y_true=y_train,y_pred=dec_tree_y_pred)))

In [None]:
y_pred_d = dec_tree.predict(X_train)
r2d = r2_score(y_true=y_train,y_pred=y_pred_d)
r2d

In [None]:
y_pred_test_d = dec_tree.predict(X_test)
r2_test_d = r2_score(y_true=y_test,y_pred=y_pred_test_d)
r2_test_d

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_tree = RandomForestRegressor(random_state=0)
rf_tree.fit(X_train,y_train)
rf_tree_y_pred = rf_tree.predict(X_train)
print("R squared: {}".format(r2_score(y_true=y_train,y_pred=rf_tree_y_pred)))

In [None]:
y_pred_test_r = rf_tree.predict(X_test)
r2_test_r = r2_score(y_true=y_test,y_pred=y_pred_test_r)
r2_test_r

Reference:
* http://r-statistics.co/Assumptions-of-Linear-Regression.html
* https://www.statisticssolutions.com/assumptions-of-linear-regression/