# <font size=7> <font color = darkblue> Model Building using Linear Regression

---

In [1]:
import numpy as np   
import pandas as pd    
import seaborn as sns
import matplotlib.pyplot as plt 

In [2]:
plt.rcParams['font.size']=14
plt.rcParams['axes.grid']=True
plt.rcParams['figure.figsize'] = (5,5)

---

#### In this problem, we are predicting the Sales of Carseats for a store based on various other variables.

In [3]:
df = pd.read_csv('Carseats.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Carseats.csv'

Let us check the data dictionary.

$\underline{Description}$
A simulated data set containing sales of child car seats at 400 different stores.

$\underline{Format}$
A data frame with 400 observations on the following 11 variables.

$\underline{Sales}$
Unit sales (in thousands) at each location

$\underline{CompPrice}$
Price charged by competitor at each location

$\underline{Income}$
Community income level (in thousands of dollars)

$\underline{Advertising}$
Local advertising budget for company at each location (in thousands of dollars)

$\underline{Population}$
Population size in region (in thousands)

$\underline{Price}$
Price company charges for car seats at each site

$\underline{ShelveLoc}$
A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

$\underline{Age}$
Average age of the local population

$\underline{Education}$
Education level at each location

$\underline{Urban}$
A factor with levels No and Yes to indicate whether the store is in an urban or rural location

$\underline{US}$
A factor with levels No and Yes to indicate whether the store is in the US or not

---

#### Let us check the basic measures of Descriptive Statistics of the numerical variables.

In [None]:
df.describe()

Let us look at the data types of each of the predictor variables.

In [None]:
df.info()

Now, let us look at the distribution plot of the Y_train variable.

In [None]:
sns.displot(df['Sales'], kde = True)
plt.show()

#### Let's identify the feature with the strongest linear relation with Sales!

In [None]:
df.corrwith(df.Sales)

Let us look at the scatterplot between 'Sales' and 'Price' and try to plot a line as well. 

In [None]:
from scipy.stats import pearsonr

In [None]:
P_S_corr = pearsonr(df['Price'], df['Sales'])[0]
P_S_corr

In [None]:
plt.scatter(df['Price'], df['Sales'], label = round(P_S_corr, 3))
plt.legend(loc = 'upper left');

Here, we see that these two variables are negatively correlated.

#### Let us now go ahead and build the Simple Linear Regression model between the variables 'Price' and 'Sales'.

In [None]:
import statsmodels.formula.api as SM

In [None]:
formula_SLR ='Sales~Price'

In [None]:
SM.ols?

In [None]:
model_SLR = SM.ols(formula=formula_SLR, data=df).fit() 
model_SLR.summary()

In [None]:
model_name = []
model_perf = []

model_name.append('SLR')
model_perf.append(model_SLR.rsquared_adj)

In [None]:
model_perf

#### We notice that the ${R^2}$ value in this case is very low. 
- Only around 20% variability in the dependent variable is being explained by the 'Price' variable in the.
- For Simple Linear Regression, the square of the Pearson's correlation is same as the value of the ${R^2}$.
Let us check it now.

In [None]:
print(np.square(P_S_corr))
print(model_SLR.rsquared)

#### Before, we build the Multiple Linear Regression model, let us play around with the data and try different kinds of variable transformation to see whether they improve performance.

In [None]:
scaled_price = (df['Price']-np.mean(df['Price']))/np.std(df['Price'], ddof=1)
scaled_price

In [None]:
scaled_sales = (df['Sales']-np.mean(df['Sales']))/np.std(df['Sales'],ddof=1)
scaled_sales

Let us check the distribution plot of the log of the 'Sales' variable.

In [None]:
sns.displot(scaled_sales, kde = True);

In [None]:
sns.displot(scaled_price, kde = True);

In [None]:
model_SLR_exp = SM.ols(formula='scaled_sales~scaled_price',data=df).fit()
model_SLR_exp.summary()

*We see that the **${R^2}$ value has remained the same after this transformation.** We can say that scaling a variable for Linear Regression will give us the same values as compared to the unscaled variables.*

#### We will now build the Multiple Linear Regression model. But this data has Categorical data as well
- So let us convert the categorical variables into dummy variables.

In [None]:
df_allvar = pd.get_dummies(df)

In [None]:
df_allvar.head()

#### Let us now check the correlation amongst the predictor variables just to make sure that the predictor variables are not highly correlated amongst themselves.

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df_allvar.corr(), annot=True, mask=np.triu(df_allvar.corr(),1), cmap = 'coolwarm')
plt.show()

Let us now go ahead and build a linear regression on the data with all the levels of categorical variables.

## Model 1: Model using dummy without dropping one level

In [None]:
formula_MLR_1 = 'Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+ShelveLoc_Bad+ShelveLoc_Good+ShelveLoc_Medium+Urban_No+Urban_Yes+US_No+US_Yes'

In [None]:
model_MLR_1 = SM.ols(formula=formula_MLR_1,data=df_allvar).fit()
model_MLR_1.summary()

In [None]:
model_name.append('All data')
model_perf.append(model_MLR_1.rsquared_adj)

 - The p-value for the variable 'Population' and the variable 'Education' is high. These variables are statistically not important. But we need to understand these variables from a business point of view and then only drop the variables if required.

- But here in the above model built, we have not dropped at least one of the categories while creating the dummy variables and thus there seems to be a problem of multicollinearity in the data. 

- We will see the test of multicollinearity in a short while but we will rebuild the model.

- For rebuilding the model, we will re-create the data set with appropriate levels of dummy variables.

## Model 2: Model using appropriate number of dummy variable levels.

In [None]:
df_dummy = pd.get_dummies(df, drop_first=True)

In [None]:
df_dummy.head()

### <font color = darkgreen> Discussion: Should we always one hot encode our categorical varaibles?

- Here, we see that the number of columns have been reduced and only the necessary columns are present.
- Let us now check the correlation matrix in the form of a heatmap.

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df_dummy.corr(),annot=True,mask=np.triu(df_dummy.corr(),+1))
plt.show()

In [None]:
formula_MLR_2 ='Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+ShelveLoc_Good+ShelveLoc_Medium+Urban_Yes+US_Yes'

In [None]:
model_MLR_2 = SM.ols(formula=formula_MLR_2,data=df_dummy).fit()
model_MLR_2.summary()

In [None]:
model_name.append('drop first dummy')
model_perf.append(model_MLR_2.rsquared_adj)

---
## <font color = darkblue> Now, let us check and treat the multicollinearity problem if it is present.

- ### Variance Inflation Factor (VIF) regresses the dependent variables amongst themselves and then calculates the VIF values based on the ${R^2}$ of each such regression.

- ### The formula for VIF calculation is :
- # \begin{equation*} VIF  =  \frac{1}{1 - {R^2}} \end{equation*} 
- ### VIF threshold value of 5 is commonly used to leave out columns. Sometimes 2 or 10 are also considered as VIF threshold values
- ### A VIF value of 5 means that we can choose to drop a predictor variable whose 80% variation is being explained by the other predictor variables.
---

We will calculate the Variance Inflation Factor by an user defined function.
Below is the function that is created to calculate the Variance Inflation Factor (VIF) values.
- 1st line code is about defining a function "vif_cal" which we shall use to call the function.
- We then define the x or the predictor variables. 
- The second step is to get the data in each of the column variable
- Then we define a 'for' loop where the y or the target variable is defined as one of the variables of the input data set.
- The x or the predictor variables are then defined as all the variables of the input data except the y or the target variable defined in the last step.
- We then fit a regression function and calculate the ${R^2}$ value which is being stored in the variable rsq.
- Another variable by the name of vif is defined and the ${R^2}$ value is put into the formula of the vif calculation.
- Lastly, we print this value.

This process is being repeated for all the predictor variables.

In [None]:
def vif_cal(input_data):
    '''
    input_data: Dataframe of features
    '''
    x_vars = input_data
    xvar_names = input_data.columns
    for i in range(len(xvar_names)):
        y = x_vars[xvar_names[i]] 
        x = x_vars[xvar_names.drop(xvar_names[i])]
        rsq = SM.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif = round(1/(1-rsq), 2)
        print (xvar_names[i], " VIF = " , vif)

In [None]:
vif_cal(input_data= df_dummy.drop('Sales', axis=1))

- Now, let us understand the mathematical significance of any one of these vif calculations.
- We will manually do the calculation behind this custom function for the variable 'CompPrice'.
- # \begin{equation*} VIF  =  \frac{1}{1 - {R^2}} \end{equation*}

In [None]:
#Building the model
model_vif = SM.ols(formula='CompPrice~Income+Advertising+Population+Price+Age+Education+ShelveLoc_Good+ShelveLoc_Medium+Urban_Yes+US_Yes',
                   data=df_dummy).fit()

In [None]:
model_vif.rsquared

In [None]:
#Calculating the vif from the above formula
round(1/(1-model_vif.rsquared),2)

In this way, the vif value for all the predictor variables is calculated.

- We know that the value of ${R^2}$ of any regression lies between 0 and 1. 0 means that all the predictor variables combined can only explain 0% in the variation in the target variable where as 1 means that all the predictor variables combined can explain 100% in the variation in the target variable.

- Higher the value of ${R^2}$, 1 - ${R^2}$ will be correspondingly smaller. Thus, the inverse of a very small number will be a huge number.

### <font color = darkblue> Lets do a few more VIF exercises to understand it better...

#### Let us check the vif of the data frame which contains the dummy variables without dropping a category

In [None]:
vif_cal(input_data= df_allvar.drop('Sales', axis=1))

The above values corroborates our understanding of vif. Since there was a presence of multicollinearity we see that the vif values are very high. 

#### Let us check how the vif values would differ if we forcefully enter a variable which should be having a strong collinearity with one of the variables.

In [None]:
df_dummy_copy = df_dummy.copy()

In [None]:
df_dummy_copy['Incomesq'] = np.square(df_dummy_copy['Income'])
#introducing a variable which is the square one of the predictor variables
df_dummy_copy.head()

In [None]:
df_dummy_copy.columns

In [None]:
vif_cal(input_data= df_dummy_copy.drop('Sales', axis=1))

Here, we see that the vif has indeed increased for the Income and Incomesq variable. We can go ahead and drop the 'Incomesq' variable as that variable has been derived from the 'Income' variable.

#### End of the VIF exercises
---

#### Coming back to our last model...

In [None]:
model_MLR_2.summary()

In [None]:
vif_cal(input_data= df_dummy.drop('Sales', axis=1))

- On our original model, we see that the vif of the 'Advertising' is comparatively a little higher but it is not so high as to drop it. We will keep it in our model. But can drop the 'US_Yes' variable as that has a comparatively high vif along with a high p-value indicating the particular variable might not be significant for this model.

- If variables are decided to be dropped on the basis of vif, we will drop them one by one. After one variable is dropped we are going to run the regression model and the vif function. Then if needed we will drop more variables. 

- Dropping variables means losing out on information. That can hamper the predictive as well as the descriptive power of the model.

---
## <font color = darkblue> Dropping features based on high P Value

- We notice that the p-value for the t-statistic calculation for the 'Population' variable is the highest (higher than 0.05).
- For the $\underline{t-statistic}$ for every co-efficient of the Linear Regression the null and alternate Hypothesis is as follows:
- #### ${H_0}$ : The variable is significant.
- #### ${H_1}$:  The variable is not significant.
- Lower the p-value for the t-statistic more significant are the variables.

## Model 3: Model without the 'Population' variable

In [None]:
formula_MLR_3 = 'Sales~CompPrice+Income+Advertising+Price+Age+Education+ShelveLoc_Good+ShelveLoc_Medium+Urban_Yes+US_Yes'

In [None]:
model_MLR_3 = SM.ols(formula=formula_MLR_3,data=df_dummy).fit()
model_MLR_3.summary()

In [None]:
model_name.append('w/o population')
model_perf.append(model_MLR_3.rsquared_adj)

- There is almost no change in the ${R^2}$ values. 

- While adding or subtracting variables from a regression model to refine the model, we need to be very careful about the Adjusted ${R^2}$ values. Adding any particular value which is not significant can increase the ${R^2}$ value but the Adjusted ${R^2}$ changes by the addition or the subtraction of significant variables.

Let us check the $R^2$ and adjusted $R^2$ values for the $2^{nd}$ and $3^{rd}$ Multiple Linear Regression Model.

In [None]:
print('For the second MLR model:','\n')

print('Rsquared',model_MLR_2.rsquared)
print('Adjusted Rsquared',model_MLR_2.rsquared_adj)

In [None]:
print('For the third MLR model:','\n')

print('Rsquared',model_MLR_3.rsquared)
print('Adjusted Rsquared',model_MLR_3.rsquared_adj)

This means that the particular information about the population does not help us in predicting the 'Sales' as compared to the other information that we have.

## Model 4: Drop 'CompPrice'
- Let us see what happens when we drop a statistically significant variable from the model.
- In this case,we will drop the 'CompPrice' model.

In [None]:
formula_MLR_4 = 'Sales~Income+Advertising+Population+Price+Age+Education+ShelveLoc_Good+ShelveLoc_Medium+Urban_Yes+US_Yes'

In [None]:
model_MLR_4 = SM.ols(formula=formula_MLR_4,data=df_dummy).fit()

In [None]:
model_MLR_4.summary()

In [None]:
model_name.append('drop Comp Price')
model_perf.append(model_MLR_4.rsquared_adj)

As per our understanding, we see that both the Adjusted ${R^2}$ and the ${R^2}$ values have dropped massively. The p-values of t-statistic of certain variables have also changed. This indicates that as per the last iteration of the model a few  values have become more important.

#### Let us again check the earlier model before dropping any statistically significant variables.

In [None]:
model_MLR_3.summary()

## Model 5 - Drop Urban Yes

Let us now go ahead and drop the 'Urban_Yes' variable as that does not seem very statistically significant

In [None]:
formula_MLR_5 = 'Sales~CompPrice+Income+Advertising+Price+Age+Education+ShelveLoc_Good+ShelveLoc_Medium+US_Yes'

In [None]:
model_MLR_5 = SM.ols(formula=formula_MLR_5,data=df_dummy).fit()

In [None]:
model_MLR_5.summary()

In [None]:
model_name.append('drop Urban Yes')
model_perf.append(model_MLR_5.rsquared_adj)

Almost no change in the ${R^2}$ and Adjusted ${R^2}$ is observed thus confirming the fact that the variable was indeed not significant.

## Model 6: Drop Education

Now we will check the diagnostics of the model after dropping the 'Education' variable as that does not seem significant.

In [None]:
formula_MLR_6 = 'Sales~CompPrice+Income+Advertising+Price+Age+ShelveLoc_Good+ShelveLoc_Medium+US_Yes'

In [None]:
model_MLR_6 = SM.ols(formula=formula_MLR_6,data=df_dummy).fit()

In [None]:
model_MLR_6.summary()

In [None]:
model_name.append('drop Education')
model_perf.append(model_MLR_6.rsquared_adj)

From the above model we can thus conclude that Education is not a significant variable when it comes to predicting the sales.

## Model 7 - drop US Yes

From the p-value of the P value of 'US_Yes', the variable does not seem significant. We will run the model by dropping the variable and then we will again check the values of $R^2$ and adjusted $R^2$.

In [None]:
formula_MLR_7 = 'Sales~CompPrice+Income+Advertising+Price+Age+ShelveLoc_Good+ShelveLoc_Medium'

In [None]:
model_MLR_7 = SM.ols(formula=formula_MLR_7,data=df_dummy).fit()

In [None]:
model_MLR_7.summary()

In [None]:
model_name.append('drop US Yes')
model_perf.append(model_MLR_7.rsquared_adj)

We see that the $R^2$ and adjusted $R^2$ values does not change much if we drop the 'US_Yes' variable.

## Model 8: Drop Income

Let us drop the 'Income' variable once and run the model. Here all the variables are significant but we are trying to see that within these significant variables if we drop the least significant one, do the output change a lot?

In [None]:
formula_MLR_8 = 'Sales~CompPrice+Advertising+Price+Age+ShelveLoc_Good+ShelveLoc_Medium'

In [None]:
model_MLR_8 = SM.ols(formula=formula_MLR_8,data=df_dummy).fit()

In [None]:
model_MLR_8.summary()

In [None]:
model_name.append('drop Income')
model_perf.append(model_MLR_8.rsquared_adj)

Let us compare the $R^2$ and the adjusted $R^2$ values with model 7.

In [None]:
print('For the seventh MLR model:','\n')

print('Rsquared',model_MLR_7.rsquared)
print('Adjusted Rsquared',model_MLR_7.rsquared_adj)

In [None]:
print('For the eigth MLR model:','\n')

print('Rsquared',model_MLR_8.rsquared)
print('Adjusted Rsquared',model_MLR_8.rsquared_adj)

In [None]:
print('We notice that there is drop of',round((model_MLR_7.rsquared - model_MLR_8.rsquared),7),'and',round((model_MLR_7.rsquared_adj-model_MLR_8.rsquared_adj),7),'for Rsquared and adjusted Rsquared respectively.')

Let us check the p-values once more.

In [None]:
model_MLR_8.pvalues

## Model 9: Drop ShelveLoc: Medium

Let us drop the 'ShelveLoc_Medium' variable. Again dropping one more least significant variable among the most significant variables.

In [None]:
formula_MLR_9 = 'Sales~CompPrice+Advertising+Price+Age+ShelveLoc_Good'

In [None]:
model_MLR_9 = SM.ols(formula=formula_MLR_9,data=df_dummy).fit()

In [None]:
model_MLR_9.summary()

In [None]:
model_name.append('drop Shelve Medium')
model_perf.append(model_MLR_9.rsquared_adj)

There is a huge drop in the values of $R^2$ and adjusted $R^2$ if we drop the 'Shelve_Medium'.

In [None]:
print('We notice that there is drop of',round((model_MLR_8.rsquared - model_MLR_9.rsquared),6),'and',round((model_MLR_8.rsquared_adj-model_MLR_9.rsquared_adj),6),'for Rsquared and adjusted Rsquared respectively.')

We have thus seen the effects and power of various variables on describing the target variable.

# <font color = darkblue> Model Evaluation


In [None]:
model_eval = pd.DataFrame({'model_name': model_name, 'model_perf': model_perf})
model_eval

- #### We will use Model 7 and Model 8 to predict and check the model evaluation.
- #### Model 7 because, it has a high Adjusted R Square, with least number of features
- #### Model 8, for comparison sake

### Model 7 & 8 - Prediction and Scatterplot

In [None]:
model_MLR_7_pred = model_MLR_7.fittedvalues
model_MLR_8_pred = model_MLR_8.fittedvalues
model_MLR_7_pred

In [None]:
f, (ax1, ax2) =  plt.subplots(nrows=1, ncols=2, figsize=(15,5), sharey=True)

ax1.scatter(df_dummy['Sales'], model_MLR_7_pred)
ax1.set_title('Model 7 predictions')

ax2.scatter(df_dummy['Sales'],model_MLR_8_pred)
ax2.set_title('Model 8 predictions')
plt.show()

#### Checking the boxplot and the distplot of the residuals

In [None]:
f,a =  plt.subplots(1,2, sharex=True, sharey=False, squeeze=False, figsize=(15,5))

#Plotting the distplot and the boxplot of the residuals for model 8

plot_0 = sns.histplot(model_MLR_7.resid, ax=a[0][0], kde=True)
a[0][0].set_title('Model 7: Distplot of the residuals')

plot_1 = sns.histplot(model_MLR_8.resid, ax=a[0][1], kde=True)
a[0][1].set_title('Model 8: Distplot of the residuals')
plt.show()


In [None]:
f,a =  plt.subplots(1,2, sharex=True, sharey=False, squeeze=False, figsize=(15,5))

#Plotting the distplot and the boxplot of the residuals for model 8

plot_0 = sns.boxplot(x= model_MLR_7.resid, ax=a[0][0])
a[0][0].set_title('Model 7: Boxplot of the residuals')

plot_1 = sns.boxplot(x = model_MLR_8.resid, ax=a[0][1])
a[0][1].set_title('Model 8: Boxplot of the residuals')
plt.show()


In [None]:
from sklearn import metrics

### Model 7 - RMSE

In [None]:
metrics.mean_squared_error(df_dummy['Sales'], model_MLR_7_pred, squared=False)

### Model 8 - RMSE

In [None]:
metrics.mean_squared_error(df_dummy['Sales'], model_MLR_8_pred,squared=False)

---

# <font color = darkblue>Only for Predictive purposes of Linear Regression
---

- If we only wanted to predict using Linear Regression and were not looking for the model building aspect of it, we can do that as well. 
- For this exercise, we will use the same variables as of Model 2, Model 7, Model 8 and Model 9.
###  Key Differences in Predictive Modelling
- #### We will split the data into train and test and get an idea about the expected quality of predictions in future.
- #### We will need to choose a metric of interest. Lets choose RMSE.
- #### build the model on the training data and check the RMSE on the test data.

###### Note: We are going to build all the models, get their predictions and then go on to evaluate those models.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
df_dummy.head()

Splitting the data into the dependent and independent variables.

In [None]:
X = df_dummy.drop('Sales', axis=1)
Y = df_dummy['Sales']

Splitting the data into train (70%) and test (30%).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

### Using only Model 2 variables to build the model on the training data and predict on the training as well as test data.

In [None]:
model_2 = lr.fit(X_train[['CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education', 'ShelveLoc_Good', 'ShelveLoc_Medium', 'Urban_Yes', 'US_Yes']], Y_train)
#We are only using Linear Regression as a predictive tool and not a descriptive tool

In [None]:
#Training Data Prediction
model_2_pred_train = model_2.predict(X_train[['CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education', 'ShelveLoc_Good', 'ShelveLoc_Medium', 'Urban_Yes', 'US_Yes']])

In [None]:
#Test Data Prediction
model_2_pred_test = model_2.predict(X_test[['CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education', 'ShelveLoc_Good', 'ShelveLoc_Medium', 'Urban_Yes', 'US_Yes']])

### Using only Model 7 variables to build the model on the training data and predict on the training as well as test data.

In [None]:
model_7 = lr.fit(X_train[['CompPrice', 'Income', 'Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']],Y_train)

In [None]:
#Training Data Prediction
model_7_pred_train = model_7.predict(X_train[['CompPrice', 'Income', 'Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']])

In [None]:
#Test Data Prediction
model_7_pred_test = model_7.predict(X_test[['CompPrice', 'Income', 'Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']])

### Using only Model 8 variables to build the model on the training data and predict on the training as well as test data.

In [None]:
model_8 = lr.fit(X_train[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']],Y_train)

In [None]:
#Training Data Prediction
model_8_pred_train = model_8.predict(X_train[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']])

In [None]:
#Test Data Prediction
model_8_pred_test = model_8.predict(X_test[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good', 'ShelveLoc_Medium']])

### Using only Model 9 variables to build the model on the training data and predict on the training as well as test data.

In [None]:
model_9 = lr.fit(X_train[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good']],Y_train)

In [None]:
#Training Data Prediction
model_9_pred_train = model_9.predict(X_train[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good']])

In [None]:
#Test Data Prediction
model_9_pred_test = model_9.predict(X_test[['CompPrice','Advertising', 'Price',
       'Age', 'ShelveLoc_Good']])

## RMSE check for all the models built

In [None]:
print('Training Data RMSE of model_2:',metrics.mean_squared_error(Y_train,model_2_pred_train,squared=False))
print('Test Data RMSE of model_2:',metrics.mean_squared_error(Y_test,model_2_pred_test,squared=False))

In [None]:
print('Training Data RMSE of model_7:',metrics.mean_squared_error(Y_train,model_7_pred_train,squared=False))
print('Test Data RMSE of model_7:',metrics.mean_squared_error(Y_test,model_7_pred_test,squared=False))

In [None]:
print('Training Data RMSE of model_8:',metrics.mean_squared_error(Y_train,model_8_pred_train,squared=False))
print('Test Data RMSE of model_8:',metrics.mean_squared_error(Y_test,model_8_pred_test,squared=False))

In [None]:
print('Training Data RMSE of model_9:',metrics.mean_squared_error(Y_train,model_9_pred_train,squared=False))
print('Test Data RMSE of model_9:',metrics.mean_squared_error(Y_test,model_9_pred_test,squared=False))

The best descriptive model might not be the best predictive model.

## Scatter plot for the predictions

In [None]:
# Training Data
f,a =  plt.subplots(2,2,sharex=True, figsize=(15,8))
a[0][0].scatter(Y_train,model_2_pred_train)
a[0][0].set_title('model_2')
a[0][1].scatter(Y_train,model_7_pred_train)
a[0][1].set_title('model_7')
a[1][0].scatter(Y_train,model_8_pred_train)
a[1][0].set_title('model_8')
a[1][1].scatter(Y_train,model_9_pred_train)
a[1][1].set_title('model_9')
plt.show()

In [None]:
# Test Data
f,a =  plt.subplots(2,2,sharex=True, figsize=(15,8))
a[0][0].scatter(Y_test,model_2_pred_test)
a[0][0].set_title('model_2')
a[0][1].scatter(Y_test,model_7_pred_test)
a[0][1].set_title('model_7')
a[1][0].scatter(Y_test,model_8_pred_test)
a[1][0].set_title('model_8')
a[1][1].scatter(Y_test,model_9_pred_test)
a[1][1].set_title('model_9')
plt.show()

# END