This question should be answered using the `Carseats` data set.

In [0]:
# import statistical tools
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as stats

In [0]:
# import data visualisation tools
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
# load data; visualisation same as Section 3.6.3
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Carseats.csv"
CarSeats = spark.read.option("header", "true").csv(url).toPandas()
CarSeats.set_index('SlNo', inplace=True)

str_cols = ["ShelveLoc", "Urban", "US"]
num_cols = ["Sales", "CompPrice", "Income", "Advertising", "Population", "Price", "Age", "Education"]
CarSeats[str_cols] = CarSeats[str_cols].astype(str)
CarSeats[num_cols] = CarSeats[num_cols].astype(float)

In [0]:
CarSeats.head()

In [0]:
list(CarSeats)

In [0]:
CarSeats.info()

**a. Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.**

In [0]:
reg = ols(formula = 'Sales ~ Price + C(Urban) + C(US)', data = CarSeats).fit() # C prepares categorical data for regression

In [0]:
reg.summary()

**b. Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative!**

For a unit increase of price ceterus paribus, the sales decrease by 0.0545 units. Likewise, for a unit increase in an urban setting
ceterus paribus the sales decrease by 0.219 units. Likewise, for a location in the US a unit increase of another store ceterus paribus
increases the sales by 1.2006 units.**

**c. Write out the model in equation form, being careful to handle the qualitative variables properly.**

Sales = 13.0435 - 0.0545xPrice - 0.0219 + 1.2006 => Sales = 14.8305 - 0.0545xPrice

**d. For which of the predictors can you reject the null hypothesis?**

We can reject "Urban" predictor, given it's high p-value(0.936).

**e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.**

In [0]:
reg_1 = ols(formula = 'Sales ~ Price + C(US)', data = CarSeats).fit()

In [0]:
reg_1.summary()

In [0]:
# run predictions
predictions_1 = pd.DataFrame(reg_1.predict())
residuals_1 = CarSeats['Sales'] - predictions_1[0]

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
sns.distplot(residuals_1) # residuals are normally distributed. Love it!!!
plt.title("Residual Plot")

In [0]:
reg_2 = ols(formula = 'Sales ~ Price + C(US)', data = CarSeats).fit()

In [0]:
reg_2.summary()

In [0]:
predictions_2 = pd.DataFrame(reg_2.predict())
residuals_2 = CarSeats['Sales'] - predictions_2[0]

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
sns.distplot(residuals_2, color = 'green') # residuals are normally distributed. Love it!!!
plt.title("Residual Plot")

**f. How well do the models in (a) and (e) ft the data?**

In [0]:
# error calculations
Y = CarSeats['Sales']
Yhat_1 = predictions_1[0]
Yhat_2 = predictions_2[0]

In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
MAE_1 = mean_absolute_error(Y, Yhat_1)
MSE_1 = mean_squared_error(Y, Yhat_1)
RMSE_1 = np.sqrt(MSE_1)

In [0]:
print("Model#1 Mean Absolute Error: %f" % MAE_1)
print("Model#1 Mean Squared Error : %f" % MSE_1)
print("Model#1 Root Mean Squared Error: %f" % RMSE_1)

In [0]:
MAE_2 = mean_absolute_error(Y, Yhat_2)
MSE_2 = mean_squared_error(Y, Yhat_2)
RMSE_2 = np.sqrt(MSE_2)

In [0]:
print("Model#1 Mean Absolute Error: %f" % MAE_2)
print("Model#1 Mean Squared Error : %f" % MSE_2)
print("Model#1 Root Mean Squared Error: %f" % RMSE_2)

**g. Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).**

From the OLS results, these are the 95% confidence intervals:
<br>
Intercept: (11.790, 14.271)
<br>
US: (0.692, 1.708)
<br>
Price: (-0.065, -0.044)

**h. Is there evidence of outliers or high leverage observations in the model from (e)?**

Create plots and find evidence of outliers and high leverage observations.

In [0]:
# residuals vs fitted plot
plt.xkcd()
plt.figure(figsize = (25, 10))
sns.regplot(Yhat_2, pd.Series(reg_2.resid_pearson), fit_reg = True, color = 'g')
plt.title("Residuals vs Fitted - Residuals_2")

In [0]:
# normal q-q plot
plt.xkcd()
plt.figure(figsize = (25, 10))
stats.probplot(residuals_2, plot = plt)
plt.title("Normal Q-Q Plot - Residuals_2 - v1")
plt.show()

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
sm.qqplot(reg_2.resid_pearson, fit = True, line = 'r') # another way to do it
plt.title("Normal Q-Q Plot - Residuals_2 - v2")
fig = plt.gcf()
fig.set_size_inches(25, 10)
plt.show()

In [0]:
# scale-location plot
plt.xkcd()
plt.figure(figsize = (25, 10))
reg_2_sqrt = pd.Series(np.sqrt(np.abs(reg_2.resid_pearson)))
sns.regplot(Yhat_2, reg_2_sqrt, fit_reg = True, color = 'y')
plt.title("Scale-Location Plot - Residuals_2")

In [0]:
# residuals vs leverage plot
plt.xkcd()
fig = plt.figure(figsize = (25, 10))
fig.set_size_inches(30, fig.get_figheight(), forward=True)
sm.graphics.influence_plot(reg_2, criterion="cooks", size = 0.0002**2)
plt.title("Residuals vs Leverage - Residuals_2")
fig = plt.gcf()
fig.set_size_inches(25, 10)
plt.show()

Yes, there are high leverage points. Point 42 is one such example.