# Multiple Linear Regression

Linear Regression with more than 1 feature.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Exploratory Data Analysis and Visualization

In [None]:
tv = np.array([181,9,58,120,9,200,66,215,24,98,204,195,68,281,69,147,218,237,13,228,62,263,143,240,249])
radio = np.array([11,49,33,20,2,3,6,24,35,8,33,48,37,40,21,24,28,5,16,17,13,4,29,17,27])
newspaper = np.array([58,75,24,12,1,21,24,4,66,7,46,53,114,56,18,19,53,24,50,26,18,20,13,23,23])
sales = np.array([13,7,12,13,5,11,9,17,9,10,19,22,13,24,11,15,18,13,6,16,10,12,15,16,19])

df = pd.DataFrame({'tv': tv, 'radio': radio, 'newspaper': newspaper, 'sales': sales})
df.head()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 6))

axes[0].plot(df['tv'], df['sales'], 'o')
axes[0].set_title("TV Spend")
axes[0].set_ylabel("Sales")

axes[1].plot(df['radio'], df['sales'], 'o')
axes[1].set_title("Radio Spend")
axes[1].set_ylabel("Sales")

axes[2].plot(df['newspaper'], df['sales'], 'o')
axes[2].set_title("Newspaper Spend");
axes[2].set_ylabel("Sales")

plt.tight_layout();

## Train and Test Splits

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# note it is capital X
X = df.drop('sales', axis=1) # get only features columns, independent variables
y = df['sales'] # get only label column, dependent variable

With test_size=0.3 we are sending 30% of our data in the `Test split`, the other 70% will be for the `Train split`

After providing the data to the model, the data will be shuffled in random order, that is why we need random_state, to have control over the randomness. The data is shuffled, because it is usually sorted by one of the columns and we don't want the first 70% to be used as train split and the other 30% for test split for sorted data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

print(len(X_train))
print(len(X_test))

## Creating the model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

## Predictions on the test data

In [None]:
# The model predicts its own y hat
# We can then compare these results to the true y test label value
y_pred = model.predict(X_test)
y_pred

## Model Performance

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

In [None]:
print(MAE)
print(MSE)
print(RMSE)
print(df['sales'].mean())

In [None]:
sns.histplot(data=df, x='sales', bins=10)

## Residuals

It's also important to plot out residuals and check for normal distribution, this helps us understand if Linear Regression was a valid model choice.

In [None]:
test_residuals = y_test - y_pred

It looks like linear regression is a good choice for our data, because below we cannot see any patterns. The dots are randomly positioned below and above the red line

In [None]:
sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', linestyle='--')

In [None]:
sns.displot(test_residuals, kde=True)

In [None]:
import scipy as sp

In [None]:
fig, ax = plt.subplots(figsize=(6, 8))
_ = sp.stats.probplot(test_residuals, plot=ax)

## Retraining Model on Full Data

If we are satisfied with the performance on the test data, before deploying our model to the real world, we should retrain on all our data. (If we were not satisfied, we could update parameters or choose another model).

In [None]:
final_model = LinearRegression()
final_model.fit(X, y)

## Deployment, Predictions, and Model Attributes

### Final Model Fit

Note, we can only do this since we only have 3 features, for any more it becomes unreasonable.

In [None]:
y_hat = final_model.predict(X)

In [None]:
fig,axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 6))

axes[0].plot(df['tv'], df['sales'], 'o')
axes[0].plot(df['tv'], y_hat, 'o', color='red')
axes[0].set_title("TV spend")
axes[0].set_ylabel("Sales")

axes[1].plot(df['radio'], df['sales'], 'o')
axes[1].plot(df['radio'], y_hat, 'o', color='red')
axes[1].set_title("Radio spend")
axes[1].set_ylabel("Sales")

axes[2].plot(df['newspaper'], df['sales'], 'o')
axes[2].plot(df['radio'], y_hat, 'o', color='red')
axes[2].set_title("Newspaper spend");
axes[2].set_ylabel("Sales")

plt.tight_layout();

### Residuals

Should be normally distributed

In [None]:
residuals = y_hat - y

In [None]:
sns.scatterplot(x=y, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')

### Coefficients

Based on our model there is no relationship between spending on advertisement on newspaper and making bigger sale, actually we are going to lose if we do that, because we have negative coef.

The coefs means that if we spend 1 unit on one of the advertisement channels we will have increase in sales with the value of the respective feature coef.

In [None]:
final_model.coef_

In [None]:
coeff_df = pd.DataFrame(final_model.coef_, X.columns, columns=['Coefficient'])
coeff_df

In [None]:
df.corr()

### Prediction on New Data

Recall, X_test data set looks exactly the same as brand new data, so we simply need to call .predict() just as before to predict sales for a new advertising campaign.

Our next ad campaign will have a total spend of 149 on TV, 22 on Radio, and 12 on Newspaper Ads, how many units could we expect to sell as a result of this? The answer is 14.11954608

How accurate is this prediction 14.11954608? No real way to know! We only know truly know our model's performance on the test data, that is why we had to be satisfied by it first, before training our full model

In [None]:
ctv = np.array([149])
cradio = np.array([22])
cnewspaper = np.array([12])

campaign = pd.DataFrame({'tv': ctv, 'radio': cradio, 'newspaper': cnewspaper})
final_model.predict(campaign)

## Model Persistence

In [None]:
from joblib import dump, load

It will create a file with the model

In [None]:
dump(final_model, 'sales_model.joblib') 

In [None]:
loaded_model = load('sales_model.joblib')

In [None]:
loaded_model.predict(campaign)