# Day 6 - multi feature linear regression

Today I'm using a dataset provided by https://www.statlearning.com/resources-python.
It is a dataset that includes advertisement expenditure for three types of media; tv, radio, and newspaper. As well as a column showing the sale amount for each country - although each country is anonymised.

The goal will then be to create a linear regression model to predict the sales amount based on the mix of advertising expenditure - where hopefully we'll be able to see which ratios or media should be preferred.

In [42]:
# Import libraries
import numpy as np
import altair as alt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import set_config

In [8]:
df_advertising = pd.read_csv("advertising.csv")
df_advertising

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


In [41]:
# Conduct some exploratory analysis to get som intuition of the dataset

# TV -> Sales
charts = []
for column in {"TV", "radio", "newspaper"}:
    chart = alt.Chart(df_advertising).mark_circle().encode(
        x=column,
        y="sales"
    )
    charts.append(chart)
alt.hconcat(*charts)

From the above charts, we can see that newspaper expenditure is not really linear in relation to sales figures. Radio expenditure is more linear, but spreads out into a sort of cone with increasing variance the higher the radio expenditure. Tv expenditure is the most linear.

In [119]:
# Initialize multivariate linear regression model
set_config(transform_output="pandas")

# Create training and test split of data
advertising_train, advertising_test = train_test_split(
    df_advertising, train_size=0.75
)

# Fit model
mlm = LinearRegression()
mlm.fit(
    advertising_train[["TV", "radio", "newspaper"]],
    advertising_train["sales"]       
)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [120]:
# Predict
advertising_test["predicted"] = mlm.predict(advertising_test[["TV", "radio", "newspaper"]])

# Calculate mean squared error
lm_mult_test_RMSPE = mean_squared_error(
    y_true=advertising_test["sales"],
    y_pred=advertising_test["predicted"]
)**0.5

lm_mult_test_RMSPE

1.6484137311985188

In [105]:
mlm.coef_

array([ 4.44160282e-02,  1.92212147e-01, -9.58332789e-05])

In [106]:
mlm.intercept_

np.float64(3.0951596810002453)

Let's see if it is better to fit the model only to the features "TV" and "radio", as "newspaper" didn't seem like having a linear relationship with sales.

In [123]:
# Create training and test split of data
tv_radio_train, tv_radio_test = train_test_split(
    df_advertising, train_size=0.75
)

# Fit model
tv_radio_mlm = LinearRegression()
tv_radio_mlm.fit(
    tv_radio_train[["TV", "radio"]],
    tv_radio_train["sales"]
)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [124]:
# Predict
tv_radio_test["predicted"] = tv_radio_mlm.predict(tv_radio_test[["TV", "radio"]])

# Calculate mean squared error for tv and radio dataset
tv_radio_lm_mult_test_RMSPE = mean_squared_error(
    y_true=tv_radio_test["sales"],
    y_pred=tv_radio_test["predicted"]
)**0.5

tv_radio_lm_mult_test_RMSPE

1.7383471696381916

Difficult to evaluate if the removal of newspaper as a feature actually improved the model, as the mean squared error are quite similar.

In [125]:
tv_radio_mlm.coef_

array([0.04744042, 0.18478325])