In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt

In [2]:
data_path = "https://www.statlearning.com/s/Advertising.csv" 

# Read the CSV data from the link
data_df = pd.read_csv(data_path,index_col=0)

# Print out first 5 samples from the DataFrame
data_df.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [63]:
X = data_df['TV'].to_numpy()
Y = data_df['sales'].to_numpy()

---

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=989)

In [66]:
lin_reg = LinearRegression()
lin_reg.fit(X_train.reshape(-1,1), y_train)

In [67]:
print(f"y = {lin_reg.intercept_} + {lin_reg.coef_[0]} x") 

y = 6.858509764823011 + 0.04842623831035565 x


In [68]:
y_pred = lin_reg.predict(X_test.reshape(-1,1))
y_pred

array([18.09823968,  8.69870682,  7.27981804, 13.652711  ,  8.94083801,
       17.06191818,  8.70839207,  8.64543796,  7.79797879, 18.47596434,
       17.50259695, 14.93600631, 10.14180872, 14.87789483, 19.58976782,
       12.67934361, 20.7858959 , 13.53648803, 10.20960546, 10.09822511,
       20.48565323, 14.10791764, 14.77619973, 13.59944214, 10.49532026,
       11.49290077, 13.99169467, 15.42995395,  8.06916572, 11.72050409,
       10.50500551,  9.01347737,  7.76892305, 16.78588862, 13.45416342,
       11.08127775, 19.70114816, 12.17571073, 18.44690859, 20.62608932])

### Performance Metrics


In [69]:
from sklearn.metrics import max_error, mean_squared_error, mean_absolute_error, r2_score

In [70]:
max_error(y_test, y_pred)

np.float64(7.58976781661551)

* Mean Squared Error (MSE): $\frac{\sum_{i=1}^{N}(y_{actual} - y_{predicted})^2}{N}$


In [71]:
mean_squared_error(y_test, y_pred)

np.float64(10.209617603110354)

* Mean Absolute Error (MAE): $\frac{\sum_{i=1}^{N}|y_{actual} - y_{predicted}|}{N}$


In [72]:
mean_absolute_error(y_test, y_pred)

np.float64(2.4910506561061587)

### R2 Cofficient

R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates a perfect fit. It is calculated as:

$R^2 = 1 - \frac{RSS}{TSS}$

Where RSS is the residual sum of squares and TSS is the total sum of squares.

The RSS (Residual Sum of Squares) represents the sum of squared differences between the observed dependent variable values (y) and the predicted values (ŷ) obtained from the linear regression model. Mathematically, it is calculated as follows:

$RSS = Σ(y - ŷ)^2$

On the other hand, the TSS (Total Sum of Squares) represents the total variation in the dependent variable (y) from its mean (ȳ). It measures the sum of squared differences between each observed dependent variable value (y) and the mean of the dependent variable (ȳ). Mathematically, it is calculated as follows:

$TSS = Σ(y - ȳ)^2$

R-squared is always between 0 and 100%:

* 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
* 100% represents a model that explains all the variation in the response variable around its mean.

In [73]:
r2_score(y_test, y_pred)

0.5697181446883102