**Part 1 - Linear Regression**

In [31]:
import pandas as pd
import numpy as np
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Advertising.csv"

df = pd.read_csv(url)

df = df.drop(columns=["Unnamed: 0"])
df.head()

X = df[["TV", "Radio", "Newspaper"]]
y = df["Sales"]

# Split for train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

# Fit model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

coefficients = pd.Series(lin_reg.coef_, index=X.columns)
print("Intercept:", lin_reg.intercept_)
print("\nCoefficients:")
print(coefficients)

# Predictions
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)

# Metrics
linear_r2_train = r2_score(y_train, y_train_pred)
linear_rmse_train = sqrt(mean_squared_error(y_train, y_train_pred))
linear_mae_train = mean_absolute_error(y_train, y_train_pred)

linear_r2_test = r2_score(y_test, y_test_pred)
linear_rmse_test = sqrt(mean_squared_error(y_test, y_test_pred))
linear_mae_test = mean_absolute_error(y_test, y_test_pred)

results = []
results.append({
    "Model": "Linear (baseline)",
    "R-squared train": linear_r2_train,
    "R-squared test": linear_r2_test,
    "RMSE train": linear_rmse_train,
    "RMSE test": linear_rmse_test,
    "MAE train": linear_mae_train,
    "MAE test": linear_mae_test
})

results_df = pd.DataFrame(results)
results_df

Intercept: 2.937215734690609

Coefficients:
TV           0.046952
Radio        0.176586
Newspaper    0.001851
dtype: float64


Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833


Part 1: Baseline Linear Regression

Interpretation:

- TV has a positive coefficient, suggesting that increasing TV advertising speand would lead to higher sales.

- Radio also has a positive coefficient, suggesting it also increases sales.

- Newspaper coefficient is small, meaning it is rather a weak predictor.

- Train and test R² are similar, suggesting that the model generalises well. Errors for train and test are also similar and not very big.

**Part 2**

In [32]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from math import sqrt
from sklearn.metrics import mean_absolute_error

poly = PolynomialFeatures(degree=5)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

poly_linear = LinearRegression()
poly_linear.fit(X_train_poly_scaled, y_train)

y_train_poly_pred = poly_linear.predict(X_train_poly_scaled)
y_test_poly_pred = poly_linear.predict(X_test_poly_scaled)

# Metrics
poly_r2_train = r2_score(y_train, y_train_poly_pred)
poly_r2_test = r2_score(y_test, y_test_poly_pred)

poly_rmse_train = sqrt(mean_squared_error(y_train, y_train_poly_pred))
poly_rmse_test = sqrt(mean_squared_error(y_test, y_test_poly_pred))

poly_mae_train = mean_absolute_error(y_train, y_train_poly_pred)
poly_mae_test = mean_absolute_error(y_test, y_test_poly_pred)

# See how many coefficients we have for this model
print("Number of polynomial coefficients:", len(poly_linear.coef_))
coef_names = poly.get_feature_names_out(["TV", "Radio", "Newspaper"])
poly_coefficients = pd.DataFrame({
    "Feature": coef_names,
    "Coefficient": poly_linear.coef_
})
print(poly_coefficients.head(25))

results.append({
    "Model": "Poly degree 5 (no reg)",
    "R-squared train": poly_r2_train,
    "R-squared test": poly_r2_test,
    "RMSE train": poly_rmse_train,
    "RMSE test": poly_rmse_test,
    "MAE train": poly_mae_train,
    "MAE test": poly_mae_test
})

results_df = pd.DataFrame(results)
results_df


Number of polynomial coefficients: 56
                 Feature   Coefficient
0                      1  3.685530e-10
1                     TV  1.227425e+01
2                  Radio -7.761032e+00
3              Newspaper  9.018661e-01
4                   TV^2 -4.315755e+01
5               TV Radio  2.798255e+01
6           TV Newspaper -1.013778e+01
7                Radio^2  3.093985e+01
8        Radio Newspaper  2.095949e+00
9            Newspaper^2  1.280014e+00
10                  TV^3  7.395312e+01
11            TV^2 Radio -3.569242e+01
12        TV^2 Newspaper  2.947617e+01
13            TV Radio^2 -4.354276e+01
14    TV Radio Newspaper -1.176643e+01
15        TV Newspaper^2  1.810386e+01
16               Radio^3 -3.970530e+01
17     Radio^2 Newspaper -1.656861e+01
18     Radio Newspaper^2  4.955863e+00
19           Newspaper^3 -1.138927e+01
20                  TV^4 -5.769059e+01
21            TV^3 Radio  3.060621e+01
22        TV^3 Newspaper -4.573475e+01
23          TV^2 Radio^2  

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791


Part 2: Polynomial Model (degree 5)

For the polynomial model (degree 5), the number of coefficients increased from 3 in the baseline model to 56. Many of these coefficients are large and correspond to high-order powers and interactions of TV, Radio and Newspaper. So they are not easy to interpret in a marketing context. This suggests a risk of overfitting.

Performance comparison:

- Polynomial model:
  - Train R² ≈ 0.998
  - Test R² ≈ 0.79
  - Train RMSE ≈ 0.25
  - Test RMSE ≈ 2.26

- Baseline linear model:
  - Train R² ≈ 0.89
  - Test R² ≈ 0.92
  - Train RMSE ≈ 1.79
  - Test RMSE ≈ 1.39

The polynomial model fits the training data almost perfectly (it has a very high train R² and very low train RMSE), but its performance on the test set is actually worse than the simple linear model. This is a classic sign of overfitting: the model is memorising noise in the training data rather than learning the true relationship.

Even though the training metrics look impressive, this is not a good model for prediction or decision-making, because it does not generalise well to unseen data.


**Part 3**

In [33]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=0.05)
ridge.fit(X_train_poly_scaled, y_train)

len(ridge.coef_), ridge.coef_[:10]

y_train_ridge_pred = ridge.predict(X_train_poly_scaled)
y_test_ridge_pred = ridge.predict(X_test_poly_scaled)

# Metrics
ridge_r2_train = r2_score(y_train, y_train_ridge_pred)
ridge_r2_test = r2_score(y_test, y_test_ridge_pred)

ridge_rmse_train = sqrt(mean_squared_error(y_train, y_train_ridge_pred))
ridge_rmse_test = sqrt(mean_squared_error(y_test, y_test_ridge_pred))

ridge_mae_train = mean_absolute_error(y_train, y_train_ridge_pred)
ridge_mae_test = mean_absolute_error(y_test, y_test_ridge_pred)

coef_names = poly.get_feature_names_out(["TV", "Radio", "Newspaper"])
ridge_coefficients = pd.DataFrame({
    "Feature": coef_names,
    "Coefficient": ridge.coef_
})
print(ridge_coefficients.head(25))

results.append({
    "Model": "Ridge (α = 0.05)",
    "R-squared train": ridge_r2_train,
    "R-squared test": ridge_r2_test,
    "RMSE train": ridge_rmse_train,
    "RMSE test": ridge_rmse_test,
    "MAE train": ridge_mae_train,
    "MAE test": ridge_mae_test
})

results_df = pd.DataFrame(results)
results_df

                 Feature  Coefficient
0                      1     0.000000
1                     TV     6.527871
2                  Radio     0.370194
3              Newspaper     0.846251
4                   TV^2    -7.052522
5               TV Radio     4.631767
6           TV Newspaper    -0.957912
7                Radio^2     0.500073
8        Radio Newspaper    -0.569028
9            Newspaper^2    -0.897476
10                  TV^3     0.653481
11            TV^2 Radio    -2.338669
12        TV^2 Newspaper    -0.428635
13            TV Radio^2     1.063374
14    TV Radio Newspaper    -0.470371
15        TV Newspaper^2     0.375836
16               Radio^3    -0.760039
17     Radio^2 Newspaper     0.714281
18     Radio Newspaper^2    -0.503632
19           Newspaper^3     0.382167
20                  TV^4     2.583361
21            TV^3 Radio     1.969147
22        TV^3 Newspaper     1.433933
23          TV^2 Radio^2    -2.472947
24  TV^2 Radio Newspaper     1.269717


Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791
2,Ridge (α = 0.05),0.991876,0.989893,0.475706,0.501439,0.315803,0.328785


In [34]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1, max_iter=10000)
lasso.fit(X_train_poly_scaled, y_train)

len(lasso.coef_), lasso.coef_[:10]

y_train_lasso_pred = lasso.predict(X_train_poly_scaled)
y_test_lasso_pred = lasso.predict(X_test_poly_scaled)

# Metrics
lasso_r2_train = r2_score(y_train, y_train_lasso_pred)
lasso_r2_test = r2_score(y_test, y_test_lasso_pred)

lasso_rmse_train = sqrt(mean_squared_error(y_train, y_train_lasso_pred))
lasso_rmse_test = sqrt(mean_squared_error(y_test, y_test_lasso_pred))

lasso_mae_train = mean_absolute_error(y_train, y_train_lasso_pred)
lasso_mae_test = mean_absolute_error(y_test, y_test_lasso_pred)

lasso_coefficients = pd.DataFrame({
    "Feature": coef_names,
    "Coefficient": lasso.coef_
})
print(lasso_coefficients.head(25))

number_zero = np.sum(lasso.coef_ == 0)
number_nonzero = np.sum(lasso.coef_ != 0)
print("Number of zero coefficients:", number_zero)
print("Number of non-zero coefficients:", number_nonzero)

results.append({
    "Model": "Lasso (α = 0.1)",
    "R-squared train": lasso_r2_train,
    "R-squared test": lasso_r2_test,
    "RMSE train": lasso_rmse_train,
    "RMSE test": lasso_rmse_test,
    "MAE train": lasso_mae_train,
    "MAE test": lasso_mae_test
})

results_df = pd.DataFrame(results)
results_df


                 Feature  Coefficient
0                      1     0.000000
1                     TV     1.710876
2                  Radio     0.068323
3              Newspaper     0.000000
4                   TV^2    -0.000000
5               TV Radio     3.972318
6           TV Newspaper     0.000000
7                Radio^2     0.000000
8        Radio Newspaper     0.057180
9            Newspaper^2     0.000000
10                  TV^3    -0.000000
11            TV^2 Radio    -0.000000
12        TV^2 Newspaper    -0.000000
13            TV Radio^2     0.000000
14    TV Radio Newspaper     0.000000
15        TV Newspaper^2     0.000000
16               Radio^3     0.000000
17     Radio^2 Newspaper     0.000000
18     Radio Newspaper^2     0.000000
19           Newspaper^3     0.000000
20                  TV^4    -0.000000
21            TV^3 Radio    -0.000000
22        TV^3 Newspaper    -0.000000
23          TV^2 Radio^2    -0.000000
24  TV^2 Radio Newspaper    -0.000000
Number of ze

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791
2,Ridge (α = 0.05),0.991876,0.989893,0.475706,0.501439,0.315803,0.328785
3,Lasso (α = 0.1),0.969578,0.9844,0.920545,0.622962,0.589273,0.43891


Question 2

The unregularised degree-5 polynomial model produced large, unstable coefficients. After applying regularisation, Ridge shrinks most coefficients toward zero but keeps every feature, while Lasso shrinks many coefficients exactly to zero. In my results, Lasso set 50 out of 56 coefficients to zero, leaving only 6 non-zero. This means that Lasso is identifying only a few meaningful features, while removing irrelevant interactions. It also shows that the relationship between advertising spend and sales is fairly simple, and most of the high-order polynomial terms do not help.

Ridge model performance:
- train R² = 0.9919,
- test R² = 0.9899,
- train RMSE ≈ 0.476,
- test RMSE ≈ 0.501.

Lasso model performance:
- train R² = 0.9696,
- test R² = 0.9844,
- train RMSE ≈ 0.921,
- test RMSE ≈ 0.623.

Compared to the overfit polynomial model, both Ridge and Lasso perform much better on the test set. This proves that regularisation helps prevent overfitting and produces models that are more stable and reliable.

Question 3

After comparing a simple model, an overfit polynomial model, and two regularised models, the best overall choice seems to be either the simple Linear Regression model, the Ridge Regression model or the Lasso model. The linear model is easy to interpret and already performs quite well, while Ridge gives the highest predictive accuracy without overfitting. Lasso also perfomrs very well and (depending on the choice of alpha) it can acheive a good balance by keeping only the mosty importnat features.

Across all models, TV is the strongest and most reliable driver of sales, followed by Radio. Newspaper consistently has a very small effect, and Lasso removes almost all Newspaper-related polynomial features. This would suggest that Newspaper advertising contributes very little to sales in this dataset.

Final recommendation to the CMO: invest more heavily in TV, followed by Radio advertising, and reduce spending on Newspaper.