**Part 1 - Linear Regression**

In [12]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Advertising.csv"

df = pd.read_csv(url)
df.head()

df = df.drop(columns=["Unnamed: 0"])
df.head()

X = df[["TV", "Radio", "Newspaper"]]
y = df["Sales"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

X_train.shape, X_test.shape

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

lin_reg.intercept_, lin_reg.coef_

coefficients = pd.Series(lin_reg.coef_, index=X.columns)
print("Intercept:", lin_reg.intercept_)
print("\nCoefficients:")
print(coefficients)

# Predictions
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)

# R-squared
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Mean squared error
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

# Root mean squared error
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

from math import sqrt
from sklearn.metrics import mean_absolute_error

linear_r2_test = r2_score(y_test, y_test_pred)
linear_rmse_test = sqrt(mean_squared_error(y_test, y_test_pred))
linear_mae_test = mean_absolute_error(y_test, y_test_pred)

linear_r2_train = r2_score(y_train, y_train_pred)
linear_rmse_train = sqrt(mean_squared_error(y_train, y_train_pred))
linear_mae_train = mean_absolute_error(y_train, y_train_pred)

results = []

results.append({
    "Model": "Linear (baseline)",
    "R-squared train": linear_r2_train,
    "R-squared test": linear_r2_test,
    "RMSE train": linear_rmse_train,
    "RMSE test": linear_rmse_test,
    "MAE train": linear_mae_train,
    "MAE test": linear_mae_test
})

results_df = pd.DataFrame(results)
results_df

Intercept: 2.937215734690609

Coefficients:
TV           0.046952
Radio        0.176586
Newspaper    0.001851
dtype: float64


Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833


Part 1 — Baseline Linear Regression (Simple Model)

What I did:

- Used TV, Radio, and Newspaper as features

- Split the data into training and test sets (70/30)

- Fitted a Linear Regression model on the training data

- Evaluated it using R² and RMSE

Interpretation:

- TV and Radio usually have meaningful positive coefficients, suggesting that they drive sales.

- Newspaper coefficient is usually small, meaning it is a weak predictor.

- Train and test R² are similar, suggesting that the model generalises well.


This is our baseline model to compare against more complex models later.

**Part 2**

In [13]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from math import sqrt
from sklearn.metrics import mean_absolute_error

poly = PolynomialFeatures(degree=5)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

poly_linear = LinearRegression()
poly_linear.fit(X_train_poly_scaled, y_train)

y_train_poly_pred = poly_linear.predict(X_train_poly_scaled)
y_test_poly_pred = poly_linear.predict(X_test_poly_scaled)

# Metrics
poly_r2_train = r2_score(y_train, y_train_poly_pred)
poly_r2_test = r2_score(y_test, y_test_poly_pred)

poly_rmse_train = sqrt(mean_squared_error(y_train, y_train_poly_pred))
poly_rmse_test = sqrt(mean_squared_error(y_test, y_test_poly_pred))

poly_mae_train = mean_absolute_error(y_train, y_train_poly_pred)
poly_mae_test = mean_absolute_error(y_test, y_test_poly_pred)

# See how many coefficients this polynomial has
print("Number of polynomial coefficients:", len(poly_linear.coef_))
print(poly_linear.coef_[:10])


results.append({
    "Model": "Poly degree 5 (no reg)",
    "R-squared train": poly_r2_train,
    "R-squared test": poly_r2_test,
    "RMSE train": poly_rmse_train,
    "RMSE test": poly_rmse_test,
    "MAE train": poly_mae_train,
    "MAE test": poly_mae_test
})

results_df = pd.DataFrame(results)
results_df


Number of polynomial coefficients: 56
[ 3.68552982e-10  1.22742490e+01 -7.76103224e+00  9.01866144e-01
 -4.31575466e+01  2.79825459e+01 -1.01377760e+01  3.09398521e+01
  2.09594948e+00  1.28001448e+00]


Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791


Part 2 – Complex Polynomial Model (degree 5)

For the polynomial model (degree 5), the number of coefficients increased from 3 in the baseline model to 56. Many of these coefficients correspond to high-order powers and interactions of TV, Radio, and Newspaper, and they are not easy to interpret in a marketing context.

Performance comparison:

- Baseline linear model:
  - Train R² ≈ 0.89
  - Test R² ≈ 0.92
  - Train RMSE ≈ 1.79
  - Test RMSE ≈ 1.39

- Polynomial model:
  - Train R² ≈ 0.998
  - Test R² ≈ 0.79
  - Train RMSE ≈ 0.25
  - Test RMSE ≈ 2.26

The polynomial model fits the training data almost perfectly (very high train R² and very low train RMSE), but its performance on the test set is actually worse than the simple linear model. This is a classic sign of overfitting: the model is memorising noise in the training data rather than learning the true underlying relationship.

Even though the training metrics look impressive, this is not a good model for prediction or decision-making, because it does not generalise well to unseen data.


**Part 3**

In [14]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=0.05)
ridge.fit(X_train_poly_scaled, y_train)

len(ridge.coef_), ridge.coef_[:5]

y_train_ridge_pred = ridge.predict(X_train_poly_scaled)
y_test_ridge_pred = ridge.predict(X_test_poly_scaled)

#Metrics

ridge_r2_train = r2_score(y_train, y_train_ridge_pred)
ridge_r2_test = r2_score(y_test, y_test_ridge_pred)

ridge_rmse_train = sqrt(mean_squared_error(y_train, y_train_ridge_pred))
ridge_rmse_test = sqrt(mean_squared_error(y_test, y_test_ridge_pred))

ridge_mae_train = mean_absolute_error(y_train, y_train_ridge_pred)
ridge_mae_test = mean_absolute_error(y_test, y_test_ridge_pred)

results.append({
    "Model": "Ridge (α=0.05)",
    "R-squared train": ridge_r2_train,
    "R-squared test": ridge_r2_test,
    "RMSE train": ridge_rmse_train,
    "RMSE test": ridge_rmse_test,
    "MAE train": ridge_mae_train,
    "MAE test": ridge_mae_test
})

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791
2,Ridge (α=0.05),0.991876,0.989893,0.475706,0.501439,0.315803,0.328785


In [15]:
from sklearn.linear_model import Ridge, Lasso
lasso = Lasso(alpha=0.1, max_iter=10000)
lasso.fit(X_train_poly_scaled, y_train)

len(lasso.coef_), lasso.coef_[:5]

y_train_lasso_pred = lasso.predict(X_train_poly_scaled)
y_test_lasso_pred = lasso.predict(X_test_poly_scaled)

lasso_r2_train = r2_score(y_train, y_train_lasso_pred)
lasso_r2_test = r2_score(y_test, y_test_lasso_pred)

lasso_rmse_train = sqrt(mean_squared_error(y_train, y_train_lasso_pred))
lasso_rmse_test = sqrt(mean_squared_error(y_test, y_test_lasso_pred))

lasso_mae_train = mean_absolute_error(y_train, y_train_lasso_pred)
lasso_mae_test = mean_absolute_error(y_test, y_test_lasso_pred)

results.append({
    "Model": "Lasso (α=0.1)",
    "R-squared train": lasso_r2_train,
    "R-squared test": lasso_r2_test,
    "RMSE train": lasso_rmse_train,
    "RMSE test": lasso_rmse_test,
    "MAE train": lasso_mae_train,
    "MAE test": lasso_mae_test
})

results_df = pd.DataFrame(results)
results_df


Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear (baseline),0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Poly degree 5 (no reg),0.997769,0.794384,0.249258,2.261647,0.193958,0.736791
2,Ridge (α=0.05),0.991876,0.989893,0.475706,0.501439,0.315803,0.328785
3,Lasso (α=0.1),0.969578,0.9844,0.920545,0.622962,0.589273,0.43891


In [16]:
number_zero = np.sum(lasso.coef_ == 0)
number_nonzero = np.sum(lasso.coef_ != 0)
print("Number of zero coefficients:", number_zero)
print("Number of non-zero coefficients:", number_nonzero)

Number of zero coefficients: 50
Number of non-zero coefficients: 6


Question 2 – Analysis & Performance:

The unregularised degree-5 polynomial model produced very large, unstable coefficients. After applying regularisation, Ridge shrinks all coefficients toward zero but keeps every feature, while Lasso shrinks many coefficients exactly to zero. In my results, Lasso set 50 out of 56 coefficients to zero, leaving only 6 non-zero. This means Lasso is identifying only a few meaningful features, while removing noisy or irrelevant interactions. It also shows that the relationship between advertising spend and sales is fairly simple, and most of the high-order polynomial terms do not help.

Ridge model performance:
- train R² = 0.9919,
- test R² = 0.9899,
- train RMSE ≈ 0.476,
- test RMSE ≈ 0.501.

Lasso model performance:
- train R² = 0.9696,
- test R² = 0.9844,
- train RMSE ≈ 0.921,
- test RMSE ≈ 0.623.

Compared to the overfit polynomial model, both Ridge and Lasso perform much better on the test set. This proves that regularisation helps prevent overfitting and produces models that are more stable and reliable.

Question 3

After comparing a simple model, an overfit polynomial model, and two regularised models, the best overall choice is either the simple Linear Regression model or the Ridge Regression model. The linear model is easy to interpret and already performs quite well, while Ridge gives the highest predictive accuracy without overfitting.

Across all models, TV is the strongest and most reliable driver of sales, followed by Radio. Newspaper consistently has a very small effect, and Lasso removes almost all Newspaper-related polynomial features. This suggests that Newspaper advertising contributes very little to sales in this dataset.

Final recommendation to the CMO: invest more heavily in TV and Radio advertising, and reduce spending on Newspaper. Regularised models (especially Ridge) should be preferred when using more complex feature sets, because they generalise better and avoid the overfitting observed in the unregularised polynomial model.