Now that we have a solid grasp of the theory and mathematics behind Linear Regression (Simple, Multiple, assumptions, Cost Function, Gradient Descent, and the Normal Equation), let's implement this using the **Scikit-learn** library in Python.

Scikit-learn is the go-to library for most machine learning tasks in Python. It provides a clean, consistent API for various algorithms, including Linear Regression.

**Part 6: Implementation with Scikit-learn**

**1. The Scikit-learn Philosophy (Estimator API)**

Scikit-learn's design is built around the "Estimator" concept. Estimators are objects that learn from data. They all share a consistent interface:
* **Instantiation:** You create an instance of the model class (e.g., `LinearRegression()`). You can set hyperparameters during instantiation.
* **`fit(X, y)` method:** This is used to train the model using your feature data `X` and target data `y`. This is where the learning happens (e.g., calculating coefficients).
* **`predict(X_new)` method:** Once trained, you use this to make predictions on new, unseen data `X_new`.
* **`score(X, y)` method:** For regression models, this typically returns the R-squared ($R^2$) coefficient of determination.
* **Attributes:** After fitting, estimators expose learned parameters as attributes ending with an underscore (e.g., `model.coef_`, `model.intercept_`).

We'll look at two main implementations for linear regression in Scikit-learn:
* `sklearn.linear_model.LinearRegression`: Typically uses an analytical solver.
* `sklearn.linear_model.SGDRegressor`: Uses Stochastic Gradient Descent.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [3]:
# Load the Advertising dataset
try:
    df_adv = pd.read_csv('Advertising.csv', index_col=0)
except FileNotFoundError:
    print("Error: 'Advertising.csv' not found. Please download it and place it in the correct directory.")
    exit()

# Prepare features (X) and target (y) for Multiple Linear Regression
X = df_adv[['TV', 'radio', 'newspaper']]
y = df_adv['sales']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Test set shape: X_test: {X_test.shape}, y_test: {y_test.shape}")

Training set shape: X_train: (160, 3), y_train: (160,)
Test set shape: X_test: (40, 3), y_test: (40,)


---
**2. Using `sklearn.linear_model.LinearRegression`**

This is often your first choice for standard linear regression problems if your dataset isn't excessively large. It internally uses efficient methods (often based on Singular Value Decomposition - SVD, related to the Normal Equation) to find the ordinary least squares solution directly.

* **When to Use:** Good for datasets where computing the direct solution is feasible (most common cases unless you have hundreds of thousands of features or samples).
* **Under the Hood:** Solves $b = (X^T X)^{-1} X^T y$ or uses more numerically stable decomposition methods like SVD to achieve the same result.
* **Feature Scaling:** Not strictly required for `LinearRegression` to work correctly (as it's an analytical solution). However, having features on vastly different scales can sometimes lead to numerical precision issues in any matrix computation, though Scikit-learn's implementation is quite robust.

In [5]:
print("\n--- Using sklearn.linear_model.LinearRegression ---")

# 1. Instantiate the model
# Common parameters:
#   fit_intercept=True (default): Calculates the intercept. Set to False if data is already centered or you don't want an intercept.
#   n_jobs=None (default): Number of CPU cores to use for computation. -1 means using all available.
lr_model = LinearRegression(fit_intercept=True)


--- Using sklearn.linear_model.LinearRegression ---


In [6]:
# 2. Train (fit) the model
lr_model.fit(X_train, y_train)

In [7]:
# 3. Inspect the learned parameters
print(f"Intercept (b0): {lr_model.intercept_:.4f}")
# lr_model.coef_ will contain [b_TV, b_Radio, b_Newspaper]
print("Coefficients (b_j):")
for feature, coef in zip(X.columns, lr_model.coef_):
    print(f"  {feature}: {coef:.4f}")

Intercept (b0): 2.9791
Coefficients (b_j):
  TV: 0.0447
  radio: 0.1892
  newspaper: 0.0028


In [8]:
# 4. Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)

In [10]:
# 5. Evaluate the model
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
# Alternatively, use the model's score method for R-squared
r2_lr_score_method = lr_model.score(X_test, y_test)

print(f"\nModel Performance (LinearRegression):")
print(f"  Mean Squared Error (MSE): {mse_lr:.4f}")
print(f"  R-squared (R²): {r2_lr:.4f}")
print(f"  R-squared (using model.score()): {r2_lr_score_method:.4f}")


Model Performance (LinearRegression):
  Mean Squared Error (MSE): 3.1741
  R-squared (R²): 0.8994
  R-squared (using model.score()): 0.8994


**3. Using `sklearn.linear_model.SGDRegressor`**

This model uses Stochastic Gradient Descent (SGD) to find the coefficients that minimize the cost function. It's highly efficient for large datasets.

* **When to Use:** Very large datasets (many samples and/or many features) where `LinearRegression` might be too slow or memory-intensive. Also useful for "online learning" where you want to update the model incrementally as new data arrives (using the `partial_fit` method).
* **Under the Hood:** Iteratively updates coefficients using the Gradient Descent algorithm (specifically, a stochastic variant).
* **Feature Scaling: CRUCIAL!** Gradient Descent-based algorithms are sensitive to feature scales. You *must* scale your features (e.g., using `StandardScaler`) for `SGDRegressor` to perform well and converge properly.

In [11]:
print("\n--- Using sklearn.linear_model.SGDRegressor ---")

# 1. Feature Scaling (ESSENTIAL for SGDRegressor)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on training data and transform it
X_test_scaled = scaler.transform(X_test)       # Transform test data using the same scaler


--- Using sklearn.linear_model.SGDRegressor ---


In [12]:
# 2. Instantiate the model
# Common parameters:
#   loss='squared_error': Specifies the loss function. 'squared_error' means Ordinary Least Squares.
#   penalty=None: Type of regularization (e.g., 'l2', 'l1', 'elasticnet'). We'll cover this in Topic 7. For now, None.
#   alpha=0.0001: Constant that multiplies the regularization term. (Default, not very relevant if penalty=None).
#   max_iter=1000: Maximum number of passes over the training data (epochs).
#   tol=1e-3: The stopping criterion. Training will stop when (loss > previous_loss - tol).
#   learning_rate='invscaling' (default): Schedule for learning rate. 'eta0' is the initial learning rate.
#   eta0=0.01 (default): Initial learning rate for 'invscaling', 'constant', 'adaptive'.
#   random_state=None: For shuffling the data, ensures reproducibility if set to an integer.
sgd_model = SGDRegressor(loss='squared_error', penalty=None,
                         max_iter=2000, tol=1e-4, # Increased max_iter and stricter tol for potentially better convergence
                         learning_rate='invscaling', eta0=0.01,
                         random_state=42) # For reproducibility

In [13]:
# 3. Train (fit) the model on SCALED data
sgd_model.fit(X_train_scaled, y_train)

In [14]:
# 4. Inspect the learned parameters
# Note: These coefficients are for the SCALED features.
# The intercept is also affected by the scaling of y if y were scaled (but we haven't scaled y here).
print(f"Intercept (b0) from SGD: {sgd_model.intercept_[0]:.4f}") # intercept_ is an array
print("Coefficients (b_j) from SGD (for SCALED features):")
# To map to original feature names (assuming order is preserved by StandardScaler)
for feature, coef in zip(X.columns, sgd_model.coef_):
    print(f"  {feature}: {coef:.4f}")

Intercept (b0) from SGD: 14.0984
Coefficients (b_j) from SGD (for SCALED features):
  TV: 3.7675
  radio: 2.7874
  newspaper: 0.0594


In [15]:
# 5. Make predictions on the SCALED test set
y_pred_sgd = sgd_model.predict(X_test_scaled)

In [16]:
# 6. Evaluate the model
mse_sgd = mean_squared_error(y_test, y_pred_sgd)
r2_sgd = r2_score(y_test, y_pred_sgd)
r2_sgd_score_method = sgd_model.score(X_test_scaled, y_test)

print(f"\nModel Performance (SGDRegressor):")
print(f"  Mean Squared Error (MSE): {mse_sgd:.4f}")
print(f"  R-squared (R²): {r2_sgd:.4f}")
print(f"  R-squared (using model.score()): {r2_sgd_score_method:.4f}")


Model Performance (SGDRegressor):
  Mean Squared Error (MSE): 3.1723
  R-squared (R²): 0.8995
  R-squared (using model.score()): 0.8995


In [17]:
# Compare results (they should be reasonably close if SGD converged well)
print("\nComparison (R²):")
print(f"  LinearRegression: {r2_lr:.4f}")
print(f"  SGDRegressor:     {r2_sgd:.4f}")
if abs(r2_lr - r2_sgd) < 0.01 : # Check if they are close
    print("The R² values are very close, indicating SGDRegressor converged well.")
else:
    print("The R² values differ somewhat. SGDRegressor might benefit from further hyperparameter tuning or more iterations.")


Comparison (R²):
  LinearRegression: 0.8994
  SGDRegressor:     0.8995
The R² values are very close, indicating SGDRegressor converged well.


**4. Key Differences and When to Choose:**

| Feature               | `LinearRegression`                                       | `SGDRegressor`                                                               |
| :-------------------- | :------------------------------------------------------- | :--------------------------------------------------------------------------- |
| **Underlying Solver** | Analytical (e.g., SVD, Normal Equation based)            | Iterative (Stochastic Gradient Descent)                                      |
| **Feature Scaling** | Not strictly required (but good practice for numerics)   | **Essential** for good performance and convergence.                        |
| **Hyperparameters** | Few (e.g., `fit_intercept`)                              | Several (e.g., `learning_rate`, `eta0`, `max_iter`, `tol`, `penalty`, `alpha`) |
| **Dataset Size** | Best for small to medium datasets where direct solution is feasible | Scales well to very large datasets (samples & features).                     |
| **Online Learning** | No direct support                                        | Yes, via `partial_fit()` method.                                           |
| **Speed** | Can be slow for very high number of features.          | Can be faster for large datasets if tuned well.                              |
| **Solution** | Exact (up to numerical precision)                        | Approximate (converges towards the solution)                                 |

**In Summary:**
Scikit-learn makes implementing Linear Regression straightforward.
* For most standard use cases where data fits comfortably in memory, `LinearRegression` is simple and effective.
* When dealing with very large datasets or needing online learning capabilities, `SGDRegressor` is the more appropriate tool, but it requires careful feature scaling and potentially hyperparameter tuning.

Both estimators, once trained, provide the `coef_` and `intercept_` attributes, allowing you to understand the learned linear relationship, and the `predict()` method to apply the model to new data.

This concludes our initial look at Linear Regression and its implementation. We've covered simple and multiple regression, the critical assumptions, how the model learns via cost functions and optimization (Normal Equation & Gradient Descent), the mathematical intuition, and finally, how to put it all into practice with Scikit-learn.