# Linear Regression from Scratch

## Introduction

Linear regression is one of the most fundamental algorithms in statistical learning and machine learning. It models the relationship between a dependent variable $y$ and one or more independent variables $\mathbf{x}$ by fitting a linear equation to observed data.

## Mathematical Foundation

### The Model

For simple linear regression with one predictor, the model is:

$$y = \beta_0 + \beta_1 x + \epsilon$$

where:
- $y$ is the dependent variable (response)
- $x$ is the independent variable (predictor)
- $\beta_0$ is the y-intercept
- $\beta_1$ is the slope
- $\epsilon$ is the error term (assumed to be normally distributed with mean 0)

### Ordinary Least Squares (OLS)

The goal is to find the parameters $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals:

$$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$

### Derivation of OLS Estimators

Taking partial derivatives and setting them to zero:

$$\frac{\partial \text{SSE}}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i) = 0$$

$$\frac{\partial \text{SSE}}{\partial \beta_1} = -2\sum_{i=1}^{n}x_i(y_i - \beta_0 - \beta_1 x_i) = 0$$

Solving these normal equations yields the closed-form solutions:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}$$

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

where $\bar{x}$ and $\bar{y}$ are the sample means.

### Matrix Formulation

For multiple linear regression, the model can be expressed in matrix form:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

The OLS estimator becomes:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

### Coefficient of Determination ($R^2$)

The $R^2$ statistic measures the proportion of variance in $y$ explained by the model:

$$R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

where $\text{SST}$ is the total sum of squares.

## Implementation

Let us now implement linear regression from scratch using only NumPy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12

### Generate Synthetic Data

We create a dataset with a known linear relationship plus Gaussian noise.

In [None]:
# True parameters
true_beta_0 = 2.5  # Intercept
true_beta_1 = 1.8  # Slope
noise_std = 1.5    # Standard deviation of noise

# Generate data
n_samples = 100
X = np.random.uniform(0, 10, n_samples)
epsilon = np.random.normal(0, noise_std, n_samples)
y = true_beta_0 + true_beta_1 * X + epsilon

print(f"Generated {n_samples} data points")
print(f"True parameters: β₀ = {true_beta_0}, β₁ = {true_beta_1}")
print(f"X range: [{X.min():.2f}, {X.max():.2f}]")
print(f"y range: [{y.min():.2f}, {y.max():.2f}]")

### Linear Regression Class from Scratch

In [None]:
class LinearRegressionOLS:
    """
    Linear Regression using Ordinary Least Squares.
    
    Implements both the analytical closed-form solution
    and the matrix formulation.
    """
    
    def __init__(self):
        self.beta_0 = None  # Intercept
        self.beta_1 = None  # Slope (for simple regression)
        self.coefficients = None  # Full coefficient vector
        
    def fit_simple(self, X, y):
        """
        Fit simple linear regression using closed-form OLS formulas.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples,)
            Training data (single feature)
        y : array-like, shape (n_samples,)
            Target values
        """
        X = np.asarray(X)
        y = np.asarray(y)
        
        n = len(X)
        
        # Calculate means
        x_mean = np.mean(X)
        y_mean = np.mean(y)
        
        # Calculate slope using covariance formula
        # β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean) ** 2)
        
        self.beta_1 = numerator / denominator
        
        # Calculate intercept
        # β₀ = ȳ - β₁x̄
        self.beta_0 = y_mean - self.beta_1 * x_mean
        
        self.coefficients = np.array([self.beta_0, self.beta_1])
        
        return self
    
    def fit_matrix(self, X, y):
        """
        Fit linear regression using matrix formulation.
        
        β = (X'X)⁻¹X'y
        
        Parameters:
        -----------
        X : array-like, shape (n_samples,) or (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Target values
        """
        X = np.asarray(X)
        y = np.asarray(y)
        
        # Reshape X if 1D
        if X.ndim == 1:
            X = X.reshape(-1, 1)
        
        # Add column of ones for intercept (design matrix)
        n_samples = X.shape[0]
        X_design = np.column_stack([np.ones(n_samples), X])
        
        # Compute OLS estimator: β = (X'X)⁻¹X'y
        XtX = X_design.T @ X_design
        XtX_inv = np.linalg.inv(XtX)
        Xty = X_design.T @ y
        
        self.coefficients = XtX_inv @ Xty
        self.beta_0 = self.coefficients[0]
        self.beta_1 = self.coefficients[1] if len(self.coefficients) > 1 else None
        
        return self
    
    def predict(self, X):
        """
        Make predictions using the fitted model.
        
        Parameters:
        -----------
        X : array-like
            Samples to predict
            
        Returns:
        --------
        y_pred : array
            Predicted values
        """
        X = np.asarray(X)
        
        if X.ndim == 1:
            return self.beta_0 + self.beta_1 * X
        else:
            X_design = np.column_stack([np.ones(X.shape[0]), X])
            return X_design @ self.coefficients
    
    def score(self, X, y):
        """
        Calculate R² (coefficient of determination).
        
        R² = 1 - SSE/SST
        
        Parameters:
        -----------
        X : array-like
            Test samples
        y : array-like
            True values
            
        Returns:
        --------
        r2 : float
            Coefficient of determination
        """
        y = np.asarray(y)
        y_pred = self.predict(X)
        
        # Sum of squared errors (residuals)
        ss_res = np.sum((y - y_pred) ** 2)
        
        # Total sum of squares
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        
        return 1 - (ss_res / ss_tot)
    
    def residuals(self, X, y):
        """
        Calculate residuals (errors).
        
        Parameters:
        -----------
        X : array-like
            Samples
        y : array-like
            True values
            
        Returns:
        --------
        residuals : array
            y - ŷ
        """
        return np.asarray(y) - self.predict(X)

### Fit the Model

In [None]:
# Create and fit model using simple closed-form solution
model_simple = LinearRegressionOLS()
model_simple.fit_simple(X, y)

print("Simple OLS Formula Results:")
print(f"  Estimated β₀ (intercept): {model_simple.beta_0:.4f}")
print(f"  Estimated β₁ (slope):     {model_simple.beta_1:.4f}")
print(f"  True β₀: {true_beta_0}, True β₁: {true_beta_1}")
print(f"  R² score: {model_simple.score(X, y):.4f}")

print("\n" + "="*50 + "\n")

# Create and fit model using matrix formulation
model_matrix = LinearRegressionOLS()
model_matrix.fit_matrix(X, y)

print("Matrix Formulation Results:")
print(f"  Estimated β₀ (intercept): {model_matrix.beta_0:.4f}")
print(f"  Estimated β₁ (slope):     {model_matrix.beta_1:.4f}")
print(f"  R² score: {model_matrix.score(X, y):.4f}")

### Visualization

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Data and Regression Line
ax1 = axes[0, 0]
ax1.scatter(X, y, alpha=0.6, edgecolors='black', linewidth=0.5, label='Data points')

# Regression line
X_line = np.linspace(X.min(), X.max(), 100)
y_pred_line = model_simple.predict(X_line)
ax1.plot(X_line, y_pred_line, 'r-', linewidth=2, 
         label=f'Fitted: $y = {model_simple.beta_0:.2f} + {model_simple.beta_1:.2f}x$')

# True line for comparison
y_true_line = true_beta_0 + true_beta_1 * X_line
ax1.plot(X_line, y_true_line, 'g--', linewidth=2, alpha=0.7,
         label=f'True: $y = {true_beta_0} + {true_beta_1}x$')

ax1.set_xlabel('$x$')
ax1.set_ylabel('$y$')
ax1.set_title('Linear Regression Fit')
ax1.legend()

# Plot 2: Residuals vs Fitted Values
ax2 = axes[0, 1]
y_pred = model_simple.predict(X)
residuals = model_simple.residuals(X, y)

ax2.scatter(y_pred, residuals, alpha=0.6, edgecolors='black', linewidth=0.5)
ax2.axhline(y=0, color='r', linestyle='--', linewidth=1)
ax2.set_xlabel('Fitted Values ($\\hat{y}$)')
ax2.set_ylabel('Residuals ($y - \\hat{y}$)')
ax2.set_title('Residuals vs Fitted Values')

# Plot 3: Histogram of Residuals
ax3 = axes[1, 0]
ax3.hist(residuals, bins=20, edgecolor='black', alpha=0.7, density=True)

# Overlay normal distribution
from scipy import stats
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
ax3.plot(x_norm, stats.norm.pdf(x_norm, 0, np.std(residuals)), 
         'r-', linewidth=2, label='Normal PDF')

ax3.set_xlabel('Residuals')
ax3.set_ylabel('Density')
ax3.set_title('Distribution of Residuals')
ax3.legend()

# Plot 4: Q-Q Plot
ax4 = axes[1, 1]
stats.probplot(residuals, dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot (Normality Check)')

plt.tight_layout()

# Save the figure
plt.savefig('plot.png', dpi=150, bbox_inches='tight')
print("Plot saved to 'plot.png'")

plt.show()

### Statistical Analysis

In [None]:
# Calculate additional statistics
n = len(y)
p = 2  # Number of parameters (β₀ and β₁)

# Predictions and residuals
y_pred = model_simple.predict(X)
residuals = y - y_pred

# Sum of squares
SSE = np.sum(residuals ** 2)
SST = np.sum((y - np.mean(y)) ** 2)
SSR = SST - SSE  # Regression sum of squares

# Mean squared error
MSE = SSE / (n - p)
RMSE = np.sqrt(MSE)

# Standard error of coefficients
X_design = np.column_stack([np.ones(n), X])
var_beta = MSE * np.linalg.inv(X_design.T @ X_design)
se_beta_0 = np.sqrt(var_beta[0, 0])
se_beta_1 = np.sqrt(var_beta[1, 1])

# t-statistics
t_beta_0 = model_simple.beta_0 / se_beta_0
t_beta_1 = model_simple.beta_1 / se_beta_1

# p-values (two-tailed)
p_value_beta_0 = 2 * (1 - stats.t.cdf(abs(t_beta_0), n - p))
p_value_beta_1 = 2 * (1 - stats.t.cdf(abs(t_beta_1), n - p))

# F-statistic
MSR = SSR / (p - 1)
F_stat = MSR / MSE
p_value_F = 1 - stats.f.cdf(F_stat, p - 1, n - p)

print("="*60)
print("REGRESSION ANALYSIS SUMMARY")
print("="*60)
print(f"\nSample size: {n}")
print(f"R²: {model_simple.score(X, y):.4f}")
print(f"Adjusted R²: {1 - (1 - model_simple.score(X, y)) * (n - 1) / (n - p):.4f}")
print(f"RMSE: {RMSE:.4f}")
print(f"\nF-statistic: {F_stat:.4f} (p-value: {p_value_F:.4e})")

print("\n" + "-"*60)
print("COEFFICIENTS")
print("-"*60)
print(f"{'Parameter':<12} {'Estimate':<12} {'Std. Error':<12} {'t-value':<12} {'p-value':<12}")
print("-"*60)
print(f"{'β₀ (int.)':<12} {model_simple.beta_0:<12.4f} {se_beta_0:<12.4f} {t_beta_0:<12.4f} {p_value_beta_0:<12.4e}")
print(f"{'β₁ (slope)':<12} {model_simple.beta_1:<12.4f} {se_beta_1:<12.4f} {t_beta_1:<12.4f} {p_value_beta_1:<12.4e}")
print("-"*60)

# Confidence intervals (95%)
alpha = 0.05
t_crit = stats.t.ppf(1 - alpha/2, n - p)

ci_beta_0 = (model_simple.beta_0 - t_crit * se_beta_0, 
             model_simple.beta_0 + t_crit * se_beta_0)
ci_beta_1 = (model_simple.beta_1 - t_crit * se_beta_1, 
             model_simple.beta_1 + t_crit * se_beta_1)

print(f"\n95% Confidence Intervals:")
print(f"  β₀: [{ci_beta_0[0]:.4f}, {ci_beta_0[1]:.4f}]")
print(f"  β₁: [{ci_beta_1[0]:.4f}, {ci_beta_1[1]:.4f}]")

## Gradient Descent Implementation

As an alternative to the closed-form solution, we can also find the optimal parameters using gradient descent. This iterative approach is particularly useful when dealing with large datasets or when extending to more complex models.

The gradients of the cost function $J(\beta_0, \beta_1) = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2$ are:

$$\frac{\partial J}{\partial \beta_0} = -\frac{1}{n}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)$$

$$\frac{\partial J}{\partial \beta_1} = -\frac{1}{n}\sum_{i=1}^{n}x_i(y_i - \beta_0 - \beta_1 x_i)$$

In [None]:
class LinearRegressionGD:
    """
    Linear Regression using Gradient Descent.
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.beta_0 = None
        self.beta_1 = None
        self.cost_history = []
        
    def fit(self, X, y):
        """
        Fit model using gradient descent.
        """
        X = np.asarray(X)
        y = np.asarray(y)
        n = len(X)
        
        # Initialize parameters
        self.beta_0 = 0
        self.beta_1 = 0
        self.cost_history = []
        
        for i in range(self.n_iterations):
            # Predictions
            y_pred = self.beta_0 + self.beta_1 * X
            
            # Compute cost (MSE)
            cost = np.mean((y - y_pred) ** 2) / 2
            self.cost_history.append(cost)
            
            # Compute gradients
            d_beta_0 = -np.mean(y - y_pred)
            d_beta_1 = -np.mean(X * (y - y_pred))
            
            # Update parameters
            self.beta_0 -= self.learning_rate * d_beta_0
            self.beta_1 -= self.learning_rate * d_beta_1
        
        return self
    
    def predict(self, X):
        return self.beta_0 + self.beta_1 * np.asarray(X)


# Fit using gradient descent
model_gd = LinearRegressionGD(learning_rate=0.01, n_iterations=1000)
model_gd.fit(X, y)

print("Gradient Descent Results:")
print(f"  Estimated β₀: {model_gd.beta_0:.4f}")
print(f"  Estimated β₁: {model_gd.beta_1:.4f}")
print(f"  Final cost: {model_gd.cost_history[-1]:.4f}")

# Plot convergence
plt.figure(figsize=(10, 4))
plt.plot(model_gd.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE/2)')
plt.title('Gradient Descent Convergence')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()

## Conclusion

This notebook demonstrated the implementation of linear regression from scratch using:

1. **Closed-form OLS solution** - Direct computation using the analytical formulas derived from minimizing squared errors
2. **Matrix formulation** - Using the normal equation $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
3. **Gradient descent** - Iterative optimization approach

All three methods converge to the same solution, validating our implementations. The closed-form solution is computationally efficient for small to medium datasets, while gradient descent becomes preferable for very large datasets or when extending to regularized regression.

Key takeaways:
- Linear regression minimizes the sum of squared residuals
- The $R^2$ statistic measures model fit quality
- Residual analysis helps validate model assumptions (normality, homoscedasticity)
- Standard errors enable hypothesis testing and confidence interval construction