# No-Framework Linear Regression

### This implementation uses only Numpy to build a linear regression model

### Goal: Predict used car prices using gradient descent optimization

What we'll implement manually:
- Train/Test split
- Feature scaling (z-score normalization)
- Forward pass (predictions)
- Cost function (Mean Squared Error)
- Gradient computation
- Parameter updates (gradient descent)
- Evaluation metrics (MSE, RMSE, R^2)

In [1]:
# Numpy: Core library for numerical operations on arrays
# ONLY external dependency for the model itself
import numpy as np

# matplotlib: for creating visualizations of training progress and results
import matplotlib as plt

# os: for handling file paths in a cross-platform way
import os

# Set random seed for reporducibility
# Project-wide seed of 113
np.random.seed(113)

# Load cleaned data

- Load the pre-processed dataset that was cleaned in the data-preperation step
- This same file will be used by all 4 frameworks for fair comparison

In [2]:
# Define path to our cleaned dataset
DATA_PATH = os.path.join('..', '..', 'data', 'processed', 'vehicles_clean.csv')

# np.genfromtxt() reads CSV files into numpy arrays
# delimiter=',' specifies that columns are seperated by comas
# skip_header=1 skips the first row (column names)
# Gives us a 2D array where each row is a car, each column is a feature
data = np.genfromtxt(DATA_PATH, delimiter=',', skip_header=1)

# Verify the data loaded correctly
# shape should be (100000, 12)
print(f"Data shape: {data.shape}")
print(f"First row: {data[0]}")

Data shape: (100000, 12)
First row: [2.9990e+04 2.0140e+03 7.0000e+00 2.0000e+00 6.0000e+00 2.0000e+00
 2.6129e+04 0.0000e+00 2.0000e+00 0.0000e+00 8.0000e+00 1.7000e+01]


# Seperate features and Target
- Our columns are: price, year, manufacturer, condition, cylinders, fuel, odometer, title_status, transmission, drive, type, state
- price (column 0) is our TARGET - what we want to predict
- All other columns (1-11) are FEATURES - inputs to our model

In [3]:
# Extract target variable (price)
# data[:, 0] means "all rows, column 0"
y = data[:, 0]

# Extract features variables
# data[:, 1:] means "all rows, column 1 through the end"
X = data[:, 1:]

# Print shapes to verify seperation
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")

Features (X) shape: (100000, 11)
Target (y) shape: (100000,)


In [4]:
# Define feature names for reference (matching our cleaned data columns)
FEATURE_NAMES = ['year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']
print(f"Feature Names: {FEATURE_NAMES}")

Feature Names: ['year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']


# Train/Test Split

We need to split our data into two sets:
- Training set (80%): Used to train the model (learn the weights)
- Trest set (20%): Used to evaluate performance on unseen data

Why Split? If we test on the same data we traine don, we can't tell if the model actually learned patterns or just memorized the training data.

In [5]:
def train_test_split(X, y, test_size=0.2, random_seed=113):
    """
    Split features and target into training and testing sets.

    Parameters:
    -----------
    X : numpy.ndarray
        Feature matrix of shape (n_samples, n_features)

    y : numpy.ndarray
        Target vector of shape (n_samples,)
    
    test_size : float
        Proportion of data to use for testing (0.0 to 1.0)
        Default 0.2 means 20% test, 80% train
    
    random_seed : int
        Seed for random number generator to ensure reproducibility

    Returns:
    --------
    X_train, X_test, y_train, y_test : numpy.ndarray
        Split arrays for training and testing
    """

    # Set random seed for reproducibility
    np.random.seed(random_seed)

    # Get the total number of samples in our dataset
    n_samples = X.shape[0]

    # Calculate the number of test samples
    # int() truncates to whole number
    n_test = int(n_samples * test_size)

    # Calculate number of training samples
    n_train = n_samples - n_test

    # Create an array of all indices
    indices = np.arange(n_samples)

    # Randomly shuffle the indices
    # np.random.shuffle() modifies the array in place
    # This randomizes which samples go to train vs test
    np.random.shuffle(indices)

    # Split indices into train and test portions
    # First n_train indices go to training set
    train_indices = indices[:n_train]
    # Remaining indices go to test set
    test_indices = indices[n_train:]

    # Use the indices to select rows from X and y
    # X[train_indices] selects rows at those index positions
    X_train = X[train_indices]
    X_test = X[test_indices]
    y_train = y[train_indices]
    y_test = y[test_indices]

    return X_train, X_test, y_train, y_test


# Perform the split using our funciton
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_seed=113)

# Verify the split worked correctly
print(f"Training set size: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Test set size: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
print(f"\nX_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Training set size: 80,000 samples (80%)
Test set size: 20,000 samples (20%)

X_train shape: (80000, 11)
X_test shape: (20000, 11)
y_train shape: (80000,)
y_test shape: (20000,)


# Feature Scaling (Z-score normalization)

Feature scaling is critical for gradient descent to work properly.

Why scale features?
Our features have very different ranges:
- year: 1990-2022 (range of 32)
- odometer: 100-500,000 (range of 500,000)
- manufacturer: 0-40 (range of 40)

Without scaling, features with large values dominate the gradients, causing gradient descent to zigzag inefficently or fail to converge.

Z-score normalization transforms each feature to have
- Mean = 0
- Standard deviation = 1
Formula: x_scaled = (x - mean) / std

IMPORTANT: We calculate mean and std from TRAINING data only!
Using test data statistics would be "data leakage" - the model would indirectly learn information about the test set during training.

In [7]:
def compute_scaling_params(X_train):
    """
    Compute mean and standard deviation for each feature from training data.

    Parameters:
    -----------
    X_train : numpy.ndarray
        Training feature matrix of shape (n_samples, n_features)

    Returns:
    --------
    means: numpy.ndarray
        Mean of each feature, shape (n_features,)
    stds : numpy.ndarray
        Standard deviation of each feature, shape (n_features,)
    """
    # np.mean() with axis=0 computes mean for each column (feature)
    # Results in a 1D array with one mean value per feature
    means = np.mean(X_train, axis=0)

    # np.std() with axis=0 computes standard deviation for each column
    # Result is a 1D array with one std value per feature
    stds = np.std(X_train, axis=0)

    return means, stds

def scale_features(X, means, stds):
    """
    Apply z-score normalization to features.

    Parameters:
    -----------
    X : numpy.ndarray
        Feature matrix to scale, shape (n_samples, n_features)
    means : numpy.ndarray
        Mean of each feature (from training data)
    stds : numpy.ndarray
        Standard deviation of each feature (from training data)

    Returns:
    --------
    X_scaled : numpy.ndarray
        Noramlized feature matrix with mean=0, std=1 for each feature
    """
    # Apply z-score formula: (X - mean) / stds
    # Numpy broadcasting handles the element-wise operations automatically
    # Each column is subtracted by its mean, then divided by its std
    X_scaled = (X - means) / stds

    return X_scaled

# Step 1: Compute scaling parameters from TRAINING data only
means, stds = compute_scaling_params(X_train)

# Display the computed parameters for each feature
print("Scaling parameters (computed from training data:)\n")
print(f"{'Feature':<15} {'Mean':>15} {'Std':.15}")
print("-" *47)
for i, name in enumerate(FEATURE_NAMES):
    print(f"{name:<15} {means[i]:>15.2f} {stds[i]:>15.2f}")

# Step 2: Apply scaling to both training and test data
# IMPORTANT: Use the same means and stds (from training) for both sets
X_train_scaled = scale_features(X_train, means, stds)
X_test_scaled = scale_features(X_test, means, stds)

# Verify scaling worked - training data should have mean=0 and std=1
print("\n--- Verification (Training Data After Scaling) ---")
print(f"Mean of each feature (should be =0): {np.mean(X_train_scaled, axis=0).round(6)}")
print(f"Std of each feature (should be 1): {np.std(X_train_scaled, axis=0).round(6)}")


Scaling parameters (computed from training data:)

Feature                    Mean Std
-----------------------------------------------
year                    2012.32            5.78
manufacturer              18.20           11.47
condition                  3.09            2.44
cylinders                  6.00            1.92
fuel                       2.05            0.78
odometer               94285.02        63064.30
title_status               0.24            1.06
transmission               0.39            0.77
drive                      1.41            1.21
type                       7.15            4.12
state                     23.64           15.10

--- Verification (Training Data After Scaling) ---
Mean of each feature (should be =0): [-0. -0. -0. -0.  0.  0.  0. -0. -0. -0.  0.]
Std of each feature (should be 1): [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
