# PyTorch Linear Regression

### This implementation uses PyTorch to build a linear regression model

### Goal: Predict used car prices and compare with No-Framework and Scikit-Learn

What PyTorch provides (That we built manually in No-Framework)
- `torch.Tensor`: GPU-compatible arrays that track gradients automatically
- `torch.nn.Linear`: Encapsulates weights and bias in a single layer
- `torch.nn.MSELoss`: Pre-built loss function (replaces our manual compute_cost)
- `torch.optim.SGD`: Optimizer that handles parameter updates (replaces manual gradient descent)
- `auotgrad`: Automatic differentiation - computes gradients via .backward()

Key Concept - Autograd:
- In No-Framework, we manually computed gradients
- In PyTorch, we jull call loss.backward() and gradients are computed automatically
- This is the foundation of modern deep learning - same math, zero manual calculus

In [1]:
# torch: The main PyTorch library for tensor operations and neural networks
import torch

# torch.nn: Neural network module containing layers, loss function, etc.
# We import it as 'nn' for shorter syntax
import torch.nn as nn

# torch.optim: Optimization algorithms (SGD, Adam, etc.)
# These handle the weights updates we did manually in No-Framework
import torch.optim as optim

# numpy: Still needed for initial data handling before converting to tensors
import numpy as np

#pandas: For loading CSV data
import pandas as pd

# matplotlib: for visualizations
import matplotlib.pyplot as plt

# os: File path handling
import os
# Sklearn utilities: Using these for consistency with  Scikit-Learn implementation
# This ensures identical train/test splits and scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Performance tracking
import time
import tracemalloc
import platform

# Set random seeds for reporducibility
# We set seeds for BOTH numpy and torch to ensure consistent results
RANDOM_SEED = 113
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

print("All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Random seed set to : {RANDOM_SEED}")

All imports successful!
PyTorch version: 2.10.0+cpu
Random seed set to : 113


# Load Cleaned Data

- Load the same pre-processed dataset used in NF and SL
- Using pandas for consistency with SL implementation
- This ensures fair comparison across all frameworks

In [2]:
# Define path to our cleaned dataset
DATA_PATH = os.path.join('..', '..', 'data', 'processed', 'vehicles_clean.csv')

# Load data using pandas
df = pd.read_csv(DATA_PATH)

# Verify data loaded correctly
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst 3 rows:")
print(df.head(3))

Dataset shape: (100000, 12)
Columns: ['price', 'year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']

First 3 rows:
   price    year  manufacturer  condition  cylinders  fuel  odometer  \
0  29990  2014.0             7          2          6     2   26129.0   
1   6995  2006.0            12          0          6     2  198947.0   
2   4995  2009.0            35          6          8     2  152794.0   

   title_status  transmission  drive  type  state  
0             0             2      0     8     17  
1             6             0      3    10      5  
2             0             0      3    11     22  


# Separate Features and Target

- Price is our TARGET variable 
- All other columns are FEATURES
- Same seperation as Scikit-Learn implementation

In [3]:
# Define target and column variable (same as scikit-learn)
TARGET_COLUMN = 'price'
FEATURE_COLUMNS = [ 'year', 'manufacturer', 'condition', 'cylinders',
                    'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']

# Extract target (y) and features (X) as numpy arrays
y = df[TARGET_COLUMN].values
X = df[FEATURE_COLUMNS].values

# Store feature names for later use (displaying learned weights)
FEATURE_NAMES = FEATURE_COLUMNS

# Verify shapes 
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature Names: {FEATURE_NAMES}")

Features (X) shape: (100000, 11)
Target (y) shape: (100000,)

Feature Names: ['year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']


# Train/Test Split

- Using Sklearn's train_test_split for consistency with Scikit-learn implementation
- Same 80/20 split, same random seed (113)
- This ensures we're comparing apples-to-applies across frameworks

In [4]:
# Split data using sklearn (same as Scikit-Learn)
X_train, X_test, y_train, y_test = train_test_split(
    X,                          # Features to split
    y,                          # Target to split
    test_size=0.2,              # 20% for testing, 80% for training
    random_state=RANDOM_SEED    # Seed 113 for reporoducibility
)

# Verify split sizes
print(f"Training set size: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Test set size: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
print(f"\nX_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Training set size: 80,000 samples (80%)
Test set size: 20,000 samples (20%)

X_train shape: (80000, 11)
X_test shape: (20000, 11)
y_train shape: (80000,)
y_test shape: (20000,)
