# TensorFlow Linear Regression

### This implementation uses TensorFlow/Keras to build a linear regression model

### Goal: Predict used car prices and compare with No-Framework, Scikit-Learn, and PyTorch

What TensorFlow/Keras provides (that we built manually in No-Framework):
- `tf.keras.Sequential`: High-level API for building models layer by layer
- `tf.keras.layers.Dense`: Fully connected layer (replaces manual weights + bias)
- `tf.keras.losses.MeanSquaredError`: Pre-built loss function
- `tf.keras.optimizers.SGD`: Optimizer that handles parameter updates
- `model.fit()`: Complete training loop in one line

Key Concept - Keras vs Raw TensorFlow:
- TensorFlow 2.x uses Keras as its high-level API
- Keras abstracts away the computational graph complexity
- Similar to PyTorch's nn.Module, but with even simpler syntax via Sequential API


In [1]:
# tensorflow: The main TensorFlow Library
import tensorflow as tf

# numpy: Still needed for initial data handling
import numpy as np

# pandas: For loading CSV data
import pandas as pd

# matplotlib: for visualizations
import matplotlib.pyplot as plt

# os: File path handling
import os

# Sklearn utilities: Using these for consistency with previous implementations
# This ensures identical train/test splits and scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Performance tracking
import time
import tracemalloc
import platform

# Set random see for reproducibility
RANDOM_SEED = 113
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

print("All Imports successful!")
print(f"TensorFlow version: {tf.__version__}")
print(f"Random seed set to: {RANDOM_SEED}")

All Imports successful!
TensorFlow version: 2.20.0
Random seed set to: 113


# Load Cleaned Data

- Load the same pre-processed dataset used in N0-Framework, Scikit-Learn, and PyTorch
- Using pandas for consistency with SL implementation
- This ensures fair comparison across all frameworks

In [2]:
# Define path to our cleaned dataset
DATA_PATH = os.path.join('..', '..', 'data', 'processed', 'vehicles_clean.csv')

# Load data using pandas
df = pd.read_csv(DATA_PATH)

# Verify data loaded correctly
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst 3 rows:")
print(df.head(3))

Dataset shape: (100000, 12)
Columns: ['price', 'year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']

First 3 rows:
   price    year  manufacturer  condition  cylinders  fuel  odometer  \
0  29990  2014.0             7          2          6     2   26129.0   
1   6995  2006.0            12          0          6     2  198947.0   
2   4995  2009.0            35          6          8     2  152794.0   

   title_status  transmission  drive  type  state  
0             0             2      0     8     17  
1             6             0      3    10      5  
2             0             0      3    11     22  


# Separate Features and Target

- Price is our TARGET variable 
- All other columns are FEATURES
- Same seperation as Scikit-Learn implementation

In [3]:
# Define target and column variable (same as all previous implementations)
TARGET_COLUMN = 'price'
FEATURE_COLUMNS = [ 'year', 'manufacturer', 'condition', 'cylinders',
                    'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']

# Extract target (y) and features (X) as numpy arrays
y = df[TARGET_COLUMN].values
X = df[FEATURE_COLUMNS].values

# Store feature names for later use (displaying learned weights)
FEATURE_NAMES = FEATURE_COLUMNS

# Verify shapes 
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature Names: {FEATURE_NAMES}")

Features (X) shape: (100000, 11)
Target (y) shape: (100000,)

Feature Names: ['year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'state']


# Train/Test Split

- Using Sklearn's train_test_split for consistency with Scikit-learn implementation
- Same 80/20 split, same random seed (113)
- This ensures we're comparing apples-to-applies across frameworks

In [4]:
# Split data using sklearn (same as Scikit-Learn and PyTorch Implementations)
X_train, X_test, y_train, y_test = train_test_split(
    X,                          # Features to split
    y,                          # Target to split
    test_size=0.2,              # 20% for testing, 80% for training
    random_state=RANDOM_SEED    # Seed 113 for reporoducibility
)

# Verify split sizes
print(f"Training set size: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Test set size: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
print(f"\nX_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Training set size: 80,000 samples (80%)
Test set size: 20,000 samples (20%)

X_train shape: (80000, 11)
X_test shape: (20000, 11)
y_train shape: (80000,)
y_test shape: (20000,)


# Feature Scaling (Z-Score Normalization)

- Using sklearn's StandardScaler for consistency with Scikit-Learn implementation
- Same z-score noramlization
- Fit on training data only, transform both train and test

In [5]:
# Create and fit scaler on training data (same as Scikit-Learn)
scaler = StandardScaler()

# fit_transform on training data: calculates mean/std AND applies scaling
X_train_scaled = scaler.fit_transform(X_train)

# transform on test data: uses mean/std from training (no data leakage)
X_test_scaled = scaler.transform(X_test)

# Display the learned scaling parameters
print("Scaling parameters (computed from training data):\n")
print(f"{'Feature':<15} {'Mean':>15} {'Std':>15}")
print("-" * 47)
for i, name in enumerate(FEATURE_NAMES):
    print(f"{name:<15} {scaler.mean_[i]:>15.2f} {scaler.scale_[i]:>15.2f}")

# Verify scaling worked - training data should have mean=0 and std=1
print("\n--- Verification (Training Data After Scaling) ---")
print(f"Mean of each feature (should be 0): {np.mean(X_train_scaled, axis=0).round(6)}")
print(f"Std of each feature (should be 1): {np.std(X_train_scaled, axis=0).round(6)}")

Scaling parameters (computed from training data):

Feature                    Mean             Std
-----------------------------------------------
year                    2012.32            5.79
manufacturer              18.24           11.48
condition                  3.09            2.43
cylinders                  6.00            1.92
fuel                       2.05            0.78
odometer               94235.84        62977.76
title_status               0.23            1.06
transmission               0.39            0.77
drive                      1.40            1.21
type                       7.14            4.12
state                     23.60           15.10

--- Verification (Training Data After Scaling) ---
Mean of each feature (should be 0): [ 0.  0.  0. -0.  0. -0. -0.  0.  0. -0. -0.]
Std of each feature (should be 1): [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


# Start Performance Tracking

- Begin measuring time and memory BEFORE model initialization
- This captures model creation, compilation, and training
- Matches the timing approachf used in No-Framework and PyTorch

In [6]:
# Start memory tracking
tracemalloc.start()

# Record start time
start_time = time.time()

print("Performance tracking started...")
print(" - Memory tracking: ACTIVE")
print(" - Time: STARTED")

Performance tracking started...
 - Memory tracking: ACTIVE
 - Time: STARTED


# Define the Model

TensorFlow uses Keras as its high-level API. For linear regression, we use:
- `tf.keras.Sequential`: Container for stacking layers linearly
- `tf.keras.layers.Dense`: Fully connected layer (like PyTorch's nn.Linear)

| TensorFlow/Keras | PyTorch Equivalent | No-Framework Equivalent |
|------------------|-------------------|------------------------|
| `Sequential([Dense(1)])` | `nn.Linear(11, 1)` | `weights = np.zeros(11)` + `bias = 0` |
| `model(X)` | `model(X)` | `X @ weights + bias` |
| `model.get_weights()` | `model.parameters()` | Manual weight/bias variables |


In [7]:
# Define the model

# tf.keras.Sequential creates a model by stacking layers in order
# For linear regression, we need just one Dense layer with 1 output

# Dense layer parameters:
# - units=1: One output
# - input_shape=(11,): 11 input features
# - Automatically creates weights (11,) and bias (1,)

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(11,)),
    tf.keras.layers.Dense(units=1)
])

# Displayu model architecture
print("Model Architecture:")
model.summary()

# View initial weights (randomly initialized by default)
# get_weights() returns [weights_array, bias_array]
weights, bias = model.get_weights()
print(f"\nInitial Weights shape: {weights.shape}")
print(f"Initial Bias shape: {bias.shape}")
print(f"\nInitial weights (first 5): {weights[:5, 0]}")
print(f"Initial Bias: {bias}")

Model Architecture:



Initial Weights shape: (11, 1)
Initial Bias shape: (1,)

Initial weights (first 5): [-0.27453867  0.49492913 -0.21581462 -0.04182166  0.44886726]
Initial Bias: [0.]


# Compile the Model

TensorFlow requires a separate `compile()` step before training. This configures:
- **Loss function**: What to minimize (MSE for regression)
- **Optimizer**: How to minimize it (SGD with learning rate 0.01)

| TensorFlow | PyTorch Equivalent |
|------------|-------------------|
| `model.compile(loss, optimizer)` | Separate `criterion` and `optimizer` objects |
| Configured once before training | Used explicitly in training loop |


In [8]:
# Compile the model

# mode.compile() configures the model for training
# This is required BEFORE calling model.fit()

# Parameters:
# - Optimizer: SGD with learning_rate=0.01
# - loss: Mean Squared Error

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
    loss='mse'
)

print("Model compiled successfully!")
print(" Optimizer: SGD (learning_rate=0.01)")
print(" Loss function: Mean Squared Error (MSE)")

Model compiled successfully!
 Optimizer: SGD (learning_rate=0.01)
 Loss function: Mean Squared Error (MSE)


# Train the Model

TensorFlow's `model.fit()` handles the ENTIRE training loop in one line:
- Forward pass
- Loss computation
- Backward pass (gradient computation)
- Weight updates

| TensorFlow | PyTorch Equivalent | No-Framework Equivalent |
|------------|-------------------|------------------------|
| `model.fit(X, y, epochs=1000)` | Manual for loop (1000 iterations) | Manual for loop (1000 iterations) |
| ~1 line | ~15 lines | ~30 lines |

The `history` object returned by `fit()` contains loss values for plotting.


In [None]:
# Train the model

# model.fit() performs the entire training loop:
#   1. Forward pass (predictions)
#   2. Loss computation (MSE)
#   3. Backward pass (gradients via automatic differentiation)
#   4. Weight updates (SGD optimizer step)
# All in one line of code

# Parameters:
#   - X_train_scaled: Input features (scaled)
#   - y_train: Target values
#   - epochs: Numer of training iterations (1000, same as previous frameworks)
#   - batch_size: Use full dataset per epoch (same as previous frameworks)
#   - verbose: 0=silent, 1=progress bar, 2= one line per epoch
#   - shuffle: False to match our other implementations

print("Training Linear Regression Model...")

# Train the model - this replaces the entire manual training loop
history = model.fit(
    X_train_scaled,         # Input features
    y_train,                # Target values
    epochs=1000,            # Same as previous frameworks
    batch_size=len(X_train),# Full batch per epoch
    verbose=0,              # Silent training - printing summary after
    shuffle=False           # Don't shuffle to match other frameworks
)

# STOP Performance tracking
end_time = time.time()
current_mem, peak_mem = tracemalloc.get_traced_memory()

# Calculate performance metrics
training_time = end_time - start_time
peak_memory_mb = peak_mem / 1024 / 1024

print("=" * 50)
print("Training Complete!")
print(f"  Initial Loss: {history.history['loss'][0]:,.2f}")
print(f"  Final Loss:   {history.history['loss'][-1]:,.2f}")
print(f"\n--- Training Performance ---")
print(f"  Training time: {training_time:.4f} seconds")
print(f"  Peak memory:   {peak_memory_mb:.2f} MB")

Training Linear Regression Model...
Training Complete!
  Initial Loss: 569,614,528.00
  Final Loss:   101,652,944.00

--- Training Performance ---
  Training time: 24.6862 seconds
  Peak memory:   8.38 MB
