# AutoEncoder CUDA - Quick Start Guide

Run on **GPU runtime** (Runtime -> Change runtime type -> T4 GPU)

*Note: enable High RAM mode if you want to training svm, evaluate or run full pipeline*

## 1. Setup

In [1]:
# Clone repository
import os

repos = "https://github.com/QuackPhuc/AutoEncoder-CUDA.git"

if not os.path.exists('/content/AutoEncoder-CUDA'):
    !git clone --recursive {repos}

%cd /content/AutoEncoder-CUDA
!chmod +x scripts/*.sh build.sh run.sh

Cloning into 'AutoEncoder-CUDA'...
remote: Enumerating objects: 171, done.[K
remote: Counting objects: 100% (171/171), done.[K
remote: Compressing objects: 100% (137/137), done.[K
remote: Total 171 (delta 51), reused 152 (delta 32), pack-reused 0 (from 0)[K
Receiving objects: 100% (171/171), 147.13 KiB | 18.39 MiB/s, done.
Resolving deltas: 100% (51/51), done.
Submodule 'external/thundersvm' (https://github.com/Xtra-Computing/thundersvm.git) registered for path 'external/thundersvm'
Cloning into '/content/AutoEncoder-CUDA/external/thundersvm'...
remote: Enumerating objects: 7469, done.        
remote: Counting objects: 100% (93/93), done.        
remote: Compressing objects: 100% (21/21), done.        
remote: Total 7469 (delta 74), reused 72 (delta 72), pack-reused 7376 (from 2)        
Receiving objects: 100% (7469/7469), 4.88 MiB | 19.13 MiB/s, done.
Resolving deltas: 100% (4997/4997), done.
Submodule path 'external/thundersvm': checked out '5c6a056ac7f474b085d5415c81c5d48a14196

In [2]:
# Download dataset & Build
!scripts/download_cifar10.sh
!./build.sh --clean

[download] CIFAR-10 Dataset
  Downloading (162 MB)...
  Extracting...
[OK] Dataset ready
-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.5.82 with host compiler GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "12.5.82")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  

---
## 2. Usage Options

| Command | Description |
|---------|-------------|
| `train-autoencoder` | Train autoencoder only |
| `train-svm` | Train SVM with existing encoder weights |
| `evaluate` | Evaluate with pre-trained weights |
| `pipeline` | Full: train -> SVM -> evaluate (default) |

### Option A: Evaluate with Pre-trained Weights (Fast)

First, download pretrained weights from Google Drive:

In [3]:
# Download pretrained weights (encoder + SVM)
!./scripts/download_weights.sh


 AutoEncoder CUDA - Download Weights
[download] All pretrained weights

--- Encoder Weights ---
  Downloading encoder.weights...
  [OK] encoder.weights downloaded (2.9MiB)

--- SVM Model ---
  Downloading svm.bin...
  [OK] svm.bin downloaded (3.3GiB)

 Download complete!


In [4]:
# Evaluate with downloaded weights
!./run.sh evaluate

[evaluate] device=gpu | epochs=20 | version=v2
  encoder: ./checkpoints/encoder.weights
  svm:     ./checkpoints/svm.bin

=== Evaluating ===
=== Inference Pipeline ===
Encoder: ./checkpoints/encoder.weights
SVM:     ./checkpoints/svm.bin (pre-trained)
GPU:     GPU Opt v2 (im2col+GEMM)

Loading CIFAR-10...
Train: 50000 images
Test:  10000 images
Extracting features...
GPU: Tesla T4 (15095 MB)
Extracting 10000 images (batch=128)... done.
Feature extraction: 1.4s

Loading SVM model: ./checkpoints/svm.bin
Evaluating on test set...

Overall Accuracy: 65.57%

Per-Class Accuracy:
          Class    Accuracy    Count
-------------------------------------
       airplane      69.30%      1000
     automobile      76.10%      1000
           bird      50.50%      1000
            cat      48.70%      1000
           deer      61.10%      1000
            dog      55.00%      1000
           frog      76.80%      1000
          horse      68.80%      1000
           ship      78.90%      1000
   

### Option B: Train Autoencoder Only

In [5]:
# Quick test: 5 epochs, 1000 samples
!./run.sh train-autoencoder --epochs 5 --samples 1000

[train-autoencoder] device=gpu | epochs=5 | version=v2
  output: ./checkpoints/encoder_20251214_130522.weights

=== Training Autoencoder ===
AutoEncoder CUDA | GPU Opt v2 (im2col+GEMM)
Epochs: 5 | Batch: 64 | Samples: 1000
Train: 1000 images (limited)
Test:  10000 images
GPU: Tesla T4 (15095 MB)

Training: 5 epochs, 15 batches/epoch
  Epoch  1/5 | Loss: 0.103639 | 0.9s
  Epoch  2/5 | Loss: 0.084700 | 0.9s
  Epoch  3/5 | Loss: 0.073234 | 0.9s
  Epoch  4/5 | Loss: 0.065620 | 0.9s
  Epoch  5/5 | Loss: 0.060204 | 0.9s

 Performance Metrics: GPU Opt v2 (im2col+GEMM)
Training Time:     4.62 sec
Time per Epoch:    0.92 sec
Final Loss:        0.06
GPU Memory Used:   1.2 GB

Model saved: ./checkpoints/encoder_20251214_130522.weights



In [6]:
# Full training: 20 epochs, all samples (~20 minutes)
# !./run.sh train-autoencoder --epochs 20

### Option C: Train SVM (requires trained encoder)

In [7]:
# Train SVM using default encoder weights (download weights first)
!./run.sh train-svm

# Or specify custom encoder weights:
# !./run.sh train-svm --encoder-weights ./checkpoints/gpu_opt_v2.weights

[train-svm] device=gpu | epochs=20 | version=v2
  input encoder: ./checkpoints/encoder.weights
  output svm:    ./checkpoints/svm_20251214_130527.bin

=== Training SVM ===
=== Inference Pipeline ===
Encoder: ./checkpoints/encoder.weights
SVM:     ./checkpoints/svm_20251214_130527.bin (will train)
GPU:     GPU Opt v2 (im2col+GEMM)

Loading CIFAR-10...
Train: 50000 images
Test:  10000 images
Extracting features...
GPU: Tesla T4 (15095 MB)
Extracting 50000 images (batch=128)... done.
Extracting 10000 images (batch=128)... done.
Feature extraction: 7.3s

Training SVM (ThunderSVM GPU)...
2025-12-14 13:05:39,378 INFO [default] #instances = 50000, #features = 8192
2025-12-14 13:05:41,052 INFO [default] #classes = 10
2025-12-14 13:05:43,430 INFO [default] total memory size is 0.915863 max mem size is 8
2025-12-14 13:05:43,430 INFO [default] free mem is 7.08414
2025-12-14 13:05:43,430 INFO [default] working set size = 1024
2025-12-14 13:05:43,431 INFO [default] training start
2025-12-14 13:05:4

### Option D: Full Pipeline

In [8]:
# Train autoencoder -> Train SVM -> Evaluate
!./run.sh pipeline --epochs 20

[pipeline] device=gpu | epochs=20 | version=v2
  output encoder: ./checkpoints/encoder_20251214_131115.weights
  output svm:     ./checkpoints/svm_20251214_131115.bin

=== Step 1: Training Autoencoder ===
AutoEncoder CUDA | GPU Opt v2 (im2col+GEMM)
Epochs: 20 | Batch: 64 | Samples: all
Train: 50000 images
Test:  10000 images
GPU: Tesla T4 (15095 MB)

Training: 20 epochs, 781 batches/epoch
  Epoch  1/20 | Loss: 0.027545 | 52.0s
  Epoch  2/20 | Loss: 0.021852 | 51.2s
  Epoch  3/20 | Loss: 0.019280 | 51.2s
  Epoch  4/20 | Loss: 0.017686 | 51.4s
  Epoch  5/20 | Loss: 0.016559 | 51.2s
  Epoch  6/20 | Loss: 0.015702 | 51.3s
  Epoch  7/20 | Loss: 0.015019 | 51.1s
  Epoch  8/20 | Loss: 0.014458 | 51.2s
  Epoch  9/20 | Loss: 0.013982 | 51.2s
  Epoch 10/20 | Loss: 0.013572 | 51.2s
  Epoch 11/20 | Loss: 0.013214 | 51.1s
  Epoch 12/20 | Loss: 0.012896 | 51.2s
  Epoch 13/20 | Loss: 0.012612 | 51.2s
  Epoch 14/20 | Loss: 0.012357 | 51.3s
  Epoch 15/20 | Loss: 0.012125 | 51.3s
  Epoch 16/20 | Loss: 0

---
## 3. Benchmark

In [9]:
# Compare CPU vs GPU versions (quick)
!scripts/benchmark.sh --epochs 3 --samples 100

[benchmark] epochs=3 samples=100

  CPU... 453667ms ( GB)
  GPU-Basic... 2556ms (1.2 GB)
  GPU-OptV1... 1413ms (1.2 GB)
  GPU-OptV2... 787ms (1.2 GB)

Results:
  Version        Time(ms)      Speedup Memory(GB)
  ------------------------------------------------
  CPU              453667        1.00x        N/A
  GPU-Basic          2556        177.4x        1.2
  GPU-OptV1          1413        321.0x        1.2
  GPU-OptV2           787        576.4x        1.2

[OK] Saved: ./results/benchmark.csv

[benchmark] Results
Version         Time(s)    Speedup   
-----------------------------------
CPU             453.67     1.0       x
GPU-Basic       2.56       177.5     x
GPU-OptV1       1.41       321.1     x
GPU-OptV2       0.79       576.5     x

[OK] Chart: ./results/benchmark.png


In [10]:
# GPU-only benchmark (more samples)
!scripts/benchmark.sh --epochs 3 --samples 10000 --gpu-only

[benchmark] epochs=3 samples=10000

  GPU-Basic... 301045ms (1.2 GB)
  GPU-OptV1... 150590ms (1.2 GB)
  GPU-OptV2... 31222ms (1.2 GB)

Results:
  Version        Time(ms)      Speedup Memory(GB)
  ------------------------------------------------
  GPU-Basic        301045          1.0x        1.2
  GPU-OptV1        150590          1.9x        1.2
  GPU-OptV2         31222          9.6x        1.2

[OK] Saved: ./results/benchmark.csv

[benchmark] Results
Version         Time(s)    Speedup   
-----------------------------------
GPU-Basic       301.05     1.0       x
GPU-OptV1       150.59     2.0       x
GPU-OptV2       31.22      9.6       x

[OK] Chart: ./results/benchmark.png


---
## 4. Advanced Options

```bash
# GPU versions: naive (basic), v1 (memory opt), v2 (kernel fusion)
./run.sh train-autoencoder --version v2 --epochs 20

# Custom weight paths
./run.sh evaluate --encoder-weights ./checkpoints/my.weights --svm-model ./checkpoints/my.bin

# CPU training
./run.sh train-autoencoder --device cpu --epochs 5 --samples 100
```