Documentation: https://minitorch.github.io/
Overview (Required reading): https://minitorch.github.io/module3/module3/
Module 3 focuses on optimizing tensor operations through parallel computing and GPU acceleration. You'll implement CPU parallel operations using Numba and GPU kernels using CUDA, achieving dramatic performance improvements over the sequential tensor backend from Module 2.
- CPU Parallelization: Implement parallel tensor operations with Numba
 - GPU Programming: Write CUDA kernels for tensor operations
 - Performance Optimization: Achieve significant speedup through hardware acceleration
 - Matrix Multiplication: Optimize the most computationally intensive operations with operator fusion
 
Task 3.1: CPU Parallel Operations
File to edit: minitorch/fast_ops.py
Feel free to use numpy functions like np.array_equal() and np.zeros().
Task 3.2: CPU Matrix Multiplication
File to edit: minitorch/fast_ops.py
Implement optimized batched matrix multiplication with parallel outer loops.
Task 3.3: GPU Operations (requires GPU)
File to edit: minitorch/cuda_ops.py
Implement CUDA kernels for tensor map, zip, and reduce operations.
Task 3.4: GPU Matrix Multiplication (requires GPU)
File to edit: minitorch/cuda_ops.py
Implement CUDA matrix multiplication with shared memory optimization for maximum performance.
Task 3.5: Training (requires GPU)
File to edit: project/run_fast_tensor.py
Implement missing functions and train models on all datasets to demonstrate performance improvements.
- Installation Guide - Setup instructions including GPU configuration
 - Testing Guide - How to run tests locally and handle GPU requirements
 
Follow this link. Go to the Colab file → save to drive, select runtime to T4 and follow instructions.
# Automatic style checking
pre-commit install
git commit -m "your changes"  # Runs style checks automatically
# Manual style checks
ruff check .      # Linting
ruff format .     # Formatting
pyright .         # Type checking# Debug Numba JIT issues
NUMBA_DISABLE_JIT=1 pytest -m task3_1 -v
# Debug CUDA kernels
NUMBA_CUDA_DEBUG=1 pytest -m task3_3 -v
# Monitor GPU usage
nvidia-smi -l 1  # Update every second# CPU Backend
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET xor --RATE 0.05
# GPU Backend  
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05# GPU Backend examples
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05TODO: Add your training results here
- CPU Backend: [Add time per epoch and accuracy]
 - GPU Backend: [Add time per epoch and accuracy]
 
- CPU Backend: [Add time per epoch and accuracy]
 - GPU Backend: [Add time per epoch and accuracy]
 
- CPU Backend: [Add time per epoch and accuracy]
 - GPU Backend: [Add time per epoch and accuracy]