Needle (Necessary Elements of Deep Learning) is a comprehensive deep learning framework built from scratch as part of Carnegie Mellon University's 10-714 Deep Learning Systems course. This educational project implements core components of modern deep learning frameworks including automatic differentiation, neural network modules, and multi-backend support.
- Automatic Differentiation: Complete implementation of reverse-mode automatic differentiation with computational graph support
- Tensor Operations: Comprehensive set of tensor operations including arithmetic, linear algebra, and element-wise operations
- Neural Network Modules: Implementation of common neural network layers (Linear, Conv2d, BatchNorm, ReLU, etc.)
- Optimizers: Various optimization algorithms (SGD, Adam, etc.)
- Data Loading: Utilities for loading and preprocessing datasets (CIFAR-10, Penn Treebank)
- CPU Backend: High-performance CPU implementation using C++ with pybind11
- CUDA Backend: GPU acceleration for NVIDIA GPUs
- Metal Backend: Apple Silicon GPU acceleration using Metal Performance Shaders
- NumPy Backend: Pure Python implementation for development and testing
- Convolutional Neural Networks: 2D convolution operations with padding and stride support
- Recurrent Neural Networks: LSTM and RNN implementations for sequence modeling
- Memory Management: Efficient memory allocation and deallocation
- Gradient Checking: Built-in numerical gradient verification
needle/
βββ python/needle/ # Core Python implementation
β βββ autograd.py # Automatic differentiation engine
β βββ ops.py # Tensor operations
β βββ nn.py # Neural network modules
β βββ optim.py # Optimization algorithms
β βββ data.py # Data loading utilities
β βββ backend_ndarray/ # Compiled backend modules
βββ src/ # C++/CUDA/Metal source code
β βββ ndarray_backend_cpu.cc
β βββ ndarray_backend_cuda.cu
β βββ ndarray_backend_metal.cpp
βββ apps/ # Example applications
β βββ simple_training.py # CIFAR-10 training example
β βββ models.py # Model definitions
β βββ mlp_resnet.py # ResNet implementation
βββ tests/ # Comprehensive test suite
βββ data/ # Dataset storage
βββ hw*.ipynb # Course homework notebooks
- Python 3.8+
- CMake 3.2+
- C++ compiler (GCC/Clang)
- CUDA toolkit (optional, for GPU support)
- Xcode command line tools (macOS, for Metal support)
-
Clone the repository:
git clone <repository-url> cd needle
-
Install Python dependencies:
pip install -r requirements.txt
-
Build the framework:
make
This will compile the C++/CUDA/Metal backends and create the necessary Python extensions.
- Install CUDA toolkit
- Ensure
nvidia-smi
is available - The build system will automatically detect and compile CUDA support
- Requires macOS with Apple Silicon
- Xcode command line tools must be installed
- Metal support is automatically enabled on macOS
import needle as ndl
import needle.nn as nn
# Create tensors
x = ndl.Tensor([1, 2, 3, 4], dtype="float32")
y = ndl.Tensor([5, 6, 7, 8], dtype="float32")
# Basic operations
z = x + y
print(z) # [6, 8, 10, 12]
# Automatic differentiation
z = (x * y).sum()
z.backward()
print(x.grad) # [5, 6, 7, 8]
import needle as nn
import needle.optim as optim
# Define a simple MLP
class MLP(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.linear1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.relu(self.linear1(x))
return self.linear2(x)
# Create model and optimizer
model = MLP(784, 128, 10)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.SoftmaxLoss()
# Training loop
for epoch in range(num_epochs):
for batch_x, batch_y in dataloader:
logits = model(batch_x)
loss = loss_fn(logits, batch_y)
loss.backward()
optimizer.step()
optimizer.reset_grad()
# Run the example training script
python apps/simple_training.py
Run the comprehensive test suite:
# Run all tests
python -m pytest tests/
# Run specific test categories
python -m pytest tests/test_ndarray.py # Tensor operations
python -m pytest tests/test_nn_and_optim.py # Neural networks
python -m pytest tests/test_conv.py # Convolution operations
This framework is developed as part of CMU's 10-714 Deep Learning Systems course, covering:
- Homework 1: Automatic differentiation and basic tensor operations
- Homework 2: Neural network modules and optimizers
- Homework 3: Backend implementations (CPU/CUDA/Metal)
- Homework 4: Convolutional and recurrent neural networks
Each homework builds upon the previous implementations, creating a complete deep learning framework.
- Implemented in C++ with pybind11 bindings
- Optimized for performance with SIMD instructions
- Supports all tensor operations
- GPU acceleration for NVIDIA hardware
- Custom CUDA kernels for key operations
- Automatic memory management
- Apple Silicon GPU acceleration
- Metal Performance Shaders integration
- Optimized for M1/M2 Macs
- API Reference: See docstrings in individual modules
- Examples: Check
apps/
directory for usage examples - Tests: Comprehensive test suite in
tests/
directory
This is an educational project. For questions or issues related to the course, please refer to the course materials and discussion forums.
This project is part of Carnegie Mellon University's 10-714 Deep Learning Systems course. Please refer to the course guidelines for usage and distribution.
- Carnegie Mellon University 10-714 course staff
- PyTorch team for inspiration on API design
- The open-source community for various tools and libraries
Note: This is an educational implementation and should not be used for production workloads. For production deep learning, consider established frameworks like PyTorch or TensorFlow.