In [1]:
import torch
print(torch.__version__)
if torch.cuda.is_available():
    print("GPU is:", torch.cuda.get_device_name(0))

2.8.0+cu128
GPU is: NVIDIA GeForce RTX 4090


# Project: Aether-Vision
# A Distributed High-Resolution Image & Video Analysis System
## 1. The Mission
The goal of Aether-Vision is to transform raw visual data (images and videos) into human-readable scene descriptions. While pre-trained models like YOLOv12 or LLaVA exist to solve this today, this project is a "Learning by Reconstruction" effort. We are intentionally building the entire stack—from the fundamental training physics to the distributed inference infrastructure—to master the complexities of modern Machine Learning Engineering.

## 2. Final Product Vision
Input: A high-resolution image or video file uploaded via a cloud-hosted gateway.

Processing: The image is securely streamed to a remote "GPU Engine," where it is tiled and analyzed by a custom-trained deep learning backbone.

Output: A structured natural language description identifying multiple objects and their general context (e.g., "Detected a mountain bike in the foreground and a forest trail in the background; likely an outdoor sporting scene.")

## 3. Phase 1: The "Backbone" (Research & Training)
We will train a state-of-the-art vision architecture (e.g., ConvNeXt or ResNet-50) from scratch using the 115GB ImageNet-1k dataset.

The Learning Goal: To understand convergence, optimization (AdamW/SGD), and the data-loading bottlenecks that occur when moving beyond "toy" datasets into 100GB+ scales.

The Hardware Challenge: Fully saturating an RTX 4090 by implementing high-performance data pipelines (using FFCV or WebDataset) to ensure the GPU is never "starving" for data.

## 4. Phase 2: The "Orchestrator" (Infrastructure & Production)
Moving beyond the notebook, we productionize the model using a Hybrid Cloud Architecture.

Distributed Inference: A lightweight Control Plane (AWS/GCP/DigitalOcean) handles user requests and proxies them to a remote Inference Worker (the 4090) via gRPC.

Tiling Engine: To handle high-resolution inputs where standard models might miss small details, the system will programmatically "tile" images into segments, process them in parallel, and aggregate the results.

SRE Principles: Implementing health checks, circuit breakers, and observability (metrics/logs) to ensure the system remains reliable even if the remote GPU connection is unstable.

## 5. Why Build This From Scratch?
To own the weights: Understanding the "Training Recipe" allows us to modify the model for specific needs (like Canva’s internal datasets) in the future.

To solve the "Resolution Problem": Off-the-shelf models often downsample images to 640px, losing critical detail. Our Tiling approach preserves every pixel for maximum accuracy.

To Master the Bridge: Connecting separate infrastructure providers into one low-latency service is a core MLE/SRE skill that cannot be learned using a single-click deployment tool.