<a href="https://colab.research.google.com/github/zack-dev-cm/agnitraai/blob/main/agnitra_enhanced_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agnitra AI: Hardware-Oriented Proof-of-Concept (Enhanced Demo)

Agnitra AI is an AI-native runtime optimization platform designed to dynamically boost the performance of AI models on GPUs and other accelerators. The core innovations — dynamic runtime tuning, cross-vendor abstraction, LLM+RL optimization and a telemetry feedback loop — differentiate it from existing solutions like TensorRT or cuDNN. This notebook provides an enhanced demonstration of those principles, with a focus on hardware-friendly code generation rather than high-level code synthesis. It integrates the latest OpenAI Codex API and GPT-5 via the new Responses API, simulates reinforcement-learning (RL) based tuning, and shows how a recommender/code-generation engine can scan and improve inefficient code.

**Disclaimer:** Real API calls require a valid `OPENAI_API_KEY` and access to the respective models. If the key is not provided or the model name is unavailable, the notebook will log debug messages and skip those steps gracefully.

## Overview of Core Modules and MVP Scope

Agnitra’s MVP is composed of several modules designed to work together as a runtime optimization pipeline. The platform captures hardware telemetry, converts the model into an intermediate representation (IR), invokes an AI optimizer (LLMs + RL), and then generates and patches custom kernels. The key components summarised below are based on the project requirements document:

- **Telemetry Collector:** uses `torch.profiler` to hook into each layer, capturing CUDA time, tensor shapes,   allocated memory and storing the logs in JSON.
- **IR Graph Extractor:** converts the model into an IR using `torch.fx` and attaches telemetry to each node,   producing a structured graph that can be fed into an optimizer.
- **LLM-Based Optimizer:** prompts a large language model to suggest tiling/block parameters and other kernel   optimizations based on the telemetry and IR.
- **Reinforcement Learning (RL) Agent:** simulates a reward loop and uses PPO to tune parameters such as tile   size, loop unrolling or fusion decisions.
- **Kernel Generator & Runtime Patcher:** builds a kernel template engine and dynamically replaces graph nodes   with optimized kernels, including fallback logic.

This notebook implements a simplified version of the above modules, emphasising clarity, hardware friendliness, and investor appeal. It also demonstrates a **one-line API** (`optimize_model`) that integrates the pipeline.

In [1]:
# Install required packages
!pip install -q --upgrade openai==1.30.2
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install -q stable-baselines3
!pip install -q gymnasium
!pip install -q triton
!pip install -q rich


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import torch
try:
    from google.colab import userdata  # type: ignore
    openai_key = userdata.get("OPENAI_API_KEY")
except Exception:
    openai_key = os.getenv("OPENAI_API_KEY")

if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    if "A100" not in device_name:
        print(f"[WARNING] Expected A100 GPU but found {device_name}")
else:
    print("[WARNING] CUDA not available; running on CPU")

if openai_key:
    os.environ["OPENAI_API_KEY"] = openai_key
else:
    print("[WARNING] No OpenAI API key provided.")


In [7]:
# Imports and API setup
import os, json, time, math
import torch
from torch import nn
from torch.profiler import profile, record_function, ProfilerActivity
from torch.fx import symbolic_trace
import numpy as np

try:
    from stable_baselines3 import PPO
    import gymnasium as gym
    from gymnasium import spaces
except Exception as e:
    PPO = None
    print('[WARNING] RL libraries not available:', e)
try:
    from openai import OpenAI
except ImportError:
    OpenAI = None
    print('[WARNING] The openai package is not installed.')

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
if OPENAI_API_KEY and OpenAI is not None:
    client = OpenAI(api_key=OPENAI_API_KEY)
else:
    client = None
    print('[INFO] OpenAI API key not provided; LLM calls will be skipped.')
CODEX_MODEL = 'codex-latest'
GPT_MODEL = 'gpt-5'


In [8]:
# 1. Telemetry Collection
def collect_telemetry(model: nn.Module, input_tensor: torch.Tensor):
    """Run a single forward pass with torch.profiler and return telemetry as a list of dicts."""
    activities = [ProfilerActivity.CPU]
    if torch.cuda.is_available():
        activities.append(ProfilerActivity.CUDA)
    telemetry = []
    with profile(activities=activities, record_shapes=True, profile_memory=True) as prof:
        with record_function('model_inference'):
            _ = model(input_tensor)
    for evt in prof.key_averages():
        entry = {
            'name': evt.key,
            'cpu_time_ms': getattr(evt, 'cpu_time_total', 0.0) / 1e6,
            'cuda_time_ms': (getattr(evt, 'cuda_time_total', 0.0) / 1e6) if torch.cuda.is_available() else 0.0,
            'input_shape': getattr(evt, 'input_shapes', []),
            'cpu_memory_bytes': getattr(evt, 'self_cpu_memory_usage', 0),
            'cuda_memory_bytes': getattr(evt, 'self_cuda_memory_usage', 0) if torch.cuda.is_available() else 0
        }
        telemetry.append(entry)
    return telemetry


In [9]:
# 2. IR Graph Extraction
def extract_ir(model: nn.Module, telemetry: list):
    """Trace the model with torch.fx and attach telemetry to each node."""
    traced = symbolic_trace(model)
    ir_nodes = []
    for node in traced.graph.nodes:
        matched = None
        for entry in telemetry:
            if node.target and node.target in str(entry['name']):
                matched = entry
                break
        ir_nodes.append({
            'op': node.op,
            'target': str(node.target),
            'args': str(node.args),
            'kwargs': str(node.kwargs),
            'telemetry': matched
        })
    return ir_nodes


In [10]:
# 3. LLM-based optimization suggestion
def request_kernel_suggestions(telemetry: list, ir_nodes: list, client=None, model_name=CODEX_MODEL):
    """Call the LLM to propose optimized kernel parameters. Returns a suggestion or None."""
    if client is None or OpenAI is None:
        print('[INFO] No OpenAI client or API key; skipping kernel suggestion.')
        return None
    try:
        ir_json = json.dumps(ir_nodes)
    except TypeError:
        ir_json = json.dumps([{'op': n['op'], 'target': n['target']} for n in ir_nodes])
    system_message = {
        'role': 'system',
        'content': [
            {'type': 'input_text', 'text': 'You are an expert GPU kernel optimizer. Given telemetry and an IR graph, suggest block size, tile size and unroll factors to reduce latency.'}
        ]
    }
    user_message = {
        'role': 'user',
        'content': [
            {'type': 'input_text', 'text': f"""Telemetry: {telemetry} IR graph: {ir_json}
Provide optimized kernel parameters (e.g., tile sizes, block size, loop unroll count) and rationale."""}
        ]
    }
    try:
        response = client.responses.create(model=model_name, input=[system_message, user_message], max_output_tokens=1024, store=False)
        optimized_text = ''
        for item in response.output:
            if not hasattr(item, 'content') or item.content is None:
                continue
            for entry in item.content:
                optimized_text += entry.text
        return optimized_text.strip()
    except Exception as e:
        print('[ERROR] LLM call failed:', e)
        return None


In [11]:
# 4. Code scanning and improvement via GPT
def improve_python_code(code_str: str, client=None, model_name=GPT_MODEL):
    """Send a code snippet to the GPT model and request an improved, hardware-friendly version."""
    if client is None or OpenAI is None:
        print('[INFO] No OpenAI client or API key; skipping code improvement.')
        return None
    system_message = {
        'role': 'system',
        'content': [
            {'type': 'input_text', 'text': 'You are a senior performance engineer. Improve the following Python code by making it more hardware-efficient (vectorized operations, batching).'}
        ]
    }
    user_message = {
        'role': 'user',
        'content': [
            {'type': 'input_text', 'text': code_str}
        ]
    }
    try:
        response = client.responses.create(model=model_name, input=[system_message, user_message], max_output_tokens=1024, store=False)
        improved = ''
        for item in response.output:
            if item.content is None:
                continue
            for chunk in item.content:
                improved += chunk.text
        return improved.strip()
    except Exception as e:
        print('[ERROR] Code improvement call failed:', e)
        return None


In [12]:
# 5. Reinforcement Learning Environment for Kernel Tuning
class KernelTuningEnv(gym.Env):
    """A simple environment where the action controls tile size and unroll factor for matrix multiplication."""
    def __init__(self, matrix_size=256):
        super().__init__()
        self.param_options = [(16,1),(32,1),(64,1),(128,1),(16,2),(32,2),(64,2)]
        self.action_space = spaces.Discrete(len(self.param_options))
        self.observation_space = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=float)
        self.size = matrix_size
        self.A = np.random.rand(self.size, self.size).astype(np.float32)
        self.B = np.random.rand(self.size, self.size).astype(np.float32)
    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        return np.array([0.0], dtype=float), {}
    def step(self, action):
        tile_size, unroll = self.param_options[int(action)]
        start = time.perf_counter()
        result = np.zeros((self.size, self.size), dtype=np.float32)
        for i in range(0, self.size, tile_size):
            for j in range(0, self.size, tile_size):
                for k in range(0, self.size, tile_size):
                    i_end, j_end, k_end = i+tile_size, j+tile_size, k+tile_size
                    block = np.dot(self.A[i:i_end, k:k_end], self.B[k:k_end, j:j_end])
                    for _ in range(unroll):
                        result[i:i_end, j:j_end] += block
        elapsed = time.perf_counter() - start
        reward = -elapsed
        return np.array([0.0], dtype=float), reward, True, False, {'tile_size': tile_size, 'unroll': unroll, 'elapsed': elapsed}


In [13]:
# 6. High-level wrapper for model optimization

def optimize_model(model: nn.Module, input_tensor: torch.Tensor, enable_rl: bool = True):
    """One-line API that profiles a model, extracts IR, obtains LLM suggestions and optionally tunes parameters with RL."""
    telemetry = collect_telemetry(model, input_tensor)
    print(f'[INFO] Collected {len(telemetry)} telemetry events.')
    ir_nodes = extract_ir(model, telemetry)
    print(f'[INFO] Extracted {len(ir_nodes)} IR nodes.')
    suggestion = request_kernel_suggestions(telemetry, ir_nodes, client=client)
    if suggestion:
        print("""[LLM Suggestion]""" + suggestion)
    else:
        print('[LLM Suggestion] No suggestion or API unavailable.')
    if enable_rl and PPO is not None:
        env = KernelTuningEnv(matrix_size=128)
        model_rl = PPO('MlpPolicy', env, verbose=0, n_steps=1, batch_size=4, ent_coef=0.0, n_epochs=1)
        model_rl.learn(total_timesteps=20)
        _, _ = env.reset()
        best_reward, best_params = -math.inf, None
        for action_idx in range(env.action_space.n):
            _, reward, _, _, meta = env.step(action_idx)
            print(f'    RL eval: tile_size={meta["tile_size"]}, unroll={meta["unroll"]}, elapsed={meta["elapsed"]:.4f}s, reward={reward:.4f}')
            if reward > best_reward:
                best_reward, best_params = reward, (meta['tile_size'], meta['unroll'])
        print(f'[RL Tuner] Best parameters found: tile_size={best_params[0]}, unroll={best_params[1]}')
    else:
        print('[RL Tuner] Skipped due to missing dependencies or disabled.')
    return {'telemetry': telemetry, 'ir': ir_nodes, 'llm_suggestion': suggestion}


In [14]:
# 7. Demonstration on a simple model

class DemoNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(512, 512)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(512, 512)
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

demo_model = DemoNet()
demo_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
demo_model.to(demo_device)
input_tensor = torch.randn(1, 512, device=demo_device)

# Usage
result_meta = optimize_model(demo_model, input_tensor)


AttributeError: 'FunctionEventAvg' object has no attribute 'cuda_time_total'

  return datetime.utcnow().replace(tzinfo=utc)


## Investor Use-Cases and Business Model

Agnitra AI is designed to serve multiple customer segments, each with a clear value proposition. According to the PRD, the target customers and their needs include ML teams seeking faster inference and cheaper deployment, chip companies demonstrating better performance-per-dollar, cloud providers aiming to improve GPU utilisation, AI startups looking to run LLMs on lower-cost hardware, and OEMs wanting embedded runtime optimisation. The business model mixes B2B SaaS, enterprise SDK licensing, per-GPU licensing and an optimisation-as-a-service offering. The one-line API demonstrated above (`optimize_model`) illustrates how quickly developers can adopt the technology, aligning with the goal of a sub-10-minute SDK integration time.

## Initial Focus Features and Scaling Roadmap

The MVP emphasises the telemetry collector, IR extractor, prompt-based LLM optimiser, RL agent, kernel generator and runtime patcher. These foundational components enable Agnitra to perform runtime tuning without requiring model code changes. Post-MVP phases will add a telemetry visualisation dashboard, support for alternative compiler backends, multi-vendor hardware support and a hosted optimisation-as-a-service platform.

## Conclusion and Next Steps

This notebook demonstrated a hardware-oriented proof of concept for Agnitra AI with an integrated pipeline: it collected telemetry, converted the model to an IR, queried the latest Codex/GPT models for optimisation suggestions, simulated RL-based tuning and even scanned and improved inefficient Python code. By exposing everything through a single function (`optimize_model`), developers can adopt Agnitra quickly and benefit from dynamic tuning and hardware-aware code generation.

To advance this prototype, one could integrate real Triton kernels, extend the RL environment to tune additional parameters, and package the modules into an installable `agnitra` library and CLI. Coupled with the business model and roadmap, these enhancements would position Agnitra as a compelling investment opportunity in the AI infrastructure space.

In [None]:
import logging
from agnitra.sdk import optimizer as sdk_optimizer
logging.basicConfig(level=logging.INFO)

class TinyNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(4, 4)

    def forward(self, x):
        return self.linear(x)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = TinyNet().to(device)
input_tensor = torch.randn(1, 4, device=device)
telemetry = sdk_optimizer.collect_telemetry(model, input_tensor)
print('Telemetry:', telemetry)
ir_nodes = sdk_optimizer.extract_ir(model, telemetry)
suggestion = sdk_optimizer.request_kernel_suggestions(telemetry, ir_nodes)
print('LLM suggestion:', suggestion)
sdk_optimizer.optimize_model(model, input_tensor, enable_rl=False)


In [None]:
from agnitra.demo.demo import profile_sample_models
profile_sample_models()

In [None]:
!pytest -k "optimizer or telemetry"