# AI-Powered Code Generation: Python to C++ Optimization

## 📋 Overview
This notebook explores the capabilities of state-of-the-art LLMs for **automated code generation**, specifically focusing on converting Python code to high-performance C++ implementations just in a metter of experiment during completing the [course](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/?couponCode=MT250923G1).

## 🎯 Objectives
- **Performance Benchmarking**: Compare execution speed improvements between Python and AI-generated C++ code
- **Model Evaluation**: Test and compare different LLMs' code generation capabilities
- **Pipeline Development**: Build an automated conversion pipeline for production use

## 🔬 Methodology
We'll evaluate models using:
1. **Computational Tasks**: Mathematical algorithms (π computation, array processing)
2. **Performance Metrics**: Execution time comparison and correctness validation  
3. **Code Quality**: Analyzing generated C++ code structure and optimization techniques

## 💡 Expected Outcomes
- Identify the best-performing LLM for code generation tasks
- Quantify performance improvements (targeting 10x+ speedup)
- Establish a reliable conversion pipeline for Python → C++ optimization

## 🔍 Model Selection & Evaluation Framework

### 📊 Industry Benchmarking Sources
To select optimal models for our code generation task, we analyze current industry leaderboards:

| **Leaderboard** | **Focus Area** | **Key Metrics** |
|-----------------|----------------|-----------------|
| [Vellum AI Coding](https://www.vellum.ai/best-llm-for-coding) | General coding tasks | Multi-language performance |
| [SEAL Leaderboard](https://scale.com/leaderboard/coding) | Software engineering | Real-world problem solving |

### 🧪 Understanding Coding Benchmarks

Modern LLM evaluation relies on several standardized benchmarks, each testing different aspects of coding capability:

#### **🎯 Core Benchmarks**

| **Benchmark** | **Problems** | **Focus** | **Difficulty** |
|---------------|-------------|-----------|----------------|
| **HumanEval** | 164 Python problems | Language comprehension, algorithms, mathematics | Entry to intermediate |
| **MBPP** | 974 programming tasks | Natural language → code synthesis | Entry level |
| **SWE-Bench** | Real GitHub issues | Production code debugging & enhancement | Advanced |
| **LiveCodeBench** | Contest problems (LeetCode, AtCoder, CodeForces) | Competitive programming, self-repair, execution | Advanced |

#### **🔬 Why These Benchmarks Matter for Our Task**
- **HumanEval**: Tests fundamental algorithmic thinking crucial for optimization tasks
- **SWE-Bench**: Evaluates real-world code manipulation skills needed for language conversion
- **LiveCodeBench**: Ensures models can handle complex computational problems like our test cases


### 🎯 Model Selection Strategy

For this experiment, we focus on **high-performance models accessible through free tiers** to ensure reproducibility and cost-effectiveness.

#### **🏆 Selected Models**

Based on [LLM Stats HumanEval Leaderboard](https://llm-stats.com/benchmarks/humaneval), our candidates are:

| **Model** | **Parameters** | **HumanEval** | **SWE-Bench** | **Availability** | **Expected Performance** |
|-----------|----------------|---------------|----------------|------------------|-------------------------|
| **Kimi Dev 72B** | 72B | ~85% | 60.4% | OpenRouter Free | Strong mathematical reasoning |
| **Qwen3 Coder 480B** | 480B (A35B active) | ~90% | 69.6% | OpenRouter Free | Superior code generation |

#### **🔮 Performance Hypothesis**
- **Qwen3 Coder**: Expected to excel due to specialized training on code and larger active parameter count
- **Kimi Dev**: Strong mathematical foundation may benefit algorithmic tasks
- **Key Test**: Which model produces more optimized C++ code with better performance characteristics?

#### **⚖️ Evaluation Criteria**
1. **Correctness**: Identical output to Python implementation
2. **Performance**: Execution speed improvement
3. **Code Quality**: C++ best practices and optimization techniques
4. **Generation Speed**: Model response time 

In [None]:
# Models

MODEL_KIMI_K2 = "moonshotai/kimi-dev-72b:free"
MODEL_QWEN_3_CODER = "qwen/qwen3-coder:free"
MODEL_GEMINI_2_FLASH = "google/gemini-2.5-flash:free"

## Set up

In [62]:
# Imports

import os
import io
import sys
import re
import enum
from dataclasses import dataclass
from dotenv import load_dotenv
from google import genai
from openai import OpenAI
from IPython.display import Markdown, display, update_display
import gradio as gr
import subprocess

In [59]:
# Environment

load_dotenv(override=True)
open_router_api_key = os.getenv('OPEN_ROUTER_API_KEY', 'NO_OPEN_ROUTER_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY', 'NO_GOOGLE_API_KEY')

if open_router_api_key == 'NO_OPEN_ROUTER_API_KEY':
    raise ValueError('Please set OPEN_ROUTER_API_KEY in your environment variables')
else:
    print('OpenRouter API key is set')

if google_api_key == 'NO_GOOGLE_API_KEY':
    raise ValueError('Please set GOOGLE_API_KEY in your environment variables')
else:
    print('Google API key is set')

OPEN_ROUTER_API_URL = "https://openrouter.ai/api/v1"

OpenRouter API key is set
Google API key is set


In [None]:
# Set up OpenRouter client
or_client = OpenAI(base_url=OPEN_ROUTER_API_URL, api_key=open_router_api_key)

google_client = genai.Client(api_key=google_api_key)

In [None]:
# Prompts

system_message = """
You are an assistant that reimplements Python code in high performance C++ for an M1 Mac. 
Respond only with C++ code; use comments sparingly and do not provide any explanation other than occasional comments. 
The C++ response needs to produce an identical output in the fastest possible time.
"""

def user_prompt_for(python_code: str) -> str:
    # remember to #include is not needed for Claude but needed for GPT. Let's test with it for now covering more models and cases.
    user_prompt = """
    Rewrite this Python code in C++ with the fastest possible implementation that produces identical output in the least time. 
    Respond only with C++ code; do not explain your work other than a few comments. 
    Pay attention to number types to ensure no int overflows. Remember to #include all necessary C++ packages such as iomanip.

    Response should be only C++ code

    Python code:
    """
    user_prompt += python_code
    return user_prompt

def messages_for(python_code: str) -> list[dict]:
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt_for(python_code)}
    ]


In [None]:
# Write to a file called optimized.cpp

def clean_think_blocks(text):
    """Remove content between <think>...</think> or ◁think▷... markers"""
    # Remove <think>...</think> blocks
    text = re.sub(r'<think>.*?</think>', '', text, flags=re.IGNORECASE | re.DOTALL)
    # Remove ◁think▷...◁think▷ blocks for Kimi model
    text = re.sub(r'◁think▷.*?◁/think▷\n\n', '', text, flags=re.DOTALL)
     
    return text

def write_output(cpp_code: str, file_name: str = "optimized.cpp"):
    code = clean_think_blocks(cpp_code)
    code = code.replace("```cpp","").replace("```","")

    with open(file_name, "w") as f:
        f.write(code)

In [None]:
def optimize_kimi(python_code: str):    
    stream = or_client.chat.completions.create(model=MODEL_KIMI_K2, messages=messages_for(python_code), stream=True)
    reply = ""
    for chunk in stream:
        fragment = chunk.choices[0].delta.content or ""
        reply += fragment
        print(fragment, end='', flush=True)
    write_output(reply, file_name="optimized_kimi.cpp")

def optimize_qwen(python_code: str):
    stream = or_client.chat.completions.create(model=MODEL_QWEN_3_CODER, messages=messages_for(python_code), stream=True)
    reply = ""
    for chunk in stream:
        fragment = chunk.choices[0].delta.content or ""
        reply += fragment
        print(fragment, end='', flush=True)
    write_output(reply, file_name="optimized_qwen.cpp")


## 🧮 Experiment 1: Pi Computation via Riemann Sum

### 📝 Task Description
**Objective**: Convert a Python π computation algorithm to optimized C++ code

**Algorithm**: Riemann sum approximation using the integral formula:
```
π = 4 ∫₀¹ 1/(1+x²) dx
```

### 🔍 Why This Test Case?
- **Mathematical Foundation**: Tests models' understanding of numerical computation
- **Loop Optimization**: Critical for performance - models must optimize the 100M iteration loop
- **Precision Handling**: Requires careful floating-point arithmetic
- **Common Benchmark**: Widely used in performance testing, enabling comparison with established baselines

### 📊 Success Metrics
- ✅ **Correctness**: Result matches Python output (3.141592653590)
- ⚡ **Performance**: Target >50x speedup over Python
- 🔧 **Code Quality**: Proper C++ optimization techniques (compiler flags, data types)


In [28]:
pi = """
import time

def integrate_pi(iterations: int) -> float:
    step = 1.0 / iterations
    total = 0.0
    for i in range(iterations):
        x = (i + 0.5) * step
        total += 4.0 / (1.0 + x*x)
    return step * total

start_time = time.time()
result = integrate_pi(100_000_000)
end_time = time.time()

print(f"Result: {result:.12f}")
print(f"Execution Time: {(end_time - start_time):.6f} seconds")
"""



In [29]:
exec(pi)

Result: 3.141592653590
Execution Time: 7.109315 seconds


Now we ready to TRY!!!

Let's start from Kimi

In [30]:
optimize_kimi(pi)

◁think▷Okay, let's tackle this problem. The user wants to rewrite the given Python code into C++ with the fastest possible implementation. The goal is to produce the same output but in record time. Let me think about how to approach this.

First, the Python function integrate_pi calculates an approximation of pi using the Monte Carlo method. The user provided code uses a loop that runs for a specified number of iterations (like 100 million). The main aspect here is the loop, which adds up the terms of the series. Converting this to C++ requires translating that loop and ensuring that the number types are appropriate to prevent integer overflow, which the user specifically mentioned.

Looking at the Python code: the step is 1.0 divided by iterations. Then, for each i from 0 to iterations-1, x is (i + 0.5) * step, and then total accumulates 4.0 divided by (1.0 + x*x). Finally, the result is step multiplied by the total sum.

In C++, we need to replicate this. The range function in Python

NameError: name 'write_output' is not defined

First try failed to write the result, but here is the actual result:

```
#include <iostream>
#include <chrono>
#include <iomanip>

double integrate_pi(int iterations) {
    double step = 1.0 / iterations;
    double total = 0.0;
    for (int i = 0; i < iterations; i++) {
        double x = (i + 0.5) * step;
        total += 4.0 / (1.0 + x * x);
    }
    return step * total;
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    double result = integrate_pi(100'000'000);
    auto end = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed = end - start;

    std::cout << "Result: " << std::fixed << std::setprecision(12) << result << std::endl;
    std::cout << "Execution Time: " << std::fixed << std::setprecision(6) << elapsed.count() << " seconds" << std::endl;

    return 0;
}
```


In [None]:
!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized_kimi optimized_kimi.cpp
!./optimized_kimi

Result: 3.141592653590
Execution Time: 0.118013 seconds


In [33]:
exec(pi)

Result: 3.141592653590
Execution Time: 6.972198 seconds


### 📈 Experiment 1 Results: Kimi Dev 72B Performance

#### **⚡ Performance Analysis**

| **Metric** | **Python** | **C++ (Kimi)** | **Improvement** |
|------------|------------|----------------|-----------------|
| **Execution Time** | 6.97 seconds | 0.118 seconds | **59.1x faster** |
| **Result Accuracy** | 3.141592653590 | 3.141592653590 | ✅ **Perfect match** |
| **Code Quality** | - | High-performance C++17 | ✅ **Optimized** |

#### **🔍 Key Observations**

**✨ Strengths:**
- **Exceptional Performance**: Nearly 60x speedup exceeds our 50x target
- **Perfect Accuracy**: Identical numerical results demonstrate precision handling
- **Code Quality**: Clean, well-structured C++ with proper includes and timing
- **Thinking Process**: Model showed detailed reasoning (visible in output)

**🛠️ Technical Implementation:**
- Used `double` precision for numerical stability
- Proper C++17 chrono for accurate timing
- Optimized loop structure with pre-increment
- Correct compiler optimization flags (`-O3`, `-march=armv8.3-a`)

**🎯 Verdict**: Kimi Dev 72B demonstrates excellent mathematical reasoning and C++ optimization capabilities.

In [36]:
optimize_qwen(pi)

```cpp
#include <iostream>
#include <chrono>
#include <iomanip>

double integrate_pi(int iterations) {
    double step = 1.0 / iterations;
    double total = 0.0;
    
    for (int i = 0; i < iterations; ++i) {
        double x = (i + 0.5) * step;
        total += 4.0 / (1.0 + x * x);
    }
    
    return step * total;
}

int main() {
    auto start_time = std::chrono::high_resolution_clock::now();
    double result = integrate_pi(100000000);
    auto end_time = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
    
    std::cout << std::fixed << std::setprecision(12);
    std::cout << "Result: " << result << std::endl;
    
    std::cout << std::fixed << std::setprecision(6);
    std::cout << "Execution Time: " << duration.count() / 1000000.0 << " seconds" << std::endl;
    
    return 0;
}
```

First think that you notice instantly is generation time. Qwen done this in **4.8 SEC**! Style is a little bit different but overall both results actualy same.

In [37]:
!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized_qwen optimized_qwen.cpp
!./optimized_qwen


Result: 3.141592653590
Execution Time: 0.104924 seconds


## 📊 Experiment 1: Complete Results & Analysis

### 🏆 Final Performance Comparison

| **Metric** | **Python Baseline** | **Kimi Dev 72B** | **Qwen3 Coder** | **Winner** |
|------------|---------------------|-------------------|------------------|------------|
| **Execution Time** | ~7.0 seconds | 0.118 seconds | 0.105 seconds | 🥇 **Qwen3** |
| **Speedup Factor** | 1x | **59.3x** | **66.7x** | 🥇 **Qwen3** |
| **Result Accuracy** | 3.141592653590 | 3.141592653590 ✅ | 3.141592653590 ✅ | 🤝 **Tie** |
| **Generation Time** | - | ~15-20 seconds | **4.8 seconds** | 🥇 **Qwen3** |
| **Code Quality** | - | Excellent | Excellent | 🤝 **Tie** |

### 🎯 Key Findings

#### **🚀 Performance Excellence**
- Both models achieved **exceptional performance gains** (59-67x speedup)
- **Qwen3 Coder edges out** with slightly better execution time (0.105s vs 0.118s)
- Both far exceeded our **50x speedup target**

#### **⚡ Generation Speed**
- **Qwen3 Coder**: **4x faster generation** (4.8s vs ~15-20s)
- Critical for production pipelines where conversion time matters
- Kimi's thinking process adds value but impacts speed

#### **🔬 Code Analysis**

**Similarities:**
- Both use `double` precision for numerical stability
- Proper C++17 chrono for timing
- Identical algorithmic approach
- Clean, professional code structure

**Differences:**
- **Qwen3**: More concise, direct implementation
- **Kimi**: More verbose with detailed thinking process wich can be interesting in harder tasks

### 📈 Statistical Summary

```
Performance Improvement Over Python:
├── Kimi Dev 72B:    5,930% faster
├── Qwen3 Coder:     6,670% faster
└── Average:         6,300% faster

Generation Efficiency:
├── Qwen3 Coder:     4.8 seconds
├── Kimi Dev 72B:    ~17.5 seconds (estimated)
└── Difference:      73% faster generation
```

### 🏅 Conclusions

#### **🥇 Overall Winner: Qwen3 Coder**
**Reasons:**
1. **Superior Performance**: 12% faster execution
2. **Rapid Generation**: 4x faster model response
3. **Production Ready**: Optimal for automated pipelines
4. **Identical Accuracy**: Perfect mathematical precision

#### **🥈 Strong Second: Kimi Dev 72B**
**Strengths:**
1. **Detailed Reasoning**: Valuable thinking process insights
2. **Excellent Performance**: Still achieved 59x speedup
3. **Educational Value**: Great for understanding optimization approaches

### 🔬 Technical Insights

Both models demonstrated:
- ✅ **Perfect algorithmic understanding**
- ✅ **Optimal C++ optimization techniques**
- ✅ **Proper numerical precision handling**
- ✅ **Professional code structure**

## Experiment 2: Max sub array sum

In this task given an array of random numbers (positives and negatives) and the task is to find maximal sub array sum.
In this task we will not generate random numbers using python.random but instead provide implementation of simple linear congurantial generator (lcg)

In [40]:
max_sub_array_sum = """# Be careful to support large number sizes

def lcg(seed, a=1664525, c=1013904223, m=2**32):
    value = seed
    while True:
        value = (a * value + c) % m
        yield value
        
def max_subarray_sum(n, seed, min_val, max_val):
    lcg_gen = lcg(seed)
    random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]
    max_sum = float('-inf')
    for i in range(n):
        current_sum = 0
        for j in range(i, n):
            current_sum += random_numbers[j]
            if current_sum > max_sum:
                max_sum = current_sum
    return max_sum

def total_max_subarray_sum(n, initial_seed, min_val, max_val):
    total_sum = 0
    lcg_gen = lcg(initial_seed)
    for _ in range(20):
        seed = next(lcg_gen)
        total_sum += max_subarray_sum(n, seed, min_val, max_val)
    return total_sum

# Parameters
n = 10000         # Number of random numbers
initial_seed = 42 # Initial seed for the LCG
min_val = -10     # Minimum value of random numbers
max_val = 10      # Maximum value of random numbers

# Timing the function
import time
start_time = time.time()
result = total_max_subarray_sum(n, initial_seed, min_val, max_val)
end_time = time.time()

print("Total Maximum Subarray Sum (20 runs):", result)
print("Execution Time: {:.6f} seconds".format(end_time - start_time))
"""

In [41]:
python_result = exec(max_sub_array_sum)
python_result

Total Maximum Subarray Sum (20 runs): 10980
Execution Time: 41.841283 seconds


In [47]:
optimize_kimi(max_sub_array_sum)

◁think▷Okay, let's tackle this problem. The user wants to rewrite the given Python code in C++ to make it as fast as possible while ensuring identical output. The key parts are the LCG (Linear Congruential Generator), the max_subarray_sum function using Kadane's algorithm, and the total_max_subarray_sum which runs 20 iterations.

First, I need to implement the LCG in C++. The Python version uses generator functions with yield, but in C++, we can use a class or a function that keeps the state. Since the LCG is called multiple times, a class might be better. The LCG parameters (a, c, m) are given, and the seed is initialized. The calculation is (a * value + c) % m. In C++, we need to handle the 32-bit modulus correctly. Using unsigned int might help here, but since m is 2^32, maybe using a 64-bit type to avoid overflow. Wait, m is 2^32, so value is in [0, 2^32 -1]. So using a uint32_t for value would work. But in the calculation a * value could overflow, so perhaps we need to use a 64-bi

In [51]:
!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized_kimi optimized_kimi.cpp
!./optimized_kimi

Total Maximum Subarray Sum (20 runs): 10980
Execution Time: 0.000485 seconds


In [None]:
optimize_qwen(max_sub_array_sum)

In [None]:
!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized_qwen optimized_qwen.cpp
!./optimized_qwen

## UI

Let's also add Gradio UI. 

<span style="color:red">Be carefull, never use "share=True" for this project, this will create security vulnarabillity for you</span>

~~We can notice that SWE-bench leaderboard contain one more interesting candidate. It is **Gemini 2.5 Pro** with 53.6%~~

In [63]:
class Provider(enum.Enum):
    OPEN_AI = "OpenAI"
    # GOOGLE = "Google"

@dataclass
class ModelInfo:
    model_name: str
    model_id: str
    client: Provider

AVAILABLE_MODELS = [
    ModelInfo(model_name="Kimi K2", model_id=MODEL_KIMI_K2, client=Provider.OPEN_AI),
    ModelInfo(model_name="Qwen 3 Coder", model_id=MODEL_QWEN_3_CODER, client=Provider.OPEN_AI),
    # ModelInfo(model_name="Gemini 2 Flash", model_id=MODEL_QWEN_3_CODER, client=Provider.GOOGLE),
]

In [64]:
def stream_model_open_ai_client(model, python):    
    stream = or_client.chat.completions.create(model=model, messages=messages_for(python), stream=True)
    reply = ""
    for chunk in stream:
        fragment = chunk.choices[0].delta.content or ""
        reply += fragment
        yield reply

def stream_model(model_info, python):
    if model_info.client == Provider.OPEN_AI:
        return stream_model_open_ai_client(model_info.model_id, python)
    else:
        raise ValueError("Unknown client")

def trim_result(result):
    result = re.sub(r'◁think▷.*?◁/think▷\n\n', '', result, flags=re.DOTALL)
    result = result.replace('```cpp\n','').replace('```','')
    return result

def optimize(python, model):
    if model=="Kimi K2":
        model_info = next(model for model in AVAILABLE_MODELS if model.model_name == "Kimi K2")
        result = stream_model(model_info, python)
    elif model=="Qwen 3 Coder":
        model_info = next(model for model in AVAILABLE_MODELS if model.model_name == "Qwen 3 Coder")
        result = stream_model(model_info, python)
    else:
        raise ValueError("Unknown model")
    
    # Stream all output to thinking first
    final_result = ""
    for stream_buffer in result:
        final_result = stream_buffer
        yield stream_buffer, ""  
    
    # When streaming is complete, trim result and show in cpp
    trimmed_cpp = trim_result(final_result)
    yield final_result, trimmed_cpp


In [None]:
available_models = [model.model_name for model in AVAILABLE_MODELS]

with gr.Blocks() as ui:
    with gr.Row():
        python = gr.Textbox(label="Python code:", lines=15, value=max_sub_array_sum)
        with gr.Column():
            thinking = gr.Textbox(label="Thinking:", lines=5)
            cpp = gr.Textbox(label="C++ code:", lines=10)
    with gr.Row():
        model = gr.Dropdown(available_models, label="Select model", value="Kimi K2")
        convert = gr.Button("Convert code")

    convert.click(optimize, inputs=[python, model], outputs=[thinking, cpp])

ui.launch(inbrowser=True)