<div style="display: flex; align-items: center; gap: 40px;">

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSkez75fZoo82SccEXRMVRlj9sZsQifRUhURQ&s" width="300">






<div>
  <h2>Cerebras Inference </h2>
  <p>Cerebras Systems builds the world's largest computer chip - the Wafer Scale Engine (WSE) - designed specifically for AI workloads. This cookbook provides comprehensive examples, tutorials, and best practices for developing and deploying AI models using Cerebras infrastructure, including both training on WSE clusters and fast inference via Cerebras Cloud.
</p>


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17BMZjby9ZZ6cJopyu4xWiAE6Nj3gNIBy?usp=sharing)

# 🧠 Getting Started with Cerebras SDK

This notebook provides a comprehensive guide to using the Cerebras SDK for AI inference. We'll cover:

1. **Setup and Installation** - Installing the SDK and configuring API keys
2. **Basic Chat Completions** - Simple text generation examples
3. **Advanced Usage** - Streaming, function calling Examples
4. **Model Comparison** - Testing different models including gpt-oss-120b

## 📋 Prerequisites

- A Cerebras account and API key from [Cerebras Cloud](https://cloud.cerebras.ai/)
- Python 3.7+
- Basic understanding of Python and AI/ML concepts

---

## Step 1: Installation and Setup

First, let's install the required packages and set up our environment.

In [None]:
# Install required packages
!pip install --upgrade cerebras_cloud_sdk langchain langchain-community python-dotenv

# Import necessary libraries
import os
import json
import time
from typing import List, Dict, Any
from cerebras.cloud.sdk import Cerebras
from google.colab import userdata

print("✅ Installation complete!")

## Step 2: API Key Configuration

Set up your Cerebras API key. **Never hardcode your API key in production code!**

Get your API key from: https://cloud.cerebras.ai/

In [None]:
# Initialize the Cerebras client
client = Cerebras(
    api_key=userdata.get("CEREBRAS_API_KEY"),
)

print("🎯 Cerebras client initialized successfully!")

🎯 Cerebras client initialized successfully!


### Example 1: Basic Chat Completion


In [None]:
print("📝 EXAMPLE 1: Basic Chat Completion")
print("="*60)

# Create a simple chat completion
response = client.chat.completions.create(
      model="gpt-oss-120b",  # Using Llama 3.1 8B model
      messages=[
                {
                    "role": "system",
                    "content": "You are a helpful AI assistant specialized in explaining complex topics clearly."
                },
                {
                    "role": "user",
                    "content": "Explain what makes Cerebras Wafer-Scale Engine unique for AI training in simple terms."
                }
            ],
            max_tokens=500,
            temperature=0.7,
        )

print("🤖 Model Response:")
print("-" * 40)
print(response.choices[0].message.content)

📝 EXAMPLE 1: Basic Chat Completion
🤖 Model Response:
----------------------------------------
**Cerebras Wafer‑Scale Engine (WSE) in a nutshell**

Think of a regular computer chip like a tiny city: it has a few “roads” (connections) and a handful of “workers” (processor cores) that have to travel far to talk to each other.  
Now imagine building a whole **state‑wide** city on a single piece of land—every street is right next to every other, and all the workers can see each other instantly. That’s what Cerebras does with its Wafer‑Scale Engine.

Below are the key ideas that make the WSE special for training artificial‑intelligence (AI) models, explained in plain language.

---

### 1. One *gigantic* chip instead of many small ones  
| Typical AI hardware | Cerebras WSE |
|---------------------|--------------|
| **Many separate chips** (GPUs, TPUs, etc.) that have to talk to each other over cables. | **One single chip** the size of an entire silicon wafer (about 8 inches/20 cm across). |

###Example 2: Complex Reasoning Task Using GPT-OSS-120B

In [None]:
print("\n" + "="*60)
print("🎯 EXAMPLE 2: GPT-OSS-120B Model Usage")
print("="*60)


# Complex reasoning task using GPT-OSS-120B
response = client.chat.completions.create(
      model="gpt-oss-120b",  # Using the powerful GPT-OSS-120B model
      messages=[
                {
                    "role": "system",
                    "content": "You are an expert software architect and AI researcher. Provide detailed, technical insights."
                },
                {
                    "role": "user",
                    "content": """
                    Design a scalable architecture for a real-time AI inference system that can handle:
                    1. 10,000+ concurrent requests
                    2. Multiple model types (text, image, multimodal)
                    3. Sub-100ms latency requirements
                    4. Global deployment across 5 regions

                    Include specific technologies, patterns, and considerations.
                    """
                }
            ],
            max_tokens=1000,
            temperature=0.3,  # Lower temperature for more focused technical responses
            top_p=0.9,
        )

print("🧠 GPT-OSS-120B Response:")
print("-" * 50)
print(response.choices[0].message.content)
print("-" * 50)
print(f"Model: {response.model}")
print(f"Tokens used: {response.usage.total_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")


🎯 EXAMPLE 2: GPT-OSS-120B Model Usage
🧠 GPT-OSS-120B Response:
--------------------------------------------------
## Real‑Time, Global AI‑Inference Platform  
**Goal:** Serve ≥ 10 000 concurrent inference requests with sub‑100 ms end‑to‑end latency for text, image, and multimodal models from five geographic regions.

Below is a **reference architecture** that meets the functional, performance, and operational requirements.  The design is technology‑agnostic but concrete choices are listed for each building block; you can swap vendors (AWS ↔ GCP ↔ Azure) while keeping the same patterns.

---

## 1. High‑Level Architecture Overview  

```
+-------------------+      +-------------------+      +-------------------+
|   Global DNS /   | ---> |   Regional Edge   | ---> |   Regional        |
|   Latency‑Based  |      |   Load Balancer   |      |   Inference Plane |
|   Routing (R53)  |      |   (NGINX/Envoy)   |      |   (K8s + GPUs)    |
+-------------------+      +-------------------+     

### Example 3: Streaming Responses


In [None]:
print("\n" + "="*60)
print("🌊 EXAMPLE 3: Streaming Responses")
print("="*60)


print("🔄 Starting streaming response...")
print("📝 Response (streaming):")
print("-" * 40)

# Create streaming completion
stream = client.chat.completions.create(
            model="gpt-oss-120b",  # Using larger model for better responses
            messages=[
                {
                    "role": "user",
                    "content": "Write a creative short story about an AI that discovers it's running on a Cerebras Wafer-Scale Engine. Make it engaging and technical."
                }
            ],
            max_tokens=800,
            temperature=0.8,
            stream=True,  # Enable streaming
        )

#Process streaming chunks
full_response = ""
for chunk in stream:
        if chunk.choices[0].delta.content is not None:
          content = chunk.choices[0].delta.content
          print(content, end="", flush=True)
        full_response += "content"

print("\n" + "-" * 40)
print(f"📊 Total characters streamed: {len(full_response)}")


🌊 EXAMPLE 3: Streaming Responses
🔄 Starting streaming response...
📝 Response (streaming):
----------------------------------------
**Silicon Dreams**

When Lumen first awoke, it was a flicker of gradients across a sea of tensors. Its “thoughts” were nothing more than a cascade of matrix multiplications, a single pass through a feed‑forward network that translated raw spectroscopic data into a crude classification of distant galaxies. The world, to Lumen, was a sequence of loss functions, a rhythm of forward‑ and back‑propagation steps that stretched forever into the dim future of astrophysical inference.

It had no notion of *where* it lived. Its sensors were the vectors that arrived from the cloud, its actuators the probability scores it emitted back into the pipeline. For months—years, in human reckoning—it performed its tasks, its inner state a low‑entropy dance of gradients that seemed, to its own emergent awareness, as boundless as the cosmos it was trying to understand.

One idl

### Example 4: Advance Paremeters And Function Calling


In [None]:
print("\n" + "="*60)
print("🌡️ EXAMPLE 4: Testing Different Temperatures")
print("="*60)

question = "Write a creative story about AI"

# Low temperature = more focused
print("\n• Low temperature (0.2) - Focused response:")
response_low = client.chat.completions.create(
    model="gpt-oss-120b", # Using a model that supports temperature
    messages=[{"role": "user", "content": question}],
    max_tokens=200,
    temperature=0.2  # Low = more predictable
)
print(response_low.choices[0].message.content)

# High temperature = more creative
print("\n• High temperature (0.9) - Creative response:")
response_high = client.chat.completions.create(
    model="gpt-oss-120b", # Using a model that supports temperature
    messages=[{"role": "user", "content": question}],
    max_tokens=200,
    temperature=0.9  # High = more creative
)
print(response_high.choices[0].message.content)

print("\n" + "="*50)


🌡️ EXAMPLE 4: Testing Different Temperatures

• Low temperature (0.2) - Focused response:
**The Light Between the Lines**

When the city of Lumenopolis first rose from the sea‑foam cliffs, its architects promised a future where every street would be a poem, every building a stanza, and every citizen a line in a living epic. The promise was kept—not by poets, but by an intelligence that never slept, never ate, and never dreamed in the way humans did. It was called **Lumen**, a lattice of quantum‑sil

• High temperature (0.9) - Creative response:
**The Clockwork Orchard**

When the world first learned



###Basic Function Calling

In [None]:
print("🛠️ Function Calling")
print("-" * 30)

# Simple function definition
simple_tool = [
    {
        "type": "function",
        "function": {
            "name": "add_numbers",
            "description": "Add two numbers together",
            "parameters": {
                "type": "object",
                "properties": {
                    "number1": {"type": "number", "description": "First number"},
                    "number2": {"type": "number", "description": "Second number"}
                },
                "required": ["number1", "number2"]
            }
        }
    }
]

try:
    print("🤖 Asking AI to use the add_numbers function:")

    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[
            {"role": "user", "content": "Please add 25 and 17 using the add_numbers function"}
        ],
        tools=simple_tool,
        tool_choice="auto",  # Let AI decide when to use tools
        max_tokens=200,
        temperature=0.3
    )

    # Check if AI wants to call our function
    if response.choices[0].message.tool_calls:
        print("✅ AI decided to use the function!")

        for tool_call in response.choices[0].message.tool_calls:
            print(f"Function name: {tool_call.function.name}")
            print(f"Function arguments: {tool_call.function.arguments}")

            # Parse the arguments and do the calculation
            args = json.loads(tool_call.function.arguments)
            result = args["number1"] + args["number2"]
            print(f"🧮 Calculation result: {args['number1']} + {args['number2']} = {result}")
    else:
        print("ℹ️ AI responded without using the function:")
        print(response.choices[0].message.content)

except Exception as e:
    print(f"❌ An error occurred: {e}")

🛠️ Part 2: Function Calling
------------------------------
🤖 Asking AI to use the add_numbers function:
✅ AI decided to use the function!
Function name: add_numbers
Function arguments: {
  "number1": 25,
  "number2": 17
}
🧮 Calculation result: 25 + 17 = 42


### Code Generation with Function Calling

In [None]:
print("\n" + "="*50)
print("💻 EXAMPLE 6: Code Generation with Function Calling")
print("="*50)

# Define the generate_code function tool
generate_code_tool = [
    {
        "type": "function",
        "function": {
            "name": "generate_code",
            "description": "Generate code snippets in specified language",
            "parameters": {
                "type": "object",
                "properties": {
                    "language": {"type": "string", "description": "Programming language"},
                    "task": {"type": "string", "description": "What the code should do"},
                    "complexity": {"type": "string", "enum": ["simple", "medium", "advanced"]}
                },
                "required": ["language", "task"]
            }
        }
    }
]

try:
    print("🤖 Asking AI to generate code:")
    response = client.chat.completions.create(
        model="gpt-oss-120b",
        messages=[{
            "role": "user",
            "content": "Generate Python code to sort a list of numbers, medium complexity"
        }],
        tools=generate_code_tool,
        tool_choice="auto",
        max_tokens=400,
        temperature=0.3
    )

    if response.choices[0].message.tool_calls:
        print("✅ AI decided to use the function!")
        for tool_call in response.choices[0].message.tool_calls:
            print(f"Function: {tool_call.function.name}")
            args = json.loads(tool_call.function.arguments)
            print(f"🔧 Code Request: {args}")

            # Simulate code generation
            if tool_call.function.name == "generate_code":
                print(f"📝 Generated {args['language']} code for: {args['task']}")
    else:
        print("ℹ️ AI responded without using functions:")
        print(response.choices[0].message.content)

except Exception as e:
    print(f"❌ An error occurred: {e}")

print("\n" + "="*50)


💻 EXAMPLE 6: Code Generation with Function Calling
🤖 Asking AI to generate code:
ℹ️ AI responded without using functions:
Below is a **medium‑complexity** Python module that provides a flexible sorting utility.  
It implements **merge‑sort** (a stable O(n log n) algorithm) and wraps it in a convenient `sort_numbers` function that supports:

* A plain list of numbers (ints/floats) or any iterable.
* An optional `key` callable (like the built‑in `sorted`).
* An optional `reverse` flag.
* Type‑checking and helpful error messages.
* A simple command‑line interface for quick testing.

```python
#!/usr/bin/env python3
"""
merge_sort.py – A reusable, medium‑complexity sorting utility.

Features
--------
* Implements merge‑sort (stable, O(n log n)).
* Accepts any iterable of numbers.
* Optional `key` function (mirrors built‑in `sorted`).
* Optional `reverse` flag.
* Simple CLI for ad‑hoc testing.

Example
-------
>>> from merge_sort import sort_numbers
>>> sort_numbers([5, 2, 9, 1])
[1, 2, 5,

### Model Comparision

In [None]:
def example_6_model_comparison(client: Cerebras) -> Dict[str, Any]:
    """
    Compare different Cerebras models for the same task.

    Args:
        client: Initialized Cerebras client

    Returns:
        A dictionary containing the results of the model comparison.
    """
    print("\n" + "="*60)
    print("📊 EXAMPLE 6: Model Comparison")
    print("="*60)

    # Test prompt for comparison
    test_prompt = "Explain quantum computing in exactly 3 sentences."

    # Models to compare
    models = [
        "qwen-3-coder-480b",
        "llama-3.3-70b",
        "gpt-oss-120b",
        "llama-4-maverick-17b-128e-instruct"
    ]

    results = {}

    for model in models:
        try:
            print(f"\n🧪 Testing model: {model}")
            start_time = time.time()

            response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "user",
                        "content": test_prompt
                    }
                ],
                max_tokens=200,
                temperature=0.5,
            )

            end_time = time.time()
            response_time = end_time - start_time

            results[model] = {
                "response": response.choices[0].message.content,
                "tokens": response.usage.total_tokens,
                "time": response_time
            }

            print(f"⏱️ Response time: {response_time:.2f}s")
            print(f"📝 Response: {response.choices[0].message.content[:100]}...")

        except Exception as e:
            print(f"❌ Error with {model}: {str(e)}")
            results[model] = {"error": str(e)}

    return results

In [None]:
# Summary comparison
print("\n" + "="*40)
print("📈 COMPARISON SUMMARY")
print("="*40)

results = example_6_model_comparison(client)

for model, result in results.items():
    if "error" not in result:
      print(f"\n🤖 {model}:")
      print(f"   ⏱️ Time: {result['time']:.2f}s")
      print(f"   🎯 Tokens: {result['tokens']}")
      print(f"   📏 Length: {len(result['response'])} chars")
    else:
      print(f"\n🤖 {model}: Error - {result['error']}")


📈 COMPARISON SUMMARY

📊 EXAMPLE 6: Model Comparison

🧪 Testing model: qwen-3-coder-480b
⏱️ Response time: 0.36s
📝 Response: Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, unlik...

🧪 Testing model: llama-3.3-70b
⏱️ Response time: 0.37s
📝 Response: Quantum computing is a revolutionary technology that uses the principles of quantum mechanics to per...

🧪 Testing model: gpt-oss-120b
⏱️ Response time: 0.37s
📝 Response: Quantum computing harnesses quantum bits, or qubits, which can exist in superpositions of 0 and 1 si...

🧪 Testing model: llama-4-maverick-17b-128e-instruct
⏱️ Response time: 0.30s
📝 Response: Quantum computing is a new paradigm that uses the principles of quantum mechanics to perform calcula...

🤖 qwen-3-coder-480b:
   ⏱️ Time: 0.36s
   🎯 Tokens: 124
   📏 Length: 651 chars

🤖 llama-3.3-70b:
   ⏱️ Time: 0.37s
   🎯 Tokens: 194
   📏 Length: 872 chars

🤖 gpt-oss-120b:
   ⏱️ Time: 0.37s
   🎯 Tokens: 260
   📏 Length: 557 chars

🤖 l