<div style="display: flex; align-items: center; gap: 40px;">

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSkez75fZoo82SccEXRMVRlj9sZsQifRUhURQ&s" width="240">
<img src="https://pbs.twimg.com/profile_images/1783589223406415872/3KMxGGrF_400x400.jpg" width="130">






<div>
  <h2>Cerebras Inference</h2>
  <p>Cerebras Systems builds the world's largest computer chip - the Wafer Scale Engine (WSE) - designed specifically for AI workloads. This cookbook provides comprehensive examples, tutorials, and best practices for developing and deploying AI models using Cerebras infrastructure, including both training on WSE clusters and fast inference via Cerebras Cloud.</p>

  <h2>LiteLLM 🚅</h2>
    <p>LiteLLM simplifies access to 100+ large language models (LLMs) with a unified API. It enables easy model integration, spend tracking, rate-limiting, fallbacks, and observability—helping developers manage LLMs like OpenAI, Anthropic, Groq, Cohere, Google, and more from a single interface.</p>
  </div>
</div>



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1KrVzIjva5AqYwuaciHBNY8eIkrnIx1VJ?usp=sharing)

## Get Your API Keys

Before you begin, make sure you have:

1. A Cerebras API key (Get yours at [Cerebras Cloud](https://cloud.cerebras.ai/))
2. Basic familiarity with Python and Jupyter notebooks

This notebook is designed to run in Google Colab, so no local Python installation is required.

###Cerebras using LiteLLM 🚅

###Install Requirements

In [1]:
!pip install -q litellm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m272.3/272.3 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25h

###Setup API Keys

In [2]:
import os
from google.colab import userdata

os.environ["CEREBRAS_API_KEY"] = userdata.get("CEREBRAS_API_KEY")

###Initialize Cerebras Model via LiteLLM 🚅:

In [3]:
import os
import litellm

api_key = os.environ.get("CEREBRAS_API_KEY")

response = litellm.completion(
    model="cerebras/llama3.3-70b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Explain how Cerebras achieves ultra-fast inference speeds."
        }
    ],
    temperature=0.7,
    max_tokens=500
)

print(response['choices'][0]['message']['content'])

Cerebras is a company that specializes in developing high-performance computing systems for artificial intelligence (AI) and machine learning (ML) workloads. They achieve ultra-fast inference speeds through a combination of innovative hardware and software designs. Here are some key factors that contribute to their exceptional performance:

1. **Wafer-Scale Engine (WSE)**: Cerebras' flagship product is the Wafer-Scale Engine, a massive chip that integrates thousands of processing cores, memory, and interconnects on a single wafer of silicon. This design eliminates the need for traditional chip-to-chip communication, reducing latency and increasing bandwidth.
2. **Massive Parallelism**: The WSE features a large number of processing cores, which enables massive parallelism and allows for the simultaneous execution of thousands of instructions. This leads to significant speedups in compute-intensive tasks like matrix multiplication and convolution.
3. **High-Bandwidth Memory**: Cerebras' 

###Example 1: Streaming Response

In [4]:
response = litellm.completion(
    model="cerebras/llama3.3-70b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Explain quantum computing in simple terms"
        }
    ],
    temperature=0.8,
    max_tokens=1024,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end='')

Quantum computing! It's a complex topic, but I'll try to break it down in simple terms.

**Classical Computing vs. Quantum Computing**

Classical computers, like the one you're using now, use "bits" to process information. Bits are either 0 or 1, and they're used to perform calculations and store data. Think of them like light switches – they're either on (1) or off (0).

Quantum computers, on the other hand, use "qubits" (quantum bits). Qubits are special because they can be both 0 and 1 at the same time, which is known as a "superposition." It's like a light switch that's both on and off simultaneously!

**How Qubits Work**

Imagine a coin. In classical computing, the coin can either be heads (0) or tails (1). But in quantum computing, the coin can exist in a state where it's both heads AND tails at the same time. This is called a "superposition."

When you measure the qubit (like flipping the coin), it "collapses" into one state or the other – either 0 or 1. But before measurement, 

### Example 2: Code Generation with Qwen Coder

In [5]:
response = litellm.completion(
    model="cerebras/qwen-3-coder-480b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function to implement a binary search algorithm with detailed comments."
        }
    ],
    temperature=0.2,
    max_tokens=700
)

print(response['choices'][0]['message']['content'])

```python
def binary_search(arr, target):
    """
    Implements binary search algorithm to find a target value in a sorted array.
    
    Binary search works by repeatedly dividing the search interval in half.
    It compares the target value with the middle element of the array and
    eliminates half of the remaining elements based on the comparison.
    
    Args:
        arr (list): A sorted list of elements to search through
        target: The value to search for in the array
    
    Returns:
        int: The index of the target element if found, otherwise -1
    
    Time Complexity: O(log n)
    Space Complexity: O(1)
    
    Example:
        >>> binary_search([1, 3, 5, 7, 9, 11, 13], 7)
        3
        >>> binary_search([1, 3, 5, 7, 9, 11, 13], 4)
        -1
    """
    
    # Initialize the left and right pointers
    # left points to the beginning of the array (index 0)
    # right points to the end of the array (last index)
    left = 0
    right = len(arr) - 1
    
 

### Example 3: Comparing Multiple Models

In [6]:
models = [
    "cerebras/llama3.3-70b",
    "cerebras/qwen-3-coder-480b",
    "cerebras/gpt-oss-120b"
]

prompt = "What are the key features of functional programming?"

for model in models:
    print(f"\n{'='*60}")
    print(f"Model: {model}")
    print(f"{'='*60}")

    response = litellm.completion(
        model=model,
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.6,
        max_tokens=300
    )

    print(response['choices'][0]['message']['content'])


Model: cerebras/llama3.3-70b
Functional Programming
### Key Features

The following are the key features of functional programming:

1. **Immutable Data**: Data is never changed in place. Instead, new data structures are created each time the data needs to be updated.
2. **Pure Functions**: Functions have no side effects and always return the same output given the same inputs.
3. **Functions as First-Class Citizens**: Functions can be passed as arguments to other functions, returned as values from functions, and stored in data structures.
4. **Higher-Order Functions**: Functions that take other functions as arguments or return functions as output.
5. **Recursion**: A programming technique where a function calls itself to solve a problem.
6. **Referential Transparency**: The output of a function depends only on its inputs and not on any external state.
7. **Lazy Evaluation**: Expressions are only evaluated when their values are actually needed.
8. **Type Systems**: Functional programmi

 ### Example 4: Code Debugging with Qwen Coder

In [7]:
buggy_code = '''
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

result = calculate_average([1, 2, 3, 4, 5])
print(result)
'''

prompt = f"""Analyze this Python code and suggest improvements for edge cases and error handling:

{buggy_code}

Provide an improved version with better error handling."""

response = litellm.completion(
    model="cerebras/qwen-3-coder-480b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {"role": "user", "content": prompt}
    ],
    temperature=0.3,
    max_tokens=600
)

print("Original Code:")
print(buggy_code)
print("\nImproved Version:")
print(response['choices'][0]['message']['content'])

Original Code:

def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

result = calculate_average([1, 2, 3, 4, 5])
print(result)


Improved Version:
## Analysis of the Original Code

The original `calculate_average` function has several issues with edge cases and error handling:

### Problems Identified:
1. **Division by zero**: When an empty list is passed, `len(numbers)` returns 0, causing a `ZeroDivisionError`
2. **No input validation**: The function doesn't check if the input is actually iterable
3. **No type checking**: Non-numeric values in the list will cause a `TypeError`
4. **No handling of None input**: Passing `None` will result in a `TypeError`

## Improved Version

Here's a robust version with comprehensive error handling:

```python
def calculate_average(numbers):
    """
    Calculate the average of a list of numbers.
    
    Args:
        numbers: An iterable containing numeric values
        
    Ret

### Example 5: Different Temperature Settings

In [8]:
temperatures = [0.1, 0.5, 0.9]
prompt = "Write a creative opening line for a science fiction story."

for temp in temperatures:
    print(f"\n--- Temperature: {temp} ---")

    response = litellm.completion(
        model="cerebras/llama3.3-70b",
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=100
    )

    print(response['choices'][0]['message']['content'])


--- Temperature: 0.1 ---
As the last remnants of sunlight faded from the ravaged horizon, a lone spaceship, its hull etched with the scars of a thousand midnights, emerged from the depths of a nebula, carrying with it the whispers of a forgotten civilization and the echoes of a prophecy that would soon shatter the fragile balance of the galaxy.

--- Temperature: 0.5 ---
As the last remnants of sunlight faded from the ravaged horizon, a lone spaceship, its hull etched with the scars of a thousand midnights, emerged from the depths of a nebula, bearing a cargo of forbidden memories and a crew that was not entirely its own.

--- Temperature: 0.9 ---
As the last star in the universe died, its final whisper echoed through the void, a faint hum that awakened an ancient, cryogenically frozen city on a planet shrouded in an eternal, impenetrable twilight.


### Example 6: Batch Processing Multiple Queries

In [9]:
queries = [
    "What is machine learning?",
    "Explain neural networks.",
    "What is deep learning?",
    "Define artificial intelligence.",
    "What is natural language processing?"
]

print("Batch Processing Results:\n")

for i, query in enumerate(queries, 1):
    print(f"\n{i}. Query: {query}")

    response = litellm.completion(
        model="cerebras/llama3.3-70b",
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": query}],
        temperature=0.5,
        max_tokens=150
    )

    print(f"Answer: {response['choices'][0]['message']['content']}")
    print("-" * 80)

Batch Processing Results:


1. Query: What is machine learning?
Answer: **Machine Learning Definition**

Machine learning is a subset of artificial intelligence (AI) that involves training algorithms to learn from data and make predictions or decisions without being explicitly programmed. It enables systems to improve their performance on a task over time, based on the data they receive.

**Key Characteristics:**

1. **Data-driven**: Machine learning relies on large amounts of data to learn patterns and relationships.
2. **Algorithmic**: Machine learning uses algorithms to analyze data and make predictions or decisions.
3. **Adaptive**: Machine learning systems can adapt to new data and improve their performance over time.
4. **Autonomous**: Machine learning systems can operate independently, without human intervention.

**Types of Machine Learning:**

1. **Supervised Learning**:
--------------------------------------------------------------------------------

2. Query: Explain neural 

###Building a Simple Chatbot with LiteLLM 🚅

In [10]:
import os
import litellm

print("Cerebras Chatbot: Hello! Type 'exit' to end the conversation.\n")

chat_history = []

while True:
    user_input = input("You: ")

    if user_input.lower() == "exit":
        print("Chatbot: Goodbye! 👋")
        break

    chat_history.append({"role": "user", "content": user_input})

    try:
        response = litellm.completion(
            model="cerebras/llama3.3-70b",
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=chat_history,
            temperature=0.7,
            max_tokens=500,
        )

        reply = response["choices"][0]["message"]["content"]
        print("Chatbot:", reply)

        chat_history.append({"role": "assistant", "content": reply})

    except Exception as e:
        print("Error:", str(e))


Cerebras Chatbot: Hello! Type 'exit' to end the conversation.

You: hi
Chatbot: Hello. How can I assist you today?
You: exit
Chatbot: Goodbye! 👋


### Advanced: Using LiteLLM with Multiple Providers

In [11]:
def query_with_fallback(prompt, primary_model, fallback_model):
    try:
        print(f"Trying primary model: {primary_model}")
        response = litellm.completion(
            model=primary_model,
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300
        )
        return response['choices'][0]['message']['content']

    except Exception as e:
        print(f"Primary model failed: {e}")
        print(f"Trying fallback model: {fallback_model}")

        response = litellm.completion(
            model=fallback_model,
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300
        )
        return response['choices'][0]['message']['content']

prompt = "Explain the concept of recursion in programming."
result = query_with_fallback(
    prompt,
    primary_model="cerebras/llama3.3-70b",
    fallback_model="cerebras/gpt-oss-120b"
)

print("\nResponse:")
print(result)

Trying primary model: cerebras/llama3.3-70b

Response:
Recursion in Programming

Recursion is a fundamental concept in programming where a function calls itself repeatedly until it reaches a base case that stops the recursion. This technique is used to solve problems that can be broken down into smaller sub-problems of the same type.

Key Components of Recursion
----------------------------

1. **Base Case**: A condition that stops the recursion when met.
2. **Recursive Call**: The function calls itself with a smaller input or a modified version of the original input.
3. **State**: The current state of the function, including any local variables or parameters.

How Recursion Works
--------------------

1. The function is called with an initial input.
2. The function checks if the base case is met. If it is, the function returns a result.
3. If the base case is not met, the function calls itself with a smaller input or a modified version of the original input.
4. The recursive call crea

## Conclusion

This notebook demonstrated how to use Cerebras models with LiteLLM, including:

1. Basic completion requests
2. Streaming responses
3. Code generation with specialized models
4. Model comparison
5. Temperature settings
6. Batch processing
7. Building a chatbot
8. Fallback strategies

LiteLLM provides a unified interface for accessing Cerebras's ultra-fast inference platform, making it easy to integrate into your applications.