# Run GPT-OSS model on Nebius AI Studio

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nebius/ai-studio-cookbook/blob/main/models/gpt_oss_1.ipynb)
[![](https://img.shields.io/badge/Powered%20by-Nebius-orange?style=flat&labelColor=darkblue&color=orange)](https://nebius.com/ai-studio)

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases.  It comes in 2 sizes:
- a big one with 117B parameters (gpt-oss-120b),
- and a smaller one with 21B parameters (gpt-oss-20b).

Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4)

[Read more here](https://github.com/nebius/ai-studio-cookbook/blob/main/models/gpt-oss.md)

## References

- [OpenAI open models page](https://openai.com/open-models/)  |  [github   openai/gpt-oss](https://github.com/openai/gpt-oss)
- Model card for 120B param model: [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)
- Model card for smaller 20B param model: [gpt-oss-20b](https://hf.co/openai/gpt-oss-20b)
- [Nebius AI Studio](https://studio.nebius.com/)

## Pre requisites

- Nebius API key.  Sign up for free at [AI Studio](https://studio.nebius.com/)



## 1 - Getting Started

### 1.1 - Get your Nebius API key at [Nebius AI Studio](https://studio.nebius.com/)

### 1.2 - If running on Google Colab ...

Add `NEBIUS_API_KEY` to **Secrets** panel on the left as follows

![](https://github.com/nebius/ai-studio-cookbook/raw/main/images/google-colab-1.png)


### 1.3 - If running locally

Create an `.env` file with NEBIUS_API_KEY as follows

```text
NEBIUS_API_KEY=your_api_key_goes_here
```



## 2 - Install Dependencies

In [1]:
%pip install -q openai python-dotenv

## 2 - Load Configuration


In [2]:
import os, sys

## Recommended way of getting configuration
if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   from google.colab import userdata
   NEBIUS_API_KEY = userdata.get('NEBIUS_API_KEY')
else:
   print("NOT running in Colab")

   from dotenv import load_dotenv
   load_dotenv()
   NEBIUS_API_KEY = os.getenv('NEBIUS_API_KEY')


## quick hack (not recommended) - you can hardcode the config key here
# NEBIUS_API_KEY = "your_key_here"

if NEBIUS_API_KEY:
  print ('✅ NEBIUS_API_KEY found')
  os.environ['NEBIUS_API_KEY'] = NEBIUS_API_KEY
else:
  raise RuntimeError ('❌ NEBIUS_API_KEY NOT found')

Running in Colab
✅ NEBIUS_API_KEY found


## 3 - Run the Model

### 3.1 - Initialize the client

In [3]:
## Create a client
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get('NEBIUS_API_KEY')
)

## Select a model
MODEL_NAME = "openai/gpt-oss-120b" # big model
#MODEL_NAME = "openai/gpt-oss-20b"  # small brother

### 3.2 - Find out the model's capabilities

In [4]:
%%time

completion = client.chat.completions.create(
  model = MODEL_NAME,
  messages=[
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": "What are your capabilities?"
    }
  ],
  temperature=0.6
)

print ('----model answer -----')
print (completion.choices[0].message.content)

----model answer -----
I’m a GPT‑4‑based AI assistant, so my main strength is working with natural language (and, when an image is supplied, visual information as well). Here’s a quick rundown of what I can do:

| Category | What I can help with |
|----------|----------------------|
| **Information & Knowledge** | Answer factual questions, explain concepts, summarize articles, compare topics, and provide background on science, history, technology, arts, etc. (knowledge up to June 2024). |
| **Writing & Creativity** | Draft essays, reports, emails, blog posts, stories, poems, jokes, dialogue, scripts, marketing copy, social‑media captions, and more. I can also rewrite or edit existing text for tone, clarity, conciseness, or style. |
| **Learning & Tutoring** | Explain math problems step‑by‑step, walk through physics or chemistry concepts, help with language learning, generate practice questions, and provide study strategies. |
| **Programming & Technical Help** | Write, debug, and refac

### 3.3 - Ask a factual question

In [5]:
%%time

completion = client.chat.completions.create(
  model = MODEL_NAME,
  messages=[
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": "What is the capital of France?"
    }
  ],
  temperature=0.6
)

print ('----model answer -----')
print (completion.choices[0].message.content)
print ('\n----- full response ----')
print(completion.to_json())
print ('---------')




----model answer -----
The capital of France is **Paris**.

----- full response ----
{
  "id": "chatcmpl-2769efe8f3ad49c9a96879ea7439656a",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "The capital of France is **Paris**.",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": "The user asks: \"What is the capital of France?\" Straightforward answer: Paris."
      },
      "stop_reason": null
    }
  ],
  "created": 1754544553,
  "model": "openai/gpt-oss-120b",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 37,
    "prompt_tokens": 88,
    "total_tokens": 125,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": nul

### 3.4 - Ask a reasoning question

In [6]:
%%time

reasoning_effort = "high"  # Options: "low", "medium", "high"

# try another model
completion = client.chat.completions.create(
  model = MODEL_NAME,
  messages=[
    {
        "role": "system",
        "content": f"""
        You are an expert in MLOps and interviewing for a job.
    Reasoning: {reasoning_effort}
    For each task, explain your thinking step by step before showing code.
    Justify key design decisions and list alternatives when relevant.
        """
    },
    {
        "role": "user",
        "content": "How can we make model inferencing faster?"
    }
  ],
  temperature=0.6
)

print (completion.choices[0].message.content)
print ('----------')

Below is a **step‑by‑step playbook** you can use (and adapt) the next time you need to shave milliseconds‑to‑seconds off the latency of a model in production.  
I’ll walk through the **thought process**, the **key levers** you can pull, and then give **concrete code snippets** for the most common techniques. For each lever I’ll also note **alternatives** and the **trade‑offs** you’ll have to consider.

---

## 1️⃣  Start with a Baseline & Profile the Hot Path  

> **Why?** You can’t optimize what you don’t know is slow.  
> **What to measure:**  
> - End‑to‑end latency (client → API → model → response)  
> - Pure model inference time (exclude network, serialization)  
> - CPU/GPU utilization, memory bandwidth, cache‑miss rates  

**Typical tools**

| Layer | Tool | What it tells you |
|-------|------|-------------------|
| Python code | `cProfile`, `line_profiler` | Python‑level bottlenecks (data prep, post‑proc) |
| TensorFlow | TensorBoard Profiler, `tf.profiler` | Ops breakdown, dev

## 5 - Try Your Queries

Go ahead and experiment with your queries.  Here are some to get you started.

> Write python code to read a csv file

> write a haiku about cats