# Building AI Applications with OpenAI's Responses API

## Introduction

Welcome to this comprehensive guide on using OpenAI's **Responses API**, the modern replacement for the deprecated Assistants API. In this tutorial, we'll explore how to build powerful AI applications that can search through documents and perform complex data analysis.

### What's Changed?

The Assistants API is being deprecated (sunset planned for mid-2026) in favor of the new **Responses API**, which offers:

- **Simpler architecture**: No more complex thread management - just pass `previous_response_id` to continue conversations
- **Unified experience**: Combines the best of Chat Completions and Assistants APIs
- **Stateless design**: Instructions are stateless and apply only to the current request
- **Better performance**: More efficient conversation management
- **Same powerful tools**: File Search and Code Interpreter are both fully supported

### What You'll Learn

By the end of this tutorial, you'll be able to:

1. Use the **File Search** tool to build knowledge-based assistants
2. Leverage the **Code Interpreter** tool for data analysis and visualization
3. Manage vector stores for document retrieval
4. Chain responses to maintain conversation context
5. Combine multiple tools for powerful AI applications

Let's get started!

## Setup

First, let's set up our environment with the required imports and initialize our OpenAI client:

In [1]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")

In [2]:
from openai import OpenAI
import time

# Initialize the OpenAI client
client = OpenAI()

print("OpenAI client initialized successfully!")

OpenAI client initialized successfully!


---

## Part 1: File Search with the Responses API

File Search allows your AI to access knowledge from documents you provide. OpenAI automatically handles:
- Parsing and chunking documents
- Creating and storing embeddings
- Performing vector and keyword search
- Retrieving relevant content to answer queries

### Understanding Vector Stores

**Vector Stores** are specialized databases for efficient storage and retrieval of information:
- Files are automatically processed (parsed, chunked, embedded)
- Support both keyword and semantic search
- Can store up to 10,000 files each
- Maximum file size: 512 MB per file

### Creating a Vector Store

Let's create a vector store for our documentation:

In [5]:
def create_vector_store(store_name: str):
    """Creates a vector store for document storage."""
    vector_store = client.vector_stores.create(name=store_name)
    
    details = {
        "id": vector_store.id,
        "name": vector_store.name,
        "created_at": vector_store.created_at,
        "file_count": vector_store.file_counts.completed
    }
    
    print(f"Created vector store: {vector_store.name}")
    print(f"Vector Store ID: {vector_store.id}")
    
    return details

# Create our first vector store
vector_store_details = create_vector_store("Product Documentation")

APIConnectionError: Connection error.

### Uploading Files to the Vector Store

Now let's upload documents to our vector store. We'll download a sample research paper from arXiv:

In [None]:
import requests

# Download a sample research paper from arXiv

def download_paper(paper_url: str, file_path: str):
    """Download a research paper from arXiv."""
    print("Downloading research paper...")
    response = requests.get(paper_url)
    response.raise_for_status()

    with open(file_path, "wb") as f:
        f.write(response.content)

    print(f"Downloaded: {file_path}")
    
    return file_path



file_paths = ["./assets-resources/pdfs/future_agents.pdf","./assets-resources/pdfs/attention_paper.pdf"]

paper_urls = [
    "https://arxiv.org/pdf/2506.02153",
    "https://arxiv.org/pdf/1706.03762"
]

for file_path, paper_url in zip(file_paths, paper_urls):
    download_paper(paper_url, file_path)

Downloading research paper...
Downloaded: ./pdfs/future_agents.pdf
Downloading research paper...
Downloaded: ./pdfs/attention_paper.pdf


In [18]:
def upload_file_to_vector_store(file_path: str, vector_store_id: str):
    """Uploads a file to the vector store."""
    file_name = os.path.basename(file_path)
    
    try:
        # Upload file to OpenAI
        print(f"Uploading {file_name}...")
        file_response = client.files.create(
            file=open(file_path, 'rb'),
            purpose="assistants"
        )
        
        # Add file to vector store
        print(f"Adding to vector store...")
        client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        
        print(f"✓ Successfully uploaded: {file_name}")
        return {"file": file_name, "status": "success", "file_id": file_response.id}
        
    except Exception as e:
        print(f"✗ Failed to upload {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

# Upload our file
# Created this vector store below manually in the openai platform
# vector_store_id = vector_store_details['id']
vector_store_id = "vs_6900b46abc748191b31df0c8d561e5d7"

for file_path in file_paths:
    upload_result = upload_file_to_vector_store(file_path, vector_store_id)
    # Wait a moment for processing
    print("\nWaiting for file processing...")
    time.sleep(3)
    print("Ready!")

Uploading future_agents.pdf...
Adding to vector store...
✓ Successfully uploaded: future_agents.pdf

Waiting for file processing...
Ready!
Uploading attention_paper.pdf...
Adding to vector store...
✓ Successfully uploaded: attention_paper.pdf

Waiting for file processing...
Ready!


### Using File Search with the Responses API

Now comes the exciting part! With the Responses API, we can query our documents directly without managing threads or assistants.

The key difference from the old Assistants API:
- **Old way**: Create assistant → Create thread → Add message → Run assistant → Poll for completion
- **New way**: Simply call `responses.create()` with the file_search tool

In [23]:
def query_documents(query: str, vector_store_id: str, model: str = "gpt-5-mini"):
    """Query documents using file search in the Responses API."""
    
    instructions = """You are a helpful research assistant. 
    Use the provided documentation to answer questions accurately.
    If you're not sure about something, admit it and stick to the information in the documents.
    Always cite your sources when possible."""
    
    response = client.responses.create(
        input=query,
        model=model,
        instructions=instructions,
        tools=[{
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
            "max_num_results": 5
        }]
    )
    
    return response


# vector_store_id = vector_store_details['id']
# Ask a question about our document
query = "Explain the core idea behind the attention mechanism and what is the future of agents according to the papers you have access to."
print(f"Query: {query}\n")
response = query_documents(query, vector_store_id)

Query: Explain the core idea behind the attention mechanism and what is the future of agents according to the papers you have access to.



In [24]:
from IPython.display import Markdown

Markdown(response.output[-1].content[0].text)

Short answer first:

- Attention: the core idea is to let the model compute, for each position (or query), a weighted sum of (value) representations where the weights — the attention scores — come from similarity between a query and a set of keys; this lets the model directly model relationships between any two positions and replace recurrence with parallel self-attention (with scaled dot-product attention and multi‑head attention to capture multiple types of relations) .  
- Future of agents: the papers you provided argue that small, deployable language models (SLMs) — not giant generalist LLMs — are the likely future for most agentic systems because they are sufficiently capable for routine agent tasks, are more operationally suitable (latency, privacy, cost, local inference), and are economically preferable; the authors therefore advocate migrating many agent uses from large LLMs to SLMs and provide a conversion outline and case studies to support this view .

More detail

1) Core idea behind attention (Transformer perspective)
- Attention computes interactions between elements by forming queries (Q), keys (K) and values (V). For each query you compute similarity scores to all keys, normalize (softmax) to get attention weights, and take the weighted sum of the corresponding values. This produces context-aware representations that directly capture dependencies regardless of distance in the sequence .  
- Scaled dot‑product attention and multi‑head attention: scaling stabilizes gradients for large dimensionalities; multi‑head attention runs several attention computations in parallel (different linear projections) so the model can attend to different types of relationships simultaneously .  
- Practical payoff: replacing recurrence/convolution with self‑attention yields much more parallelizable architectures and a constant (per‑layer) number of operations to relate any two positions, which improved training speed and quality in the Transformer experiments .

2) What the provided paper says about the future of agents
- Definitions & thesis: the authors define SLMs as language models small enough to run with low-latency on common consumer devices (rough rule: models < ~10B params as of 2025) and contrast them with LLMs; they state that SLMs are the future of agentic AI because they (1) are typically powerful enough for agent tasks, (2) are operationally better suited, and (3) are much more economical, so SLM adoption is a necessary outcome if practical priorities are followed .  
- Rationale and implications: the paper argues current practice often overuses single, large generalist LLMs to handle agents’ requests; instead, many agent requests are simple and can be handled by smaller specialized models; moving to SLMs brings advantages in deployment cost, latency, privacy (on-device inference), and energy/compute efficiency; the authors propose an algorithmic pathway to migrate agentic applications from LLMs to SLMs and present case studies estimating the potential scope of replacement .  
- Tone: this is not a claim that LLMs will disappear — rather that for the majority of practical agent workloads the tradeoffs favor smaller, local or specialized models as the mainstream solution going forward .

If you’d like, I can:
- Give a short diagram or math sketch of scaled dot‑product / multi‑head attention.  
- Summarize the migration steps and case studies from the agent paper in more detail.

### Continuing the Conversation

One of the most powerful features of the Responses API is simple conversation continuity. Just pass the `previous_response_id` to maintain context:

In [26]:
def continue_conversation(query: str, previous_response_id: str, vector_store_id: str, model: str = "gpt-4o"):
    """Continue a conversation using previous_response_id."""
    
    instructions = """You are a helpful research assistant. 
    Use the provided documentation to answer questions accurately.
    Refer to our previous conversation when relevant."""
    
    response = client.responses.create(
        input=query,
        model=model,
        instructions=instructions,
        previous_response_id=previous_response_id,
        tools=[{
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
            "max_num_results": 5
        }]
    )
    
    return response

# Ask a follow-up question
# vector_store_id = vector_store_details['id']
follow_up_query = "Can you explain the methodology of each of these paper in 2 or 3 sentences?"
print(f"Follow-up Query: {follow_up_query}\n")

follow_up_response = continue_conversation(
    follow_up_query,
    response.id,
    vector_store_id
)

Markdown(follow_up_response.output[-1].content[0].text)

Follow-up Query: Can you explain the methodology of each of these paper in 2 or 3 sentences?



### Methodology of the Attention Mechanism in Transformers

The Transformer model eliminates the reliance on recurrence by using self-attention mechanisms to compute representations of input and output sequences. The core methodology involves using an attention mechanism to model dependencies between words regardless of their distance in the sequence. This is achieved through scaled dot-product attention and multi-head attention, which allow multiple attention operations to be performed in parallel, providing the capability for the model to focus on different parts of the sequence at the same time.

### Methodology of the Future of Agents Paper

The paper presents a thesis that smaller language models (SLMs) are more suitable than large language models (LLMs) for agentic AI applications due to their operational and economic advantages. It outlines a methodology to transition from LLMs to SLMs, argues the case with operational and economic benefits, and defends this position through case studies. The paper proposes a conversion algorithm for migrating applications to use SLMs, supported by practical examples and cost analysis.

### Extracting Citations

The Responses API provides annotations that include information about which files were used to generate the response:

In [27]:
def extract_citations(response):
    """Extract file citations from response annotations."""
    citations = set()
    
    for item in response.output:
        if hasattr(item, 'content'):
            for content in item.content:
                if hasattr(content, 'annotations'):
                    for annotation in content.annotations:
                        if hasattr(annotation, 'filename'):
                            citations.add(annotation.filename)
    
    return citations

# Extract and display citations
citations = extract_citations(follow_up_response)

if citations:
    print("\nFiles Referenced:")
    for citation in citations:
        print(f"  - {citation}")
else:
    print("\nNo explicit file citations found in response.")


Files Referenced:
  - future_agents.pdf
  - attention_paper.pdf


---

## Best Practices and Cost Management

### Cost Considerations

- **File Search**: $2.50 per 1,000 queries + $0.10/GB/day storage (first GB free)
- **Code Interpreter**: $0.03 per session
- **Vector Storage**: Set expiration policies to manage costs

In [None]:
def create_vector_store_with_expiration(name: str, days: int = 7):
    """Create a vector store with automatic expiration after inactivity."""
    
    vector_store = client.vector_stores.create(
        name=name,
        expires_after={
            "anchor": "last_active_at",
            "days": days
        }
    )
    
    print(f"Created vector store: {name}")
    print(f"Will expire after {days} days of inactivity")
    
    return vector_store

# Example: Create a temporary vector store
temp_store = create_vector_store_with_expiration("Temporary Research Store", days=3)

### Supported File Types

File Search supports many text-based formats:
- Documents: `.pdf`, `.docx`, `.txt`, `.md`
- Code: `.py`, `.js`, `.java`, `.cpp`, `.html`, `.css`
- Data: `.json`, `.csv`, `.xml`
- Maximum file size: 512 MB
- Maximum tokens per file: 5,000,000

### Cleanup Resources

Remember to clean up resources when you're done:

In [None]:
def cleanup_resources(vector_store_ids=None, file_ids=None):
    """Clean up vector stores and files."""
    
    if vector_store_ids:
        for vs_id in vector_store_ids:
            try:
                client.vector_stores.delete(vs_id)
                print(f"Deleted vector store: {vs_id}")
            except Exception as e:
                print(f"Error deleting vector store {vs_id}: {e}")
    
    if file_ids:
        for file_id in file_ids:
            try:
                client.files.delete(file_id)
                print(f"Deleted file: {file_id}")
            except Exception as e:
                print(f"Error deleting file {file_id}: {e}")

# Example cleanup (uncomment to use)
# cleanup_resources(
#     vector_store_ids=[vector_store_details['id']],
#     file_ids=[uploaded_file_id]
# )

---

## Migration Notes: Assistants API → Responses API

### Key Differences

| Aspect | Assistants API (Old) | Responses API (New) |
|--------|---------------------|--------------------|
| **Conversation Management** | Thread-based, manual history | `previous_response_id` |
| **Instructions** | Stored on server | Stateless, per-request |
| **API Calls** | Create assistant → thread → message → run | Single `responses.create()` |
| **Execution** | Asynchronous with polling | Synchronous or streaming |
| **Complexity** | Higher (multiple objects) | Lower (unified) |

### Code Comparison

**Old Way (Assistants API):**
```python
# Create assistant
assistant = client.beta.assistants.create(
    instructions="You are helpful",
    tools=[{"type": "file_search"}]
)

# Create thread
thread = client.beta.threads.create()

# Add message
message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Hello"
)

# Run assistant
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id
)

# Get messages
messages = client.beta.threads.messages.list(thread_id=thread.id)
```

**New Way (Responses API):**
```python
# Single call
response = client.responses.create(
    input="Hello",
    instructions="You are helpful",
    tools=[{"type": "file_search", "vector_store_ids": [vs_id]}]
)

# Continue conversation
response2 = client.responses.create(
    input="Follow up",
    previous_response_id=response.id,
    tools=[{"type": "file_search", "vector_store_ids": [vs_id]}]
)
```

### Benefits of Responses API

1. **Simpler code**: Fewer API calls and objects to manage
2. **Better performance**: Direct response without polling
3. **Flexible state**: Choose stateful or stateless as needed
4. **Easier debugging**: Unified request/response structure
5. **Better typing**: Improved TypeScript definitions

---

## Conclusion

Congratulations! You've learned how to use OpenAI's modern Responses API with both File Search and Code Interpreter tools.

### What We Covered

1. **File Search**: Build knowledge-based assistants that can search through documents
2. **Code Interpreter**: Perform data analysis, create visualizations, and solve complex problems
3. **Conversation Management**: Maintain context using `previous_response_id`
4. **Combined Tools**: Use multiple tools together for powerful applications
5. **Best Practices**: Cost management, file handling, and resource cleanup

### Next Steps

- Experiment with your own documents and datasets
- Combine tools creatively for your specific use cases
- Explore the other tools available in the Responses API (web search, etc.)
- Build production applications with proper error handling and monitoring

### Resources

- [OpenAI Responses API Documentation](https://platform.openai.com/docs/api-reference/responses)
- [File Search Guide](https://platform.openai.com/docs/guides/tools-file-search)
- [OpenAI Cookbook Examples](https://cookbook.openai.com/)

Happy building! 🚀