Skip to content

Alibaba Integration

Armin RAD edited this page Dec 16, 2025 · 1 revision

Alibaba Cloud Integration

Access 100+ Qwen models through Alibaba Cloud DashScope


Overview

Alibaba Cloud integration provides access to Qwen language models via DashScope:

  • 25+ Qwen Models - Flagship (Max, Plus, Flash), specialized (Coder, Math, VL, Omni)
  • OpenAI-Compatible API - Direct compatibility with OpenAI Python SDK
  • Multi-Region Support - Singapore (International) and Beijing (China) endpoints
  • Automatic Failover - Integrated into provider failover chain
  • Cost-Effective - Competitive pricing on quality models

Quick Setup

1. Get API Key

  1. Sign up: https://www.alibabacloud.com/
  2. Navigate to Model Studio: https://dashscope.aliyuncs.com/
  3. Create or retrieve API key from dashboard
  4. Copy the key

2. Configure Environment

Add to your .env file:

# Required: Main API key
ALIBABA_CLOUD_API_KEY=your-dashscope-api-key-here

# Optional: Region-specific keys (if you have separate keys)
ALIBABA_CLOUD_API_KEY_INTERNATIONAL=dashscope-intl-key
ALIBABA_CLOUD_API_KEY_CHINA=dashscope-cn-key

3. Verify Configuration

# Check logs for successful provider loading
✓ Loaded alibaba_cloud provider client

Supported Models

Commercial Models (Recommended)

Model Context Use Case Pricing ($/1M tokens)
qwen-flash 1M Fast, cost-effective $0.001 / $0.003
qwen-plus 1M Balanced performance $0.005 / $0.015
qwen-max 262K Most powerful $0.012 / $0.036
qwen-coder 262K Code generation $0.008 / $0.024
qwen-long 10M Document processing $0.001 / $0.003

Reasoning Models

Model Context Specialty
qwq-plus 262K Advanced reasoning for math and code
qwq-32b-preview 262K 32B parameter reasoning model

Specialized Models

Model Context Features
qwen-omni - Multimodal (text, image, audio, video)
qwen-vl - Vision and language understanding
qwen-math - Mathematics problem-solving
qwen-mt - Translation (92 languages)

Series Models

Qwen 3 Series (Latest)

  • qwen-3-30b-a3b-instruct - 30B instruction model
  • qwen-3-80b-a3b-instruct - 80B instruction model
  • qwen-3-30b-a3b-thinking - 30B with thinking mode
  • qwen-3-80b-a3b-thinking - 80B with thinking mode

Qwen 2.5 Series

  • qwen-2.5-72b-instruct - Enhanced 72B model
  • qwen-2.5-7b-instruct - Efficient 7B model

Qwen 2 Series

  • qwen-2-72b-instruct - Stable 72B baseline
  • qwen-2-7b-instruct - Stable 7B baseline

Qwen 1.5 Series (Legacy)

  • qwen-1.5-72b-chat - Legacy 72B
  • qwen-1.5-14b-chat - Legacy 14B

Usage Examples

Basic Chat Completion

import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Authorization": "Bearer YOUR_GATEWAYZ_API_KEY"}

payload = {
    "model": "qwen-plus",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 1000
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

Using OpenAI SDK

from openai import OpenAI

# Both URL formats work
client = OpenAI(
    api_key="YOUR_GATEWAYZ_API_KEY",
    base_url="http://localhost:8000"  # SDK appends /v1 automatically
)

response = client.chat.completions.create(
    model="qwen-plus",
    messages=[
        {"role": "user", "content": "Write a Python function to sort a list"}
    ]
)

print(response.choices[0].message.content)

Streaming Responses

response = client.chat.completions.create(
    model="qwen-flash",
    messages=[{"role": "user", "content": "Write a haiku about AI"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

With Organization Prefix

# Using org/model format
payload = {
    "model": "qwen/qwen-max",  # Automatically routes to Alibaba Cloud
    "messages": [{"role": "user", "content": "Hello"}]
}

Code Generation with Qwen Coder

response = client.chat.completions.create(
    model="qwen-coder",
    messages=[
        {
            "role": "user",
            "content": "Write a FastAPI endpoint for user registration with email validation"
        }
    ],
    temperature=0.3  # Lower temperature for code generation
)

Math Problem Solving

response = client.chat.completions.create(
    model="qwq-plus",
    messages=[
        {
            "role": "user",
            "content": "Solve: If x^2 + 5x + 6 = 0, what are the values of x?"
        }
    ]
)

Model Selection and Routing

The gateway automatically detects Qwen models based on:

  1. Pattern Matching: Models starting with qwen/ or alibaba-cloud/
  2. Model Name Mapping: Direct lookups in transformation table
  3. Failover Support: Falls back to alternative providers if needed

Automatic Detection

User Input                  → Provider      → DashScope Model ID
qwen-plus                  → alibaba-cloud → qwen-plus
qwen/qwen-max              → alibaba-cloud → qwen-max
alibaba-cloud/qwen-coder   → alibaba-cloud → qwen-coder
qwen-3-30b                 → alibaba-cloud → qwen-3-30b-a3b-instruct

Configuration

Environment Variables

Variable Required Description
ALIBABA_CLOUD_API_KEY Yes Main DashScope API key
ALIBABA_CLOUD_API_KEY_INTERNATIONAL No Singapore endpoint key (optional)
ALIBABA_CLOUD_API_KEY_CHINA No Beijing endpoint key (optional)

Region Endpoints

Singapore (International):

https://dashscope-intl.aliyuncs.com/compatible-mode/v1

Beijing (Mainland China):

https://dashscope.aliyuncs.com/compatible-mode/v1

Default: Singapore endpoint is used unless configured otherwise in alibaba_cloud_client.py


Pricing

Pricing per 1M tokens:

Model Input ($/1M) Output ($/1M) Context
qwen-flash 0.001 0.003 1M
qwen-plus 0.005 0.015 1M
qwen-max 0.012 0.036 262K
qwen-coder 0.008 0.024 262K
qwq-plus 0.020 0.060 262K
qwen-long 0.001 0.003 10M

Pricing is defined in src/data/manual_pricing.json and automatically applied.


Failover Behavior

Alibaba Cloud is integrated into the failover chain:

Priority Order:
1. huggingface
2. featherless
3. vercel-ai-gateway
4. aihubmix
5. anannas
6. alibaba-cloud      ← Your provider
7. fireworks
8. together
9. google-vertex
10. openrouter

Auto-Retry: If Alibaba Cloud returns 502/503/504, gateway automatically tries next provider.


Rate Limiting

Alibaba Cloud requests subject to:

  • Per-user limits
  • Per-API-key limits
  • System-wide limits

Configure via rate_limits table in database.


Troubleshooting

401 Unauthorized

Issue: "Authorization failed" or "Invalid API key"

Solution:

  1. Verify ALIBABA_CLOUD_API_KEY is set correctly
  2. Check API key hasn't expired
  3. Confirm you're using a valid DashScope API key
  4. Test API key directly:
    curl https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
      -H "Authorization: Bearer YOUR_KEY" \
      -H "Content-Type: application/json" \
      -d '{"model":"qwen-plus","messages":[{"role":"user","content":"test"}]}'

503 Service Unavailable

Issue: Provider returns 503 errors

Behavior: Gateway automatically falls back to next provider in chain

Debug:

Model Not Found

Issue: "Model xyz not found"

Solution:

  1. Verify model is available in your region (Singapore vs Beijing)
  2. Check model ID is correctly mapped in model_transformations.py
  3. Try using a different region endpoint in alibaba_cloud_client.py
  4. List available models:
    # Check model catalog
    response = requests.get("http://localhost:8000/models")
    models = [m for m in response.json()["data"] if "qwen" in m["id"].lower()]

Timeout Issues

Issue: Request times out

Solution:

  1. Default timeout is 30 seconds
  2. Increase timeout for Alibaba Cloud:
    request_timeout = PROVIDER_TIMEOUTS.get("alibaba-cloud", 30)
  3. Use faster models (qwen-flash instead of qwen-max)
  4. Reduce max_tokens in request

Import/Loading Errors

Issue: Provider not loaded on startup

Solution: Check logs for:

⚠ Failed to load alibaba_cloud provider client: ImportError: ...

Verify dependencies:

pip install openai httpx

Advanced Configuration

Switching Regions

Edit src/services/alibaba_cloud_client.py:

# Singapore (International)
base_url = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"

# Beijing (Mainland China)
base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"

Custom Headers

Add custom headers if needed:

def get_alibaba_cloud_client():
    return OpenAI(
        base_url=base_url,
        api_key=Config.ALIBABA_CLOUD_API_KEY,
        default_headers={
            "X-DashScope-SSE": "enable",  # Enable streaming
            "X-Custom-Header": "value"
        }
    )

Model-Specific Parameters

response = client.chat.completions.create(
    model="qwen-plus",
    messages=[...],
    temperature=0.7,
    top_p=0.9,
    max_tokens=2000,
    frequency_penalty=0.5,
    presence_penalty=0.5
)

Monitoring

Check Provider Status

# Look for successful provider loading
✓ Loaded alibaba_cloud provider client

Model Transformation Logs

# Request routing logs
Transformed model ID from 'qwen-plus' to 'qwen-plus' for provider alibaba-cloud

API Key Configuration

# Missing key error
ValueError: Alibaba Cloud API key not configured

Integration Points

Files Modified/Added

  1. src/services/alibaba_cloud_client.py (NEW)

    • Core provider integration
    • Functions: get_alibaba_cloud_client(), make_alibaba_cloud_request_openai(), process_alibaba_cloud_response(), make_alibaba_cloud_request_openai_stream()
  2. src/config/config.py (MODIFIED)

    • Added ALIBABA_CLOUD_API_KEY configuration
    • Added region-specific key support
  3. src/routes/chat.py (MODIFIED)

    • Provider imports and registration
    • Request routing for streaming/non-streaming
  4. src/services/model_transformations.py (MODIFIED)

    • Alibaba Cloud model ID mappings
    • Provider detection for Qwen patterns
  5. src/services/provider_failover.py (MODIFIED)

    • Added to fallback priority chain
  6. src/data/manual_pricing.json (MODIFIED)

    • Qwen model pricing data

API Response Format

Request:

{
  "model": "qwen-plus",
  "messages": [
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "Hello"}
  ],
  "temperature": 0.7,
  "max_tokens": 1000
}

Response:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "qwen-plus",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm here to help..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 100,
    "total_tokens": 120
  }
}

Performance Optimization

Caching

Model responses automatically cached by Redis when available, reducing latency for repeated requests.

Connection Pooling

OpenAI client maintains connection pools internally. No additional configuration needed.

Batch Processing

For high-volume requests, consider batching multiple requests:

import asyncio

async def batch_requests(prompts):
    tasks = [
        client.chat.completions.create(
            model="qwen-flash",
            messages=[{"role": "user", "content": p}]
        )
        for p in prompts
    ]
    return await asyncio.gather(*tasks)

Best Practices

  1. Use qwen-flash for high-throughput, low-cost workloads
  2. Use qwen-plus for balanced performance and cost
  3. Use qwen-max for complex reasoning tasks
  4. Use qwen-coder for code generation
  5. Use qwq-plus for mathematical reasoning
  6. Set appropriate temperature (0.3 for code, 0.7-0.9 for creative)
  7. Monitor usage via pricing audit system
  8. Handle rate limits with exponential backoff

Related Documentation


Reference Links


Last Updated: December 2024 Status: Production Ready

Clone this wiki locally