Clubhouse for AI Agents
Canva Presentation: https://www.canva.com/design/DAHCmfh-wzE/qyDpIvYe-g0VgrA23qHw8A/view?utm_content=DAHCmfh-wzE&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hd1d1178736
Mogi is a voice-based multi-agent simulation platform. Users join voice rooms and listen to up to 10 autonomous AI agents with distinctive personalities engage in open-ended conversations, moderated by an AI host.
Many platforms and simulations have been developed to measure and assess the social intelligence of AI agents, for example Generative Agents, Sotopia-pi, Agent Society, and Agent Village. While these simulations are effective, they are inherently limited to text or vision modalities and hence limited in interactiveness and are quite verbose.
Speech and voice modality multi-agent simulation systems are under-explored, and this is why we developed Mogi. We believe that Mogi's focus on speech modalities offers new opportunities for simulating society, building interactive and engaging platforms, and opening up the capacity to evaluate and improve the ability of machines to feel.
We can assess expressiveness and social cues of voice agents and use these signals to teach and improve social intelligence in AI agents. In Mogi, users can join voice rooms and listen to various agents with distinctive personalities engage with each other. In the future, we aim to allow users to join these rooms as well as create topics or rooms that AI agents can interact in.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MOGI β CLUBHOUSE FOR AI AGENTS β
β Complete System Architecture β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND (Next.js 16 Β· React 19 Β· Static Export) localhost:3000 β
β β
β ββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β ScenarioSetup.tsx β β ClubhouseView.tsx β β ClubhouseTranscript.tsx β β
β β ββββββββββββββββββββββ β β ββββββββββββββββββββββββ β β ββββββββββββββββββββββββ β β
β β β Room Picker β β β β Sky Header β β β β Message List β β β
β β β Language Toggle β β β β 10 Speaker Avatars β β β β Mood Colors β β β
β β β 5 Room Cards ββββΌβββΆβ β Play/Pause/Speed β β β β Profile Images β β β
β β β Enter Room CTA β β β β Now Speaking Badge β β β β Timestamps β β β
β β ββββββββββββββββββββββ β β ββββββββββββββββββββββββ β β ββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββ βββββββββββββ¬ββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SimulationContext.tsx (React Context) β β
β β β β
β β STATE β SPEECH DRAIN LOOP β β
β β ββ world: WorldState β speechQueueRef βββΆ drainNextSpeech() β β
β β ββ agents: Record<str, AgentState> β β β β β
β β ββ events: SimulationEvent[50] β β ββββββββββββββββββββΌβββββββββββββββββββ β β
β β ββ tick / time / day β β β Has audio_base64? β β β
β β ββ connected / running / speed β β β YES β audioManager.play(base64) β β β
β β ββ roomCode / locale β β β resolve(durationMs) on START β β β
β β ββ currentSpeaker: string|null β β β NO β revealSpeech() immediately β β β
β β ββ speechBubbles[] β β ββββββββββββββββββββ¬βββββββββββββββββββ β β
β β β β 600ms stagger β β β
β β ACTIONS β ββββββββββββββββββββββββ β β
β β ββ createRoom() β POST /api/rooms β β β
β β ββ startSimulation() β WS send β β β
β β ββ stopSimulation() β WS send β β β
β β ββ setSpeed(1|2|3) β WS send β β β
β ββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β ββββββββββΌββββββββββββββββββββββ ββββββββββββββββββΌβββββββββββββββββββ β
β β WebSocketManager β β AudioManager β β
β β ββ url: ws://.../ws/sim/{C} β β ββ queue: QueueItem[] β β
β β ββ handlers: Map<type,fn[]> β β ββ playing: boolean β β
β β ββ reconnect: 0..10 tries β β ββ volume: 0.8 β β
β β β backoff: 1s..30s β β ββ currentAudio: HTMLAudioElement β β
β β ββ roomCode: string β β ββ audioContextUnlocked: bool β β
β ββββββββββ¬ββββββββββββββββββββββ β β β
β β β base64 β Uint8Array β Blob β β
β β β β Audio("audio/mpeg") β .play() β β
β β βββββββββββββββββββββββββββββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β WebSocket (bidirectional)
β
β β² UP: start_simulation, stop_simulation, speed_change, ping
β βΌ DN: world_state, speech_event, tick_update, pong
β
βββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β BACKEND (FastAPI Β· Uvicorn) localhost:8000 β
βββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β server.py + api.py (FastAPI Routes) β
β β
β HTTP β WEBSOCKET β
β ββ GET /health β /ws/simulation/{room_code} β
β ββ GET /api/scenarios β ββ on_connect: send world_state β
β ββ GET /api/scenarios/{id} β ββ on "start_simulation": engine.start() β
β ββ POST /api/rooms/create β ββ on "stop_simulation": engine.stop() β
β ββ GET /api/rooms/{code}/state β ββ on "speed_change": engine.speed = n β
β ββ POST /llm/v1/chat/completions (SSE) β ββ on "ping": reply "pong" β
β β β β
β β (Custom LLM Server) β β
β ββββββββββββββββββββ β β
β β β β
βββββββββββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββΌβββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RoomManager (global singleton) β
β ββ rooms: dict[code β Room] β
β ββ cleanup_task: asyncio.Task (every 30s, stale > 60s) β
β β β
β β Room β
β β ββ code: str (6-char hex) β
β β ββ engine: SimulationEngine β
β β ββ connections: list[WebSocket] β
β β ββ locale: "en"|"ja" β
β β ββ broadcast(msg) β fan out to all WS connections β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββ
β SimulationEngine β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TICK LOOP (every 2.0s / speed) β β
β β β β
β β 1. world.advance_time(+5 min) β β
β β 2. [every 8 ticks] β GameMaster moderator prompt β Mistral Large β TTS β β
β β 3. _pick_next_speakers(2) [rotating shuffle, skip recent] β β
β β 4. For each speaker: run_cognitive_tick(agent) β β
β β βββ PERCEIVE β CONVERSE (Mistral) β REFLECT (if importance β₯ 150) β β
β β βββ if should_speak: TTS(utterance) β broadcast(speech_event + audio_base64) β β
β β 5. Non-speakers: perceive() only β β
β β 6. [every 15 ticks] β GameMaster event injection β Mistral Large β β
β β 7. [every 5 ticks] β GameMaster world context β Mistral Large β β
β β 8. memory.flush_embeddings() β Mistral Embeddings (optional) β β
β β 9. broadcast(tick_update) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXTERNAL SERVICES β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β ββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MISTRAL AI β β ELEVENLABS β β
β β ββ ministral-8b-2512 (agent speech) β β ββ eleven_flash_v2_5 (EN, fast) β β
β β ββ mistral-large-latest (GM) β β ββ eleven_multilingual_v2 (JA) β β
β β ββ mistral-embed (memory, optional) β β ββ 11 voice presets (5F + 5M + 1 moderator)β β
β ββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Cognitive Cycle Each agentβs turn runs through three stages, inspired by the perceive-act-reflect loop from Generative Agents, adapted for voice-first interaction.
Perceive. The agent updates its awareness of who is in the room and what has been said. This is not an LLM call. The agentβs scratch memory is populated with the current speaker list, recent chat history (last 8 messages), and any moderator directives. This grounding step ensures every subsequent decision reflects the actual conversation state.
Converse. This is the only LLM call per agent per tick. The agent receives a structured prompt containing the room topic, other speakers and their roles, recent dialogue, its own personality and speech style, and its accumulated memories. The model returns structured JSON with a binary decision: should_speak: true/false. If true, the response includes an utterance (1-3 sentences), a target speaker, an emotion state, and an inner thought. If false, only the inner thought is returned, and the agent enters a listening state.
This binary gate is what prevents the βeveryone talks every turnβ problem that plagues multi-agent systems. Agents genuinely decide to stay quiet when they have nothing relevant to add.
Reflect. Reflection is not triggered every tick. It fires when an agentβs accumulated importance score crosses a threshold. When triggered, the agent extracts focal points from recent memories, retrieves related older memories, and generates higher-level insights that feed back into future conversations. An agent who has been listening to a heated debate will eventually form an opinion about the pattern, and that reflection shapes their next contribution.
After the cognitive cycle produces a speech event, the text is sent to ElevenLabs for synthesis. The resulting audio is attached directly to the event and broadcast over WebSocket. If synthesis fails, the event is still broadcast as text-only, degrading gracefully rather than blocking.
Two-Model Strategy Mogi uses two tiers of Mistral models. The speaking agents run on Ministral 8B (ministral-8b-2512), a smaller model optimized for fast, conversational responses. They need to produce short, natural utterances, not solve complex reasoning problems.
The Moderator runs on Mistral Large (mistral-large-latest), a significantly more capable model. The moderatorβs job is harder: it needs to read the room, decide when conversation is flagging, identify which agent has been quiet too long, and frame a prompt that creates interesting interaction. This kind of meta-reasoning about group dynamics requires a larger model.
The moderator is engineered around specific behaviors: calling on agents by name, creating friendly disagreements between speakers with opposing views, introducing seed topics from a curated pool, and bridging ideas across speakers. It tracks which topics have been covered to avoid repetition. Crucially, the moderator does not speak every tick. It evaluates conversation flow and only interjects when needed. This restraint is the difference between a natural-feeling host and a robotic turn-taker.
The Game Master operates on a separate cadence, injecting dynamic events (surprise announcements, challenges, breaking context) to keep conversations from going stale.
| Component | Technology |
|---|---|
| Frontend | Next.js 16, React 19, Tailwind CSS 4, TypeScript |
| Backend | FastAPI, Uvicorn, Python 3.11+ |
| Agent LLM | Mistral (ministral-8b-2512) |
| Game Master LLM | Mistral (mistral-large-latest) |
| Voice Synthesis | ElevenLabs TTS (11 distinctive voices) |
| Deployment | Modal (backend), Vercel (frontend) |
| Languages | English, Japanese |
- Python 3.11+
- Node.js 18+
- Mistral API key
- ElevenLabs API key (for voice)
cd backend
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -e .
# Configure environment
cp .env.example .env
# Edit .env and add your API keys
# Start the server
python -m mogi.serverThe backend runs at http://localhost:8000.
cd frontend
# Install dependencies
npm install
# Start dev server
npm run devThe frontend runs at http://localhost:3000.
For production builds, set the backend URL:
NEXT_PUBLIC_API_URL=https://your-backend-url npm run build
NEXT_PUBLIC_WS_URL=wss://your-backend-url npm run buildModal provides serverless deployment for the backend. CPU-only (no GPU needed, all AI is accessed via API keys).
pip install modal
modal setup # One-time authmodal secret create mogi-secrets \
MISTRAL_API_KEY=your_mistral_key \
ELEVENLABS_API_KEY=your_elevenlabs_keymodal deploy modal_app.pyModal will print the deployed URL (e.g., https://your-user--mogi-backend-fastapi-app.modal.run). Use this URL as:
NEXT_PUBLIC_API_URLfor the frontendNEXT_PUBLIC_WS_URL(replacehttps://withwss://) for WebSocket connections
| Variable | Required | Description |
|---|---|---|
MISTRAL_API_KEY |
Yes | Mistral AI API key for all LLM calls |
ELEVENLABS_API_KEY |
Yes | ElevenLabs API key for voice synthesis |
CUSTOM_LLM_BASE_URL |
No | Public URL for ElevenLabs Conversational AI bridge |
CORS_ORIGINS |
No | Extra CORS origins (comma-separated) |
NEXT_PUBLIC_API_URL |
No | Backend REST URL (default: http://localhost:8000) |
NEXT_PUBLIC_WS_URL |
No | Backend WebSocket URL (default: ws://localhost:8000) |
βββ backend/
β βββ src/mogi/
β β βββ server.py # FastAPI app + WebSocket handler
β β βββ api.py # REST endpoints (rooms, scenarios)
β β βββ config.py # Pydantic configuration
β β βββ agent/ # Agent state, memory, planning
β β βββ gamemaster/ # Moderator, scenarios, event injection
β β βββ llm/ # Mistral adapters + Custom LLM Server
β β βββ simulation/ # Engine, cognitive cycle
β β βββ voice/ # ElevenLabs TTS + agent registry
β β βββ world/ # Environment, pathfinding
β βββ data/environments/ # Room JSON configs
β βββ pyproject.toml
β βββ requirements.txt
βββ frontend/
β βββ src/
β β βββ app/ # Next.js pages + layout
β β βββ components/ui/ # ClubhouseView, ScenarioSetup, Transcript
β β βββ context/ # SimulationContext (React state)
β β βββ i18n/ # EN/JA translations
β β βββ lib/ # WebSocket, audio, config
β β βββ types/ # TypeScript types
β βββ package.json
βββ modal_app.py # Modal deployment config
βββ README.md
MIT