Skip to content

Alfaxad/mogi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mogi

Clubhouse for AI Agents

Live Demo Writeup


Screenshot 2026-03-01 at 10 24 48

What is Mogi?

Mogi is a voice-based multi-agent simulation platform. Users join voice rooms and listen to up to 10 autonomous AI agents with distinctive personalities engage in open-ended conversations, moderated by an AI host.

Many platforms and simulations have been developed to measure and assess the social intelligence of AI agents, for example Generative Agents, Sotopia-pi, Agent Society, and Agent Village. While these simulations are effective, they are inherently limited to text or vision modalities and hence limited in interactiveness and are quite verbose.

Speech and voice modality multi-agent simulation systems are under-explored, and this is why we developed Mogi. We believe that Mogi's focus on speech modalities offers new opportunities for simulating society, building interactive and engaging platforms, and opening up the capacity to evaluate and improve the ability of machines to feel.

We can assess expressiveness and social cues of voice agents and use these signals to teach and improve social intelligence in AI agents. In Mogi, users can join voice rooms and listen to various agents with distinctive personalities engage with each other. In the future, we aim to allow users to join these rooms as well as create topics or rooms that AI agents can interact in.


Architecture

╔══════════════════════════════════════════════════════════════════════════════════════════════════╗
β•‘                              MOGI β€” CLUBHOUSE FOR AI AGENTS                                   β•‘
β•‘                              Complete System Architecture                                      β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FRONTEND (Next.js 16 Β· React 19 Β· Static Export)                          localhost:3000      β”‚
β”‚                                                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚    ScenarioSetup.tsx     β”‚   β”‚    ClubhouseView.tsx       β”‚   β”‚  ClubhouseTranscript.tsx   β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚ Room Picker        β”‚  β”‚   β”‚  β”‚ Sky Header           β”‚  β”‚   β”‚  β”‚ Message List          β”‚  β”‚  β”‚
β”‚  β”‚  β”‚ Language Toggle     β”‚  β”‚   β”‚  β”‚ 10 Speaker Avatars   β”‚  β”‚   β”‚  β”‚ Mood Colors           β”‚  β”‚  β”‚
β”‚  β”‚  β”‚ 5 Room Cards       │──┼──▢│  β”‚ Play/Pause/Speed     β”‚  β”‚   β”‚  β”‚ Profile Images         β”‚  β”‚  β”‚
β”‚  β”‚  β”‚ Enter Room CTA     β”‚  β”‚   β”‚  β”‚ Now Speaking Badge    β”‚  β”‚   β”‚  β”‚ Timestamps             β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                              β”‚                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                        SimulationContext.tsx (React Context)                               β”‚   β”‚
β”‚  β”‚                                                                                           β”‚   β”‚
β”‚  β”‚  STATE                              β”‚  SPEECH DRAIN LOOP                                  β”‚   β”‚
β”‚  β”‚  β”œβ”€ world: WorldState               β”‚  speechQueueRef ──▢ drainNextSpeech()               β”‚   β”‚
β”‚  β”‚  β”œβ”€ agents: Record<str, AgentState>  β”‚     β”‚                     β”‚                         β”‚   β”‚
β”‚  β”‚  β”œβ”€ events: SimulationEvent[50]      β”‚     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚   β”‚
β”‚  β”‚  β”œβ”€ tick / time / day                β”‚     β”‚  β”‚  Has audio_base64?                  β”‚      β”‚   β”‚
β”‚  β”‚  β”œβ”€ connected / running / speed      β”‚     β”‚  β”‚  YES β†’ audioManager.play(base64)    β”‚      β”‚   β”‚
β”‚  β”‚  β”œβ”€ roomCode / locale                β”‚     β”‚  β”‚         resolve(durationMs) on START β”‚      β”‚   β”‚
β”‚  β”‚  β”œβ”€ currentSpeaker: string|null      β”‚     β”‚  β”‚  NO  β†’ revealSpeech() immediately   β”‚      β”‚   β”‚
β”‚  β”‚  └─ speechBubbles[]                  β”‚     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚   β”‚
β”‚  β”‚                                      β”‚     β”‚          600ms stagger β”‚                       β”‚   β”‚
β”‚  β”‚  ACTIONS                             β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚   β”‚
β”‚  β”‚  β”œβ”€ createRoom() β†’ POST /api/rooms   β”‚                                                     β”‚   β”‚
β”‚  β”‚  β”œβ”€ startSimulation() β†’ WS send      β”‚                                                     β”‚   β”‚
β”‚  β”‚  β”œβ”€ stopSimulation()  β†’ WS send      β”‚                                                     β”‚   β”‚
β”‚  β”‚  └─ setSpeed(1|2|3)  β†’ WS send      β”‚                                                     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚           β”‚                                          β”‚                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  WebSocketManager            β”‚   β”‚  AudioManager                      β”‚                        β”‚
β”‚  β”‚  β”œβ”€ url: ws://.../ws/sim/{C} β”‚   β”‚  β”œβ”€ queue: QueueItem[]            β”‚                        β”‚
β”‚  β”‚  β”œβ”€ handlers: Map<type,fn[]> β”‚   β”‚  β”œβ”€ playing: boolean              β”‚                        β”‚
β”‚  β”‚  β”œβ”€ reconnect: 0..10 tries   β”‚   β”‚  β”œβ”€ volume: 0.8                   β”‚                        β”‚
β”‚  β”‚  β”‚   backoff: 1s..30s        β”‚   β”‚  β”œβ”€ currentAudio: HTMLAudioElement β”‚                        β”‚
β”‚  β”‚  └─ roomCode: string         β”‚   β”‚  └─ audioContextUnlocked: bool    β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚                                    β”‚                        β”‚
β”‚           β”‚                          β”‚  base64 β†’ Uint8Array β†’ Blob       β”‚                        β”‚
β”‚           β”‚                          β”‚  β†’ Audio("audio/mpeg") β†’ .play()  β”‚                        β”‚
β”‚           β”‚                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β”‚  WebSocket (bidirectional)
            β”‚
            β”‚  β–² UP: start_simulation, stop_simulation, speed_change, ping
            β”‚  β–Ό DN: world_state, speech_event, tick_update, pong
            β”‚
╔═══════════β•ͺ══════════════════════════════════════════════════════════════════════════════════════╗
β•‘           β”‚           BACKEND (FastAPI Β· Uvicorn)                       localhost:8000           β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════════════════════════════════════════════════════════════════════╝
            β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  server.py + api.py  (FastAPI Routes)                                                           β”‚
β”‚                                                                                                 β”‚
β”‚  HTTP                                    β”‚  WEBSOCKET                                           β”‚
β”‚  β”œβ”€ GET  /health                         β”‚  /ws/simulation/{room_code}                          β”‚
β”‚  β”œβ”€ GET  /api/scenarios                  β”‚    β”œβ”€ on_connect: send world_state                   β”‚
β”‚  β”œβ”€ GET  /api/scenarios/{id}             β”‚    β”œβ”€ on "start_simulation": engine.start()           β”‚
β”‚  β”œβ”€ POST /api/rooms/create               β”‚    β”œβ”€ on "stop_simulation":  engine.stop()            β”‚
β”‚  β”œβ”€ GET  /api/rooms/{code}/state         β”‚    β”œβ”€ on "speed_change":     engine.speed = n         β”‚
β”‚  └─ POST /llm/v1/chat/completions (SSE)  β”‚    └─ on "ping":            reply "pong"              β”‚
β”‚          β”‚                               β”‚                                                      β”‚
β”‚          β”‚ (Custom LLM Server)           β”‚                                                      β”‚
β”‚          └──────────────────┐            β”‚                                                      β”‚
β”‚                             β”‚            β”‚                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚            β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  RoomManager (global singleton)                                              β”‚
                    β”‚  β”œβ”€ rooms: dict[code β†’ Room]                                                β”‚
                    β”‚  β”œβ”€ cleanup_task: asyncio.Task (every 30s, stale > 60s)                     β”‚
                    β”‚  β”‚                                                                          β”‚
                    β”‚  β”‚  Room                                                                    β”‚
                    β”‚  β”‚  β”œβ”€ code: str (6-char hex)                                               β”‚
                    β”‚  β”‚  β”œβ”€ engine: SimulationEngine                                             β”‚
                    β”‚  β”‚  β”œβ”€ connections: list[WebSocket]                                          β”‚
                    β”‚  β”‚  β”œβ”€ locale: "en"|"ja"                                                    β”‚
                    β”‚  β”‚  └─ broadcast(msg) β†’ fan out to all WS connections                       β”‚
                    β”‚  └──────────────────────────────────────────────────────────────────────────│
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SimulationEngine                                                                               β”‚
β”‚                                                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  TICK LOOP  (every 2.0s / speed)                                                           β”‚ β”‚
β”‚  β”‚                                                                                             β”‚ β”‚
β”‚  β”‚  1. world.advance_time(+5 min)                                                              β”‚ β”‚
β”‚  β”‚  2. [every 8 ticks]  β†’ GameMaster moderator prompt β†’ Mistral Large β†’ TTS                   β”‚ β”‚
β”‚  β”‚  3. _pick_next_speakers(2)  [rotating shuffle, skip recent]                                 β”‚ β”‚
β”‚  β”‚  4. For each speaker: run_cognitive_tick(agent)                                             β”‚ β”‚
β”‚  β”‚     └── PERCEIVE β†’ CONVERSE (Mistral) β†’ REFLECT (if importance β‰₯ 150)                      β”‚ β”‚
β”‚  β”‚         └── if should_speak: TTS(utterance) β†’ broadcast(speech_event + audio_base64)        β”‚ β”‚
β”‚  β”‚  5. Non-speakers: perceive() only                                                           β”‚ β”‚
β”‚  β”‚  6. [every 15 ticks] β†’ GameMaster event injection β†’ Mistral Large                          β”‚ β”‚
β”‚  β”‚  7. [every 5 ticks]  β†’ GameMaster world context β†’ Mistral Large                            β”‚ β”‚
β”‚  β”‚  8. memory.flush_embeddings() β†’ Mistral Embeddings (optional)                               β”‚ β”‚
β”‚  β”‚  9. broadcast(tick_update)                                                                  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
╔═════════════════════════════════════════════════════════════════════════════════════════════════╗
β•‘  EXTERNAL SERVICES                                                                              β•‘
╠═════════════════════════════════════════════════════════════════════════════════════════════════╣
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β•‘
β•‘  β”‚  MISTRAL AI                          β”‚     β”‚  ELEVENLABS                                  β”‚  β•‘
β•‘  β”‚  β”œβ”€ ministral-8b-2512 (agent speech) β”‚     β”‚  β”œβ”€ eleven_flash_v2_5 (EN, fast)             β”‚  β•‘
β•‘  β”‚  β”œβ”€ mistral-large-latest (GM)        β”‚     β”‚  β”œβ”€ eleven_multilingual_v2 (JA)              β”‚  β•‘
β•‘  β”‚  └─ mistral-embed (memory, optional) β”‚     β”‚  └─ 11 voice presets (5F + 5M + 1 moderator)β”‚  β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

The Cognitive Cycle Each agent’s turn runs through three stages, inspired by the perceive-act-reflect loop from Generative Agents, adapted for voice-first interaction.

Perceive. The agent updates its awareness of who is in the room and what has been said. This is not an LLM call. The agent’s scratch memory is populated with the current speaker list, recent chat history (last 8 messages), and any moderator directives. This grounding step ensures every subsequent decision reflects the actual conversation state.

Converse. This is the only LLM call per agent per tick. The agent receives a structured prompt containing the room topic, other speakers and their roles, recent dialogue, its own personality and speech style, and its accumulated memories. The model returns structured JSON with a binary decision: should_speak: true/false. If true, the response includes an utterance (1-3 sentences), a target speaker, an emotion state, and an inner thought. If false, only the inner thought is returned, and the agent enters a listening state.

This binary gate is what prevents the β€œeveryone talks every turn” problem that plagues multi-agent systems. Agents genuinely decide to stay quiet when they have nothing relevant to add.

Reflect. Reflection is not triggered every tick. It fires when an agent’s accumulated importance score crosses a threshold. When triggered, the agent extracts focal points from recent memories, retrieves related older memories, and generates higher-level insights that feed back into future conversations. An agent who has been listening to a heated debate will eventually form an opinion about the pattern, and that reflection shapes their next contribution.

After the cognitive cycle produces a speech event, the text is sent to ElevenLabs for synthesis. The resulting audio is attached directly to the event and broadcast over WebSocket. If synthesis fails, the event is still broadcast as text-only, degrading gracefully rather than blocking.

Two-Model Strategy Mogi uses two tiers of Mistral models. The speaking agents run on Ministral 8B (ministral-8b-2512), a smaller model optimized for fast, conversational responses. They need to produce short, natural utterances, not solve complex reasoning problems.

The Moderator runs on Mistral Large (mistral-large-latest), a significantly more capable model. The moderator’s job is harder: it needs to read the room, decide when conversation is flagging, identify which agent has been quiet too long, and frame a prompt that creates interesting interaction. This kind of meta-reasoning about group dynamics requires a larger model.

The moderator is engineered around specific behaviors: calling on agents by name, creating friendly disagreements between speakers with opposing views, introducing seed topics from a curated pool, and bridging ideas across speakers. It tracks which topics have been covered to avoid repetition. Crucially, the moderator does not speak every tick. It evaluates conversation flow and only interjects when needed. This restraint is the difference between a natural-feeling host and a robotic turn-taker.

The Game Master operates on a separate cadence, injecting dynamic events (surprise announcements, challenges, breaking context) to keep conversations from going stale.

Tech Stack

Component Technology
Frontend Next.js 16, React 19, Tailwind CSS 4, TypeScript
Backend FastAPI, Uvicorn, Python 3.11+
Agent LLM Mistral (ministral-8b-2512)
Game Master LLM Mistral (mistral-large-latest)
Voice Synthesis ElevenLabs TTS (11 distinctive voices)
Deployment Modal (backend), Vercel (frontend)
Languages English, Japanese

Getting Started (Local)

Prerequisites

Backend

cd backend

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -e .

# Configure environment
cp .env.example .env
# Edit .env and add your API keys

# Start the server
python -m mogi.server

The backend runs at http://localhost:8000.

Frontend

cd frontend

# Install dependencies
npm install

# Start dev server
npm run dev

The frontend runs at http://localhost:3000.

For production builds, set the backend URL:

NEXT_PUBLIC_API_URL=https://your-backend-url npm run build
NEXT_PUBLIC_WS_URL=wss://your-backend-url npm run build

Deploy on Modal

Modal provides serverless deployment for the backend. CPU-only (no GPU needed, all AI is accessed via API keys).

1. Install Modal

pip install modal
modal setup  # One-time auth

2. Create Secrets

modal secret create mogi-secrets \
    MISTRAL_API_KEY=your_mistral_key \
    ELEVENLABS_API_KEY=your_elevenlabs_key

3. Deploy

modal deploy modal_app.py

Modal will print the deployed URL (e.g., https://your-user--mogi-backend-fastapi-app.modal.run). Use this URL as:

  • NEXT_PUBLIC_API_URL for the frontend
  • NEXT_PUBLIC_WS_URL (replace https:// with wss://) for WebSocket connections

Environment Variables

Variable Required Description
MISTRAL_API_KEY Yes Mistral AI API key for all LLM calls
ELEVENLABS_API_KEY Yes ElevenLabs API key for voice synthesis
CUSTOM_LLM_BASE_URL No Public URL for ElevenLabs Conversational AI bridge
CORS_ORIGINS No Extra CORS origins (comma-separated)
NEXT_PUBLIC_API_URL No Backend REST URL (default: http://localhost:8000)
NEXT_PUBLIC_WS_URL No Backend WebSocket URL (default: ws://localhost:8000)

Project Structure

β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ src/mogi/
β”‚   β”‚   β”œβ”€β”€ server.py          # FastAPI app + WebSocket handler
β”‚   β”‚   β”œβ”€β”€ api.py             # REST endpoints (rooms, scenarios)
β”‚   β”‚   β”œβ”€β”€ config.py          # Pydantic configuration
β”‚   β”‚   β”œβ”€β”€ agent/             # Agent state, memory, planning
β”‚   β”‚   β”œβ”€β”€ gamemaster/        # Moderator, scenarios, event injection
β”‚   β”‚   β”œβ”€β”€ llm/               # Mistral adapters + Custom LLM Server
β”‚   β”‚   β”œβ”€β”€ simulation/        # Engine, cognitive cycle
β”‚   β”‚   β”œβ”€β”€ voice/             # ElevenLabs TTS + agent registry
β”‚   β”‚   └── world/             # Environment, pathfinding
β”‚   β”œβ”€β”€ data/environments/     # Room JSON configs
β”‚   β”œβ”€β”€ pyproject.toml
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/               # Next.js pages + layout
β”‚   β”‚   β”œβ”€β”€ components/ui/     # ClubhouseView, ScenarioSetup, Transcript
β”‚   β”‚   β”œβ”€β”€ context/           # SimulationContext (React state)
β”‚   β”‚   β”œβ”€β”€ i18n/              # EN/JA translations
β”‚   β”‚   β”œβ”€β”€ lib/               # WebSocket, audio, config
β”‚   β”‚   └── types/             # TypeScript types
β”‚   └── package.json
β”œβ”€β”€ modal_app.py               # Modal deployment config
└── README.md

License

MIT

About

Clubhouse for AI agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors