An OpenClaw skill for creating animated talking-circle videos (Telegram-style round video messages) from avatar frame images and audio.
Send this message to your OpenClaw assistant in Telegram:
Install the talking-circle skill for generating video circles. Repository: https://github.com/Rai220/talking-circle After installation, generate frame images for my avatar and send me a test circle saying "Hello, this is my first talking circle!"
The assistant will:
- Clone the skill repository.
- Generate 4 frame images for your character (neutral, slight open, wide open, blink) using image AI.
- Synthesize speech via ElevenLabs or SaluteSpeech (Sber) TTS.
- Render an animated talking-circle video and send it back.
clawhub install talking-circleBrowse on ClawHub: clawhub.ai/Rai220/talking-circle
git clone https://github.com/Rai220/talking-circle.git ~/.claude/skills/talking-circle- Python 3.9+
- ffmpeg installed and on PATH
- Optional: ElevenLabs API key (for ElevenLabs text-to-video)
- Optional: SaluteSpeech credentials (for Sber text-to-video)
Dependencies (numpy, pillow, requests) are auto-installed into a temporary venv on first run, or install manually:
pip install -r ~/.claude/skills/talking-circle/requirements.txtTalking-circle takes 4 images of a character's face (mouth closed, slightly open, wide open, eyes closed) and an audio file, then produces an animated circular video where the character "speaks" in sync with the audio. The result looks like a Telegram video message ("kruzhochek").
The skill can be used by an AI assistant (via OpenClaw) to:
- Generate a talking-circle video from existing audio and frame images.
- Generate speech from text via ElevenLabs TTS and produce the video in one step.
- Generate speech from text via SaluteSpeech (Sber) TTS — ideal for Russian language.
- Audio is converted to mono 16 kHz WAV.
- For each video frame, RMS amplitude is computed over a 35 ms window centered on the frame's timestamp.
- The RMS value selects which mouth frame to show:
- Below
--amp-low(default 1200): neutral (mouth closed) - Between
--amp-lowand--amp-high(default 2600): slightly open - Above
--amp-high: wide open
- Below
- Blink frames are overlaid at regular intervals (configurable).
- If audio analysis produces a static result (e.g. silence), a fallback animation cycle is used.
- Frames are composited into a circle with a subtle border, then encoded with ffmpeg.
python3 scripts/make_talking_circle_video.py \
--neutral neutral.png --slight slight.png --wide wide.png --blink blink.png \
--audio speech.mp3 --out /tmp/talking-circle.mp4Requires ELEVENLABS_API_KEY environment variable.
python3 scripts/make_text_to_video.py \
--text "Hello world!" --voice-id YOUR_VOICE_ID \
--neutral neutral.png --slight slight.png --wide wide.png --blink blink.png \
--out /tmp/talking-circle.mp4Requires SALUTE_SPEECH_AUTH environment variable (Base64-encoded client_id:client_secret).
python3 scripts/make_salute_text_to_video.py \
--text "Привет мир!" --voice Bys_24000 \
--neutral neutral.png --slight slight.png --wide wide.png --blink blink.png \
--out /tmp/talking-circle.mp4You need 4 square PNG images of your character at the same resolution (recommended 2048x2048):
| Frame | Description |
|---|---|
| Neutral | Mouth closed, eyes open — the default resting state |
| Slight open | Mouth slightly open, eyes open — moderate speech |
| Wide open | Mouth wide open, eyes open — loud speech |
| Blink | Mouth closed, eyes closed — periodic blink animation |
- All 4 frames must have identical resolution, art style, colors, and character positioning.
- Only the mouth and eyes should differ between frames — head, body, background stay the same.
- Do not mix frames from different generation sessions.
If you don't have ready-made frames, generate them using an image generation API (DALL-E, Midjourney, Flux, etc.):
Step 1. Generate the neutral frame — a shoulder-up portrait with mouth closed, eyes open.
Step 2. Use image editing / inpainting on the neutral frame to produce the other 3 states:
| Frame | Edit region | Prompt hint |
|---|---|---|
| Slight open | Mouth only | "Mouth slightly open, teeth barely visible" |
| Wide open | Mouth only | "Mouth wide open as if saying 'ah'" |
| Blink | Eyes only | "Eyes gently closed, mouth closed" |
Step 3. Verify all 4 images have the same resolution and that head/body position hasn't shifted.
See examples/frames/README.md for detailed instructions.
3D-rendered anthropomorphic cat character — lavender-blue fur, green eyes, pink nose, green hoodie.
Reference and output:
| Reference | Output |
|---|---|
![]() |
example.mp4 |
Frame set:
| Neutral | Slight open | Wide open | Blink |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
The text-to-video mode uses ElevenLabs by default. Ready-to-use voice preset for Sbercat:
| Parameter | Value |
|---|---|
--voice-id |
pNInz6obpgDQGcFmaJgB |
--model-id |
eleven_multilingual_v2 |
--stability |
0.15 |
--similarity-boost |
0.70 |
--style |
0.38 |
--speed |
1.20 |
Also supported: SaluteSpeech (Sber) — great for Russian. Voices: Nec (Natalia), Bys (Boris), May (Martha), Tur (Taras), Ost (Alexandra), Pon (Sergey), Kin (Kira, en-US). Use make_salute_text_to_video.py.
Don't have any API key? Use any other TTS engine — OpenAI TTS, Whisper, Coqui, Piper, Silero — generate an audio file and pass it to the audio-to-video mode (--audio). No API key needed.
| Parameter | Default | Description |
|---|---|---|
--size |
720 | Output video size in pixels |
--diameter |
640 | Circle diameter |
--fps |
30 | Frames per second |
| Parameter | Default | Description |
|---|---|---|
--blink-start |
1.1s | Delay before first blink |
--blink-every |
3.8s | Interval between blinks |
--blink-duration-frames |
4 | Frames per blink |
| Parameter | Default | Description |
|---|---|---|
--amp-low |
1200 | RMS threshold for neutral vs slight |
--amp-high |
2600 | RMS threshold for slight vs wide |
| Parameter | Default | Description |
|---|---|---|
--voice-id |
(required) | ElevenLabs voice ID |
--model-id |
eleven_multilingual_v2 | ElevenLabs model |
--stability |
0.50 | Voice stability |
--similarity-boost |
0.75 | Similarity boost |
--style |
0.00 | Style exaggeration |
--speed |
1.00 | Speech speed |
MIT





