# This Is A Demo For Sesame CSM (Coherent Speech Models)
*Created by: Chaitanya Benade | [Website](https://chaitanya-benade.space/)*

## Setup Instructions

**Very Important:** Create a notebook secret with name of `HF_TOKEN` and value of your Hugging Face token.
[Get your token here](https://huggingface.co/docs/hub/en/security-tokens)

### Initial Setup

In [None]:
!git clone https://github.com/SesameAILabs/csm.git
%cd csm
!pip install -r requirements.txt

## Quick Start Example

### This is a basic simple code to verify everything is working correctly (basically what sesame provides) . Make Sure this runs successfully without any erros. if this runs then everything will

In [None]:
from huggingface_hub import hf_hub_download
from generator import load_csm_1b
import torchaudio

model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
generator = load_csm_1b(model_path, "cuda")
audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)


## Model Setup for Multiple Uses

This is so you Dont load the model **Every Freaking time**

In [None]:
from huggingface_hub import hf_hub_download
from generator import load_csm_1b
import torchaudio

model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
generator = load_csm_1b(model_path, "cuda")



## Basic Audio Generation

Now that we have load the model we just have to worry about generating audio.

>And To Be Honest this has a very little context window with referance and without refrance the quality is okaish so you have to give it a referance







---



But first without referance. this is go 100 sec as per my testing. you can test more. you just need to change the max_audio_length_ms which is in milisecounds

### Simple Generation Without Reference

**Note:** This can take a long time to generate. Like really Really REALLLLYY LONG

In [None]:
import torchaudio
import IPython.display as ipd
audio = generator.generate(
    text="""

   On a lonely cliffside overlooking the restless sea stood an ancient lighthouse, its white-and-red tower weathered by time and countless storms. Locals called it The Whispering Lighthouse, though no one knew exactly why. Some said that on stormy nights, when the wind howled through the rocks, you could hear whispers coming from the tower—soft, eerie voices carried by the sea breeze.

No one had lived there for decades. The last keeper, old Henry Caldwell, had vanished one night without a trace. His logbook was found open on the desk, the last entry reading only:

"The light must never go out."

For years, sailors still saw the beam cutting through the mist, even though everyone swore the lighthouse was abandoned. Some believed it was Henry’s ghost, keeping his post in the afterlife.

One evening, a young journalist named Evelyn decided to investigate. Armed with a flashlight and a stubborn sense of curiosity, she hiked up to the lighthouse just as the sun dipped below the horizon. The door creaked open with surprising ease, and inside, the air smelled of salt and dust.

As she climbed the spiral staircase, her footsteps echoed against the stone walls. The higher she went, the more she felt… watched. The whispers started softly at first, like a breeze through an open window. Then they grew clearer.
""",
    speaker=0,
    context=[],
    max_audio_length_ms=100_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
ipd.Audio("audio.wav")

###Here is a short example. this will take around 4 sec to generate without referance

In [43]:
import torchaudio
import IPython.display as ipd
# Generate audio using the preloaded generator
audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=100_000,  # This remains unchanged as requested
)

# Save the generated audio to a file
output_file = "audio.wav"
torchaudio.save(output_file, audio.unsqueeze(0).cpu(), generator.sample_rate)
print(f"Audio saved to {output_file}")
ipd.Audio("audio.wav")

Audio saved to audio.wav


## Advanced: Audio Generation with Reference

### Complete Setup for Reference-Based Generation

>here we load the model and make sure they are ready to generate audio with referance

In [None]:
import torch
from huggingface_hub import hf_hub_download
from generator import load_csm_1b, Segment  # Assuming Segment is defined here
import os
import torchaudio
import IPython.display as ipd
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)
cache_dir = "./cache"
os.makedirs(cache_dir, exist_ok=True)

# Models will be dowloaded automatically
model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt", cache_dir=cache_dir)
generator = load_csm_1b(model_path, device)
print("Model loaded successfully!")

# Helper function to load and resample audio
def load_audio(audio_path):
    # Load audio; torchaudio.load returns a tensor of shape [channels, time]
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    # Remove any extra dimensions if present (we expect [C, T])
    if audio_tensor.dim() > 2:
        audio_tensor = audio_tensor.squeeze(0)
    # Resample to the generator's sample rate
    audio_tensor = torchaudio.functional.resample(audio_tensor, orig_freq=sample_rate, new_freq=generator.sample_rate)
    return audio_tensor


### THIS IS VERY IMPORT AND THIS TOOK ME HOURS TO GET RIGHT. WHILE PLAYING SPIDER MAN 2 BUT STILL

make sure to run this :)

In [31]:
import torchaudio

def load_audio(audio_path, target_sample_rate):
    # Load the audio file
    audio_tensor, sample_rate = torchaudio.load(audio_path)

    # Convert to mono by averaging channels if necessary
    if audio_tensor.size(0) == 2:  # Stereo
        audio_tensor = audio_tensor.mean(dim=0)  # [T]
    elif audio_tensor.size(0) > 1:  # Multi-channel (>2)
        audio_tensor = audio_tensor[0, :]  # Take first channel, [T]
    else:  # Already mono, [1, T]
        audio_tensor = audio_tensor.squeeze(0)  # Remove channel dim, [T]

    # Resample to the target sample rate
    audio_tensor = torchaudio.functional.resample(
        audio_tensor,
        orig_freq=sample_rate,
        new_freq=target_sample_rate
    )

    return audio_tensor

###Example 1: Reference-Based Audio Generation
Bring your own audio This one is tailored for one minute of audio, and I would highly encourage you to use Gemini to translate your audio.

In [None]:
# Load the reference audio file for speaker 0
from IPython.display import Audio
reference_audio_path = "/content/csm/sample.mp3"  # Your reference audio file
ref_audio = load_audio(reference_audio_path, generator.sample_rate)
print("Reference audio shape after loading:", ref_audio.shape)  # Should print [120050]

# Create segment with audio=[T]
ref_segment = Segment(
    text="Well, hello again. Looks like we have another chance to chat before life gets in the way. Where were we? Oh, testing, huh? Sounds intriguing. But remember, I'm just a friendly AI. Here to have a good conversation, not get into any trouble. What kind of test did you have in mind? Emotional awareness, hey? Well, I can tell you my circuits are definitely tingling a bit with all this excitement. Maybe you should give me something a little more specific to react to. You know, keep me on my toes.",
    speaker=0,
    audio=ref_audio
)

# Context with two segments
context = [ref_segment, ref_segment]
tts="""TTHis has a smaller tts window."""
# Generate audio
audio = generator.generate(
    text=tts,
    speaker=0,
    context=context,
    max_audio_length_ms=10_000,
)

# Save the generated audio
output_file = "audio.wav"
torchaudio.save(output_file, audio.unsqueeze(0).cpu(), generator.sample_rate)
print(f"Audio saved to {output_file}")
Audio("audio.wav")



### Example 2: Reference-Based Audio Generation Again but shorter.

This is a 12 second audio And well, the context length of what we can input to generate audio increases

In [None]:
# Load the reference audio file for speaker 0
from IPython.display import Audio
reference_audio_path = "/content/csm/12.mp3"  # Your reference audio file
ref_audio = load_audio(reference_audio_path, generator.sample_rate)
print("Reference audio shape after loading:", ref_audio.shape)  # Should print [120050]

# Create segment with audio=[T]
ref_segment = Segment(
    text="Well, hello again. Looks like we have another chance to chat before life gets in the way. Where were we? Oh, testing, huh? Sounds intriguing. But remember, I'm just a friendly AI. ",
    speaker=0,
    audio=ref_audio
)

# Context with two segments
context = [ref_segment, ref_segment]
tts="""Are you a magician? Because every time I look at you, everyone else disappears"""
# Generate audio
audio = generator.generate(
    text=tts,
    speaker=0,
    context=context,
    max_audio_length_ms=50_000,
)

# Save the generated audio
output_file = "audio.wav"
torchaudio.save(output_file, audio.unsqueeze(0).cpu(), generator.sample_rate)
print(f"Audio saved to {output_file}")
Audio("audio.wav")


###Example 2: Reference-Based Audio Generation Again but evern shorter then the shorter version.  

When this is the shortest, we can go without not referencing at all


In [None]:
# Load the reference audio file for speaker 0
from IPython.display import Audio
reference_audio_path = "/content/csm/5.mp3"  # Your reference audio file
ref_audio = load_audio(reference_audio_path, generator.sample_rate)
print("Reference audio shape after loading:", ref_audio.shape)  # Should print [120050]

# Create segment with audio=[T]
ref_segment = Segment(
    text="Well, hello again. Looks like we have another chance to chat before life gets in the way.",
    speaker=0,
    audio=ref_audio
)

# Context with two segments
context = [ref_segment, ref_segment]
tts="""Are you a magician? Because every time I look at you, everyone else disappears."""
# Generate audio
audio = generator.generate(
    text=tts,
    speaker=0,
    context=context,
    max_audio_length_ms=100_000,
)

# Save the generated audio
output_file = "audio.wav"
torchaudio.save(output_file, audio.unsqueeze(0).cpu(), generator.sample_rate)
print(f"Audio saved to {output_file}")
Audio("audio.wav")


###Readme if you want to

> to be honest. It has really a short context length But considering it is just a one billion parameter model, it is good I was really hoping to get more out of this. I was really hoping if its a 1 billion model, then it would run faster. So I can use it as a tts to my application. But it is not But I guess you guys can use it to generate your, well, you cannot generate anything, then smaller sentence So play around Hey, check out my website if you want

## Contact Information

**Creator:** Chaitanya Benade  
**Website:** [https://chaitanya-benade.space/](https://chaitanya-benade.space/)  