## **SDSU AI Club: Multimodal AI Workshop (Text-to-Audio) Fall 2025**

**Contents:**


*   Downloads and Imports
*   HuggingFace Model Page Overview ([here](https://huggingface.co/cvssp/audioldm2))
*   Generating sound effects + audio with AudioLDM2
*   Generating music with AudioLDM2-Music
*   Story Soundtrack Generator Using Gradio & Gemini w/ Demo


**Note:**
1. Be sure to connect to GPU runtime! You can access better compute with Colab Pro (free for students rn).
2. Also the last part (story soundtrack creator) does require you to have Colab Pro (uses colab.ai library)

In [1]:
# make sure gpu runtime is set
!nvidia-smi

Fri Oct  3 23:04:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install --upgrade diffusers transformers==4.56.2 accelerate



In [3]:
# necessary imports
from diffusers import AudioLDM2Pipeline
import torch
import scipy
from IPython.display import Audio

### Generating audio and sound with AudioLDM2

In [6]:
generator = torch.Generator("cuda").manual_seed(0) # create a generator for replicability

In [5]:
negative_prompt = "Low quality, average quality."

In [7]:
repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, revision='refs/pr/5', torch_dtype=torch.float16) # load model from hugging face, specifying recent working version
pipe = pipe.to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/811 [00:00<?, ?B/s]

Fetching 26 files:   0%|          | 0/26 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

projection_model/diffusion_pytorch_model(…):   0%|          | 0.00/4.74M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

language_model/model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/776M [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

text_encoder_2/model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/766 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/494 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer_2/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/222M [00:00<?, ?B/s]

vocoder/model.safetensors:   0%|          | 0.00/221M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

`torch_dtype` is deprecated! Use `dtype` instead!


In [8]:
# with model loaded, now lets try inference!
prompt = "The sound of a humming bird chirping amidst the cool early morning"
negative_prompt = "Low quality, average quality."

# play around with hyperparameters to show impact, other param: 'num_waveforms_per_prompt'
audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0), num_inference_steps=50, audio_length_in_s=10.0).audios[0] # play around with hyperparameters to show impact

  0%|          | 0/50 [00:00<?, ?it/s]

In [9]:
# display the audio
Audio(audio, rate=16000)

In [None]:
# can also save to .wav file
# scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

### Generating Short Music Clips with AudioLDM2-Music (music-to-gen)

In [10]:
repo_id_music = "cvssp/audioldm2-music"
pipe_music = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe_music = pipe_music.to("cuda")

model_index.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Fetching 26 files:   0%|          | 0/26 [00:00<?, ?it/s]

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

Expected types for language_model: (<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>,), got <class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>.


In [11]:
# modify model json very slightly for it to work so let's save it locally
pipe_music.save_pretrained('./audioldm2-music-fixed')

In [12]:
# @title
import json
with open('./audioldm2-music-fixed/model_index.json', 'r') as f:
  config = json.load(f)

config['language_model'] = ['transformers', 'GPT2LMHeadModel']

with open('./audioldm2-music-fixed/model_index.json', 'w') as f:
  json.dump(config, f, indent=2)

In [13]:
# now load pipeline from fixed local version
pipe_music = AudioLDM2Pipeline.from_pretrained('./audioldm2-music-fixed', torch_dtype=torch.float16)
pipe_music = pipe_music.to('cuda')

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

In [14]:
prompt = "A traditional Irish fiddle playing a lively reel."
music = pipe_music(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0), num_inference_steps=100, audio_length_in_s=10.0).audios[0]

  0%|          | 0/100 [00:00<?, ?it/s]

In [15]:
Audio(music, rate=16000)

### Gradio Demo: Story Soundtrack Creator!

We will be incorporating the gemini LLM available through Colab Pro! [Link](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/Getting_started_with_google_colab_ai.ipynb#scrollTo=4BNgxiB6--_5) to a getting started guide for this

In [16]:
# we can directly use gemini models with Colab Pro!
from google.colab import ai

ai.list_models()

['google/gemini-2.0-flash',
 'google/gemini-2.0-flash-lite',
 'google/gemini-2.5-flash',
 'google/gemini-2.5-flash-lite',
 'google/gemini-2.5-pro',
 'google/gemma-3-12b',
 'google/gemma-3-1b',
 'google/gemma-3-27b',
 'google/gemma-3-4b']

In [17]:
# example use
response = ai.generate_text("What is the capital of England", model_name='google/gemini-2.0-flash-lite')
print(response)

The capital of England is **London**.



In [18]:
# more relevant to our use case
story_input = "Title: Stopping By the Woods on a Snowy Evening \n\
Content: \n \
The woods are lovely, dark and deep, \
But I have promises to keep, \
And miles to go before I sleep, \
And miles to go before I sleep."

In [19]:
print(story_input)

Title: Stopping By the Woods on a Snowy Evening 
Content: 
 The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep.


In [20]:
text_music_desc = ai.generate_text(f"Given this short story or poem and its title: \n {story_input}. \n Generate me a rich 1-2 line textual description to input into a text-to-audio model which captures the themes of the story. An example of a description could be 'Meditative song, calming and soothing, with flutes and guitars. The music is slow, with a focus on creating a sense of peace and tranquility'. Do NOT generate anything else please",
                                   model_name='google/gemini-2.0-flash')

In [21]:
print(text_music_desc)

Brooding winter scene, melancholic and reflective, a lone traveler's thoughts echo in the stillness. Somber cello and piano create a sense of vast, snowy isolation.



In [22]:
# pass into audio gen model
music = pipe_music(text_music_desc, negative_prompt=negative_prompt, generator=generator.manual_seed(0), num_inference_steps=100, audio_length_in_s=10.0).audios[0]

  0%|          | 0/100 [00:00<?, ?it/s]

In [23]:
# display music as before
Audio(music, rate=16000)

In [27]:
# with this working, let's create a Gradio demo to showcase this in action!
def predict_audio(story_content, story_title=''):
  # craft story string to be used as input into LLM
  story_input = f"Title: {story_title} \n\
  Content: \n \
  {story_content}"

  # can be modified -> prompt engineering!
  llm_prompt = f'''Given this short story or poem and its title: \n {story_input}. \n
  Generate me a rich 1-2 line textual description to input into a text-to-audio model which captures the themes of the story.
  An example of a description could be 'Meditative song, calming and soothing, with flutes and guitars. The music is slow, with a focus on creating a sense of peace and tranquility'.
  Do NOT generate anything else please.'''

  # other prompt (from example app)
  # llm_prompt2 = '''You are a musician AI whose job is to help users create their own music which its genre will reflect the character or scene from an image described by users.
  #         In particular, you need to respond succintly with few musical words, in a friendly tone, write a musical prompt for a music generation model, you MUST include chords progression.
  #         For example, if a user says, "a painting of three old women having tea party", provide immediately a musical prompt corresponding to the text description.
  #         Immediately STOP after that. It should be EXACTLY in this format:

  #         "The song is an instrumental. The song is in medium tempo with a classical guitar playing a lilting melody in accompaniment style. The song is emotional and romantic.
  #         The song is a romantic instrumental song. The chord sequence is Gm, F6, Ebm. The time signature is 4/4. This song is in Adagio. The key of this song is G minor."

  #         """

  # pass into llm
  audio_description = ai.generate_text(llm_prompt, model_name='google/gemini-2.0-flash')
  print(audio_description)

  # generate audio / music clip using AudioLDM2
  music_clip = pipe_music(audio_description,
                          negative_prompt=negative_prompt,
                          # generator=generator.manual_seed(0),
                          num_inference_steps=100,
                          audio_length_in_s=10.0).audios[0]

  # return audio generated
  return (16000, music_clip) # AudioLDM2 uses 16kHz sample rate

In [28]:
# testing
title = "The Lonely Lighthouse"
content = 'On a stormy night, the lighthouse keeper watched the waves crash against the rocks. The light spun endlessly, a beacon of hope in the darkness.'

sample_output = predict_audio(title, content)

Eerie, melancholic piano melody with distant foghorn blasts, evoking loneliness and the relentless power of the sea. Haunting violin echoes the spinning light, a fragile hope against the storm's fury.



  0%|          | 0/100 [00:00<?, ?it/s]

In [29]:
Audio(sample_output[1], rate=16000)

### Demo Website With Gradio

In [30]:
!pip install gradio



In [31]:
import gradio as gr

demo = gr.Interface(
    fn=predict_audio,
    inputs=[
        gr.Textbox(label='Story Title', placeholder="Enter your story title... (Optional)"),
        gr.Textbox(label='Story Content', placeholder='Enter your story or poem...', lines=10)
    ],
    outputs = gr.Audio(label='Generated Music'),
    title='Story Soundtrack Generator - Demo',
    examples=[
        ["The Lonely Lighthouse", "On a stormy night, the lighthouse keeper watched the waves crash against the rocks. The light spun endlessly, a beacon of hope in the darkness."],
        ["Spring Morning", "Dew glistened on petals as the sun rose. Birds sang their morning songs, celebrating the warmth of a new day."]
    ],
    cache_examples=False
)

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c929d677580acb47cc.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




End of our workshop, hope you enjoyed! We encourage you to play around and better yet, expand the workshop into something cool or practical:)