# **Build a YouTube AI assistant in Python that speaks 32 languages**

Chanin Nantasenamat, PhD

[Data Professor YouTube channel](https://youtube.com/dataprofessor)

> In a nutshell, you're building a Python workflow for Question Answering of long-form videos in 32 foreign languages using AssemblyAI's LeMUR and Claude 3.5 Sonnet.

## Install prerequisites

In [None]:
! apt-get install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
! pip install yt-dlp assemblyai

Collecting yt-dlp
  Downloading yt_dlp-2024.12.3-py3-none-any.whl.metadata (172 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/172.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.1/172.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting assemblyai
  Downloading assemblyai-0.35.1-py3-none-any.whl.metadata (27 kB)
Collecting websockets>=11.0 (from assemblyai)
  Downloading websockets-14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading yt_dlp-2024.12.3-py3-none-any.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading assemblyai-0.35.1-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.1/43.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading websockets-14.1-cp310-cp

In [None]:
pip install elevenlabs

Collecting elevenlabs
  Downloading elevenlabs-1.13.2-py3-none-any.whl.metadata (8.0 kB)
Downloading elevenlabs-1.13.2-py3-none-any.whl (211 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elevenlabs
Successfully installed elevenlabs-1.13.2


## Load API key

In [None]:
from google.colab import userdata
import assemblyai as aai

aai.settings.api_key = userdata.get('AAI_KEY')

## Retrieving audio from a YouTube video

We'll start out by downloading the YouTube video using the `yt_dlp` Python library.

In [None]:
import yt_dlp

# Retrieving audio from a YouTube video
def download_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': '%(title)s.%(ext)s',
        'verbose': True,
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

URL = "https://www.youtube.com/watch?v=UF8uR6Z6KLc"
download_audio(URL)

# Retrieving audio file name
video_title = yt_dlp.YoutubeDL({}).extract_info(URL, download=False)['title']
audio_file = f'{video_title}.mp3'

[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out UTF-8 (No ANSI), error UTF-8 (No ANSI), screen UTF-8 (No ANSI)
[debug] yt-dlp version stable@2024.12.03 from yt-dlp/yt-dlp [2b67ac300] (pip) API
[debug] params: {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'outtmpl': '%(title)s.%(ext)s', 'verbose': True, 'compat_opts': set(), 'http_headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.43 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Sec-Fetch-Mode': 'navigate'}}
[debug] Python 3.10.12 (CPython x86_64 64bit) - Linux-6.1.85+-x86_64-with-glibc2.35 (OpenSSL 3.0.2 15 Mar 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2
[debug] Optional libraries: certifi-2024.08.30, requests-2.32.3, secretstorage-3.3.1, sqlite3-3.37.2,

[youtube] Extracting URL: https://www.youtube.com/watch?v=UF8uR6Z6KLc
[youtube] UF8uR6Z6KLc: Downloading webpage
[youtube] UF8uR6Z6KLc: Downloading ios player API JSON
[youtube] UF8uR6Z6KLc: Downloading mweb player API JSON
[youtube] UF8uR6Z6KLc: Downloading player 85d2de62


[debug] Saving youtube-nsig.85d2de62 to cache
[debug] [youtube] Decrypted nsig cRFGuGTp_0tbtSTNN6j => Nhm53s7f9MisAg
[debug] Loading youtube-nsig.85d2de62 from cache
[debug] [youtube] Decrypted nsig L4p7cVy_PB3eLFltoZZ => 76K8OLRfz9w7yw


[youtube] UF8uR6Z6KLc: Downloading m3u8 information


[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec, channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id


[info] UF8uR6Z6KLc: Downloading 1 format(s): 251


[debug] Invoking http downloader on "https://rr4---sn-5hne6n6e.googlevideo.com/videoplayback?expire=1733317438&ei=3v5PZ__hDoPsi9oPzruxWA&ip=34.91.180.120&id=o-ACsJW-UJdYn00jgTWse_jgMbIUxpTLFitoWbfLMf3Z5u&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&met=1733295838%2C&mh=-6&mm=31%2C29&mn=sn-5hne6n6e%2Csn-5hnednsz&ms=au%2Crdu&mv=m&mvi=4&pl=20&rms=au%2Cau&initcwndbps=28281250&bui=AQn3pFTZXx3ykgkGDxVwV47aNi6eIedrYdZZn8gg-EQimp6OGAdueP_290T7D60MAwyyMGeSPVjWCWw0&spc=qtApAQfARu_fiQiwuH2_e8DBzmRY8gKxKN5nhNJbGMfsiz8&vprv=1&svpuc=1&mime=audio%2Fwebm&rqh=1&gir=yes&clen=11436231&dur=904.241&lmt=1727658751589050&mt=1733294708&fvip=2&keepalive=yes&fexp=51326932%2C51335594&c=IOS&txp=4532434&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhAK8IFzVJy8iT8tuQNsiZbwfuKKXM0hYPA3336daNoTBUAiBG-MumxNzDvKCzwMV2h1OCZQjsqwY8eGAeAmHWU-Z1gg%3D%3D&lsparams=met%2Cmh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Crms%2Ci

[download] Destination: Steve Jobs' 2005 Stanford Commencement Address.webm
[download] 100% of   10.91MiB in 00:00:00 at 34.47MiB/s  


[debug] ffmpeg command line: ffprobe -show_streams 'file:Steve Jobs'"'"' 2005 Stanford Commencement Address.webm'


[ExtractAudio] Destination: Steve Jobs' 2005 Stanford Commencement Address.mp3


[debug] ffmpeg command line: ffmpeg -y -loglevel repeat+info -i 'file:Steve Jobs'"'"' 2005 Stanford Commencement Address.webm' -vn -acodec libmp3lame -b:a 192.0k -movflags +faststart 'file:Steve Jobs'"'"' 2005 Stanford Commencement Address.mp3'


Deleting original file Steve Jobs' 2005 Stanford Commencement Address.webm (pass -k to keep)
[youtube] Extracting URL: https://www.youtube.com/watch?v=UF8uR6Z6KLc
[youtube] UF8uR6Z6KLc: Downloading webpage
[youtube] UF8uR6Z6KLc: Downloading ios player API JSON
[youtube] UF8uR6Z6KLc: Downloading mweb player API JSON
[youtube] UF8uR6Z6KLc: Downloading m3u8 information


## Transcribe Audio and Question Answering using LeMUR

1. Transcribe the audio file
2. Define your question prompt
3. Apply LeMUR to generate answer

Here, we applied the `lemur.task()` method on the `transcript` object, which gives us `transcript.lemur.task()`.

Next, as input arguments we supplied the question prompt along with specifying the LLM model to use via `aai.LemurModel.claude3_5_sonnet` to the `final_model` parameter.


In [None]:
# Transcribe audio file
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_file)

# Define your question prompt
prompt = "What are the 5 key messages that Steve Jobs wanted to convey in the speech? Please explain in Spanish."

# Apply LeMUR to generate answer
result = transcript.lemur.task(
    prompt,
    final_model=aai.LemurModel.claude3_5_sonnet
)

# Display results
print(result.response)

Basado en el discurso de Steve Jobs, los 5 mensajes clave que quería transmitir son:

1. Confía en que los puntos se conectarán en el futuro
Explicación: Jobs enfatiza que no siempre podemos entender cómo nuestras experiencias se relacionarán, pero debemos confiar en que lo harán de alguna manera significativa en el futuro.

2. Encuentra lo que amas hacer
Explicación: Jobs insiste en la importancia de descubrir y perseguir lo que realmente te apasiona, tanto en el trabajo como en las relaciones personales.

3. No te rindas ante los reveses
Explicación: A través de su historia personal de ser despedido de Apple, Jobs muestra cómo los fracasos pueden llevar a nuevas oportunidades y crecimiento personal.

4. Vive cada día como si fuera el último
Explicación: Jobs aconseja reflexionar diariamente sobre si estás haciendo lo que realmente quieres, utilizando la perspectiva de la mortalidad para tomar decisiones importantes.

5. Sigue tu corazón e intuición
Explicación: Jobs anima a los gradu

## Generating Speech from Text using Elevenlabs

We'll now take the text and generate speech from it.

This can be done in 2 approaches via the `generate()` method or `text_to_speech.convert()` method.

In [None]:
# Text to Speech - Method 1
from elevenlabs.client import ElevenLabs
from elevenlabs import play

elevenlabs_api_key = userdata.get('ELEVENLABS')
client = ElevenLabs(api_key=elevenlabs_api_key)

audio = client.generate(
  text=result.response,
  voice="Rachel",
  model="eleven_multilingual_v2"
)

play(audio, notebook=True)

In [None]:
# Text to Speech - Method 2
import os
import uuid
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs
import IPython.display as ipd

def text_to_speech_file(text: str) -> str:
    # Calling the text_to_speech conversion API with detailed parameters
    response = client.text_to_speech.convert(
        voice_id="EXAVITQu4vr4xnSDxMaL",
        optimize_streaming_latency="0",
        output_format="mp3_22050_32",
        text=text,
        model_id="eleven_multilingual_v2",
        voice_settings=VoiceSettings(
            stability=0.0,
            similarity_boost=1.0,
            style=0.0,
            use_speaker_boost=True,
        ),
    )

    # Generating a unique file name for the output MP3 file
    save_file_path = f"{uuid.uuid4()}.mp3"

    # Writing the audio stream to the file
    with open(save_file_path, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)

    print(f"A new audio file was saved successfully at {save_file_path}")

    # Return the path of the saved audio file
    return save_file_path

mp3_file = text_to_speech_file(result.response[:100])
ipd.Audio(mp3_file)

A new audio file was saved successfully at b735143d-e8ff-4f89-8a1a-c20d89b86aa7.mp3


## References

- [LeMUR](https://www.assemblyai.com/docs/lemur) - AssemblyAI Documentation
- [Ask questions about your audio data](https://www.assemblyai.com/docs/lemur/ask-questions) - AssemblyAI Documentation
- [Processing Audio Files with LLMs using LeMUR](https://github.com/AssemblyAI/cookbook/blob/master/lemur/using-lemur.ipynb) - AssemblyAI Cookbook