# Retrieval-Augmented Generation (RAG) For Extracting and Transcribing Audio From Public YouTube Videos

## Overview
This project demonstrates a Retrieval-Augmented Generation (RAG) pipeline that extracts and transcribes audio from public YouTube videos, indexes the transcript using LlamaIndex, and enables querying using OpenAI's GPT-4. It forms the base for a multimodal RAG system for including visual content in future project.

In [None]:
# Mounting to Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "your-path-here"

In [16]:
# Install Dependencies
%%capture
!pip install -U llama-index openai yt-dlp git+https://github.com/openai/whisper.git youtube-transcript-api

In [4]:
%%capture
!pip install git+https://github.com/openai/whisper.git

In [17]:
# Import Libraries
import os
import openai
import whisper
import tempfile
import yt_dlp
from IPython.display import Audio, Markdown, display

# LlamaIndex v0.10+ Imports
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


In [18]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [6]:
# Setup OpenAI API
openai.api_key = "your-OpenAI-API-key-here"

In [19]:
# Configure OpenAI model (GPT-4 Turbo)
Settings.llm = OpenAI(model="gpt-4-0125-preview") # replace with a model of your choice
Settings.embed_model = OpenAIEmbedding()

In [20]:
# Download YouTube Audio
def download_youtube_audio(url, output_dir):
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': os.path.join(output_dir, '%(id)s.%(ext)s'),
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        video_id = info_dict.get("id", None)
        return os.path.join(output_dir, f"{video_id}.mp3")

In [22]:
# Transcribe Audio
model = whisper.load_model("base")

In [23]:
def transcribe_audio(audio_path):
    result = model.transcribe(audio_path)
    return result["text"]

In [24]:
# Save transcript to file
def save_transcript(text, filename):
    with open(filename, "w") as f:
        f.write(text)

In [25]:
# Build Index from transcript with chunk control

def build_index_from_transcript(transcript_file):
    documents = SimpleDirectoryReader(input_files=[transcript_file]).load_data()
    parser = SimpleNodeParser.from_defaults(chunk_size=256, chunk_overlap=20)
    nodes = parser.get_nodes_from_documents(documents)
    index = VectorStoreIndex(nodes)
    return index


In [26]:
# Query the index with a prompt

def query_index(index, question):
    query_engine = index.as_query_engine()
    return query_engine.query(question)

In [27]:
# Step 1: Download and transcribe YouTube video
video_url = "https://www.youtube.com/watch?v=7Hcg-rLYwdM"  # Replace with your video
with tempfile.TemporaryDirectory() as tmpdir:
    audio_path = download_youtube_audio(video_url, tmpdir)
    print("\n✅ Audio downloaded:", audio_path)
    transcript = transcribe_audio(audio_path)
    transcript_file = os.path.join(tmpdir, "transcript.txt")
    save_transcript(transcript, transcript_file)

    # Step 2: Build RAG index
    index = build_index_from_transcript(transcript_file)

    # Step 3: Query with prompt
    prompt = (
        "Summarize the key takeaways in bullet points using markdown format. "
        "Highlight important topics in bold and include emotional or reflective aspects if mentioned."
    )
    response = query_index(index, prompt)

    # Step 4: Display formatted Markdown response
    display(Markdown(str(response)))

[youtube] Extracting URL: https://www.youtube.com/watch?v=7Hcg-rLYwdM
[youtube] 7Hcg-rLYwdM: Downloading webpage
[youtube] 7Hcg-rLYwdM: Downloading tv client config
[youtube] 7Hcg-rLYwdM: Downloading player 20830619
[youtube] 7Hcg-rLYwdM: Downloading tv player API JSON
[youtube] 7Hcg-rLYwdM: Downloading ios player API JSON
[youtube] 7Hcg-rLYwdM: Downloading m3u8 information
[info] 7Hcg-rLYwdM: Downloading 1 format(s): 251
[download] Destination: /tmp/tmpf1hkdyix/7Hcg-rLYwdM.webm
[download] 100% of    1.42MiB in 00:00:00 at 21.27MiB/s  
[ExtractAudio] Destination: /tmp/tmpf1hkdyix/7Hcg-rLYwdM.mp3
Deleting original file /tmp/tmpf1hkdyix/7Hcg-rLYwdM.webm (pass -k to keep)

✅ Audio downloaded: /tmp/tmpf1hkdyix/7Hcg-rLYwdM.mp3


- **Participation in science activities**: The speaker reflects on being proud of participating in various science activities during their mission on the International Space Station over the last two months.
- **Spacewalks**: Expresses surprise and joy at having the opportunity to conduct four additional spacewalks, describing it as "icing on the cake" for the mission.
- **Expedition 63**: The speaker was a part of Expedition 63, indicating a specific mission or time frame during their stay on the International Space Station.
- **Memorable experience**: Describes the mission as a lifetime memory and a true honor, highlighting the emotional and reflective aspect of their experience.
- **SpaceX Dragon undocking**: Mentions the undocking sequence commanded for Dragon SpaceX, indicating the spacecraft used for their return journey.
- **Importance of safe return**: Emphasizes that the hardest part of the mission was the launch, but the most crucial part was safely returning home.
- **Personal message**: Includes a personal message ("I've been trying, Daddy. We love you. Hurry home for weeks and don't get my dog. Slashed out."), suggesting a communication with family, showing the emotional side of space missions.
- **Successful return to Earth**: Concludes with a successful return to Earth after a 19-hour journey, with a welcoming message from SpaceX, and a note on the astronauts being back on Earth, referred to as "space dads."