## **1. Setup**

The project requires several packages that need to be installed into Workspace:

- Langchain: is a framework for developing generative Al applications.
- yt_dip: lets you download YouTube vide
- tiktoken: converts text into tokens.
- docarray: makes it easier to work with multi-model data (in this case mixing audio and text).

In [1]:
#!pip install -r requirements.txt

In [2]:
#!pip install openai

In [3]:
#!pip install python-dotenv

In [4]:
#!pip install -U langchain-community

In [5]:
#!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [6]:
#!pip install transformers soundfile langchain docarray

In [None]:
#!pip install librosa


In [23]:
import yt_dlp as youtube_dl
from yt_dlp import DownloadError
import os
import glob
import torch
import soundfile as sf
import librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.agents import Tool, initialize_agent
from langchain.memory import ConversationBufferMemory
from langchain_core.chat_history import InMemoryChatMessageHistory
from dotenv import load_dotenv


### **YouTube Audio Extraction & MP3 Conversion**
This snippet specifies an output folder and a target YouTube URL, then configures yt_dlp to:

1. Download the highest-quality audio stream available.

2. Use FFmpeg to extract and convert that stream into a 192 kbps MP3.

3. Name the resulting file after the video’s title.

4. Enable verbose logging so you can see detailed progress.

It wraps the download call in a try/except block to catch and report any DownloadError that might occur.

In [8]:
output_dir = "output/"
youtube_url ="https://youtu.be/4h9lQfYLOZU?si=4Z4RCfJjaAdAp2e1"

In [9]:
# Configuration for yt_dlp
def download_audio(url: str) -> str:
    
    ydl_config = {
        "format": "bestaudio/best",
        # List of post-processing steps; each dict represents one processor
        "postprocessors": [
            {
                "key": "FFmpegExtractAudio",   # Extract audio from the downloaded file
                "preferredcodec": "mp3",       # Convert audio to MP3 format
                "preferredquality": "192"      # Set audio quality to 192 kbps
            }
        ],
        # Template for naming the output file: "<video title>.<extension>"
        "outtmpl": "output/%(title)s.%(ext)s",
        "verbose": True  # Enable detailed logging
    }

    try:
        # Initialize the downloader with the specified config and start download
        with youtube_dl.YoutubeDL(ydl_config) as ydl:
            info = ydl.extract_info(url, download=True)
    except DownloadError as e:
        # Print any download errors that occur
        print("DownloadError:", e)
    
    return f"Downloaded: output_audio/{info['id']}.mp3"

### **Batch Transcription of MP3 Files with Whisper**

1. Detects whether to run on GPU (CUDA) or CPU and sets the optimal PyTorch data type for memory efficiency.

2. Loads the openai/whisper-large-v3 model (using safetensors for faster, lower-memory loading) and moves it to the chosen device.

3. Loads the accompanying processor (tokenizer + feature extractor).

4. Builds an ASR pipeline around that model + processor.

5. Uses glob to collect all .mp3 files in the specified output_dir and validates that at least one file exists.

6. Defines a single transcript output path (files/transcripts/transcript.txt) and ensures its folder is created.

7. Iterates over each MP3, reads the audio into an array, runs the ASR pipeline to generate text, and writes the resulting transcript (overwriting on each loop) to the designated text file.

In [10]:
# 1. Figure out if I can use GPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
#    and choose torch dtype to save memory on GPU
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# 2. Pick the Whisper model I want
model_id = "openai/whisper-large-v3"

# 3. Load the model weights (using safetensors to speed things up)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
# move the model to the right device
model.to(device)


WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

In [11]:
# 4. Load the processor that has both tokenizer & feature extractor
processor = AutoProcessor.from_pretrained(model_id)


In [12]:
# 5. Build the ASR pipeline with my model and processor
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

Device set to use cpu


In [None]:
# 6. Find all the MP3 files in my output directory
audio_files = glob.glob(os.path.join(output_dir, "*.mp3"))

# 7. Make sure I actually have files to process
if not audio_files:
    raise ValueError(f"No .mp3 files found in {output_dir}")

# 8. Loop through each file, transcribe, and save
# Where to save the text transcript
output_file  = "files/transcripts/transcript.txt"

for audio_path in audio_files:
    print(f"Processing {audio_path}...")
    #    read the audio array and sampling rate
    audio_array, sr = librosa.load(audio_path, sr=None, mono=True)
    sample = {"array": audio_array, "sampling_rate": sr}

    #    run the pipeline to get the transcript
    result = pipe(sample, return_timestamps=True)
    text = result["text"]

    #    prepare output path & ensure folder exists
    os.makedirs(os.path.dirname(output_file), exist_ok=True)

    #    write the transcript to disk
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(text)

    print(f"Saved transcript to {output_file}\n")

In [None]:
"""from dotenv import load_dotenv
# Load .env into environment
_ = load_dotenv()

# Initialize the new OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# —————— Transcription ——————
print("Converting audio to text…")


with open(audio_filename, "rb") as f:
    # Call the new transcription endpoint
    transcription = client.audio.transcriptions.create(
        file=f,            # the binary audio file
        model="whisper-1"  # the current Whisper model name
    )

# Extract the plain text
text = transcription.text

# —————— Save to disk ——————
# Ensure the output folder exists
os.makedirs(os.path.dirname(output_file), exist_ok=True)

with open(output_file, "w", encoding="utf-8") as out:
    out.write(text)

print(f"Transcript saved to {output_file}")"""

### **Load Transcript File into LangChain Documents** 

Uses LangChain’s TextLoader to read transcript.txt and convert it into a list of Document objects for downstream processing.

In [19]:
# Create a new instance of the TextLoader class, specifying the directory containing the text files
loader = TextLoader("./files/transcripts/transcript.txt")

# Load the documents from the specified directory using the TextLoader instance
docs = loader.load()

In [20]:
docs[0]

Document(metadata={'source': './files/transcripts/transcript.txt'}, page_content="Hi everyone, Chris here from IELTSadvantage.com with another lesson and today what we're gonna focus on is how to practice IELTS listening. So what we're gonna do is look at why doing lots of practice tests is a terrible idea. In fact, this is the worst thing you could do. If you think that just doing lots and lots and lots of practice tests is gonna help you get a higher score, you are wrong. But what then I'm gonna show you is three better ways to practice that will actually improve your scores because at the end of the day, what are we doing? We're helping you improve your scores. So we're only gonna teach you the things that work and make you aware of the things that don't work. And these three ways are totally free and you can do them at home by yourself without a teacher. So they're absolutely brilliant. So number one, why doing lots of practice tests is a terrible idea. Well, the first thing is the

### **Build & Configure the RetrievalQA Pipeline**



1. Creates a DocArrayInMemorySearch index from your docs, embedding each with OpenAI’s embeddings API.



In [24]:
# Load .env into environment
_ = load_dotenv()


In [25]:
#Create a new DoCArrayInMenorySearch Instance from the Specified documents and embeddings
db = DocArrayInMemorySearch.from_documents(
docs,
OpenAIEmbeddings())



2. Converts that index into a retriever for semantic search.

3. Instantiates a ChatOpenAI model with zero temperature for deterministic responses.

In [26]:
#Convert DocArrayInMemorySearch instance to a retriever
retriever = db.as_retriever()

#Create a new chatOpenAi 
llm = ChatOpenAI(temperature = 0.0)

  llm = ChatOpenAI(temperature = 0.0)


4. Builds a RetrievalQA chain (using the “stuff” strategy) that ties together the LLM and retriever, with verbose=True to print intermediate debug info.

In [27]:
# Create a new RetrievalQA instance with the specified parameters
qa_stuff = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever= retriever,
verbose=True)

### **Multi-Tool Conversational Agent with Memory**

Sets up three custom tools (transcript Q&A, summarization, and YouTube audio download), configures a conversation buffer to remember past messages, and initializes a LangChain agent that can use these tools interactively while preserving chat history.

In [28]:
#Create a tool to answer questions about my audio transcripts
qa_tool = Tool(
    name="TranscriptQA",
    func=qa_stuff.run,
    description="Answer questions based on the audio transcripts."
)

In [29]:
#define a summarization function and wrap it as a tool
def summarize_transcript(text: str) -> str:
    return qa_stuff.run(f"Summarize this:\n\n{text}")

summarizer_tool = Tool(
    name="TranscriptSummarizer",
    func=summarize_transcript,
    description="Generate a concise summary of a given transcript text."
)

summarizer_tool = Tool(
    name="TranscriptSummarizer",
    func=summarize_transcript,
    description="Generate a concise summary of a given transcript text."
)

In [30]:
#et up a downloader tool to grab YouTube audio and convert it to MP3
downloader_tool = Tool(
    name="YouTubeDownloader",
    func=download_audio,
    description="Download and convert a YouTube URL to an MP3 file."
)


In [31]:
session_store = {}

def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
    """Ensures each session has its own message history"""
    if session_id not in session_store:
        session_store[session_id] = InMemoryChatMessageHistory()
    return session_store[session_id]

In [32]:
import uuid
session_id = str(uuid.uuid4())  # e.g. "4f9b8a2e-1c3d-4f5a-9e6b-7d8f0a1b2c3d"

#configure a memory buffer to keep the full conversation history
memory = get_session_history(session_id)

In [35]:
chat_history = InMemoryChatMessageHistory()

memory = ConversationBufferMemory(
    memory_key="chat_history",
    chat_memory=chat_history,
    return_messages=True
)


  memory = ConversationBufferMemory(


In [36]:
# I initialize the agent with my tools, the LLM, and the memory buffer
agent = initialize_agent(
    tools=[qa_tool, summarizer_tool, downloader_tool],
    llm=llm,
    agent="chat-conversational-react-description",
    memory=memory,
    verbose=True
)

### **Execute RetrievalQA Query & Display Answer**
1. Defines the user’s question as the query string.

2. Runs that query through the previously configured RetrievalQA chain (qa_stuff), which retrieves relevant passages and then generates an answer.

3. Prints out the final response text to the console.

In [39]:
# set the query
query = "What is this video about?"

#run the query 
response = agent.run(query)

#print response
print(response)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "TranscriptQA",
    "action_input": "Provide the audio transcript for the video in question"
}
```[0m

[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

Observation: [36;1m[1;3mI'm sorry, but I can't provide the audio transcript for the video in question.[0m
Thought:[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The three tools that can be used are TranscriptQA, TranscriptSummarizer, and YouTubeDownloader."
}
```[0m

[1m> Finished chain.[0m
The three tools that can be used are TranscriptQA, TranscriptSummarizer, and YouTubeDownloader.


In [40]:
query = "why you can not?"

#run the query 
response = agent.run(query)

#print response
print(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "I am unable to provide verbatim transcripts of audio or video content longer than 90 seconds due to limitations in my capabilities."
}
```[0m

[1m> Finished chain.[0m
I am unable to provide verbatim transcripts of audio or video content longer than 90 seconds due to limitations in my capabilities.
