# Audio RAG Chatbot with Langchain and AssemblyAI

In this notebook we create a chatbot with functionality to transcribe audio files and use them to answer the questions from the user. 
The chatbot is based on langchain agent with Multi-Query retriever tool for RAG.

# Imports

In [1]:
# System related imports
import os
from dotenv import load_dotenv, find_dotenv

# OpenAI related imports
import openai
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings

# AssemblyAI and Document storage related imports
import tiktoken
import assemblyai as aai
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader

# Agent imports
import logging
from typing import List
from pydantic import BaseModel, Field
from langchain.chains import LLMChain
from langchain.agents import AgentExecutor
from langchain.prompts import PromptTemplate
from langchain.prompts import ChatPromptTemplate
from langchain.prompts import MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.output_parsers import PydanticOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.tools.render import format_tool_to_openai_function
from langchain.agents.agent_toolkits import create_retriever_tool
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser

# Chat UI
import gradio as gr
from langchain.schema import AIMessage, HumanMessage

  from .autonotebook import tqdm as notebook_tqdm


# Enviromental variables

Set up the APIs keys.

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']
aai.settings.api_key = os.environ['ASSEMBLYAI_API_KEY']

# Transribing

Start with defining the links to audiofiles that we want to transribe and use.

In [3]:
URLs = [
    "https://storage.googleapis.com/aai-web-samples/langchain_retrieval_webinar.opus",
    "https://storage.googleapis.com/aai-web-samples/langchain_agents_webinar.opus",
]

Use AssemblyAI API to aggregate the audio and get the documents.

In [4]:
def transcribe_audio(url):
    transcripts = []
    for url in URLs:
        print(f'Transcribing {url}')
        transcripts.append(AssemblyAIAudioTranscriptLoader(file_path=url).load()[0])
    return transcripts

docs = transcribe_audio(URLs) # list of documents

Transcribing https://storage.googleapis.com/aai-web-samples/langchain_retrieval_webinar.opus
Transcribing https://storage.googleapis.com/aai-web-samples/langchain_agents_webinar.opus


# Vector storage

Split the documents to store the embedded splits in a vector store.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
splits = text_splitter.split_documents(docs)

In [6]:
# modify metadata because some AssemblyAI returned metadata is not in a compatible form for the Chroma db
for split in splits:
    split.metadata = {"audio_url": split.metadata["audio_url"]}

In [7]:
# Remove old database files if any
# !rm -rf ./docs/chroma 

Creating vectordb with embedded splits.

In [8]:
embedding = OpenAIEmbeddings()
persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

# Creating a chat agent

## Multi-query retriever

Create a retriever that will generate multiple prompts similar to the user prompt and find splits relevant to the questions.

In [9]:
# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

multi_query_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), llm_chain=multi_query_chain, parser_key="lines"
)

## Tools

We have to convert the retriever to the tool so it can be used by an agent.

In [10]:
search_related_documents = create_retriever_tool(
    retriever,
    "RAG",
    "Retrieves context related to the question from the audio provided by the user.",
)

tools = [search_related_documents]
functions = [format_tool_to_openai_function(f) for f in tools] # convert tools to OpenAI functions

## Agent chain

In [11]:
model = ChatOpenAI(temperature=0).bind(functions=functions)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are helpful assistant. The user provided an audio as context to this chat. You can use it to retrieve context to answer the questions."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

In [12]:
agent_chain = RunnablePassthrough.assign(
    agent_scratchpad= lambda x: format_to_openai_functions(x["intermediate_steps"])
) | prompt | model | OpenAIFunctionsAgentOutputParser()

## Agent executor

Create the agent executor that will run the agent chain until the stop condition is met (in our case the system defines the stop on its own).

In [13]:
# Define memory type to store the conversation history
memory = ConversationBufferMemory(return_messages=True,memory_key="chat_history")
agent_executor = AgentExecutor(agent=agent_chain, tools=tools, verbose=True, memory=memory)

In [14]:
# Set logging for the queries
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Test 

Let's try to make some questions not related to the document and check if it is able to store the chat history.

In [15]:
agent_executor.invoke({"input": "Hi! Nice to meet you. My name is Logis."})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mHello Logis! Nice to meet you too. How can I assist you today?[0m

[1m> Finished chain.[0m


{'input': 'Hi! Nice to meet you. My name is Logis.',
 'chat_history': [HumanMessage(content='Hi! Nice to meet you. My name is Logis.'),
  AIMessage(content='Hello Logis! Nice to meet you too. How can I assist you today?')],
 'output': 'Hello Logis! Nice to meet you too. How can I assist you today?'}

In [16]:
agent_executor.invoke({"input": "What is my name?"})



[1m> Entering new AgentExecutor chain...[0m




[32;1m[1;3mYour name is Logis.[0m

[1m> Finished chain.[0m


{'input': 'What is my name?',
 'chat_history': [HumanMessage(content='Hi! Nice to meet you. My name is Logis.'),
  AIMessage(content='Hello Logis! Nice to meet you too. How can I assist you today?'),
  HumanMessage(content='What is my name?'),
  AIMessage(content='Your name is Logis.')],
 'output': 'Your name is Logis.'}

Everything works well, what about retriving?

In [17]:
agent_executor.invoke({"input": "What topics are discussed in the audio?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `RAG` with `audio`


[0m

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are some applications of audio technology?', '2. How does audio compression work?', '3. Can you recommend any audio editing software?', '4. What are the different types of audio formats?', '5. How does audio streaming work?']


[36;1m[1;3m[Document(page_content="of your application. Maybe that's search, maybe that is some generative component. Whatever it is, at least that's the way I think of maybe Harrison can maybe comment on this a little bit, but I think with the self improvement, I think you're right that finding ways to make that self improvement persist from one conversation to another is huge. And DSP seems like a promising way to do that. I think in terms of the retrieval and long term memory side for agents, the way it normally looks is that the amount of information that you're searching over is going to generally be smaller. Right? It's like the memory of a single agent. That's more like the scale of the amount of information a person consumes on the internet. So we're talking hundreds of documents, thousands of documents rather than millions. And then you also are maybe doing some annotation of that at the same time with the language model. So it's not just a document, it's a document with add

{'input': 'What topics are discussed in the audio?',
 'chat_history': [HumanMessage(content='Hi! Nice to meet you. My name is Logis.'),
  AIMessage(content='Hello Logis! Nice to meet you too. How can I assist you today?'),
  HumanMessage(content='What is my name?'),
  AIMessage(content='Your name is Logis.'),
  HumanMessage(content='What topics are discussed in the audio?'),
  AIMessage(content="The topics discussed in the audio include:\n1. Self-improvement and the use of DSP (Dialogue System Platform) for persistent self-improvement in conversations.\n2. Different architectures and cognitive frameworks for building agents, such as Atlantis, BB one, Soar, and others.\n3. Deploying agents and gathering user questions to compile a program using DSP.\n4. Techniques for efficient and adaptive querying, including multi-hop search and the use of language models like T5 and Ada.\n5. Compression techniques for reducing the storage space required for vectors in retrieval systems.\n6. Improving

# Creating UI for the chat

In [18]:
def predict(message, history):
    history = []
    for human, ai in history:
        history.append(HumanMessage(content=human))
        history.append(AIMessage(content=ai))
    history.append(HumanMessage(content=message))
    response = agent_executor.invoke({"input": message})
    return response["output"]
    
gr.ChatInterface(predict).queue().launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.






[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `RAG` with `main topics from the webinars in bulletpoints`


[0m

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key subjects covered in the webinars, presented in bullet point format?', '2. Can you provide a list of the main topics discussed in the webinars, using bullet points?', '3. In bullet point form, what are the primary themes addressed in the webinars?', '4. Could you summarize the main subjects covered in the webinars, using bullet points?', '5. What are the main discussion points from the webinars, presented as bullet points?']


[36;1m[1;3m[Document(page_content="better task prioritization agent for something like baby AGI. And the reason I bring that up is I think there's a lot of relationship between the way we operate, the way we think our brains, and as we think about building these autonomous agents that there's a lot of cross learning that we should absolutely be talking about. I did want to bring one last major topic in our ten minutes that we have remaining here, which was the question of safety. A lot of people are asking about security, like narrowly considered, which is like prompt, is this thing going to pseudo RMRF my machine? Is this thing going to drain my PayPal account? And then there's also the broader question of what happens when we unleash a whole bunch of much more capable web scraping and API hitting bots. So I'd love to hear just from each sort of panel member in what they think the path is for assuaging those concerns that people have. So I'd like to start with Shinyu. Yeah, actually