### RETRIEVAL AUGMENTED GENERATION - Youtube

Exemple de RAG basé sur des vidéos Youtube.

#### Prérequis

Installer ffmpeg et l'ajouter au path de la machine

```bash
sudo apt update && sudo apt upgrade
sudo apt install ffmpeg
export PATH=$PATH:/usr/bin/ffmpeg
```

#### Load

Le chargement des sources se fait exactement de la même manière que pour des documents texte. 

Nous utilisons simplement un loader adapté à notre source, ici une combinaison de `YoutubeAudioLoader` pour le téléchargement de la piste audio des vidéos et `OpenAIWhisperer` pour la transcription audio => texte.

In [None]:
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader,
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.audio import OpenAIWhisperParser

# URLS of target videos
urls = ["https://youtu.be/3caCwH2MSIk", "https://youtu.be/JlodpOubfqE"]

#LOAD
loader = GenericLoader(YoutubeAudioLoader(urls, "../__downloads__"), OpenAIWhisperParser())
docs = loader.load()

#### Split | Embed | Store

La partie split/embed/store est-elle rigoureusement la même.

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores.redis import Redis
from langchain_openai import OpenAIEmbeddings
import os
from langchain.chains import RetrievalQA

#SPLIT
combined_docs = [doc.page_content for doc in docs]
text = " ".join(combined_docs)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_text(text)

#EMBED & STORE
vectorstore = Redis.from_texts(
    splits,
    OpenAIEmbeddings(),
    redis_url=os.getenv("REDIS_URL"),
    index_name="genai_in_action_youtube_rag",
)

#### Chaîne finale

De la même manière la chaîne ne change pas.

In [3]:
from langchain.schema.output_parser import StrOutputParser
from operator import itemgetter
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI, ChatOpenAI

prompt = ChatPromptTemplate.from_messages(
  ("human", 
   """You are an assistant for question-answering tasks about  a company named Younup. 
   Use the following pieces of retrieved context to answer the question. 
   If you don't know the answer, just say that you don't know.
    Question: {question} 
    Context: {context} 
    Answer:"""
    ))

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_chain = rag_chain = (
    # Runnable parallèles
    {
        "context": itemgetter("question") | vectorstore.as_retriever() | format_docs, 
        "question": itemgetter("question"), 
        "history": itemgetter("history")
    }
    # Runnable séquentiels
    | prompt
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)



In [None]:
# Ask a question!
question = "Quelle est la principale fièrté de Younup ?"
qa_chain.invoke({
    "question":question, 
    "history": [] 
})