### RETRIEVAL AUGMENTED GENERATION - Youtube

Exemple de RAG basé sur des vidéos Youtube.

#### Prérequis

Installer ffmpeg et l'ajouter au path de la machine

```bash
sudo apt update && sudo apt upgrade
sudo apt install ffmpeg
export PATH=$PATH:/usr/bin/ffmpeg
```

#### Load

Le chargement des sources se fait exactement de la même manière que pour des documents texte. 

Nous utilisons simplement un loader adapté à notre source, ici une combinaison de `YoutubeAudioLoader` pour le téléchargement de la piste audio des vidéos et `OpenAIWhisperer` pour la transcription audio => texte.

In [1]:
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader,
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.audio import OpenAIWhisperParser

# URLS of target videos
urls = ["https://youtu.be/3caCwH2MSIk", "https://youtu.be/JlodpOubfqE"]

#LOAD
loader = GenericLoader(YoutubeAudioLoader(urls, "../__downloads__"), OpenAIWhisperParser())
docs = loader.load()

[youtube] Extracting URL: https://youtu.be/3caCwH2MSIk
[youtube] 3caCwH2MSIk: Downloading webpage
[youtube] 3caCwH2MSIk: Downloading ios player API JSON
[youtube] 3caCwH2MSIk: Downloading player e38bb6de


         n = TC7zCck9tC8i_D15 ; player = https://www.youtube.com/s/player/e38bb6de/player_ias.vflset/en_US/base.js
         n = F5llQHbDKbQdEuPm ; player = https://www.youtube.com/s/player/e38bb6de/player_ias.vflset/en_US/base.js


[youtube] 3caCwH2MSIk: Downloading m3u8 information
[info] 3caCwH2MSIk: Downloading 1 format(s): 140
[download] Destination: ../__downloads__/Notre CEO se confie.m4a
[download] 100% of    3.93MiB in 00:00:01 at 2.13MiB/s   
[FixupM4a] Correcting container of "../__downloads__/Notre CEO se confie.m4a"
[ExtractAudio] Not converting audio ../__downloads__/Notre CEO se confie.m4a; file is already in target format m4a
[youtube] Extracting URL: https://youtu.be/JlodpOubfqE
[youtube] JlodpOubfqE: Downloading webpage
[youtube] JlodpOubfqE: Downloading ios player API JSON
[youtube] JlodpOubfqE: Downloading player e38bb6de


         n = _IG5grL1nZIeEaGV ; player = https://www.youtube.com/s/player/e38bb6de/player_ias.vflset/en_US/base.js
         n = K151q6CsvKjGm-5M ; player = https://www.youtube.com/s/player/e38bb6de/player_ias.vflset/en_US/base.js


[youtube] JlodpOubfqE: Downloading m3u8 information
[info] JlodpOubfqE: Downloading 1 format(s): 140
[download] Destination: ../__downloads__/Qui sommes-nous ？.m4a
[download] 100% of    2.67MiB in 00:00:01 at 2.41MiB/s   
[FixupM4a] Correcting container of "../__downloads__/Qui sommes-nous ？.m4a"
[ExtractAudio] Not converting audio ../__downloads__/Qui sommes-nous ？.m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 1!


#### Split | Embed | Store

La partie split/embed/store est-elle rigoureusement la même.

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores.redis import Redis
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os
from langchain.chains import RetrievalQA

#SPLIT
combined_docs = [doc.page_content for doc in docs]
text = " ".join(combined_docs)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_text(text)

#EMBED & STORE
vectorstore = Redis.from_texts(
    splits,
    OpenAIEmbeddings(),
    redis_url=os.getenv("REDIS_URL"),
    index_name="genai_in_action_youtube_rag",
)

#### Chaîne finale

De la même manière la chaîne ne change pas.

In [3]:
from langchain.schema.output_parser import StrOutputParser
from operator import itemgetter
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
  ("human", 
   """You are an assistant for question-answering tasks about HR questions in a company named Younup. 
   Use the following pieces of retrieved context to answer the question. 
   If you don't know the answer, just say that you don't know. 
   Use three sentences maximum and keep the answer concise.
    Question: {question} 
    Context: {context} 
    Answer:"""
    ))

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_chain = rag_chain = (
    # Runnable parallèles
    {
        "context": itemgetter("question") | vectorstore.as_retriever() | format_docs, 
        "question": itemgetter("question"), 
        "history": itemgetter("history")
    }
    # Runnable séquentiels
    | prompt
    | ChatOpenAI(model="gpt-3.5-turbo")
    | StrOutputParser()
)



In [4]:
# Ask a question!
question = "Quelle est la principale fièrté de Younup ?"
qa_chain.invoke({
    "question":question, 
    "history": [] 
})

score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.


"La principale fierté de Younup est d'avoir réussi à conserver son ADN initial malgré sa croissance de 2 à 150 employés. L'entreprise met l'accent sur son approche customer-centric et son esprit de service. Younup se distingue par son agilité, sa flexibilité et son engagement envers la qualité de ses services."