![Image de présentation de l'appli](Images/image1.png)


# 1) Setting up our model 

In [44]:
import os 
from dotenv import load_dotenv

load_dotenv()

openai_api_key = os.environ.get("OPENAI_API_KEY")
# The link of youtube video that we gona to use 
#(I use an external site web to download the audio, because youtube api was disturbing)
YOUTUBE_AUDIO =  "/Users/nathanwandji/Downloads/youtube_audio.mp3"



Define the LLM model that we will use

In [43]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=openai_api_key, model= "gpt-3.5-turbo")

We test the model by asking a simple question 

In [3]:
model.invoke("Who is kedrick Lamar ?")

AIMessage(content='Kendrick Lamar is an American rapper, songwriter, and record producer. He is widely regarded as one of the greatest and most influential rappers of his generation. Lamar has released several critically acclaimed albums, including "good kid, m.A.A.d city," "To Pimp a Butterfly," and "DAMN." He has won numerous awards, including multiple Grammy Awards, and has been praised for his lyrical content, storytelling, and social commentary in his music.', response_metadata={'token_usage': {'completion_tokens': 95, 'prompt_tokens': 13, 'total_tokens': 108}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-7c1700c9-a35d-4e8b-8488-f117812ef6ee-0', usage_metadata={'input_tokens': 13, 'output_tokens': 95, 'total_tokens': 108})

![Image2](Images/image2.png)

During this project, we will use **StrOutputParser** to extract the answer of the model as a string. 

In [4]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'The Los Angeles Dodgers won the World Series during the COVID-19 pandemic in 2020.'

# 2) Introducing prompt template 

We want to provide the model with some context and the question. Prompt templates are a simple way to define and reuse prompts.

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# To print an example
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

![Image3](Images/image3.png)

We can now chain the prompt with the model and the output parser.

In [6]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Nathan sister?"
})

"I don't know"

It works, let's Goooooooo !!!! 😁🔥  
Now we can pass to the next step. 

# 3) Combining chain

We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

Let's start by creating a new prompt template for the translation chain:

In [7]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like :

![Image4](Images/image4.png)

In [8]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "French",
    }
)

'Marie a une sœur.'

After we understand this, we can start our project by finding how to transcript a youtube video. 

# 4) Tanscripting the youtube video 

The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using OpenAI's Whisper.
Note that, we can use any type of video not only for youtube with the whisper model. 

In [9]:
import os
import whisper

# Nom du fichier de transcription à créer
TRANSCRIPTION_FILE = "transcription.txt"

# Vérifier si le fichier de transcription existe déjà
if not os.path.exists(TRANSCRIPTION_FILE):
    # Charger le modèle Whisper
    whisper_model = whisper.load_modrrel("base")

    # Transcrire l'audio en utilisant Whisper
    transcription = whisper_model.transcribe(YOUTUBE_AUDIO, fp16=False)["text"].strip()

    # Sauvegarder la transcription dans un fichier texte
    with open(TRANSCRIPTION_FILE, "w") as file:
        file.write(transcription)


Let's read the transcription and display the first few characters to ensure everything works as expected.

In [10]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

'President Trump, welcome back to Meet the Press. Thank you. I want to dive right into this a lot to '

This model is nice ! No bad lol ! 

Since the number of words from the transcript is very high, we cannot use the transcript directly as context in our chatchot. We have to divide it into several **small chunks**.

# 5) Splitting the transcription

Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller **chunks**. We can then invoke the model using only the **relevant chunks** to answer a particular question:

![Image5](Images/image5.png)

Let's start by loading the transcription in memory using **TextLoader**:

In [11]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'transcription.txt'}, page_content="President Trump, welcome back to Meet the Press. Thank you. I want to dive right into this a lot to get to. Good. There are a number of things that make your campaign unprecedented. You are the first former president to run for reelection in more than 100 years. You are facing foreign developments. You have an incredibly significant lead in the GOP primary polls. But I want to ask you this, Mr. President, why do you want to be President again? Well, it's a very simple answer, and I can give it very easily. It's called Make America Great Again. Our country is in serious trouble. I don't think we've ever been so low in terms of certainly opinion, world opinion, and country opinion. People are devastated. They look at what's happening with millions of people coming in, millions of illegal immigrants coming into our country, flooding our cities, flooding the countryside. I think the number is going to be 15 million people by

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size.

For illustration purposes, let's split the transcription into chunks of 1000 characters with an overlap of 20 characters and display the first few chunks:

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

# 6) Finding the relevant chunks and vector store

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of embeddings comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model.
We can now compute the similarity between the query and each of chunks. The closer the embeddings are, the more similar the chunk will be.

![Image6](Images/image6.png)

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A **vector store** is a database of embeddings that specializes in fast similarity searches.

![Image7](Images/image7.png)

In [13]:
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# 7) Connecting the vector store with the chain 

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

![Image8](Images/Image8.png)

We need to configure a **Retriever**. The **retriever** will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

# 8) Settup Pinecone 

Use an in-memory vector store is not recommended. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use **Pinecone**.

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable PINECONE_API_KEY.

Then, we can load the transcription documents into Pinecone:

In [34]:
from langchain_pinecone import PineconeVectorStore
from langchain_core.embeddings.embeddings import Embeddings

index_name = "ytberag"
pinecone_api_key = os.environ.get("PINECONE_API_KEY")
print(pinecone_api_key)

147ff6c9-205f-48f8-b0f9-68984ec71762


In [37]:
pinecone = PineconeVectorStore.from_documents(
    embedding = embeddings, 
    documents=documents, 
    index_name=index_name)

In [38]:
pinecone.similarity_search("Who is Donal Trump?")[:3]

[Document(metadata={'source': 'transcription.txt'}, page_content="These are third world indictments. The President of the United States sees how we're doing. We have a movement, the likes of which has never happened in this country before. And you see it with the polls. I mean, I'm up on these people by 60 points and 59 points. I don't mean at 50, not I'm leading them by 59. You almost say like, why are they campaigning? Aisa Hutchinson. He's at zero. Christie's at two. Other ones are at one. DeSanctimonious is at nine. I just see a poll come. I mean, I'm leading him by 60 points. Mr. President. You say, why are they doing it? But here's what they did. They saw this happening. And he went to the Attorney General of the United States. And he told them in Daitrum. There's just no evidence of that, Mr. President. But let's stand still. I want Mr. President. I want to talk to him. Mr. President, wait a minute. Wait, wait, wait. Could I say one thing? Look at all the lies he's told over the

In [39]:

from langchain_core.runnables import RunnableParallel, RunnablePassthrough

We can create a map with the two inputs by using the RunnableParallel and RunnablePassthrough classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."


Let's setup the new chain using Pinecone as the vector store:

In [46]:

chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

# We execute all the chain with a question 
chain.invoke("Why Donald Trump is here")

'Donald Trump is running for President again to "Make America Great Again" as he believes the country is in serious trouble.'

Our porject is done !!!   
The next step is to build an interface with chainlit.   
Follow the file **app.py** to do that. 