<a href="https://colab.research.google.com/github/QuratulAin20/Langchain/blob/main/video_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem statement
we have to build the chatbot for a video responsible for answer query from video

## SOLUTION
- First target is to fetch the transcript of the video that we will do using langchain youtube loader or using youtube API

- Next step is to do text splitting on the data to get chunks

- Next create embedding of our chunk and store in vector db

- Create a retriever

- Pass the query to retriever to get response

- Next combine our query and chunks and create a prompt

- now it pass to LLM and get response


In [None]:
!pip install -q youtube-transcript-api langchain-community langchain-groq langchain-huggingface\
               faiss-cpu tiktoken python-dotenv

In [None]:
import os
os.environ['GROQ_API_KEY'] = ""

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate

# STEP-1
## 1a Indexing

In [None]:
video_id = "Gfr50f6ZBvo" # only the ID, not full URL and it shoould contain english transcript because we are assigning languages=["en"] for other language we use its language
try:
    # If you don’t care which language, this returns the “best” one
    transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])

    # Flatten it to plain text
    transcript = " ".join(chunk["text"] for chunk in transcript_list)
    print(transcript)

except TranscriptsDisabled:
    print("No captions available for this video.")

the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all by itself to play the game of gold better than any human in the world and alpha fold two that solved protein folding both tasks considered nearly impossible for a very long time demus is widely considered to be one of the most brilliant and impactful humans in the history of artificial intelligence and science and engineering in general this was truly an honor and a pleasure for me to finally sit down with him for this conversation and i'm sure we will talk many times again in the future this is the lex friedman podcast to support it please check out our sponsors in the description and now dear friends here's demis hassabis let's start with a bit of a personal question am i an ai program you wrote to interview people until i get good enough 

- In `transcript` we join all our extracted video text

In [None]:
transcript_list

[{'text': 'generative AI is revolutionizing',
  'start': 3.04,
  'duration': 3.799},
 {'text': 'Industries across the', 'start': 5.04, 'duration': 4.639},
 {'text': 'globe from creating stunning visuals to',
  'start': 6.839,
  'duration': 5.401},
 {'text': 'Drafting and debugging code to analyzing',
  'start': 9.679,
  'duration': 5.321},
 {'text': 'reports to powering intelligent chat',
  'start': 12.24,
  'duration': 5.799},
 {'text': 'Bots generative AI is impacting every',
  'start': 15.0,
  'duration': 4.48},
 {'text': 'role in an', 'start': 18.039, 'duration': 3.881},
 {'text': 'organization but how do you master this',
  'start': 19.48,
  'duration': 5.0},
 {'text': 'Cutting Edge technology and become a',
  'start': 21.92,
  'duration': 5.72},
 {'text': "generative AI expert in today's video we",
  'start': 24.48,
  'duration': 4.84},
 {'text': 'will walk you through a step-by-step',
  'start': 27.64,
  'duration': 3.84},
 {'text': 'road map that covers everything you need',
  

- It extract the data in form of list of dictionay
- `transcript_list` contain the list of dictionary that contain `start` showing the transcript on taht specific time and `duration` showing that transcript presnt on screen.



## 1b Applying recusive splitting to create chunking

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

In [None]:
len(chunks)

168

### Step 1c & 1d - Indexing (Embedding Generation and Storing in Vector Store)



In [None]:
import os
os.environ['HF_TOKEN'] = "hf_XV"

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
vector_store = FAISS.from_documents(chunks, embeddings)

## Step 2 - Retrieval

- Query will send into this retrieval and this retrieval embedd this query ad do vector search a\c to our desire as we define. here we define it a similarity search.
- we get the list of documnets base upon our search no. of document is describe in search_kwargs = {'k' :2} so we get 2 documnet from vectir store based upon similarity

In [None]:
retriever = vector_store.as_retriever(search_type = 'similarity' , search_kwargs = {'k':2})

In [None]:
#checking retriever
vector_store.index_to_docstore_id

{0: '4ca7fdd4-2ef8-42bc-987a-1bff970fe4b2',
 1: 'd7b53a78-591f-4fef-9f75-8b01f6437735',
 2: '029f5c38-a3e3-45cc-b42f-cc0deb8cd167',
 3: '479610fb-61b5-4b40-90e9-fe810f3b6293',
 4: '5f114a35-cb35-4959-bb98-01d6b20e7799',
 5: '79e8ed44-67ca-4859-8f05-7ccc55a5dc60',
 6: '24223f00-a48e-4434-877c-3a9481c69981',
 7: '09069e85-8270-4a2b-b2e0-3c16f112ba95',
 8: 'b0bddfb6-9b24-4c22-985a-69f4dca16cd2',
 9: '4433e3a7-212e-4d66-8f17-a9bc4a46efe9',
 10: '19018ab7-4d10-4cfa-a495-2287c5004eb4',
 11: '25c2fe0d-09f7-41a2-9e17-dace2707dbc4',
 12: '79caa7ae-dc30-47cc-b7b6-3ef8f247253a',
 13: 'b246852d-4ba3-4387-b638-f3bc2cac35fc',
 14: '828614e9-c778-49ae-bb07-e0a495c5027b',
 15: 'dd44da59-0c67-4980-a167-947e63fcbf58',
 16: 'e2c511c8-d22c-4f41-8ae2-905b748fc521',
 17: '4605efdd-1ffb-4e29-8e2b-59a33f7c19d9',
 18: '433384a7-2fae-4c34-90f9-d9325b778b18',
 19: '1c5b7b7f-a261-4383-b59b-6d97b1a399f5',
 20: 'cc46d7ad-0d07-40f9-9732-1ba5395554df',
 21: 'e092b943-e53a-4786-bc8d-770508447f2b',
 22: '9d5a789e-751e-

In [None]:
experiment = vector_store.get_by_ids(['9955c64b-5dd4-42a5-983d-9971983a5d1e'])

In [None]:
# extracting only text
for doc in experiment:
    page_content = doc.page_content  # Accessing the page_content attribute
    print(page_content)

demas establish to support this podcast please check out our sponsors in the description and now let me leave you with some words from edskar dykstra computer science is no more about computers than astronomy is about telescopes thank you for listening and hope to see you next time


## STEP3 AUGMENTATION
- Augmentation refers to improving or enhancing something by adding extra features or elements.

- In augmentation we have to define the prompt and in generation we define LLM

In [None]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant with access to a specific transcript.
      Your task is to answer the question based solely on the provided context.
      If the context does not contain enough information to answer, respond with "I don't know."

      Context:
      {context}

      Question: {question}

      Please provide a concise and accurate answer.
    """,
    input_variables=['context', 'question']
)

In [None]:
# checking the prompt
question          = "is the topic of nuclear fusion discussed in this video? if yes then what was discussed"
retrieved_docs    = retriever.invoke(question)

In [None]:
retrieved_docs

[Document(id='41efa218-80bb-4300-a4de-dd01def701cc', metadata={}, page_content="in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it's a it's an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working today and then we look at we you know we get a fusion expert to tell us and then we look at those bottlenecks and we look at the ones which ones are amenable to our ai methods today yes right and and and then and would be intere

In [None]:
# let concatenate the output of all our retrieved documents
context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
context_text

"in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it's a it's an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working today and then we look at we you know we get a fusion expert to tell us and then we look at those bottlenecks and we look at the ones which ones are amenable to our ai methods today yes right and and and then and would be interesting from a research perspective from our point of view from an ai point of\n\

- we have done with our prompt and our context text(output doc from retriever)

In [None]:
final_prompt = prompt.invoke({"context": context_text, "question": question})

In [None]:
final_prompt

StringPromptValue(text='\n      You are a helpful assistant with access to a specific transcript.\n      Your task is to answer the question based solely on the provided context.\n      If the context does not contain enough information to answer, respond with "I don\'t know."\n\n      Context:\n      in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it\'s a it\'s an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working tod

## Step 4 - Generation

In [None]:
from langchain_groq import ChatGroq
llm=ChatGroq(model_name="Llama3-8b-8192")

In [None]:
answer = llm.invoke(final_prompt)
print(answer.content)

Yes, the topic of nuclear fusion is discussed in this video. The speaker mentions collaborating with EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland to use their test reactor, and how they used AI methods to solve one of the bottleneck problems in fusion, which is holding the plasma in specific shapes for a record amount of time. They also mention that they are looking to tackle another problem in the fusion area.


## STEP5 Creating pipeline i.e Chain
- we create a chain so that the output of one step act as the input of other


In [None]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

- we will build the function that take the retrieved document and join only content of the retrieved document and return whole text

In [None]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [None]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

- `RunnableParallel` is use to define the chain in form of dictionary.

It contain the key **context** with the value of **chain** `retriever` connected to the function we build `format_docs` (a function return all our docs from retriver)

we have to make the function `format_docs` a runable thats why we are using RunnableLambda


**WHY PARALLEL CHAIN**

we use parallel chain because of follwing reason
1. In prompt we are getting 2 inputs.
- one is query and other is context.
- For query we send it directly to the prompt
- But for context we use to retrieve it accoeding to the query so here we have 2 chains
- one for question input second for retrieving doc from retriever

2. These 2 above chain now pass into another chain that is simple and consist of prompt|llm|parser

our chains are:

question | prompt (parallel)

question | retriever | docs | prompt (parallel)

prompt | llm | parser (parallel chains connect to this chain)

In [None]:
parallel_chain.invoke('who is demis')

{'context': "the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all by itself to play the game of gold better than any human in the world and alpha fold two that solved protein folding both tasks considered nearly impossible for a very long time demus is widely considered to be one of the most brilliant and impactful humans in the history of artificial intelligence and science and engineering in general this was truly an honor and a pleasure for me to finally sit down with him for this conversation and i'm sure we will talk many times again in the future this is the lex friedman podcast to support it please check out our sponsors in the description and now dear friends here's demis hassabis let's start with a bit of a personal question am i an ai program you wrote to interview people until i get

- we get 2 key. one is context and other os question

In [None]:
# getting desired output i.e string
parser = StrOutputParser()

In [None]:
# whole chain
main_chain = parallel_chain | prompt | llm | parser

In [None]:
main_chain.invoke('Can you summarize the video')

'Based on the provided transcript, here is a concise summary of the video:\n\nThe conversation is with Demis Hassabis, the CEO and co-founder of DeepMind, a company that has developed advanced artificial intelligence systems. The host, Lex Friedman, introduces Demis as one of the most brilliant and impactful humans in AI and science. The conversation starts with a personal question, whether Demis wrote an AI program to interview people until it becomes good enough.'

------------------- THAT ALL ABOUT SIMPLE RAG--------------------

# FURTHER COMPLEX TECHNIQUES USE IN INDUSTRIES FOR RAG