## Install libraries

In [1]:
%pip install -q youtube-transcript-api langchain-community langchain-huggingface sentence-transformers \
                faiss-cpu tiktoken python-dotenv

Note: you may need to restart the kernel to use updated packages.




In [2]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from dotenv import load_dotenv, find_dotenv
from langchain_core.messages import SystemMessage,AIMessage,HumanMessage
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
import os
import re

dotenv_path = find_dotenv(filename=".env", raise_error_if_not_found=True)
load_dotenv(dotenv_path)


True

## Step 1a - Indexing (Document Ingestion)

In [3]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled

api = YouTubeTranscriptApi()

video_id = "Gfr50f6ZBvo"

try:
    transcript_list = api.fetch(video_id)
    
    transcript = " ".join([d.text for d in transcript_list])
    
    print("Transcript fetched successfully!")
    print(transcript[:1000] + "...") 

except TranscriptsDisabled:
    print(f"Transcripts are disabled for video: {video_id}")
except Exception as e:
    print(f"An error occurred: {e}")

Transcript fetched successfully!
the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all by itself to play the game of gold better than any human in the world and alpha fold two that solved protein folding both tasks considered nearly impossible for a very long time demus is widely considered to be one of the most brilliant and impactful humans in the history of artificial intelligence and science and engineering in general this was truly an honor and a pleasure for me to finally sit down with him for this conversation and i'm sure we will talk many times again in the future this is the lex friedman podcast to support it please check out our sponsors in the description and now dear friends here's demis hassabis let's start with a bit of a personal question am i an ai program you wrote to intervie

## Step 1b - Indexing (Text Splitting)

In [4]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

In [5]:
len(chunks)

168

In [6]:
chunks[100]

Document(metadata={}, page_content="and and kind of come up with descriptions of the electron clouds where they're gonna go how they're gonna interact when you put two elements together uh and what we try to do is learn a simulation uh uh learner functional that will describe more chemistry types of chemistry so um until now you know you can run expensive simulations but then you can only simulate very small uh molecules very simple molecules we would like to simulate large materials um and so uh today there's no way of doing that and we're building up towards uh building functionals that approximate schrodinger's equation and then allow you to describe uh what the electrons are doing and all materials sort of science and material properties are governed by the electrons and and how they interact so have a good summarization of the simulation through the functional um but one that is still close to what the actual simulation would come out with so what um how difficult is that to ask w

## Step 1c & 1d - Indexing (Embedding Generation and Storing in Vector Store)

In [7]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [8]:
vector_store.index_to_docstore_id

{0: '5759fca0-3710-44d8-891f-8d8fc8281dde',
 1: 'c5ae5005-7075-4025-9dd4-8b70c9107440',
 2: '66da62b9-3c3c-46a7-8f81-04adb03340b4',
 3: 'a83978e1-bee1-45a6-b329-df6bf475dbe7',
 4: 'db142d6f-2d2d-4e6c-be7d-52901502346a',
 5: '845b4399-bbd6-48f2-8014-c1747fe6358c',
 6: 'd21cc0bc-e23d-4e9f-acb6-0eaf4c9c01f4',
 7: '768909f6-63bc-4496-9955-c47fd02a3872',
 8: '30ad7c5f-6116-4b75-b20e-52d8989dbacf',
 9: '3467a71b-ae63-4d60-a6d4-55204acbc98b',
 10: 'c99220de-fd4d-4594-8d39-f150044d00fc',
 11: 'dc2e147d-8873-421f-90aa-f6bf4df5e011',
 12: '1f4e4b23-be96-41e0-8dba-273cd90f49de',
 13: '26e21b4c-c410-4de3-b9de-ce7fc8324aaa',
 14: '755de3fb-ed3a-47f0-936d-eb4cccc51b29',
 15: '10b44f06-e892-46ad-ab87-72f8bcc709a7',
 16: '2fef997f-f2e9-40e4-ba40-5c39695a2d70',
 17: 'd6b005cb-79fb-4e9e-9d47-9fa402931426',
 18: 'cc5665c2-3825-42d0-b7c4-92a0b2a4419e',
 19: '08090c49-e90f-40ec-a138-e2b03d21d4d4',
 20: 'dcfbf85a-bf94-44e7-85ea-e5a40b941a7c',
 21: '02081525-a2a3-4a7d-a565-80e883f4bd0e',
 22: '734acb79-a2cd-

In [9]:
vector_store.get_by_ids(['3b87214f-6e03-4d8b-aa95-eee12141f725'])

[]

## Step 2 - Retrieval

In [10]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})

In [11]:
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000018805967310>, search_kwargs={'k': 4})

In [12]:
retriever.invoke('What is deepmind')

[Document(id='38ea7e87-0149-478b-bfff-3837966bd848', metadata={}, page_content="and how it works this is tough to uh ask you this question because you probably will say it's everything but let's let's try let's try to think to this because you're in a very interesting position where deepmind is the place of some of the most uh brilliant ideas in the history of ai but it's also a place of brilliant engineering so how much of solving intelligence this big goal for deepmind how much of it is science how much is engineering so how much is the algorithms how much is the data how much is the hardware compute infrastructure how much is it the software computer infrastructure yeah um what else is there how much is the human infrastructure and like just the humans interact in certain kinds of ways in all the space of all those ideas how much does maybe like philosophy how much what's the key if um uh if if you were to sort of look back like if we go forward 200 years look back what was the key 

## Step 3 - Augmentation

In [13]:
endpoint = HuggingFaceEndpoint(
    repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
    temperature=0.7,
    max_new_tokens=512
)

llm = ChatHuggingFace(llm=endpoint)

In [14]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      {context}
      Question: {question}
    """,
    input_variables = ['context', 'question']
)

In [15]:
question          = "is the topic of nuclear fusion discussed in this video? if yes then what was discussed"
retrieved_docs    = retriever.invoke(question)

In [16]:
retrieved_docs

[Document(id='8f2fa096-a346-4619-8de2-6604fd15c44f', metadata={}, page_content="in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it's a it's an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working today and then we look at we you know we get a fusion expert to tell us and then we look at those bottlenecks and we look at the ones which ones are amenable to our ai methods today yes right and and and then and would be intere

In [17]:
context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
context_text

"in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it's a it's an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working today and then we look at we you know we get a fusion expert to tell us and then we look at those bottlenecks and we look at the ones which ones are amenable to our ai methods today yes right and and and then and would be interesting from a research perspective from our point of view from an ai point of\n\

In [18]:
final_prompt = prompt.invoke({"context": context_text, "question": question})

In [19]:
final_prompt

StringPromptValue(text="\n      You are a helpful assistant.\n      Answer ONLY from the provided transcript context.\n      If the context is insufficient, just say you don't know.\n\n      in this case in fusion we we collaborated with epfl in switzerland the swiss technical institute who are amazing they have a test reactor that they were willing to let us use which you know i double checked with the team we were going to use carefully and safely i was impressed they managed to persuade them to let us use it and um and it's a it's an amazing test reactor they have there and they try all sorts of pretty crazy experiments on it and um the the the what we tend to look at is if we go into a new domain like fusion what are all the bottleneck problems uh like thinking from first principles you know what are all the bottleneck problems that are still stopping fusion working today and then we look at we you know we get a fusion expert to tell us and then we look at those bottlenecks and we 

## Step 4 - Generation

In [21]:
answer = llm.invoke(final_prompt)
print(answer.content)

 Yes, the topic of nuclear fusion is discussed in this video. The speaker mentions collaborating with EPFL in Switzerland and using their test reactor to explore fusion as an area where AI can help accelerate progress. They describe how they looked for bottleneck problems amenable to AI methods and focused on holding plasma in specific shapes for record amounts of time. This was achieved through a controller that can contain and hold plasma in various shapes, which was a significant problem solved in the fusion area. The team is currently exploring the next problem they can tackle with fusion startups. They also have a paper on magnetic control of tokamak plasmas using deep reinforcement learning to solve nuclear fusion.


## Building a Chain

In [22]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

In [23]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [24]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

In [25]:
parallel_chain.invoke('who is Demis')

{'context': "to get world peace because there's also other corrupting things like wanting power over people and this kind of stuff which is not necessarily satisfied by by just abundance but i think it will help um and i think uh but i think ultimately ai is not going to be run by any one person or one organization i think it should belong to the world belong to humanity um and i think maybe many there'll be many ways this will happen and ultimately um everybody should have a say in that do you have advice for uh young people in high school and college maybe um if they're interested in ai or interested in having a big impact on the world what they should do to have a career they can be proud of her to have a life they can be proud of i love giving talks to the next generation what i say to them is actually two things i i think the most important things to learn about and to find out about when you're when you're young is what are your true passions is first of all there's two things on

In [26]:
parser = StrOutputParser()

In [27]:
main_chain = parallel_chain | prompt | llm | parser

In [28]:
main_chain.invoke('Can you summarize the video')

" The speaker discusses the possibility of a more fundamental and simpler explanation of physics beyond the standard model. They mention that such an explanation could shed light on mysteries like consciousness, life, and gravity. The speaker also highlights the importance of clear and simple explanations as a sign of understanding complex topics. They reflect on the history of computers and AI in relation to chess, mentioning Claude Shannon's first chess program and IBM's Deep Blue. The speaker expresses admiration for human intelligence, suggesting that they were more impressed by Garry Kasparov's mind during Deep Blue's victory than the machine itself."