In [None]:
"""
Workflow:
1. Download the YouTube audio file.
2. Transcribe the audio using Whisper.
3. Summarize the transcribed text using LangChain with three different approaches: stuff, refine, and map_reduce.
4. Adding multiple URLs to DeepLake database, and retrieving information. 
"""

In [2]:
"""Packages to be used:
yt_dlp
whisper
langchian
deeplake
langchain_generative_ai
tiktoken
ffmpeg"""

'Packages to be used:\nyt_dlp\nwhisper\nlangchian\ndeeplake\nlangchain_generative_ai\ntiktoken'

# Function to download youtube videos

In [1]:
import yt_dlp

def download_mp4_from_youtube(url, file_name='lecuninterview.mp4'):
    #setting options for download
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
        'outtmpl': file_name,
        'quiet': True,
    }

    #downloading video
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        result = ydl.extract_info(url, download=True)


url = "https://www.youtube.com/watch?v=mBjPyte2ZZo"
download_mp4_from_youtube(url)

                                                                         

# Using whisper to transcibe the downloaded video

In [None]:
import whisper

model = whisper.load_model('base')
result = model.transcribe('lecuninterview.mp4')
print(result)

#requires strong GPU 
# for this I am using colab to download the transcribed text file

In [None]:
# Saving the transcribed text in text file
with open('text.txt', 'w') as f:
    f.write(result['text'])

# Summarization with LangChain

In [4]:
from langchain import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains.mapreduce import MapReduceChain
from langchain.chains.summarize import load_summarize_chain

llm = ChatGoogleGenerativeAI(model='gemini-pro', temperature=0)

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, 
    chunk_overlap=50,
    separators=[" ", ",", "\n"],
)

In [6]:
from langchain.docstore.document import Document

with open('text.txt', 'r') as f:
    text = f.read()

texts = text_splitter.split_text(text=text)
docs = [Document(page_content=t) for t in texts[:4]]

Using Map reduce
- "map-reduce" and "refine" approaches offer more sophisticated ways to process and extract useful information from longer documents.
- the "map-reduce" method can be parallelized, resulting in faster processing times, the "refine" approach is empirically known to produce better results.

In [10]:
import textwrap

chain = load_summarize_chain(llm=llm, chain_type='map_reduce')

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text) 

Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised InternalServerError: 500 An internal error has occurred. Please retry or report in https://developers.generativeai.google/guide/troubleshooting.


- Jan LeCoon, a pioneer in deep learning, discusses the limitations of large language models and
introduces his new joint embedding predictive architecture (JEPA) as a potential solution. - JEPA
aims to address the lack of a world model in large language models by learning representations of
text that can be used for downstream tasks. - Pre-training transformer architectures involve
removing words from a text and training a neural network to predict the missing words, helping the
system learn good representations of text. - Large language models can generate text by predicting
the next word in a sequence but struggle to represent uncertain predictions, making it difficult to
handle scenarios with multiple possible words.


Summary with bullet points using stuff chain 

- "stuff" approach is the simplest and most naive one, in which all the text from the documents is used in a single prompt.
- This method may raise exceptions if all text is longer than the available context size of the LLM and may not be the most efficient way to handle large amounts of text.

In [13]:
prompt_template = """
write a concise summary of the given text in bullet points:
{text}

Summary in Bullet points:
"""

bullet_point_prompt = PromptTemplate(
    template = prompt_template,
    input_variables=['text'],
)

In [15]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='stuff',
    prompt = bullet_point_prompt
)

output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary, 
                             width=1000,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)

- Jan LeCoon, a prominent figure in deep learning development, discussed self-supervised learning and its relationship to large language models.
- Self-supervised learning has revolutionized natural language processing by pre-training transformer architectures.
- Large language models can predict the next word in a text and generate text spontaneously.
- Generative models, like large language models, struggle to represent uncertain predictions.
- Jan LeCoon introduced his joint embedding predictive architecture (JEPA), which aims to address the limitations of large language models.
- JEPA combines self-supervised learning with a world model to improve the representation of uncertainty and enable reasoning about the world.
- Jan LeCoon believes that AI systems have the potential to exhibit features of consciousness in the future.


using refine chian

In [16]:
chain = load_summarize_chain(llm, chain_type="refine")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

Craig Smith interviews Jan LeCoon, a pioneer in deep learning and advocate for self-supervised
learning. LeCoon discusses the limitations of large language models, particularly their lack of a
world model. He introduces his new joint embedding predictive architecture (JEPA) as a potential
solution to this problem. LeCoon also shares his theory of consciousness and the possibility of AI
systems exhibiting conscious features in the future.  Additionally, LeCoon highlights the
significance of self-supervised learning in natural language processing, particularly in pre-
training transformer architectures. He emphasizes the transformative impact of self-supervised
learning in this field, explaining how it involves training a neural network to predict missing
words in a text, resulting in learned representations of text that can be used for various
downstream tasks. This approach has revolutionized the field and has practical applications in
content moderation systems and other areas.  LeCoo

# Adding Transcripts to Deep Lake

Adding transcription from multiple videos


In [None]:
import yt_dlp

def download_mp4_from_youtube(urls, job_id):
    # This will hold the titles and authors of each downloaded video
    video_info = []

    for i, urls in enumerate(urls):
        # Set the options for the download
        file_name = f'./{job_id}_{i}.mp4'
        ydl_opts = {
            'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
            'outtmpl': file_name,
            'quite': True,
        }

        # Downloading the video 
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            result = ydl.extract_info(url, download=True)
            title = result.get('tittle', "")
            author = result.get('uploader', "")

        # Adding the title and author to our list
            video_info.append((file_name, title, author))

    
    return video_info


urls=["https://www.youtube.com/watch?v=mBjPyte2ZZo&t=78s",
    "https://www.youtube.com/watch?v=cjs7QKJNVYM",]
videos_details = download_mp4_from_youtube(urls, 1)


In [None]:
import whisper

model = whisper.load_model('base')

results = []

#Iterating through each video to transcribe
for video in videos_details:
    result = whisper.transcribe(video[0])
    results.append(result['text'])


with open("text_multiple.txt", 'w') as f:
    f.write('\n'.join(results))

In [17]:
with open('text_multiple.txt', 'r') as f:
    text = f.read()

texts = text_splitter.split_text(text)
docs = [Document(page_content=t) for t in texts[:4]]

In [28]:
docs

[Document(page_content="Hi, I'm Craig Smith and this is I on A On. This week I talk to Jan LeCoon, one of the seminal figures in deep learning development and a long time proponent of self-supervised learning. Jan spoke about what's missing in large language models and about his new joint embedding predictive architecture which may be a step toward filling that gap. He also talked about his theory of consciousness and the potential for AI systems to someday exhibit the features of consciousness. It's a fascinating conversation that I hope you'll enjoy. Okay, so Jan, it's great to see you again. I wanted to talk to you about where you've gone with so supervised learning since last week spoke. In particular, I'm interested in how it relates to large language models because the large language models really came on stream since we spoke. In fact, in your talk about JEPA, which is joint embedding predictive architecture. There you go. Thank you. You mentioned that large language models lack

## Building deeplake store and storing the embeddings

In [21]:
from langchain.vectorstores import DeepLake
from langchain_google_genai.embeddings import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model='models/embedding-001')

In [22]:
my_activeloop_org_id = "samman"
my_activeloop_dataset_name = "langchain_course_youtube_summarizer"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path, embedding_function=embeddings)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!


 

In [23]:
db.add_documents(docs)

Creating 4 embeddings in 1 batches of size 4:: 100%|██████████| 1/1 [00:25<00:00, 25.11s/it]

Dataset(path='hub://samman/langchain_course_youtube_summarizer', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype     shape     dtype  compression
  -------    -------   -------   -------  ------- 
   text       text      (4, 1)     str     None   
 metadata     json      (4, 1)     str     None   
 embedding  embedding  (4, 768)  float32   None   
    id        text      (4, 1)     str     None   





['513d282f-b940-11ee-b60b-60189524c791',
 '513d2830-b940-11ee-98a7-60189524c791',
 '513d2831-b940-11ee-b082-60189524c791',
 '513d2832-b940-11ee-b8f9-60189524c791']

In [24]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4

In [25]:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of transcripts from a video to answer the question in bullet points and summarized. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Summarized answer in bullter points:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [30]:
from langchain.chains import RetrievalQA

chain_type_kwargs = {'prompt': PROMPT}

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    chain_type_kwargs=chain_type_kwargs,
    retriever=retriever,
)

print( qa.run("Summarize the mentions of large language model") )

- Large language models are partially generative models, which predict the next word in a text.
- Generative models have difficulty representing uncertain predictions.
- Large language models lack a world model.
