# Summarize, translate, and edit a Ted Talk from Youtube video by asking the LLM
By [Lior Gazit](https://www.linkedin.com/in/liorgazit/)  

<a target="_blank" href="https://colab.research.google.com/github/LiorGazit/LLM_search_inside_youtube_videos/blob/main/Summarize_translate_and_edit_a_TedTalk_video_by_asking_the_LLM.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Description of the notebook:**  
Pick a Youtube video that you'd like to summarize and edit to your liking without having to spend the time to watch all of it.
In this notebook I picked one of the popular Ted Talks, summarized it, translated it to Russian, and edited it some more.  

**Requirements:**  
* Open this notebook in a free [Google Colab instance](ttps://colab.research.google.com/github/LiorGazit/LLM_search_inside_youtube_videos/blob/main/Summarize_translate_and_edit_a_TedTalk_video_by_asking_the_LLM.ipynb).  
* This code picks OpenAI's API as a choice of LLM, so a paid **API key** is necessary.   

Install:

In [None]:
%pip -q install youtube-transcript-api
%pip -q install openai
%pip -q install numpy
%pip -q install pytube
%pip -q install faiss-cpu
%pip -q install tiktoken
%pip -q install textwrap

Imports:

In [1]:
import os
from youtube_transcript_api import YouTubeTranscriptApi
import faiss
import numpy as np
import openai
import tiktoken
from urllib.parse import urlparse, parse_qs

#### Insert API Key

In [None]:
my_api_key = "..."

#### Save API Key to Environement Variable

In [5]:
os.environ["OPENAI_API_KEY"] = my_api_key

#### Pick the Youtube Video and Insert its URL

In [2]:
video_url = "https://www.youtube.com/watch?v=q-7zAkwAOYg&ab_channel=TEDxTalks"

#### Define functions:

In [6]:
# Extract video ID from URL
def extract_video_id(url):
    query = urlparse(url).query
    params = parse_qs(query)
    return params['v'][0]

# Fetch transcript using youtube-transcript-api
def get_transcript(video_url):
    video_id = extract_video_id(video_url)
    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
    text = ' '.join([t['text'] for t in transcript])
    return text

# Split transcript into chunks
def split_chunks(transcript, max_tokens=500):
    encoding = tiktoken.get_encoding("cl100k_base")
    words = transcript.split()
    chunks, current_chunk = [], []

    for word in words:
        current_chunk.append(word)
        if len(encoding.encode(' '.join(current_chunk))) > max_tokens:
            current_chunk.pop()
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

# Get embeddings using updated OpenAI embeddings model
def get_embeddings(chunks, model="text-embedding-3-small"):
    embeddings = openai.embeddings.create(
        input=chunks,
        model=model
    )
    embeddings_list = [e.embedding for e in embeddings.data]
    return np.array(embeddings_list, dtype='float32')

# Build FAISS index
def build_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

# Similarity search
def search_chunks(question, chunks, index, top_k=3):
    query_embedding = openai.embeddings.create(
        input=[question],
        model="text-embedding-3-small"
    ).data[0].embedding
    query_embedding = np.array([query_embedding], dtype='float32')

    _, indices = index.search(query_embedding, top_k)
    return [chunks[i] for i in indices[0]]

# Query LLM with retrieved context using the latest GPT-4 model
def query_llm(prompt, model="gpt-4-turbo"):
    completion = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You answer questions based on video transcripts. Drop a new line after every sentence!"},
            {"role": "user", "content": prompt}
        ],
        temperature=0.5,
        max_tokens=1000
    )
    return completion.choices[0].message.content.strip()

In [10]:
query_llm("Complete the sentence in a correct syntactical manner: 'You is ...'", model="gpt-4-turbo")

'You are ...'

### ~~Set Up the Retrieval Mechanism:~~

In [None]:
# lecture_RAG = App(db_config=ChromaDbConfig(allow_reset=True))
# lecture_RAG.reset()
# lecture_RAG.add(data_type="youtube_video", source=video_url)



Inserting batches in chromadb: 100%|██████████| 1/1 [00:00<00:00,  1.12it/s]

Successfully saved https://www.youtube.com/watch?v=8KkKuTCFvzI&ab_channel=TED (DataType.YOUTUBE_VIDEO). New chunks count: 5





'6d9ce5a14285fef40a8afb5268a273ef'

In [10]:
original_answer = lecture_RAG.query("""Please review the entire content, summarize it to the length of 4 sentence, then translate it to Russian.
Make sure the summary is consistent with the content.
Put the string '----' between the English part of the answer and the Russian part.""")
print(textwrap.fill(original_answer, width=50, replace_whitespace=True).replace("\\n ", "\n\n").replace("---- ", "\n\nRussian:\n"))

The speaker discusses the importance of
relationships in leading a good life. They
emphasize that relationships require hard work and
are lifelong commitments. The happiest retirees
were those who actively sought new relationships
after leaving work. The speaker encourages
listeners to prioritize relationships at any age,
suggesting actions such as spending less time on
screens and more time with people, trying new
activities with loved ones, and reconciling with
estranged family members. Mark Twain's quote
reinforces the idea that life is too short for
conflicts and that love and relationships are what
truly matter.   

Russian:
Говорящий обсуждает важность
отношений для ведения хорошей жизни. Он
подчеркивает, что отношения требуют усилий и
являются пожизненными обязательствами. Самые
счастливые пенсионеры - это те, кто активно искал
новые отношения после ухода с работы. Говорящий
призывает слушателей приоритезировать отношения в
любом возрасте, предлагая такие действия, как
меньше вр

In [11]:
print(lecture_RAG.query(f"This is the response from the previous prompt: <{original_answer}> Now take the Russian response and edit it into bullet points. Provide just the Russian bullet points."))

- Говорящий подчеркивает важность отношений для ведения хорошей жизни
- Отношения требуют усилий и являются пожизненными обязательствами
- Самые счастливые пенсионеры - это те, кто активно искал новые отношения после ухода с работы
- Говорящий призывает слушателей приоритезировать отношения в любом возрасте
- Предлагает действия, такие как меньше времени на экранах и больше времени с людьми, попробовать новые активности с близкими, и восстановить отношения с отдаленными членами семьи
- Цитата Марка Твена подчеркивает идею, что жизнь слишком коротка для конфликтов и что любовь и отношения - это то, что действительно важно.
