# Summarize, translate, and edit a Ted Talk from Youtube video by asking the LLM
By [Lior Gazit](https://www.linkedin.com/in/liorgazit/)  

<a target="_blank" href="https://colab.research.google.com/github/LiorGazit/LLM_search_inside_youtube_videos/blob/main/Summarize_translate_and_edit_a_TedTalk_video_by_asking_the_LLM.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Description of the notebook:**  
Pick a Youtube video that you'd like to summarize and edit to your liking without having to spend the time to watch all of it.
In this notebook I picked one of the popular Ted Talks, summarized it, translated it to Russian, and edited it some more.  

**Requirements:**  
* Open this notebook in a free [Google Colab instance](ttps://colab.research.google.com/github/LiorGazit/LLM_search_inside_youtube_videos/blob/main/Summarize_translate_and_edit_a_TedTalk_video_by_asking_the_LLM.ipynb).  
* This code picks OpenAI's API as a choice of LLM, so a paid **API key** is necessary.   

Install:

In [None]:
%pip -q install youtube-transcript-api
%pip -q install openai
%pip -q install numpy
%pip -q install pytube
%pip -q install faiss-cpu
%pip -q install tiktoken
%pip -q install textwrap

Imports:

In [12]:
import os
from youtube_transcript_api import YouTubeTranscriptApi
import faiss
import numpy as np
import openai
import tiktoken
from urllib.parse import urlparse, parse_qs
import textwrap

#### Insert API Key

In [None]:
my_api_key = "..."

#### Save API Key to Environement Variable

In [5]:
os.environ["OPENAI_API_KEY"] = my_api_key

#### Pick the Youtube Video and Insert its URL

In [2]:
video_url = "https://www.youtube.com/watch?v=q-7zAkwAOYg&ab_channel=TEDxTalks"

#### Define functions:

In [None]:
# Extract video ID from URL
def extract_video_id(url):
    query = urlparse(url).query
    params = parse_qs(query)
    return params['v'][0]

# Fetch transcript using youtube-transcript-api
def get_transcript(video_url):
    video_id = extract_video_id(video_url)
    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
    text = ' '.join([t['text'] for t in transcript])
    return text

# Split transcript into chunks
def split_chunks(transcript, max_tokens=500):
    encoding = tiktoken.get_encoding("cl100k_base")
    words = transcript.split()
    chunks, current_chunk = [], []

    for word in words:
        current_chunk.append(word)
        if len(encoding.encode(' '.join(current_chunk))) > max_tokens:
            current_chunk.pop()
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

# Get embeddings using updated OpenAI embeddings model
def get_embeddings(chunks, model="text-embedding-3-small"):
    embeddings = openai.embeddings.create(
        input=chunks,
        model=model
    )
    embeddings_list = [e.embedding for e in embeddings.data]
    return np.array(embeddings_list, dtype='float32')

# Build FAISS index
def build_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

# Similarity search
def search_chunks(question, chunks, index, top_k=3):
    query_embedding = openai.embeddings.create(
        input=[question],
        model="text-embedding-3-small"
    ).data[0].embedding
    query_embedding = np.array([query_embedding], dtype='float32')

    _, indices = index.search(query_embedding, top_k)
    return [chunks[i] for i in indices[0]]

# Query LLM with retrieved context using the latest GPT-4 model
def query_llm(prompt, model="gpt-4-turbo"):
    completion = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You answer questions based on video transcripts. Drop a new line after every sentence!"},
            {"role": "user", "content": prompt}
        ],
        temperature=0.5,
        max_tokens=1000
    )
    return completion.choices[0].message.content.strip()

### Set Up the Retrieval Mechanism:

In [14]:
# Entire pipeline execution
def pipeline(video_url, question):
    print("--- Prompt ---\n")
    print(question)

    # Fetching transcript:
    transcript = get_transcript(video_url)

    # Splitting transcript into chunks:
    chunks = split_chunks(transcript)

    # Getting embeddings:
    embeddings = get_embeddings(chunks)

    # Building FAISS index:
    index = build_index(embeddings)

    # Searching relevant chunks:
    relevant_chunks = search_chunks(question, chunks, index)

    context = "\n\n".join(relevant_chunks)
    prompt = f"Context from video:\n\n{context}\n\nQuestion: {question}\nStart a new line after every sentence in your answer!"

    print("\n--- Answer ---\n")
    return query_llm(prompt)

### Summarizing the content of the video and translating to different languages

In [15]:
question = """Please review the entire content, summarize it to the length of 4 sentence, then translate it to Russian.
Make sure the summary is consistent with the content.
Put the string '----' between the English part of the answer and the Russian part."""

original_answer = pipeline(video_url, question)

print(textwrap.fill(original_answer, width=50, replace_whitespace=True).replace("\\n ", "\n\n").replace("---- ", "\n\nRussian:\n"))

--- Prompt ---

Please review the entire content, summarize it to the length of 4 sentence, then translate it to Russian.
Make sure the summary is consistent with the content.
Put the string '----' between the English part of the answer and the Russian part.

--- Answer ---

The Harvard Study of Adult Development, spanning
75 years and involving 724 men, reveals that good
relationships are key to happiness and health, not
wealth or fame. The study found that social
connections improve physical health and increase
longevity, while loneliness has the opposite
effect. Quality, not quantity, of relationships
matters, with supportive environments proving
beneficial and high-conflict situations being
detrimental to health. The study’s strongest
predictor of a happy, healthy old age was
satisfaction in relationships at age 50. ----
Гарвардское исследование взрослой жизни,
продолжавшееся 75 лет и охватившее 724 мужчин,
показывает, что хорошие отношения являются ключом
к счастью и здоровью, а н

In [16]:
question = f"""This is the response from the previous prompt: <{original_answer}> 
Now take the Russian response and edit it into bullet points. 
Provide just the Russian bullet points."""

original_answer = pipeline(video_url, question)

print(original_answer)

--- Prompt ---

This is the response from the previous prompt: <The Harvard Study of Adult Development, spanning 75 years and involving 724 men, reveals that good relationships are key to happiness and health, not wealth or fame. The study found that social connections improve physical health and increase longevity, while loneliness has the opposite effect. Quality, not quantity, of relationships matters, with supportive environments proving beneficial and high-conflict situations being detrimental to health. The study’s strongest predictor of a happy, healthy old age was satisfaction in relationships at age 50.
----
Гарвардское исследование взрослой жизни, продолжавшееся 75 лет и охватившее 724 мужчин, показывает, что хорошие отношения являются ключом к счастью и здоровью, а не богатство или слава. Исследование показало, что социальные связи улучшают физическое здоровье и увеличивают продолжительность жизни, в то время как одиночество имеет противоположный эффект. Важно не количество 