<a href="https://colab.research.google.com/github/PrzemyslawCh/Youtube/blob/main/Summarizing_Long_Form_Video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Video Processing: Transcription, Segmentation, and Summary

In this **notebook**, we present a comprehensive pipeline for processing YouTube videos. The process begins by extracting the transcript from a given YouTube video. Once we have the **transcript**, we break it into **segments** or '**chunks**' based on the video's **chapters**. Lastly, we aim to summarize each chunk to give readers a quick understanding of the content within each chapter. However, please note that the **summarization** functionality is currently under development and is not yet available.

Stay tuned for updates on the summarization feature. Meanwhile, let's dive into the transcription and segmentation of YouTube videos!

# Necessary functions and libraries (remember to run it)

In this cell, we provide functions for downloading transcripts of YouTube videos, breaking them into chunks based on provided chapters, and parsing chapter timestamps from video descriptions.

Here are brief descriptions of each function:

1. `download_transcript_plaintext(video_url)`: This function downloads the plaintext transcript of a given YouTube video. It takes the URL of the video as an input, extracts the video ID, requests the video page, parses the HTML to find the title, and uses the YouTubeTranscriptApi to get the transcript. It then writes the transcript to a file.

2. `download_transcript_withtimestamps(video_url)`: Similar to the first function, but it also includes timestamps with the transcript. The resulting transcript is downloaded as a text file.

3. `create_transcript_chunks(transcript, chapters, title)`: This function takes a transcript (in a specific format), a list of chapter timestamp details, and the title of the video. It then creates chunks of transcript based on the chapter details and saves these chunks in a new text file.

4. `get_video_description(url, api_key)`: This function retrieves the video description from the YouTube Data API. It requires the video ID and the API key as input parameters.

5. `parse_chapters(description)`: This function takes a YouTube video description, extracts chapter timestamps and their corresponding titles using a regular expression. It returns these as a list of dictionaries.

6. `extract_video_id_from_url(url)`: This function takes a YouTube URL as input and returns the unique video ID associated with the video. It uses Python's string manipulation methods to identify and extract the video ID. It returns None if the input URL does not contain "v=", indicating that it may not be a valid YouTube URL.


In [42]:
!pip install youtube-transcript-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [43]:
import re
import json
import requests
from bs4 import BeautifulSoup
from youtube_transcript_api import YouTubeTranscriptApi
from google.colab import files

In [44]:
def download_transcript_plaintext(video_url):
    # Extract the video ID from the URL
    video_id = video_url.split("watch?v=")[-1]

    # Request the video page
    response = requests.get(video_url)

    # Parse the response
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the title of the video
    title = soup.find('title').string.split(' - ')[0]

    # Remove invalid characters from the title
    title = re.sub(r'[\/:*?"<>|]', '_', title)

    # Get the transcript
    data = YouTubeTranscriptApi.get_transcript(video_id)

    # Open the file in write mode
    with open(f'{title}.txt', 'w') as f:
        # Loop through each dictionary in the list
        for item in data:
            # Write the 'text' value followed by a newline
            f.write(item['text'] + '\n')

    print(f"Transcript of '{title}' has been saved.")
    return title  # Returning the title

def download_transcript_withtimestamps(video_url):
    # Extract the video ID from the URL
    video_id = video_url.split("watch?v=")[-1]

    # Request the video page
    response = requests.get(video_url)

    # Parse the response
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the title of the video
    title = soup.find('title').string.split(' - ')[0]

    # Sanitize the title for the filename
    sanitized_title = re.sub(r'[\/:*?"<>|]', '_', title)

    # Get the transcript
    data = YouTubeTranscriptApi.get_transcript(video_id)

    # Create the file path in Colab's file system
    file_path = f'/content/{sanitized_title}.txt'

    # Open the file in write mode
    with open(file_path, 'w') as f:
        # Loop through each dictionary in the list
        for item in data:
            # Write the 'start' time, 'text' value followed by a newline
            f.write(f"{item['start']} - {item['text']}\n")

    print(f"Transcript of '{title}' has been saved.")

    # Download the file from Colab's file system
    files.download(file_path)

    return title  # Returning the title

def create_transcript_chunks(transcript, chapters, title):
    # Convert the transcript into a list of dictionaries with 'start', 'end', and 'text' keys
    transcript_list = [{'start': item['start'], 'end': item['start'] + item['duration'], 'text': item['text']} for item in transcript]

    # Convert the timestamps in chapters to seconds if they are strings
    for chapter in chapters:
        if isinstance(chapter['timestamp'], str):
            chapter['timestamp'] = sum(int(x) * 60 ** i for i, x in enumerate(reversed(chapter['timestamp'].split(":"))))

    # Add an end chapter that marks the end of the video
    chapters.append({'timestamp': transcript_list[-1]['end'], 'title': 'End of Video'})

    chunks = []
    chapter_index = 0
    chunk_text = ''

    # Iterate over the transcript
    for line in transcript_list:
        # If the current line is beyond the current chapter, save the chunk and move to the next chapter
        if line['start'] >= chapters[chapter_index + 1]['timestamp']:
            chunks.append({'chapter': chapters[chapter_index]['title'], 'text': chunk_text.strip()})
            chapter_index += 1
            chunk_text = ''

        # Add the current line to the chunk
        chunk_text += ' ' + line['text']

    # Sanitize the title for the filename
    sanitized_title = re.sub(r'[\/:*?"<>|]', '_', title)

    # Write the chunks to a new file
    with open(f'{sanitized_title}_chunks.txt', 'w') as f:
        for chunk in chunks:
            f.write(f"Chapter: {chunk['chapter']}\nText: {chunk['text']}\n\n")

    print(f"Chunks of '{title}' have been saved.")
    return chunks


def get_video_description(url, api_key):
    video_id = extract_video_id_from_url(url)
    url = f"https://www.googleapis.com/youtube/v3/videos?id={video_id}&key={api_key}&part=snippet"
    response = requests.get(url)
    data = response.json()
    description = data['items'][0]['snippet']['description']
    return description

def parse_chapters(description):
    pattern = r"((?:\d{1,2}:)?\d{1,2}:\d{2})\s(.+)"
    matches = re.findall(pattern, description)
    chapters = [{"timestamp": match[0], "title": match[1]} for match in matches]
    return chapters

import urllib.parse as urlparse

def extract_video_id_from_url(url):
    """
    Extracts the YouTube video ID from a YouTube URL.
    """
    # Parse the URL
    parsed_url = urlparse.urlparse(url)

    if parsed_url.netloc == "youtu.be":
        # For shortened youtu.be links, the video ID is the path
        return parsed_url.path[1:]

    if parsed_url.netloc in ("www.youtube.com", "youtube.com", "m.youtube.com"):
        if parsed_url.path == "/watch":
            # For /watch URLs, get the v query parameter
            return urlparse.parse_qs(parsed_url.query)["v"][0]

        if parsed_url.path[:7] == "/embed/":
            # For /embed/ URLs, the video ID is the path
            return parsed_url.path.split("/")[2]

        if parsed_url.path[:3] == "/v/":
            # For /v/ URLs, the video ID is the path
            return parsed_url.path.split("/")[2]

    # If all else fails, raise an exception
    raise Exception(f"Unable to parse video ID from URL: {url}")



# Run Forest run...

1. Paste your api-key: youtube.googleapis.com
2. Paste your youtube video url


In [None]:
#@title Lets Roll
api_key = ""  #@param {type: "string"}
url = ""  #@param {type: "string"}

# Get description and chapters
description = get_video_description(url, api_key)
chapters = parse_chapters(description)

# Adjust the timestamp format for chapters
for chapter in chapters:
    timestamp_parts = chapter["timestamp"].split(":")
    chapter["timestamp"] = ":".join(timestamp_parts)

# Use the functions
transcript = YouTubeTranscriptApi.get_transcript(extract_video_id_from_url(url))
chunks = create_transcript_chunks(transcript, chapters, download_transcript_withtimestamps(video_url))

# Auto Summarizing (Not Finished)


## Splitting transcript for long prompt

In [None]:
def split_prompt_from_file(file_path, max_length):
    with open(file_path, 'r') as file:
        text = file.read()

    if max_length <= 0:
        raise ValueError("Max length must be greater than 0.")

    num_parts = -(-len(text) // max_length)
    split_text = []

    for i in range(num_parts):
        start = i * max_length
        end = min((i + 1) * max_length, len(text))

        if i == 0:
            chunk = f'Do not answer yet. This is the first part of the text I want to send you. Just receive and acknowledge as "Part {i + 1}/{num_parts} received" and wait for the next part.\n[START PART {i + 1}/{num_parts}]\n' + text[start:end] + f'\n[END PART {i + 1}/{num_parts}]'
        elif i == num_parts - 1:
            chunk = f'[START PART {i + 1}/{num_parts}]\n' + text[start:end] + f'\n[END PART {i + 1}/{num_parts}]\nALL PARTS SENT. Now you can continue processing the request.'
        else:
            chunk = f'Do not answer yet. This is just another part of the text I want to send you. Just receive and acknowledge as "Part {i + 1}/{num_parts} received" and wait for the next part.\n[START PART {i + 1}/{num_parts}]\n' + text[start:end] + f'\n[END PART {i + 1}/{num_parts}]'

        split_text.append(chunk)

    return split_text



In [None]:
file_path = '/content/Science-Based Mental Training & Visualization for Improved Learning | Huberman Lab Podcast.txt'  # replace with your file's path
max_length = 18000  # Maximum chunk size

split_text = split_prompt_from_file(file_path, max_length)

for i, chunk in enumerate(split_text):
    print(f"Chunk {i+1}:\n{chunk}\n")


## Testing API call structure

In [None]:
import json
import openai

def print_api_call_representation(split_text):
    messages = []

    # Add initial system message
    system_message_content = f"""
The total length of the content that I want to send you is too large to send in only one piece.

For sending you that content, I will follow this rule:

[START PART 1/{len(split_text)}]

[END PART 1/{len(split_text)}]

Then you just answer: "Received part 1/{len(split_text)}"

And when I tell you "ALL PARTS SENT", then you can continue processing the data and answering my requests.
"""

    messages.append({
        "role": "system",
        "content": system_message_content
    })

    for i, chunk in enumerate(split_text, start=1):
        # Add user's message
        messages.append({
            "role": "user",
            "content": chunk
        })

        # Add assistant's message acknowledging the receipt of the part
        if i != len(split_text):  # If not the last part
            part_acknowledgment = f"Part {i}/{len(split_text)} received. I will wait for the next part."
        else:  # If the last part
            part_acknowledgment = f"Part {i}/{len(split_text)} received. Thank you for providing the complete text. How can I assist you further?"

        messages.append({
            "role": "assistant",
            "content": part_acknowledgment
        })

    # Print the API call representation
    print("openai.ChatCompletion.create(")
    print("  model='gpt-3.5-turbo',")
    print("  messages=[")
    for message in messages:
        print("    {")
        print(f"      'role': '{message['role']}',")
        print(f"      'content': '''{message['content']}'''")
        print("    },")
    print("  ]")
    print(")")

# Usage:

file_path = '/content/Science-Based Mental Training & Visualization for Improved Learning | Huberman Lab Podcast.txt'  # replace with your file's path
max_length = 15000  # Maximum chunk size

# Split the text into chunks
split_text = split_prompt_from_file(file_path, max_length)

# Print the API call representation
print_api_call_representation(split_text)



openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages=[
    {
      'role': 'system',
      'content': '''
The total length of the content that I want to send you is too large to send in only one piece.
        
For sending you that content, I will follow this rule:
        
[START PART 1/9]

[END PART 1/9]
        
Then you just answer: "Received part 1/9"
        
And when I tell you "ALL PARTS SENT", then you can continue processing the data and answering my requests.
'''
    },
    {
      'role': 'user',
      'content': '''Do not answer yet. This is the first part of the text I want to send you. Just receive and acknowledge as "Part 1/9 received" and wait for the next part.
[START PART 1/9]
welcome to the huberman Lab podcast
where we discuss science and
science-based tools for everyday life
I'm Andrew huberman and I'm a professor
of neurobiology and Ophthalmology at
Stanford school of medicine today we are
discussing mental training and
visualization
mental training 

## Tiktoken


In [None]:
!pip install --upgrade tiktoken


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.4.0


In [None]:
import tiktoken

def num_tokens_from_file(file_path, encoding_name):
    """Returns the number of tokens in a text file."""
    encoding = tiktoken.get_encoding(encoding_name)
    with open(file_path, 'r') as file:
        content = file.read()
    num_tokens = len(encoding.encode(content))
    return num_tokens

file_path = "/content/Science-Based Mental Training & Visualization for Improved Learning | Huberman Lab Podcast.txt"
encoding_name = "cl100k_base"
num_tokens = num_tokens_from_file(file_path, encoding_name)
print("Number of tokens:", num_tokens)


Number of tokens: 27842
