# YouTube Transcript API Markdown Formatter

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

## Description:

This app fetches YouTube video transcripts, formats them into Markdown, and enhances readability by identifying speaker boundaries. It uses AI to clean up and structure the transcript, making it easy to read and share. Ideal for content creators, researchers, and educators needing organized video transcriptions.

## Step 1: Define YouTube Video URL Template

### Purpose

Define a reusable URL template for constructing YouTube video links by replacing a placeholder with a video ID.

### Details

- **Constant Variable:**  
  `WATCH_URL` is a string template that stays unchanged.

- **URL Template:**  
  The URL format is `https://www.youtube.com/watch?v={video_id}`, where `{video_id}` is a placeholder.

- **Dynamic Placeholder:**  
  `{video_id}` can be replaced using Python's string formatting to generate video URLs.

### Usage

This template allows dynamic generation of YouTube video URLs using a specific video ID.



In [5]:
WATCH_URL = "https://www.youtube.com/watch?v={video_id}"

## Step 2: Define Custom Exception Hierarchy for YouTube Transcript Retrieval Errors

The code defines a custom exception hierarchy to handle various errors when retrieving YouTube video transcripts using a hypothetical youtube-transcript-api.

### `CouldNotRetrieveTranscript Class (Base Class)`

- **Purpose**: This is the base exception class that is raised when a transcript for a YouTube video cannot be retrieved.
  
- **Attributes**:
  
  - `ERROR_MESSAGE`: A template for the base error message, including the video URL.
  
  - `CAUSE_MESSAGE_INTRO`: A template for introducing the specific cause of the error.
  
  - `CAUSE_MESSAGE`: A placeholder for the specific cause of the error, to be populated in derived classes.
  
  - `GITHUB_REFERRAL`: A message that encourages users to report issues on GitHub if they believe the cause described doesn't apply to their issue.

- **Methods**:
  
  - `__init__(self, video_id)`: Initializes the exception with the video ID and calls the parent constructor to build the error message.
  
  - `_build_error_message(self)`: Formats the error message by replacing placeholders with actual values and appends a GitHub referral message if applicable.
  
  - `cause`: A property method that returns the `CAUSE_MESSAGE`.

### `YouTubeRequestFailed Class (Derived Class)`

- **Purpose**: Raised when a request to YouTube fails, usually due to an HTTP error.
  
- **Attributes**:
  
  - `reason`: Stores the HTTP error message as a string.

- **Methods**:
  
  - `cause`: Returns the error message with the specific HTTP error reason.

### `VideoUnavailable Class (Derived Class)`

- **Purpose**: Raised when the YouTube video is no longer available (e.g., deleted or made private).
  
- **Cause**: A fixed message: "The video is no longer available."

### `InvalidVideoId Class (Derived Class)`

- **Purpose**: Raised when an invalid video ID is provided (e.g., the full URL is given instead of just the video ID).
  
- **Cause**: Provides instructions on how to correctly use the video ID.

### `TooManyRequests Class (Derived Class)`

- **Purpose**: Raised when YouTube receives too many requests from the same IP address and requires CAPTCHA solving.
  
- **Cause**: Provides suggestions to work around this issue (e.g., manually solving the CAPTCHA, using a different IP, or waiting until the IP ban is lifted).

### `TranscriptsDisabled Class (Derived Class)`

- **Purpose**: Raised when subtitles (transcripts) are disabled for a video.
  
- **Cause**: "Subtitles are disabled for this video."

### `NoTranscriptAvailable Class (Derived Class)`

- **Purpose**: Raised when no transcripts are available for a video.
  
- **Cause**: "No transcripts are available for this video."

### `NotTranslatable Class (Derived Class)`

- **Purpose**: Raised when the requested language for the transcript is not translatable.
  
- **Cause**: "The requested language is not translatable."

### `TranslationLanguageNotAvailable Class (Derived Class)`

- **Purpose**: Raised when the requested translation language for the transcript is not available.
  
- **Cause**: "The requested translation language is not available."

### `CookiePathInvalid Class (Derived Class)`

- **Purpose**: Raised when the path to the provided cookie file is invalid.
  
- **Cause**: "The provided cookie file was unable to be loaded."

### `CookiesInvalid Class (Derived Class)`

- **Purpose**: Raised when the provided cookies are invalid or expired.
  
- **Cause**: "The cookies provided are not valid (may have expired)."

### `FailedToCreateConsentCookie Class (Derived Class)`

- **Purpose**: Raised when the system fails to automatically create a consent cookie for handling cookies in requests.
  
- **Cause**: "Failed to automatically give consent to saving cookies."

### `NoTranscriptFound Class (Derived Class)`

- **Purpose**: Raised when no transcript is found for any of the requested language codes.

- **Attributes**:
  
  - `_requested_language_codes`: Stores the list of language codes requested.
  
  - `_transcript_data`: Stores the transcript data (likely empty or partial).

- **Methods**:
  
  - `cause`: Returns the error message with the requested language codes and transcript data.


In [12]:
class CouldNotRetrieveTranscript(Exception):
    """
    Raised if a transcript could not be retrieved.
    """

    ERROR_MESSAGE = "\nCould not retrieve a transcript for the video {video_url}!"
    CAUSE_MESSAGE_INTRO = " This is most likely caused by:\n\n{cause}"
    CAUSE_MESSAGE = ""
    GITHUB_REFERRAL = (
        "\n\nIf you are sure that the described cause is not responsible for this error "
        "and that a transcript should be retrievable, please create an issue at "
        "https://github.com/jdepoix/youtube-transcript-api/issues. "
        "Please add which version of youtube_transcript_api you are using "
        "and provide the information needed to replicate the error. "
        "Also make sure that there are no open issues which already describe your problem!"
    )

    def __init__(self, video_id):
        self.video_id = video_id
        super(CouldNotRetrieveTranscript, self).__init__(self._build_error_message())

    def _build_error_message(self):
        cause = self.cause
        error_message = self.ERROR_MESSAGE.format(
            video_url=WATCH_URL.format(video_id=self.video_id)
        )

        if cause:
            error_message += (
                self.CAUSE_MESSAGE_INTRO.format(cause=cause) + self.GITHUB_REFERRAL
            )

        return error_message

    @property
    def cause(self):
        return self.CAUSE_MESSAGE


class YouTubeRequestFailed(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Request to YouTube failed: {reason}"

    def __init__(self, video_id, http_error):
        self.reason = str(http_error)
        super(YouTubeRequestFailed, self).__init__(video_id)

    @property
    def cause(self):
        return self.CAUSE_MESSAGE.format(
            reason=self.reason,
        )


class VideoUnavailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The video is no longer available"


class InvalidVideoId(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "You provided an invalid video id. Make sure you are using the video id and NOT the url!\n\n"
        'Do NOT run: `YouTubeTranscriptApi.get_transcript("https://www.youtube.com/watch?v=1234")`\n'
        'Instead run: `YouTubeTranscriptApi.get_transcript("1234")`'
    )


class TooManyRequests(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "YouTube is receiving too many requests from this IP and now requires solving a captcha to continue. "
        "One of the following things can be done to work around this:\n\
        - Manually solve the captcha in a browser and export the cookie. "
        "Read here how to use that cookie with "
        "youtube-transcript-api: https://github.com/jdepoix/youtube-transcript-api#cookies\n\
        - Use a different IP address\n\
        - Wait until the ban on your IP has been lifted"
    )


class TranscriptsDisabled(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Subtitles are disabled for this video"


class NoTranscriptAvailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "No transcripts are available for this video"


class NotTranslatable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The requested language is not translatable"


class TranslationLanguageNotAvailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The requested translation language is not available"


class CookiePathInvalid(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The provided cookie file was unable to be loaded"


class CookiesInvalid(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The cookies provided are not valid (may have expired)"


class FailedToCreateConsentCookie(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Failed to automatically give consent to saving cookies"


class NoTranscriptFound(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n"
        "{transcript_data}"
    )

    def __init__(self, video_id, requested_language_codes, transcript_data):
        self._requested_language_codes = requested_language_codes
        self._transcript_data = transcript_data
        super(NoTranscriptFound, self).__init__(video_id)

    @property
    def cause(self):
        return self.CAUSE_MESSAGE.format(
            requested_language_codes=self._requested_language_codes,
            transcript_data=str(self._transcript_data),
        )

## Step 3: Handle HTML Unescaping for Different Python Versions

This code ensures compatibility for unescaping HTML entities across different Python versions (Python 2 and 3). 

It uses conditional imports and function definitions based on the Python version to provide a consistent unescape function across environments. 

Here's a breakdown:

- `Python 3.4+`: Imports unescape directly from the html module. 

- `Python 2`: Imports HTMLParser and uses it to create an html_parser object to handle unescaping. 

- `Python 3.0` - 3.3: Uses html.parser to create an html_parser object to handle unescaping. 

- Custom unescape function: For Python versions other than 3.4+, a custom unescape function is defined using the appropriate parser for the version. 

The # pragma: no cover comments indicate that this code should not be covered by test coverage tools like coverage.py due to the version-dependent nature of the logic.


In [13]:
import sys


# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 3 and sys.version_info.minor >= 4:  # pragma: no cover
    # Python 3.4+
    from html import unescape
else:  # pragma: no cover
    if sys.version_info.major <= 2:
        # Python 2
        import HTMLParser  # type: ignore

        html_parser = HTMLParser.HTMLParser()
    else:
        # Python 3.0 - 3.3
        import html.parser

        html_parser = html.parser.HTMLParser()

    def unescape(string):
        return html_parser.unescape(string)

## Step 4: Fetch and Parse YouTube Video Transcripts

#### Key Classes:

- **TranscriptListFetcher**:

  - Fetches available transcripts from YouTube.

  - Handles errors like invalid video IDs and unavailable captions.

- **TranscriptList**:

  - Represents a collection of transcripts.

  - Supports finding specific languages and types (manual/auto).

- **Transcript**:

  - Represents a single transcript.

  - Can fetch data and translate if needed.

- **_TranscriptParser**:

  - Parses and cleans up transcript data.

  - Extracts text, start time, and duration.

#### Key Functions:

- **_raise_http_errors**:

  - Handles HTTP errors and raises custom exceptions.

- **_extract_captions_json**:

  - Extracts captions data from video HTML.


- **_create_consent_cookie**:

  - Sets consent cookies for age verification.

- **_fetch_video_html**:

  - Fetches video HTML content and handles consent dialogs.

- **_find_transcript**:

  - Searches for a transcript in a dictionary based on language codes.

#### Error Handling:

- Custom exceptions (e.g., **InvalidVideoId**, **TooManyRequests**) are used for various error cases.

#### Workflow:

- The **TranscriptListFetcher** fetches the video HTML, extracts captions, and builds a **TranscriptList**.

- The **TranscriptList** stores and searches for transcripts in specific languages.

- Each **Transcript** object can fetch transcript data and handle translations if available.

### Summary:

This structure enables efficient management of YouTube video transcripts, supporting multiple languages and error handling.


In [14]:
import sys

# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 2:  # pragma: no cover
    # ruff: noqa: F821
    reload(sys)
    sys.setdefaultencoding("utf-8")

import json

from defusedxml import ElementTree

import re

from requests import HTTPError


def _raise_http_errors(response, video_id):
    try:
        response.raise_for_status()
        return response
    except HTTPError as error:
        raise YouTubeRequestFailed(error, video_id)


class TranscriptListFetcher(object):
    def __init__(self, http_client):
        self._http_client = http_client

    def fetch(self, video_id):
        return TranscriptList.build(
            self._http_client,
            video_id,
            self._extract_captions_json(self._fetch_video_html(video_id), video_id),
        )

    def _extract_captions_json(self, html, video_id):
        splitted_html = html.split('"captions":')

        if len(splitted_html) <= 1:
            if video_id.startswith("http://") or video_id.startswith("https://"):
                raise InvalidVideoId(video_id)
            if 'class="g-recaptcha"' in html:
                raise TooManyRequests(video_id)
            if '"playabilityStatus":' not in html:
                raise VideoUnavailable(video_id)

            raise TranscriptsDisabled(video_id)

        captions_json = json.loads(
            splitted_html[1].split(',"videoDetails')[0].replace("\n", "")
        ).get("playerCaptionsTracklistRenderer")
        if captions_json is None:
            raise TranscriptsDisabled(video_id)

        if "captionTracks" not in captions_json:
            raise NoTranscriptAvailable(video_id)

        return captions_json

    def _create_consent_cookie(self, html, video_id):
        match = re.search('name="v" value="(.*?)"', html)
        if match is None:
            raise FailedToCreateConsentCookie(video_id)
        self._http_client.cookies.set(
            "CONSENT", "YES+" + match.group(1), domain=".youtube.com"
        )

    def _fetch_video_html(self, video_id):
        html = self._fetch_html(video_id)
        if 'action="https://consent.youtube.com/s"' in html:
            self._create_consent_cookie(html, video_id)
            html = self._fetch_html(video_id)
            if 'action="https://consent.youtube.com/s"' in html:
                raise FailedToCreateConsentCookie(video_id)
        return html

    def _fetch_html(self, video_id):
        response = self._http_client.get(
            WATCH_URL.format(video_id=video_id), headers={"Accept-Language": "en-US"}
        )
        return unescape(_raise_http_errors(response, video_id).text)


class TranscriptList(object):
    """
    This object represents a list of transcripts. It can be iterated over to list all transcripts which are available
    for a given YouTube video. Also it provides functionality to search for a transcript in a given language.
    """

    def __init__(
        self,
        video_id,
        manually_created_transcripts,
        generated_transcripts,
        translation_languages,
    ):
        """
        The constructor is only for internal use. Use the static build method instead.

        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param manually_created_transcripts: dict mapping language codes to the manually created transcripts
        :type manually_created_transcripts: dict[str, Transcript]
        :param generated_transcripts: dict mapping language codes to the generated transcripts
        :type generated_transcripts: dict[str, Transcript]
        :param translation_languages: list of languages which can be used for translatable languages
        :type translation_languages: list[dict[str, str]]
        """
        self.video_id = video_id
        self._manually_created_transcripts = manually_created_transcripts
        self._generated_transcripts = generated_transcripts
        self._translation_languages = translation_languages

    @staticmethod
    def build(http_client, video_id, captions_json):
        """
        Factory method for TranscriptList.

        :param http_client: http client which is used to make the transcript retrieving http calls
        :type http_client: requests.Session
        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param captions_json: the JSON parsed from the YouTube pages static HTML
        :type captions_json: dict
        :return: the created TranscriptList
        :rtype TranscriptList:
        """
        translation_languages = [
            {
                "language": translation_language["languageName"]["simpleText"],
                "language_code": translation_language["languageCode"],
            }
            for translation_language in captions_json.get("translationLanguages", [])
        ]

        manually_created_transcripts = {}
        generated_transcripts = {}

        for caption in captions_json["captionTracks"]:
            if caption.get("kind", "") == "asr":
                transcript_dict = generated_transcripts
            else:
                transcript_dict = manually_created_transcripts

            transcript_dict[caption["languageCode"]] = Transcript(
                http_client,
                video_id,
                caption["baseUrl"],
                caption["name"]["simpleText"],
                caption["languageCode"],
                caption.get("kind", "") == "asr",
                translation_languages if caption.get("isTranslatable", False) else [],
            )

        return TranscriptList(
            video_id,
            manually_created_transcripts,
            generated_transcripts,
            translation_languages,
        )

    def __iter__(self):
        return iter(
            list(self._manually_created_transcripts.values())
            + list(self._generated_transcripts.values())
        )

    def find_transcript(self, language_codes):
        """
        Finds a transcript for a given language code. Manually created transcripts are returned first and only if none
        are found, generated transcripts are used. If you only want generated transcripts use
        `find_manually_created_transcript` instead.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(
            language_codes,
            [self._manually_created_transcripts, self._generated_transcripts],
        )

    def find_generated_transcript(self, language_codes):
        """
        Finds an automatically generated transcript for a given language code.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(language_codes, [self._generated_transcripts])

    def find_manually_created_transcript(self, language_codes):
        """
        Finds a manually created transcript for a given language code.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(
            language_codes, [self._manually_created_transcripts]
        )

    def _find_transcript(self, language_codes, transcript_dicts):
        for language_code in language_codes:
            for transcript_dict in transcript_dicts:
                if language_code in transcript_dict:
                    return transcript_dict[language_code]

        raise NoTranscriptFound(self.video_id, language_codes, self)

    def __str__(self):
        return (
            "For this video ({video_id}) transcripts are available in the following languages:\n\n"
            "(MANUALLY CREATED)\n"
            "{available_manually_created_transcript_languages}\n\n"
            "(GENERATED)\n"
            "{available_generated_transcripts}\n\n"
            "(TRANSLATION LANGUAGES)\n"
            "{available_translation_languages}"
        ).format(
            video_id=self.video_id,
            available_manually_created_transcript_languages=self._get_language_description(
                str(transcript)
                for transcript in self._manually_created_transcripts.values()
            ),
            available_generated_transcripts=self._get_language_description(
                str(transcript) for transcript in self._generated_transcripts.values()
            ),
            available_translation_languages=self._get_language_description(
                '{language_code} ("{language}")'.format(
                    language=translation_language["language"],
                    language_code=translation_language["language_code"],
                )
                for translation_language in self._translation_languages
            ),
        )

    def _get_language_description(self, transcript_strings):
        description = "\n".join(
            " - {transcript}".format(transcript=transcript)
            for transcript in transcript_strings
        )
        return description if description else "None"


class Transcript(object):
    def __init__(
        self,
        http_client,
        video_id,
        url,
        language,
        language_code,
        is_generated,
        translation_languages,
    ):
        """
        You probably don't want to initialize this directly. Usually you'll access Transcript objects using a
        TranscriptList.

        :param http_client: http client which is used to make the transcript retrieving http calls
        :type http_client: requests.Session
        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param url: the url which needs to be called to fetch the transcript
        :param language: the name of the language this transcript uses
        :param language_code:
        :param is_generated:
        :param translation_languages:
        """
        self._http_client = http_client
        self.video_id = video_id
        self._url = url
        self.language = language
        self.language_code = language_code
        self.is_generated = is_generated
        self.translation_languages = translation_languages
        self._translation_languages_dict = {
            translation_language["language_code"]: translation_language["language"]
            for translation_language in translation_languages
        }

    def fetch(self, preserve_formatting=False):
        """
        Loads the actual transcript data.
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
        :rtype [{'text': str, 'start': float, 'end': float}]:
        """
        response = self._http_client.get(
            self._url, headers={"Accept-Language": "en-US"}
        )
        return _TranscriptParser(preserve_formatting=preserve_formatting).parse(
            _raise_http_errors(response, self.video_id).text,
        )

    def __str__(self):
        return '{language_code} ("{language}"){translation_description}'.format(
            language=self.language,
            language_code=self.language_code,
            translation_description="[TRANSLATABLE]" if self.is_translatable else "",
        )

    @property
    def is_translatable(self):
        return len(self.translation_languages) > 0

    def translate(self, language_code):
        if not self.is_translatable:
            raise NotTranslatable(self.video_id)

        if language_code not in self._translation_languages_dict:
            raise TranslationLanguageNotAvailable(self.video_id)

        return Transcript(
            self._http_client,
            self.video_id,
            "{url}&tlang={language_code}".format(
                url=self._url, language_code=language_code
            ),
            self._translation_languages_dict[language_code],
            language_code,
            True,
            [],
        )


class _TranscriptParser(object):
    _FORMATTING_TAGS = [
        "strong",  # important
        "em",  # emphasized
        "b",  # bold
        "i",  # italic
        "mark",  # marked
        "small",  # smaller
        "del",  # deleted
        "ins",  # inserted
        "sub",  # subscript
        "sup",  # superscript
    ]

    def __init__(self, preserve_formatting=False):
        self._html_regex = self._get_html_regex(preserve_formatting)

    def _get_html_regex(self, preserve_formatting):
        if preserve_formatting:
            formats_regex = "|".join(self._FORMATTING_TAGS)
            formats_regex = r"<\/?(?!\/?(" + formats_regex + r")\b).*?\b>"
            html_regex = re.compile(formats_regex, re.IGNORECASE)
        else:
            html_regex = re.compile(r"<[^>]*>", re.IGNORECASE)
        return html_regex

    def parse(self, plain_data):
        return [
            {
                "text": re.sub(self._html_regex, "", unescape(xml_element.text)),
                "start": float(xml_element.attrib["start"]),
                "duration": float(xml_element.attrib.get("dur", "0.0")),
            }
            for xml_element in ElementTree.fromstring(plain_data)
            if xml_element.text is not None
        ]

## Step 5: Transcript Formatter System

### This system formats transcripts into various formats like plain text, JSON, SRT, and WebVTT.

#### `Formatter Class`: Abstract base class. Subclasses implement `format_transcript()` and `format_transcripts()` to format individual or multiple transcripts.

#### `PrettyPrintFormatter`:
  
- Uses `pprint` for readable output.

#### `JSONFormatter`:
  
- Converts transcripts into JSON strings.

#### `TextFormatter`:
  
- Converts transcripts into plain text.

#### `_TextBasedFormatter`: 

- Base class for text-based formats (SRT, WebVTT).
  
- Handles timestamps and line formatting.

#### `SRTFormatter`:
  
- Formats transcripts into the SRT subtitle format with timestamps and numbering.

#### `WebVTTFormatter`:
  
- Formats transcripts for WebVTT captions with timestamps.

### `FormatterLoader`: 

- Loads the appropriate formatter class based on the type (e.g., JSON, WebVTT).

### Exception Handling: 

- Raises `UnknownFormatterType` for unsupported formatter types.

### Example use case: 

- Use `FormatterLoader` to load a formatter like `WebVTTFormatter`, then format the transcript.



In [15]:
import json

import pprint


class Formatter(object):
    """Formatter should be used as an abstract base class.

    Formatter classes should inherit from this class and implement
    their own .format() method which should return a string. A
    transcript is represented by a List of Dictionary items.
    """

    def format_transcript(self, transcript, **kwargs):
        raise NotImplementedError(
            "A subclass of Formatter must implement "
            "their own .format_transcript() method."
        )

    def format_transcripts(self, transcripts, **kwargs):
        raise NotImplementedError(
            "A subclass of Formatter must implement "
            "their own .format_transcripts() method."
        )


class PrettyPrintFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Pretty prints a transcript.

        :param transcript:
        :return: A pretty printed string representation of the transcript.'
        :rtype str
        """
        return pprint.pformat(transcript, **kwargs)

    def format_transcripts(self, transcripts, **kwargs):
        """Pretty prints a list of transcripts.

        :param transcripts:
        :return: A pretty printed string representation of the transcripts.'
        :rtype str
        """
        return self.format_transcript(transcripts, **kwargs)


class JSONFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Converts a transcript into a JSON string.

        :param transcript:
        :return: A JSON string representation of the transcript.'
        :rtype str
        """
        return json.dumps(transcript, **kwargs)

    def format_transcripts(self, transcripts, **kwargs):
        """Converts a list of transcripts into a JSON string.

        :param transcripts:
        :return: A JSON string representation of the transcript.'
        :rtype str
        """
        return self.format_transcript(transcripts, **kwargs)


class TextFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Converts a transcript into plain text with no timestamps.

        :param transcript:
        :return: all transcript text lines separated by newline breaks.'
        :rtype str
        """
        return "\n".join(line["text"] for line in transcript)

    def format_transcripts(self, transcripts, **kwargs):
        """Converts a list of transcripts into plain text with no timestamps.

        :param transcripts:
        :return: all transcript text lines separated by newline breaks.'
        :rtype str
        """
        return "\n\n\n".join(
            [self.format_transcript(transcript, **kwargs) for transcript in transcripts]
        )


class _TextBasedFormatter(TextFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own .format_timestamp() method."
        )

    def _format_transcript_header(self, lines):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own _format_transcript_header method."
        )

    def _format_transcript_helper(self, i, time_text, line):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own _format_transcript_helper method."
        )

    def _seconds_to_timestamp(self, time):
        """Helper that converts `time` into a transcript cue timestamp.

        :reference: https://www.w3.org/TR/webvtt1/#webvtt-timestamp

        :param time: a float representing time in seconds.
        :type time: float
        :return: a string formatted as a cue timestamp, 'HH:MM:SS.MS'
        :rtype str
        :example:
        >>> self._seconds_to_timestamp(6.93)
        '00:00:06.930'
        """
        time = float(time)
        hours_float, remainder = divmod(time, 3600)
        mins_float, secs_float = divmod(remainder, 60)
        hours, mins, secs = int(hours_float), int(mins_float), int(secs_float)
        ms = int(round((time - int(time)) * 1000, 2))
        return self._format_timestamp(hours, mins, secs, ms)

    def format_transcript(self, transcript, **kwargs):
        """A basic implementation of WEBVTT/SRT formatting.

        :param transcript:
        :reference:
        https://www.w3.org/TR/webvtt1/#introduction-caption
        https://www.3playmedia.com/blog/create-srt-file/
        """
        lines = []
        for i, line in enumerate(transcript):
            end = line["start"] + line["duration"]
            time_text = "{} --> {}".format(
                self._seconds_to_timestamp(line["start"]),
                self._seconds_to_timestamp(
                    transcript[i + 1]["start"]
                    if i < len(transcript) - 1 and transcript[i + 1]["start"] < end
                    else end
                ),
            )
            lines.append(self._format_transcript_helper(i, time_text, line))

        return self._format_transcript_header(lines)


class SRTFormatter(_TextBasedFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "{:02d}:{:02d}:{:02d},{:03d}".format(hours, mins, secs, ms)

    def _format_transcript_header(self, lines):
        return "\n\n".join(lines) + "\n"

    def _format_transcript_helper(self, i, time_text, line):
        return "{}\n{}\n{}".format(i + 1, time_text, line["text"])


class WebVTTFormatter(_TextBasedFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "{:02d}:{:02d}:{:02d}.{:03d}".format(hours, mins, secs, ms)

    def _format_transcript_header(self, lines):
        return "WEBVTT\n\n" + "\n\n".join(lines) + "\n"

    def _format_transcript_helper(self, i, time_text, line):
        return "{}\n{}".format(time_text, line["text"])


class FormatterLoader(object):
    TYPES = {
        "json": JSONFormatter,
        "pretty": PrettyPrintFormatter,
        "text": TextFormatter,
        "webvtt": WebVTTFormatter,
        "srt": SRTFormatter,
    }

    class UnknownFormatterType(Exception):
        def __init__(self, formatter_type):
            super(FormatterLoader.UnknownFormatterType, self).__init__(
                "The format '{formatter_type}' is not supported. "
                "Choose one of the following formats: {supported_formatter_types}".format(
                    formatter_type=formatter_type,
                    supported_formatter_types=", ".join(FormatterLoader.TYPES.keys()),
                )
            )

    def load(self, formatter_type="pretty"):
        """
        Loads the Formatter for the given formatter type.

        :param formatter_type:
        :return: Formatter object
        """
        if formatter_type not in FormatterLoader.TYPES.keys():
            raise FormatterLoader.UnknownFormatterType(formatter_type)
        return FormatterLoader.TYPES[formatter_type]()

## Step 6: YouTube Transcript API Integration

This integration allows fetching transcripts from YouTube videos, handling cookies, proxies, and errors.

### `requests`:

- Used to make HTTP requests and manage responses.

### `cookiejar`:

- Manages cookies for YouTube authentication, supporting both Python 2.x and 3.x.

### `YouTubeTranscriptApi Class`:

- Contains methods to fetch and manage YouTube video transcripts.

#### `list_transcripts`:
  
- Retrieves available transcripts for a YouTube video using its `video_id`.

#### `get_transcripts`:
  
- Fetches transcripts for multiple videos, handling language preferences and errors.

#### `get_transcript`:
  
- Retrieves the transcript for a single video, handling language and HTML formatting.

#### `_load_cookies`:
  
- Loads authentication cookies from a file for authorized access.

### Exceptions:

#### `CookiesInvalid`:
  
- Raised if the loaded cookies are invalid.

#### `CookiePathInvalid`:
  
- Raised if the cookie file path is incorrect or missing.

### Conclusion:

- This API integration allows fetching YouTube video transcripts with support for multiple videos, language preferences, and error handling, ensuring smooth retrieval of authorized content.


In [16]:
import requests

try:  # pragma: no cover
    import http.cookiejar as cookiejar

    CookieLoadError = (FileNotFoundError, cookiejar.LoadError)
except ImportError:  # pragma: no cover
    import cookielib as cookiejar  # type: ignore


class YouTubeTranscriptApi(object):
    @classmethod
    def list_transcripts(cls, video_id, proxies=None, cookies=None):
        """
        Retrieves the list of transcripts which are available for a given video. It returns a `TranscriptList` object
        which is iterable and provides methods to filter the list of transcripts for specific languages. While iterating
        over the `TranscriptList` the individual transcripts are represented by `Transcript` objects, which provide
        metadata and can either be fetched by calling `transcript.fetch()` or translated by calling
        `transcript.translate('en')`. Example::

            # retrieve the available transcripts
            transcript_list = YouTubeTranscriptApi.get('video_id')

            # iterate over all available transcripts
            for transcript in transcript_list:
                # the Transcript object provides metadata properties
                print(
                    transcript.video_id,
                    transcript.language,
                    transcript.language_code,
                    # whether it has been manually created or generated by YouTube
                    transcript.is_generated,
                    # a list of languages the transcript can be translated to
                    transcript.translation_languages,
                )

                # fetch the actual transcript data
                print(transcript.fetch())

                # translating the transcript will return another transcript object
                print(transcript.translate('en').fetch())

            # you can also directly filter for the language you are looking for, using the transcript list
            transcript = transcript_list.find_transcript(['de', 'en'])

            # or just filter for manually created transcripts
            transcript = transcript_list.find_manually_created_transcript(['de', 'en'])

            # or automatically generated ones
            transcript = transcript_list.find_generated_transcript(['de', 'en'])

        :param video_id: the youtube video id
        :type video_id: str
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :return: the list of available transcripts
        :rtype TranscriptList:
        """
        with requests.Session() as http_client:
            if cookies:
                http_client.cookies = cls._load_cookies(cookies, video_id)
            http_client.proxies = proxies if proxies else {}
            return TranscriptListFetcher(http_client).fetch(video_id)

    @classmethod
    def get_transcripts(
        cls,
        video_ids,
        languages=("en",),
        continue_after_error=False,
        proxies=None,
        cookies=None,
        preserve_formatting=False,
    ):
        """
        Retrieves the transcripts for a list of videos.

        :param video_ids: a list of youtube video ids
        :type video_ids: list[str]
        :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
        it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
        do so.
        :type languages: list[str]
        :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving
        one of the video transcripts
        :type continue_after_error: bool
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of
        video ids, which could not be retrieved
        :rtype ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}):
        """
        assert isinstance(video_ids, list), "`video_ids` must be a list of strings"

        data = {}
        unretrievable_videos = []

        for video_id in video_ids:
            try:
                data[video_id] = cls.get_transcript(
                    video_id, languages, proxies, cookies, preserve_formatting
                )
            except Exception as exception:
                if not continue_after_error:
                    raise exception

                unretrievable_videos.append(video_id)

        return data, unretrievable_videos

    @classmethod
    def get_transcript(
        cls,
        video_id,
        languages=("en",),
        proxies=None,
        cookies=None,
        preserve_formatting=False,
    ):
        """
        Retrieves the transcript for a single video. This is just a shortcut for calling::

            YouTubeTranscriptApi.list_transcripts(video_id, proxies).find_transcript(languages).fetch()

        :param video_id: the youtube video id
        :type video_id: str
        :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
        it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
        do so.
        :type languages: list[str]
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
        :rtype [{'text': str, 'start': float, 'end': float}]:
        """
        assert isinstance(video_id, str), "`video_id` must be a string"
        return (
            cls.list_transcripts(video_id, proxies, cookies)
            .find_transcript(languages)
            .fetch(preserve_formatting=preserve_formatting)
        )

    @classmethod
    def _load_cookies(cls, cookies, video_id):
        try:
            cookie_jar = cookiejar.MozillaCookieJar()
            cookie_jar.load(cookies)
            if not cookie_jar:
                raise CookiesInvalid(video_id)
            return cookie_jar
        except CookieLoadError:
            raise CookiePathInvalid(video_id)

## Step 7: YouTube Transcript CLI Integration

This integration allows users to fetch YouTube video transcripts via the command line.

### `argparse`:

- Parses command-line arguments to interact with users and fetch transcripts.

### `YouTubeTranscriptCli Class`:

#### Constructor (`__init__`):

- Stores command-line arguments (`args`) passed by the user.

#### `run Method`:

- Handles parsing arguments, setting up proxies and cookies, and fetching transcripts. It processes video IDs and handles exceptions, returning formatted transcripts.

#### `_fetch_transcript Method`:

- Retrieves transcripts using `YouTubeTranscriptApi`, with options to list, exclude types of transcripts, or translate them.

#### `_parse_args Method`:

- Defines and parses command-line options like video IDs, language preferences, and output format.

#### `_sanitize_video_ids Method`:

- Removes escape characters (backslashes) from video IDs.

### Conclusion:

- This class is a command-line utility to fetch YouTube video transcripts, supporting filters, translations, and proxy usage. It provides an easy interface for retrieving transcripts based on user preferences.


In [18]:
import argparse


class YouTubeTranscriptCli(object):
    def __init__(self, args):
        self._args = args

    def run(self):
        parsed_args = self._parse_args()

        if parsed_args.exclude_manually_created and parsed_args.exclude_generated:
            return ""

        proxies = None
        if parsed_args.http_proxy != "" or parsed_args.https_proxy != "":
            proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy}

        cookies = parsed_args.cookies

        transcripts = []
        exceptions = []

        for video_id in parsed_args.video_ids:
            try:
                transcripts.append(
                    self._fetch_transcript(parsed_args, proxies, cookies, video_id)
                )
            except Exception as exception:
                exceptions.append(exception)

        return "\n\n".join(
            [str(exception) for exception in exceptions]
            + (
                [
                    FormatterLoader()
                    .load(parsed_args.format)
                    .format_transcripts(transcripts)
                ]
                if transcripts
                else []
            )
        )

    def _fetch_transcript(self, parsed_args, proxies, cookies, video_id):
        transcript_list = YouTubeTranscriptApi.list_transcripts(
            video_id, proxies=proxies, cookies=cookies
        )

        if parsed_args.list_transcripts:
            return str(transcript_list)

        if parsed_args.exclude_manually_created:
            transcript = transcript_list.find_generated_transcript(
                parsed_args.languages
            )
        elif parsed_args.exclude_generated:
            transcript = transcript_list.find_manually_created_transcript(
                parsed_args.languages
            )
        else:
            transcript = transcript_list.find_transcript(parsed_args.languages)

        if parsed_args.translate:
            transcript = transcript.translate(parsed_args.translate)

        return transcript.fetch()

    def _parse_args(self):
        parser = argparse.ArgumentParser(
            description=(
                "This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. "
                "It also works for automatically generated subtitles and it does not require a headless browser, like "
                "other selenium based solutions do!"
            )
        )
        parser.add_argument(
            "--list-transcripts",
            action="store_const",
            const=True,
            default=False,
            help="This will list the languages in which the given videos are available in.",
        )
        parser.add_argument(
            "video_ids", nargs="+", type=str, help="List of YouTube video IDs."
        )
        parser.add_argument(
            "--languages",
            nargs="*",
            default=[
                "en",
            ],
            type=str,
            help=(
                'A list of language codes in a descending priority. For example, if this is set to "de en" it will '
                "first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails "
                "to do so. As I can't provide a complete list of all working language codes with full certainty, you "
                "may have to play around with the language codes a bit, to find the one which is working for you!"
            ),
        )
        parser.add_argument(
            "--exclude-generated",
            action="store_const",
            const=True,
            default=False,
            help="If this flag is set transcripts which have been generated by YouTube will not be retrieved.",
        )
        parser.add_argument(
            "--exclude-manually-created",
            action="store_const",
            const=True,
            default=False,
            help="If this flag is set transcripts which have been manually created will not be retrieved.",
        )
        parser.add_argument(
            "--format",
            type=str,
            default="pretty",
            choices=tuple(FormatterLoader.TYPES.keys()),
        )
        parser.add_argument(
            "--translate",
            default="",
            help=(
                "The language code for the language you want this transcript to be translated to. Use the "
                "--list-transcripts feature to find out which languages are translatable and which translation "
                "languages are available."
            ),
        )
        parser.add_argument(
            "--http-proxy",
            default="",
            metavar="URL",
            help="Use the specified HTTP proxy.",
        )
        parser.add_argument(
            "--https-proxy",
            default="",
            metavar="URL",
            help="Use the specified HTTPS proxy.",
        )
        parser.add_argument(
            "--cookies",
            default=None,
            help="The cookie file that will be used for authorization with youtube.",
        )

        return self._sanitize_video_ids(parser.parse_args(self._args))

    def _sanitize_video_ids(self, args):
        args.video_ids = [video_id.replace("\\", "") for video_id in args.video_ids]
        return args

## Step 8: Integrate and Query Google Gemini API

This integration allows you to interact with the Google Gemini generative models for content generation.

### 1. Imports

#### `google.generativeai as genai`:
- Imports the Gemini API module to interact with Google Gemini models.

### 2. Loading Environment Variables

#### `load_dotenv()`:
- Loads the environment variables from the `.env` file into the current Python session to handle sensitive data securely.

### 3. Setting API Key and Model Constants

#### `gemini_api_key = os.getenv("GEMINI_API_KEY")`:
- Retrieves the Gemini API key from environment variables.

#### `DEFAULT_GEMINI_MODEL = "gemini-1.5-flash"`:
- Sets the default model version for the Gemini API.

### 4. Function to Get the Gemini Client

#### `get_gemini_client()`:
- Initializes and returns a client for the specified Gemini model.

#### `genai.configure(api_key=gemini_api_key)`:
- Authenticates using the API key.

#### `model = genai.GenerativeModel(model_name=model)`:
- Initializes the model client.

### 5. Function to Query Gemini

#### `gemini(query, model=DEFAULT_GEMINI_MODEL)`:
- Sends a query to the Gemini model and retrieves the generated content.

#### `model.generate_content(query)`:
- Requests content generation based on the query.

#### `return response.text`:
- Returns the generated content.

### Summary

- The script provides functions to initialize a Gemini client and query the model for content generation, securely handling the API key using environment variables.


In [28]:
import google.generativeai as genai
import os
from dotenv import load_dotenv

load_dotenv()

gemini_api_key = os.getenv("GEMINI_API_KEY")
DEFAULT_GEMINI_MODEL = "gemini-1.5-flash"


def get_gemini_client(model=DEFAULT_GEMINI_MODEL):
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel(model_name=model)
    return model


def gemini(query, model=DEFAULT_GEMINI_MODEL):
    model = get_gemini_client(model)
    response = model.generate_content(query)
    return response.text

## Step 9: Fetch and Convert YouTube Transcript to Markdown

#### Imports:

- **TextFormatter**: Converts raw transcript to plain text.

- **YouTubeTranscriptApi**: Fetches YouTube transcripts.

- **display, Markdown**: Renders content in Jupyter notebooks.

#### Function: `transcript_to_markdown(text_transcript)`

- Converts plain text transcript to Markdown using Gemini AI.

#### Fetching and Formatting:

- Fetch transcript for a YouTube video (`video_id`).

- Format it to plain text using **TextFormatter**.

#### Convert to Markdown:

- Send plain text to **Gemini** for Markdown formatting.

#### Display:

- Render the Markdown-formatted transcript in Jupyter.

### Summary:

- Fetch YouTube transcript, convert to plain text, then format it into Markdown using Gemini and display it.


In [None]:
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api import YouTubeTranscriptApi
from IPython.display import display, Markdown


def transcript_to_markdown(text_transcript: str) -> str:
    """Transformes a text transcript into markdown format with speaker boundaries."""
    prompt = f"""
        - Given a transcript of a video in text format, convert it into markdown format.
        - Make sure no details are missed out and all the information in the transcript is included. 
        - Don't hallucinate or make up factual information. 
        - Identify 'speaker boundaries' and tag them with the speaker's name. If you don't have the name just say 'Speaker 1', 'Speaker 2', etc.
        - Clean the markdown and format it properly. 
        Transcript: {text_transcript}
    """
    response = gemini(query=prompt)
    return response


formatter = TextFormatter()

video_id = "no_XaCE969Y"
transcript = YouTubeTranscriptApi.get_transcript(video_id)
text_transcript = formatter.format_transcript(transcript)
transcript_markdown = transcript_to_markdown(text_transcript)
display_markdown = Markdown(transcript_markdown)
display(display_markdown)

## Conclusion:

This app provides a streamlined way to retrieve, format, and display YouTube video transcripts in a structured and readable Markdown format. The key steps involve:

### 1. **Fetching Transcripts:**
- The app fetches the transcript data for a specified YouTube video using the `YouTubeTranscriptApi`. This allows users to get both manually created and automatically generated transcripts.

### 2. **Formatting the Transcript:**
- The raw transcript data is formatted into plain text using the `TextFormatter` class, making it easy to work with.

### 3. **Enhancing the Transcript:**
- By leveraging the power of AI (via the Gemini model), the plain text transcript is further processed into a well-organized and polished Markdown format. This includes proper speaker identification, clear text separation, and formatting.

### 4. **Displaying the Output:**
- The final Markdown-formatted transcript is displayed in a user-friendly manner using the IPython `display` function, which works seamlessly in Jupyter notebooks or similar environments.

This app serves as a powerful tool for users who want to extract, clean, and present YouTube transcripts in a readable, well-structured format, making it especially useful for tasks such as content analysis, transcription review, or preparing content for documentation.


---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

