# YouTube Transcript API Markdown Formatter

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

## Description:

This app fetches YouTube video transcripts, formats them into Markdown, and enhances readability by identifying speaker boundaries. It uses AI to clean up and structure the transcript, making it easy to read and share. Ideal for content creators, researchers, and educators needing organized video transcriptions.

## Step 1: Define YouTube Video URL Template

## Purpose

This step involves defining a reusable URL template for constructing YouTube video links. It lays the foundation for dynamically generating complete URLs for specific videos by substituting a placeholder with a unique video ID.

## Details

- **Constant Variable:**

  - `WATCH_URL` is a string intended to act as a template.

  - The use of uppercase naming (`WATCH_URL`) follows the convention for constants, indicating that the value should remain unchanged throughout the program.

- **URL Template:**

  - The URL represents the standard structure of a YouTube video link, with the `{video_id}` placeholder serving as a dynamic segment.

  - The fixed part of the URL is `https://www.youtube.com/watch?v=`, which remains consistent for any YouTube video.

- **Dynamic Placeholder:**

  - `{video_id}` is a placeholder that can be replaced using Python's string formatting techniques.

  - This allows the program to dynamically generate complete video URLs for any given video ID.

## Usage

- This template is especially useful when:

  - Dynamically generating a full YouTube URL for a specific video by substituting the `{video_id}` placeholder with an actual video ID.


In [5]:
WATCH_URL = "https://www.youtube.com/watch?v={video_id}"

## Step 2: Define Custom Exception Hierarchy for YouTube Transcript Retrieval Errors

## Purpose

This code defines a series of custom exception classes to handle various errors related to retrieving YouTube transcripts using a hypothetical API called `youtube-transcript-api`. The classes inherit from the base exception class `CouldNotRetrieveTranscript`, which is extended by more specific exceptions for different causes. Below is a detailed explanation:

## Base Exception: `CouldNotRetrieveTranscript`

- **Purpose:**  
  This is the base exception class raised when a transcript for a YouTube video cannot be retrieved.

- **Attributes:**  
  - `ERROR_MESSAGE`: A template message that includes the URL of the video for which the transcript couldn't be retrieved.  

  - `CAUSE_MESSAGE_INTRO`: A string template for introducing the cause of the error.  

  - `CAUSE_MESSAGE`: An empty string placeholder that will be populated in derived classes.  

  - `GITHUB_REFERRAL`: A message that encourages users to report issues on GitHub if they believe the cause isn't described and gives guidance on how to report the issue.  


- **Methods:**  
  - `__init__(self, video_id)`: The constructor accepts the video ID of the YouTube video for which the transcript couldn't be retrieved. It also calls the parent constructor to initialize the exception message.  

  - `_build_error_message(self)`: This private method builds the error message by formatting the base `ERROR_MESSAGE` and adding more detailed information about the cause if available. It also appends the GitHub referral message if a cause is provided.  

  - `cause`: A property method that returns the `CAUSE_MESSAGE`. This can be overridden in subclasses to provide specific causes.

## Derived Exceptions

These are subclasses of `CouldNotRetrieveTranscript`, each with its own specific cause for the failure to retrieve a transcript.

### 1. `YouTubeRequestFailed`

- **Purpose:**  
  Raised when a request to YouTube fails due to an HTTP error.

- **Attributes:**  
  - `reason`: Stores the HTTP error message as a string.  

- **Methods:**  
  - `cause`: Overridden to include the specific HTTP error message in the cause.

### 2. `VideoUnavailable`

- **Purpose:**  
  Raised when the YouTube video is no longer available, meaning it might have been deleted or made private.

- **Cause:**  
  "The video is no longer available."

### 3. `InvalidVideoId`

- **Purpose:**  
  Raised when an invalid video ID is provided.

- **Cause:**  
  The error message instructs the user to provide only the video ID and not the entire URL.

### 4. `TooManyRequests`

- **Purpose:**  
  Raised when YouTube receives too many requests from the same IP address and requires the user to solve a CAPTCHA.

- **Cause:**  
  It provides suggestions for how to work around this, such as solving the CAPTCHA manually, using a different IP, or waiting for the ban to lift.

### 5. `TranscriptsDisabled`

- **Purpose:**  
  Raised when subtitles (transcripts) are disabled for the video by the uploader.

- **Cause:**  
  "Subtitles are disabled for this video."

### 6. `NoTranscriptAvailable`

- **Purpose:**  
  Raised when no transcript is available for the video.

- **Cause:**  
  "No transcripts are available for this video."

### 7. `NotTranslatable`

- **Purpose:**  
  Raised when the requested language for the transcript is not translatable.

- **Cause:**  
  "The requested language is not translatable."

### 8. `TranslationLanguageNotAvailable`

- **Purpose:**  
  Raised when the requested translation language is not available.

- **Cause:**  
  "The requested translation language is not available."

### 9. `CookiePathInvalid`

- **Purpose:**  
  Raised when the path to the provided cookie file is invalid.

- **Cause:**  
  "The provided cookie file was unable to be loaded."

### 10. `CookiesInvalid`

- **Purpose:**  
  Raised when the provided cookies are invalid, possibly expired.

- **Cause:**  
  "The cookies provided are not valid (may have expired)."

### 11. `FailedToCreateConsentCookie`

- **Purpose:**  
  Raised when the system fails to automatically create a consent cookie (likely for handling cookies in requests).

- **Cause:**  
  "Failed to automatically give consent to saving cookies."

### 12. `NoTranscriptFound`

- **Purpose:**  
  Raised when no transcript is found for any of the requested language codes.

- **Attributes:**  
  - `_requested_language_codes`: The list of language codes the user requested.  
  - `_transcript_data`: The transcript data (likely empty or partial).  

- **Methods:**  
  - `cause`: Overridden to include the requested language codes and any available transcript data in the error message.

## Summary

This code provides a robust way to handle different types of errors that might occur when trying to retrieve YouTube video transcripts using the `youtube-transcript-api`. The base class provides a standard error message format, and each specific error subclass customizes the error message to provide more details on the nature of the problem.


In [12]:
class CouldNotRetrieveTranscript(Exception):
    """
    Raised if a transcript could not be retrieved.
    """

    ERROR_MESSAGE = "\nCould not retrieve a transcript for the video {video_url}!"
    CAUSE_MESSAGE_INTRO = " This is most likely caused by:\n\n{cause}"
    CAUSE_MESSAGE = ""
    GITHUB_REFERRAL = (
        "\n\nIf you are sure that the described cause is not responsible for this error "
        "and that a transcript should be retrievable, please create an issue at "
        "https://github.com/jdepoix/youtube-transcript-api/issues. "
        "Please add which version of youtube_transcript_api you are using "
        "and provide the information needed to replicate the error. "
        "Also make sure that there are no open issues which already describe your problem!"
    )

    def __init__(self, video_id):
        self.video_id = video_id
        super(CouldNotRetrieveTranscript, self).__init__(self._build_error_message())

    def _build_error_message(self):
        cause = self.cause
        error_message = self.ERROR_MESSAGE.format(
            video_url=WATCH_URL.format(video_id=self.video_id)
        )

        if cause:
            error_message += (
                self.CAUSE_MESSAGE_INTRO.format(cause=cause) + self.GITHUB_REFERRAL
            )

        return error_message

    @property
    def cause(self):
        return self.CAUSE_MESSAGE


class YouTubeRequestFailed(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Request to YouTube failed: {reason}"

    def __init__(self, video_id, http_error):
        self.reason = str(http_error)
        super(YouTubeRequestFailed, self).__init__(video_id)

    @property
    def cause(self):
        return self.CAUSE_MESSAGE.format(
            reason=self.reason,
        )


class VideoUnavailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The video is no longer available"


class InvalidVideoId(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "You provided an invalid video id. Make sure you are using the video id and NOT the url!\n\n"
        'Do NOT run: `YouTubeTranscriptApi.get_transcript("https://www.youtube.com/watch?v=1234")`\n'
        'Instead run: `YouTubeTranscriptApi.get_transcript("1234")`'
    )


class TooManyRequests(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "YouTube is receiving too many requests from this IP and now requires solving a captcha to continue. "
        "One of the following things can be done to work around this:\n\
        - Manually solve the captcha in a browser and export the cookie. "
        "Read here how to use that cookie with "
        "youtube-transcript-api: https://github.com/jdepoix/youtube-transcript-api#cookies\n\
        - Use a different IP address\n\
        - Wait until the ban on your IP has been lifted"
    )


class TranscriptsDisabled(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Subtitles are disabled for this video"


class NoTranscriptAvailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "No transcripts are available for this video"


class NotTranslatable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The requested language is not translatable"


class TranslationLanguageNotAvailable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The requested translation language is not available"


class CookiePathInvalid(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The provided cookie file was unable to be loaded"


class CookiesInvalid(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "The cookies provided are not valid (may have expired)"


class FailedToCreateConsentCookie(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = "Failed to automatically give consent to saving cookies"


class NoTranscriptFound(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = (
        "No transcripts were found for any of the requested language codes: {requested_language_codes}\n\n"
        "{transcript_data}"
    )

    def __init__(self, video_id, requested_language_codes, transcript_data):
        self._requested_language_codes = requested_language_codes
        self._transcript_data = transcript_data
        super(NoTranscriptFound, self).__init__(video_id)

    @property
    def cause(self):
        return self.CAUSE_MESSAGE.format(
            requested_language_codes=self._requested_language_codes,
            transcript_data=str(self._transcript_data),
        )

## Step 3: Handle HTML Unescaping for Different Python Versions

This code ensures compatibility for unescaping HTML entities across different Python versions (Python 2 and 3). 

It uses conditional imports and function definitions based on the Python version to provide a consistent unescape function across environments. 

Here's a breakdown:

- `Python 3.4+`: Imports unescape directly from the html module. 

- `Python 2`: Imports HTMLParser and uses it to create an html_parser object to handle unescaping. 

- `Python 3.0` - 3.3: Uses html.parser to create an html_parser object to handle unescaping. 

- Custom unescape function: For Python versions other than 3.4+, a custom unescape function is defined using the appropriate parser for the version. 

The # pragma: no cover comments indicate that this code should not be covered by test coverage tools like coverage.py due to the version-dependent nature of the logic.


In [13]:
import sys


# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 3 and sys.version_info.minor >= 4:  # pragma: no cover
    # Python 3.4+
    from html import unescape
else:  # pragma: no cover
    if sys.version_info.major <= 2:
        # Python 2
        import HTMLParser  # type: ignore

        html_parser = HTMLParser.HTMLParser()
    else:
        # Python 3.0 - 3.3
        import html.parser

        html_parser = html.parser.HTMLParser()

    def unescape(string):
        return html_parser.unescape(string)

## Step 4: Fetch and Parse YouTube Video Transcripts

### 1. Version Compatibility for Python 2 and 3

This block checks if the Python version is 2. If so, it reloads the sys module and sets the default encoding to utf-8. 

This is a compatibility measure for Python 2 because it doesn’t handle Unicode in the same way as Python 3.

reload(sys) reinitializes the sys module, and sys.setdefaultencoding("utf-8") ensures that the default encoding for strings is UTF-8, which is important for handling non-ASCII characters.

For Python 3, it does nothing, as Python 3 already uses UTF-8 by default.

### 2. Importing Necessary Modules

- `json`: Used to parse JSON data from YouTube's HTML response, specifically the caption tracks.

- `ElementTree from defusedxml`: This is used for safe parsing of XML data. defusedxml ensures that XML data is parsed securely, preventing potential security vulnerabilities such as XML External Entity (XXE) attacks.

- `re`: Regular expressions are used to search and process the transcript data, specifically for removing HTML tags.

- `HTTPError`: An exception from the requests library, used to handle HTTP errors during the web requests made to YouTube.


### 3. `_raise_http_errors`

This method is a utility function that checks if an HTTP request was successful. 

If not, it raises a custom YouTubeRequestFailed exception.

response.raise_for_status() triggers an exception for HTTP error codes (e.g., 404 or 500).

If there is an error, it raises the YouTubeRequestFailed error, passing the original HTTP error and the video_id.

### 4. `TranscriptListFetcher`

This class is responsible for fetching and building the transcript list for a given YouTube video. It requires an http_client, which is an instance of requests.Session or similar.

#### Key Methods:

- **fetch**: 

The main method that takes the video_id and uses several internal methods to retrieve the list of available transcripts for that video. It returns a TranscriptList object.

- **_extract_captions_json**:

Extracts the captions JSON from the YouTube HTML response. It looks for the "captions": part of the HTML and parses the JSON that follows.

If no captions are found, it raises specific exceptions like InvalidVideoId, TooManyRequests, VideoUnavailable, or TranscriptsDisabled.

- **_create_consent_cookie**:

YouTube sometimes requires consent to process data. If consent is needed, this method creates a consent cookie and sets it in the HTTP request.

- **_fetch_video_html**:

Retrieves the HTML of the video page. If consent is required, it handles the process of obtaining the consent before fetching the HTML.

- **_fetch_html**:

Makes an HTTP GET request to fetch the video page HTML and processes any HTTP errors through the _raise_http_errors method.

### 5. `TranscriptList`

This class holds the list of available transcripts for a video and provides methods to find transcripts in specific languages. 

It also handles the parsing of manually created and automatically generated transcripts.

#### Key Methods:

- **build**: 

A factory method that builds a TranscriptList object. It parses the captions JSON and organizes the available captions into manually created transcripts (manually_created_transcripts) and automatically generated transcripts (generated_transcripts).

- **__iter__**:

Allows iteration over the available transcripts. It combines both manually created and generated transcripts and returns them in an iterable form.

- **find_transcript**:

This method attempts to find a transcript based on the priority of language codes provided. It first looks for manually created transcripts and, if none are found, looks for automatically generated ones.

- **find_generated_transcript** and **find_manually_created_transcript**:

These methods are specialized to fetch either automatically generated transcripts or manually created ones, based on the user's request.

- **_find_transcript**:

A helper method that searches for transcripts in the provided dictionaries. It iterates over language codes and looks for matching transcripts in the provided dictionaries (either manually created or generated).

- **__str__**:

This method returns a string representation of the available transcripts, including both the manually created and generated transcripts, as well as the translation languages available for the video.

### 6. `Transcript`

Represents an individual transcript for a video. It stores metadata such as the language, URL, and whether the transcript is automatically generated.

#### Key Methods:

- **fetch**:

Retrieves the actual transcript data by making an HTTP GET request to the transcript URL. It uses _TranscriptParser to parse the transcript data into a list of dictionaries containing text, start time, and duration for each segment.

- **translate**:

If the transcript is translatable (i.e., it has available translation languages), this method allows the transcript to be translated into another language. It raises an exception if the transcript is not translatable or if the target language is unavailable.

- **__str__**:

Provides a string representation of the transcript, including the language and whether it is translatable.

- **is_translatable**:

A property that returns True if the transcript has available translation languages.

### 7. `_TranscriptParser`

This class parses the transcript XML data into a usable format. It can either remove all HTML tags or preserve certain formatting tags (like strong, em, etc.), depending on the preserve_formatting flag.

#### Key Methods:

- **_get_html_regex**:

Returns the appropriate regular expression for removing or preserving HTML tags based on the preserve_formatting flag.

If preserve_formatting is True, it allows certain tags like strong, em, etc.

- **parse**:

This method parses the XML data and returns a list of dictionaries, each containing the text, start time, and duration of each transcript segment.

### 8. Error Handling

The code raises several custom exceptions when certain conditions are met:

- **YouTubeRequestFailed**: Raised if an HTTP request fails.

- **InvalidVideoId**: Raised if the video ID is invalid.

- **TranscriptsDisabled**: Raised if the video has transcripts disabled.

- **NoTranscriptAvailable**: Raised if no transcripts are available.

- **TooManyRequests**: Raised if the video has been blocked due to too many requests.


These exceptions help handle specific issues and ensure that errors are meaningful and easily traceable.

### Summary

This code is designed to fetch, manage, and parse YouTube video transcripts using both manually created and automatically generated captions.

It handles version compatibility between Python 2 and 3, provides methods to find and retrieve transcripts in different languages, and includes robust error handling for various failure scenarios (e.g., missing captions, consent issues, etc.).

By abstracting the transcript-fetching process into several classes (TranscriptListFetcher, TranscriptList, Transcript, _TranscriptParser), the code is modular, making it easier to manage and extend.


In [14]:
import sys

# This can only be tested by using different python versions, therefore it is not covered by coverage.py
if sys.version_info.major == 2:  # pragma: no cover
    # ruff: noqa: F821
    reload(sys)
    sys.setdefaultencoding("utf-8")

import json

from defusedxml import ElementTree

import re

from requests import HTTPError


def _raise_http_errors(response, video_id):
    try:
        response.raise_for_status()
        return response
    except HTTPError as error:
        raise YouTubeRequestFailed(error, video_id)


class TranscriptListFetcher(object):
    def __init__(self, http_client):
        self._http_client = http_client

    def fetch(self, video_id):
        return TranscriptList.build(
            self._http_client,
            video_id,
            self._extract_captions_json(self._fetch_video_html(video_id), video_id),
        )

    def _extract_captions_json(self, html, video_id):
        splitted_html = html.split('"captions":')

        if len(splitted_html) <= 1:
            if video_id.startswith("http://") or video_id.startswith("https://"):
                raise InvalidVideoId(video_id)
            if 'class="g-recaptcha"' in html:
                raise TooManyRequests(video_id)
            if '"playabilityStatus":' not in html:
                raise VideoUnavailable(video_id)

            raise TranscriptsDisabled(video_id)

        captions_json = json.loads(
            splitted_html[1].split(',"videoDetails')[0].replace("\n", "")
        ).get("playerCaptionsTracklistRenderer")
        if captions_json is None:
            raise TranscriptsDisabled(video_id)

        if "captionTracks" not in captions_json:
            raise NoTranscriptAvailable(video_id)

        return captions_json

    def _create_consent_cookie(self, html, video_id):
        match = re.search('name="v" value="(.*?)"', html)
        if match is None:
            raise FailedToCreateConsentCookie(video_id)
        self._http_client.cookies.set(
            "CONSENT", "YES+" + match.group(1), domain=".youtube.com"
        )

    def _fetch_video_html(self, video_id):
        html = self._fetch_html(video_id)
        if 'action="https://consent.youtube.com/s"' in html:
            self._create_consent_cookie(html, video_id)
            html = self._fetch_html(video_id)
            if 'action="https://consent.youtube.com/s"' in html:
                raise FailedToCreateConsentCookie(video_id)
        return html

    def _fetch_html(self, video_id):
        response = self._http_client.get(
            WATCH_URL.format(video_id=video_id), headers={"Accept-Language": "en-US"}
        )
        return unescape(_raise_http_errors(response, video_id).text)


class TranscriptList(object):
    """
    This object represents a list of transcripts. It can be iterated over to list all transcripts which are available
    for a given YouTube video. Also it provides functionality to search for a transcript in a given language.
    """

    def __init__(
        self,
        video_id,
        manually_created_transcripts,
        generated_transcripts,
        translation_languages,
    ):
        """
        The constructor is only for internal use. Use the static build method instead.

        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param manually_created_transcripts: dict mapping language codes to the manually created transcripts
        :type manually_created_transcripts: dict[str, Transcript]
        :param generated_transcripts: dict mapping language codes to the generated transcripts
        :type generated_transcripts: dict[str, Transcript]
        :param translation_languages: list of languages which can be used for translatable languages
        :type translation_languages: list[dict[str, str]]
        """
        self.video_id = video_id
        self._manually_created_transcripts = manually_created_transcripts
        self._generated_transcripts = generated_transcripts
        self._translation_languages = translation_languages

    @staticmethod
    def build(http_client, video_id, captions_json):
        """
        Factory method for TranscriptList.

        :param http_client: http client which is used to make the transcript retrieving http calls
        :type http_client: requests.Session
        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param captions_json: the JSON parsed from the YouTube pages static HTML
        :type captions_json: dict
        :return: the created TranscriptList
        :rtype TranscriptList:
        """
        translation_languages = [
            {
                "language": translation_language["languageName"]["simpleText"],
                "language_code": translation_language["languageCode"],
            }
            for translation_language in captions_json.get("translationLanguages", [])
        ]

        manually_created_transcripts = {}
        generated_transcripts = {}

        for caption in captions_json["captionTracks"]:
            if caption.get("kind", "") == "asr":
                transcript_dict = generated_transcripts
            else:
                transcript_dict = manually_created_transcripts

            transcript_dict[caption["languageCode"]] = Transcript(
                http_client,
                video_id,
                caption["baseUrl"],
                caption["name"]["simpleText"],
                caption["languageCode"],
                caption.get("kind", "") == "asr",
                translation_languages if caption.get("isTranslatable", False) else [],
            )

        return TranscriptList(
            video_id,
            manually_created_transcripts,
            generated_transcripts,
            translation_languages,
        )

    def __iter__(self):
        return iter(
            list(self._manually_created_transcripts.values())
            + list(self._generated_transcripts.values())
        )

    def find_transcript(self, language_codes):
        """
        Finds a transcript for a given language code. Manually created transcripts are returned first and only if none
        are found, generated transcripts are used. If you only want generated transcripts use
        `find_manually_created_transcript` instead.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(
            language_codes,
            [self._manually_created_transcripts, self._generated_transcripts],
        )

    def find_generated_transcript(self, language_codes):
        """
        Finds an automatically generated transcript for a given language code.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(language_codes, [self._generated_transcripts])

    def find_manually_created_transcript(self, language_codes):
        """
        Finds a manually created transcript for a given language code.

        :param language_codes: A list of language codes in a descending priority. For example, if this is set to
        ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if
        it fails to do so.
        :type languages: list[str]
        :return: the found Transcript
        :rtype Transcript:
        :raises: NoTranscriptFound
        """
        return self._find_transcript(
            language_codes, [self._manually_created_transcripts]
        )

    def _find_transcript(self, language_codes, transcript_dicts):
        for language_code in language_codes:
            for transcript_dict in transcript_dicts:
                if language_code in transcript_dict:
                    return transcript_dict[language_code]

        raise NoTranscriptFound(self.video_id, language_codes, self)

    def __str__(self):
        return (
            "For this video ({video_id}) transcripts are available in the following languages:\n\n"
            "(MANUALLY CREATED)\n"
            "{available_manually_created_transcript_languages}\n\n"
            "(GENERATED)\n"
            "{available_generated_transcripts}\n\n"
            "(TRANSLATION LANGUAGES)\n"
            "{available_translation_languages}"
        ).format(
            video_id=self.video_id,
            available_manually_created_transcript_languages=self._get_language_description(
                str(transcript)
                for transcript in self._manually_created_transcripts.values()
            ),
            available_generated_transcripts=self._get_language_description(
                str(transcript) for transcript in self._generated_transcripts.values()
            ),
            available_translation_languages=self._get_language_description(
                '{language_code} ("{language}")'.format(
                    language=translation_language["language"],
                    language_code=translation_language["language_code"],
                )
                for translation_language in self._translation_languages
            ),
        )

    def _get_language_description(self, transcript_strings):
        description = "\n".join(
            " - {transcript}".format(transcript=transcript)
            for transcript in transcript_strings
        )
        return description if description else "None"


class Transcript(object):
    def __init__(
        self,
        http_client,
        video_id,
        url,
        language,
        language_code,
        is_generated,
        translation_languages,
    ):
        """
        You probably don't want to initialize this directly. Usually you'll access Transcript objects using a
        TranscriptList.

        :param http_client: http client which is used to make the transcript retrieving http calls
        :type http_client: requests.Session
        :param video_id: the id of the video this TranscriptList is for
        :type video_id: str
        :param url: the url which needs to be called to fetch the transcript
        :param language: the name of the language this transcript uses
        :param language_code:
        :param is_generated:
        :param translation_languages:
        """
        self._http_client = http_client
        self.video_id = video_id
        self._url = url
        self.language = language
        self.language_code = language_code
        self.is_generated = is_generated
        self.translation_languages = translation_languages
        self._translation_languages_dict = {
            translation_language["language_code"]: translation_language["language"]
            for translation_language in translation_languages
        }

    def fetch(self, preserve_formatting=False):
        """
        Loads the actual transcript data.
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
        :rtype [{'text': str, 'start': float, 'end': float}]:
        """
        response = self._http_client.get(
            self._url, headers={"Accept-Language": "en-US"}
        )
        return _TranscriptParser(preserve_formatting=preserve_formatting).parse(
            _raise_http_errors(response, self.video_id).text,
        )

    def __str__(self):
        return '{language_code} ("{language}"){translation_description}'.format(
            language=self.language,
            language_code=self.language_code,
            translation_description="[TRANSLATABLE]" if self.is_translatable else "",
        )

    @property
    def is_translatable(self):
        return len(self.translation_languages) > 0

    def translate(self, language_code):
        if not self.is_translatable:
            raise NotTranslatable(self.video_id)

        if language_code not in self._translation_languages_dict:
            raise TranslationLanguageNotAvailable(self.video_id)

        return Transcript(
            self._http_client,
            self.video_id,
            "{url}&tlang={language_code}".format(
                url=self._url, language_code=language_code
            ),
            self._translation_languages_dict[language_code],
            language_code,
            True,
            [],
        )


class _TranscriptParser(object):
    _FORMATTING_TAGS = [
        "strong",  # important
        "em",  # emphasized
        "b",  # bold
        "i",  # italic
        "mark",  # marked
        "small",  # smaller
        "del",  # deleted
        "ins",  # inserted
        "sub",  # subscript
        "sup",  # superscript
    ]

    def __init__(self, preserve_formatting=False):
        self._html_regex = self._get_html_regex(preserve_formatting)

    def _get_html_regex(self, preserve_formatting):
        if preserve_formatting:
            formats_regex = "|".join(self._FORMATTING_TAGS)
            formats_regex = r"<\/?(?!\/?(" + formats_regex + r")\b).*?\b>"
            html_regex = re.compile(formats_regex, re.IGNORECASE)
        else:
            html_regex = re.compile(r"<[^>]*>", re.IGNORECASE)
        return html_regex

    def parse(self, plain_data):
        return [
            {
                "text": re.sub(self._html_regex, "", unescape(xml_element.text)),
                "start": float(xml_element.attrib["start"]),
                "duration": float(xml_element.attrib.get("dur", "0.0")),
            }
            for xml_element in ElementTree.fromstring(plain_data)
            if xml_element.text is not None
        ]

## Step 5: Transcript Formatter System

This code defines a system for formatting transcripts into different formats, such as plain text, JSON, pretty-print, SRT (SubRip Subtitle), and WebVTT (Web Video Text Tracks). 

### 1. `Formatter Class`

#### Purpose:
This is an abstract base class for all formatters. Subclasses of Formatter must implement the `format_transcript()` and `format_transcripts()` methods.

#### Methods:
- **format_transcript()**: Should be implemented by subclasses to return a formatted string representation of a single transcript.

- **format_transcripts()**: Should be implemented by subclasses to return a formatted string representation of a list of transcripts.

### 2. `PrettyPrintFormatter Class`

#### Purpose:

Formats a transcript using Python's `pprint` (pretty print) module for a readable output.

#### Methods:

- **format_transcript()**: Converts a transcript into a pretty-printed string.

- **format_transcripts()**: Formats a list of transcripts by using the `format_transcript()` method.

### 3. `JSONFormatter Class`

#### Purpose:

Converts transcripts into a JSON string.

#### Methods:

- **format_transcript()**: Converts a single transcript into a JSON string using `json.dumps()`.

- **format_transcripts()**: Converts a list of transcripts into a JSON string.

### 4. `TextFormatter Class`

#### Purpose:

Converts transcripts into plain text with no timestamps.

#### Methods:

- **format_transcript()**: Returns a string where the transcript's text is joined with newline breaks.

- **format_transcripts()**: Converts a list of transcripts into plain text, with each transcript separated by a double newline.

### 5. `_TextBasedFormatter Class`

#### Purpose:

This is a base class for formatting text-based formats (like SRT or WebVTT) and extends `TextFormatter`. It defines helper methods that handle timestamps and the formatting of transcript lines.

#### Methods:

- **_format_timestamp()**: Converts a time in seconds into a timestamp string format (this method is abstract and must be implemented by subclasses).

- **_format_transcript_header()**: Formats the transcript header (this method is abstract and must be implemented by subclasses).

- **_format_transcript_helper()**: Helper method that formats each transcript line (this method is abstract and must be implemented by subclasses).

- **_seconds_to_timestamp()**: Converts a floating-point number representing seconds into a formatted timestamp.

### 6. `SRTFormatter Class`

#### Purpose:

Formats a transcript into the SRT subtitle format, which includes a specific timestamp format and line numbering.

#### Methods:
- **_format_timestamp()**: Converts a timestamp into SRT format (e.g., 00:00:06,930).

- **_format_transcript_header()**: Joins the transcript lines with newline breaks.

- **_format_transcript_helper()**: Formats each transcript line in the format required by SRT, including line numbers and timestamps.


### 7. `WebVTTFormatter Class`

#### Purpose:

Formats a transcript into the WebVTT format, typically used for captions and subtitles on the web.

#### Methods:

- **_format_timestamp()**: Converts a timestamp into WebVTT format (e.g., 00:00:06.930).

- **_format_transcript_header()**: Adds the "WEBVTT" header and joins the lines.

- **_format_transcript_helper()**: Formats each transcript line with timestamps in WebVTT format.


### 8. `FormatterLoader Class`

#### Purpose:

Loads the appropriate formatter class based on the requested formatter type (e.g., JSON, pretty, text, WebVTT, SRT).

#### Methods:

- **load()**: Takes the formatter type as input and returns the corresponding formatter object. If the requested formatter type is invalid, it raises an `UnknownFormatterType` exception.

### 9. `Exception Handling`

#### UnknownFormatterType Exception:
Raised if a non-supported formatter type is requested, providing a clear error message with the available options.

### 10. `Example Use Case`

If you want to format a transcript into a WebVTT subtitle file, you would use `FormatterLoader` to load the `WebVTTFormatter` and then call `format_transcript()` to generate the formatted string.

```python
formatter_loader = FormatterLoader()

webvtt_formatter = formatter_loader.load("webvtt")

formatted_transcript = webvtt_formatter.format_transcript(transcript)


In [15]:
import json

import pprint


class Formatter(object):
    """Formatter should be used as an abstract base class.

    Formatter classes should inherit from this class and implement
    their own .format() method which should return a string. A
    transcript is represented by a List of Dictionary items.
    """

    def format_transcript(self, transcript, **kwargs):
        raise NotImplementedError(
            "A subclass of Formatter must implement "
            "their own .format_transcript() method."
        )

    def format_transcripts(self, transcripts, **kwargs):
        raise NotImplementedError(
            "A subclass of Formatter must implement "
            "their own .format_transcripts() method."
        )


class PrettyPrintFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Pretty prints a transcript.

        :param transcript:
        :return: A pretty printed string representation of the transcript.'
        :rtype str
        """
        return pprint.pformat(transcript, **kwargs)

    def format_transcripts(self, transcripts, **kwargs):
        """Pretty prints a list of transcripts.

        :param transcripts:
        :return: A pretty printed string representation of the transcripts.'
        :rtype str
        """
        return self.format_transcript(transcripts, **kwargs)


class JSONFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Converts a transcript into a JSON string.

        :param transcript:
        :return: A JSON string representation of the transcript.'
        :rtype str
        """
        return json.dumps(transcript, **kwargs)

    def format_transcripts(self, transcripts, **kwargs):
        """Converts a list of transcripts into a JSON string.

        :param transcripts:
        :return: A JSON string representation of the transcript.'
        :rtype str
        """
        return self.format_transcript(transcripts, **kwargs)


class TextFormatter(Formatter):
    def format_transcript(self, transcript, **kwargs):
        """Converts a transcript into plain text with no timestamps.

        :param transcript:
        :return: all transcript text lines separated by newline breaks.'
        :rtype str
        """
        return "\n".join(line["text"] for line in transcript)

    def format_transcripts(self, transcripts, **kwargs):
        """Converts a list of transcripts into plain text with no timestamps.

        :param transcripts:
        :return: all transcript text lines separated by newline breaks.'
        :rtype str
        """
        return "\n\n\n".join(
            [self.format_transcript(transcript, **kwargs) for transcript in transcripts]
        )


class _TextBasedFormatter(TextFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own .format_timestamp() method."
        )

    def _format_transcript_header(self, lines):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own _format_transcript_header method."
        )

    def _format_transcript_helper(self, i, time_text, line):
        raise NotImplementedError(
            "A subclass of _TextBasedFormatter must implement "
            "their own _format_transcript_helper method."
        )

    def _seconds_to_timestamp(self, time):
        """Helper that converts `time` into a transcript cue timestamp.

        :reference: https://www.w3.org/TR/webvtt1/#webvtt-timestamp

        :param time: a float representing time in seconds.
        :type time: float
        :return: a string formatted as a cue timestamp, 'HH:MM:SS.MS'
        :rtype str
        :example:
        >>> self._seconds_to_timestamp(6.93)
        '00:00:06.930'
        """
        time = float(time)
        hours_float, remainder = divmod(time, 3600)
        mins_float, secs_float = divmod(remainder, 60)
        hours, mins, secs = int(hours_float), int(mins_float), int(secs_float)
        ms = int(round((time - int(time)) * 1000, 2))
        return self._format_timestamp(hours, mins, secs, ms)

    def format_transcript(self, transcript, **kwargs):
        """A basic implementation of WEBVTT/SRT formatting.

        :param transcript:
        :reference:
        https://www.w3.org/TR/webvtt1/#introduction-caption
        https://www.3playmedia.com/blog/create-srt-file/
        """
        lines = []
        for i, line in enumerate(transcript):
            end = line["start"] + line["duration"]
            time_text = "{} --> {}".format(
                self._seconds_to_timestamp(line["start"]),
                self._seconds_to_timestamp(
                    transcript[i + 1]["start"]
                    if i < len(transcript) - 1 and transcript[i + 1]["start"] < end
                    else end
                ),
            )
            lines.append(self._format_transcript_helper(i, time_text, line))

        return self._format_transcript_header(lines)


class SRTFormatter(_TextBasedFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "{:02d}:{:02d}:{:02d},{:03d}".format(hours, mins, secs, ms)

    def _format_transcript_header(self, lines):
        return "\n\n".join(lines) + "\n"

    def _format_transcript_helper(self, i, time_text, line):
        return "{}\n{}\n{}".format(i + 1, time_text, line["text"])


class WebVTTFormatter(_TextBasedFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "{:02d}:{:02d}:{:02d}.{:03d}".format(hours, mins, secs, ms)

    def _format_transcript_header(self, lines):
        return "WEBVTT\n\n" + "\n\n".join(lines) + "\n"

    def _format_transcript_helper(self, i, time_text, line):
        return "{}\n{}".format(time_text, line["text"])


class FormatterLoader(object):
    TYPES = {
        "json": JSONFormatter,
        "pretty": PrettyPrintFormatter,
        "text": TextFormatter,
        "webvtt": WebVTTFormatter,
        "srt": SRTFormatter,
    }

    class UnknownFormatterType(Exception):
        def __init__(self, formatter_type):
            super(FormatterLoader.UnknownFormatterType, self).__init__(
                "The format '{formatter_type}' is not supported. "
                "Choose one of the following formats: {supported_formatter_types}".format(
                    formatter_type=formatter_type,
                    supported_formatter_types=", ".join(FormatterLoader.TYPES.keys()),
                )
            )

    def load(self, formatter_type="pretty"):
        """
        Loads the Formatter for the given formatter type.

        :param formatter_type:
        :return: Formatter object
        """
        if formatter_type not in FormatterLoader.TYPES.keys():
            raise FormatterLoader.UnknownFormatterType(formatter_type)
        return FormatterLoader.TYPES[formatter_type]()

## Step 6: YouTube Transcript API Integration

### 1. Imports:

- **requests**: This module is used to make HTTP requests in Python. It simplifies working with web data, especially in handling responses, including headers, cookies, and payloads.

- **cookiejar**: This is used to handle cookies in HTTP requests. Cookies are small pieces of data that a server stores on a client’s machine. In this code, it’s used to manage YouTube authentication cookies. The code first attempts to import `http.cookiejar` (common in Python 3), and if that fails, it falls back on `cookielib` (used in Python 2.x). This allows the code to be compatible with both versions of Python.

### 2. Class Definition: `YouTubeTranscriptApi`

This class contains methods to fetch transcripts for YouTube videos. It handles multiple functionalities like retrieving transcripts, managing languages, and handling errors. The class methods are defined as class methods, which means they are bound to the class itself and not instances of the class.

### 3. `list_transcripts` Method:

This method retrieves a list of available transcripts for a specific YouTube video using its `video_id`.

- It starts by creating a session for HTTP requests. Sessions allow you to maintain persistent settings like cookies and headers across requests.
- If the `cookies` parameter is provided, it loads the cookies from the specified file to authenticate requests.
- If proxies are provided, they are used for routing the requests through specific network paths.
- The method then calls a helper class `TranscriptListFetcher`, which uses the session to fetch available transcripts for the video.

### 4. `get_transcripts` Method:

This method allows fetching transcripts for multiple videos at once.

- It accepts a list of `video_ids`, a list of preferred languages (for transcript language preference), and other optional parameters like proxy settings and cookies.
- The method iterates over each video in the list and tries to retrieve its transcript. If an error occurs for any video and `continue_after_error` is set to `True`, it skips that video and continues with the others. Otherwise, the error is raised, stopping the process.
- It returns a dictionary mapping video IDs to their corresponding transcripts and a list of video IDs that couldn’t be fetched.

### 5. `get_transcript` Method:

This method retrieves the transcript for a single video.

- It combines the `list_transcripts` method and a helper method (`find_transcript`) to get the transcript in the specified language(s).
- After finding the desired transcript, it fetches the actual content and, if needed, can preserve HTML formatting.

### 6. `_load_cookies` Method:

This method is used to load cookies from a file, which may contain authorization data needed to access transcripts.

- It attempts to load cookies using the `cookiejar.MozillaCookieJar` object, which is specifically designed to work with cookies in a format similar to those used by Firefox.
- If the cookies can’t be loaded (e.g., due to an invalid path or file), it raises a custom exception to indicate that the cookies file is either missing or invalid.

### 7. Exceptions:

- **CookiesInvalid**: This exception is raised when the cookies loaded from the specified file are invalid or incomplete. It requires the `video_id` as a parameter for contextual error reporting.

- **CookiePathInvalid**: This exception is raised if the path to the cookie file is incorrect or the file itself is missing.

## Conclusion:

This code is designed to interact with YouTube’s transcript system, allowing the user to retrieve transcripts in different languages for YouTube videos. It handles multiple videos at once and manages cookies and proxies to enable the fetching of transcripts, especially for authorized content. The code also takes care of error handling, like invalid cookies or missing files, ensuring a smooth process for users to fetch and work with video transcripts.


In [16]:
import requests

try:  # pragma: no cover
    import http.cookiejar as cookiejar

    CookieLoadError = (FileNotFoundError, cookiejar.LoadError)
except ImportError:  # pragma: no cover
    import cookielib as cookiejar  # type: ignore


class YouTubeTranscriptApi(object):
    @classmethod
    def list_transcripts(cls, video_id, proxies=None, cookies=None):
        """
        Retrieves the list of transcripts which are available for a given video. It returns a `TranscriptList` object
        which is iterable and provides methods to filter the list of transcripts for specific languages. While iterating
        over the `TranscriptList` the individual transcripts are represented by `Transcript` objects, which provide
        metadata and can either be fetched by calling `transcript.fetch()` or translated by calling
        `transcript.translate('en')`. Example::

            # retrieve the available transcripts
            transcript_list = YouTubeTranscriptApi.get('video_id')

            # iterate over all available transcripts
            for transcript in transcript_list:
                # the Transcript object provides metadata properties
                print(
                    transcript.video_id,
                    transcript.language,
                    transcript.language_code,
                    # whether it has been manually created or generated by YouTube
                    transcript.is_generated,
                    # a list of languages the transcript can be translated to
                    transcript.translation_languages,
                )

                # fetch the actual transcript data
                print(transcript.fetch())

                # translating the transcript will return another transcript object
                print(transcript.translate('en').fetch())

            # you can also directly filter for the language you are looking for, using the transcript list
            transcript = transcript_list.find_transcript(['de', 'en'])

            # or just filter for manually created transcripts
            transcript = transcript_list.find_manually_created_transcript(['de', 'en'])

            # or automatically generated ones
            transcript = transcript_list.find_generated_transcript(['de', 'en'])

        :param video_id: the youtube video id
        :type video_id: str
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :return: the list of available transcripts
        :rtype TranscriptList:
        """
        with requests.Session() as http_client:
            if cookies:
                http_client.cookies = cls._load_cookies(cookies, video_id)
            http_client.proxies = proxies if proxies else {}
            return TranscriptListFetcher(http_client).fetch(video_id)

    @classmethod
    def get_transcripts(
        cls,
        video_ids,
        languages=("en",),
        continue_after_error=False,
        proxies=None,
        cookies=None,
        preserve_formatting=False,
    ):
        """
        Retrieves the transcripts for a list of videos.

        :param video_ids: a list of youtube video ids
        :type video_ids: list[str]
        :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
        it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
        do so.
        :type languages: list[str]
        :param continue_after_error: if this is set the execution won't be stopped, if an error occurs while retrieving
        one of the video transcripts
        :type continue_after_error: bool
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a tuple containing a dictionary mapping video ids onto their corresponding transcripts, and a list of
        video ids, which could not be retrieved
        :rtype ({str: [{'text': str, 'start': float, 'end': float}]}, [str]}):
        """
        assert isinstance(video_ids, list), "`video_ids` must be a list of strings"

        data = {}
        unretrievable_videos = []

        for video_id in video_ids:
            try:
                data[video_id] = cls.get_transcript(
                    video_id, languages, proxies, cookies, preserve_formatting
                )
            except Exception as exception:
                if not continue_after_error:
                    raise exception

                unretrievable_videos.append(video_id)

        return data, unretrievable_videos

    @classmethod
    def get_transcript(
        cls,
        video_id,
        languages=("en",),
        proxies=None,
        cookies=None,
        preserve_formatting=False,
    ):
        """
        Retrieves the transcript for a single video. This is just a shortcut for calling::

            YouTubeTranscriptApi.list_transcripts(video_id, proxies).find_transcript(languages).fetch()

        :param video_id: the youtube video id
        :type video_id: str
        :param languages: A list of language codes in a descending priority. For example, if this is set to ['de', 'en']
        it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to
        do so.
        :type languages: list[str]
        :param proxies: a dictionary mapping of http and https proxies to be used for the network requests
        :type proxies: {'http': str, 'https': str} - http://docs.python-requests.org/en/master/user/advanced/#proxies
        :param cookies: a string of the path to a text file containing youtube authorization cookies
        :type cookies: str
        :param preserve_formatting: whether to keep select HTML text formatting
        :type preserve_formatting: bool
        :return: a list of dictionaries containing the 'text', 'start' and 'duration' keys
        :rtype [{'text': str, 'start': float, 'end': float}]:
        """
        assert isinstance(video_id, str), "`video_id` must be a string"
        return (
            cls.list_transcripts(video_id, proxies, cookies)
            .find_transcript(languages)
            .fetch(preserve_formatting=preserve_formatting)
        )

    @classmethod
    def _load_cookies(cls, cookies, video_id):
        try:
            cookie_jar = cookiejar.MozillaCookieJar()
            cookie_jar.load(cookies)
            if not cookie_jar:
                raise CookiesInvalid(video_id)
            return cookie_jar
        except CookieLoadError:
            raise CookiePathInvalid(video_id)

## Step 7: YouTube Transcript CLI Integration

### 1. Imports:

- `argparse`: This module helps in parsing command-line arguments. It makes it easier to handle user input from the terminal and use it in the program.

### 2. Class Definition: `YouTubeTranscriptCli`

This is a command-line interface (CLI) for the YouTube Transcript API. It takes command-line arguments, processes them, and fetches the transcripts based on the user’s input. The main purpose of this class is to interact with users via the command line to retrieve transcripts from YouTube.

### 3. Constructor (`__init__`):

- **`self._args = args`**: The constructor stores the command-line arguments (`args`) passed to it. This is later used to process and execute the logic.

### 4. `run` Method:

- **`parsed_args = self._parse_args()`**: This calls the `_parse_args()` method to parse the arguments passed by the user.

- **`if parsed_args.exclude_manually_created and parsed_args.exclude_generated:`**: This checks if both flags are set to exclude manually created and generated transcripts. If both are set, no action is taken, and an empty string is returned.

- **`proxies = None`**: Initializes proxies to None. Proxies may be used to route the HTTP requests through specific network paths.

- **`if parsed_args.http_proxy != "" or parsed_args.https_proxy != "":`**: Checks if HTTP or HTTPS proxies are provided. If either is set, it creates a dictionary (`proxies`) containing the proxy URLs.

- **`cookies = parsed_args.cookies`**: Stores the path to the cookies file for YouTube authentication.

- **`transcripts = []`**: Initializes an empty list to store the fetched transcripts.

- **`exceptions = []`**: Initializes an empty list to store any exceptions encountered during the process.

- **`for video_id in parsed_args.video_ids:`**: Iterates over each YouTube video ID provided by the user.

    - **`try:`**: Attempts to fetch the transcript for each video ID using the `_fetch_transcript` method.
    
    - **`except Exception as exception:`**: If an error occurs, it appends the exception to the `exceptions` list.

- **`return "\n\n".join(...):`**: After processing all video IDs, the method returns a string containing:

    - All exceptions (if any).
    - The formatted transcripts (if any), based on the selected format.

### 5. `_fetch_transcript` Method:

- **`transcript_list = YouTubeTranscriptApi.list_transcripts(...)`**: This fetches the list of available transcripts for a given video using the `YouTubeTranscriptApi.list_transcripts` method, passing the necessary arguments like proxies and cookies.

- **`if parsed_args.list_transcripts:`**: If the user requests to list the available transcripts, it returns a string representation of the transcript list.

- **`if parsed_args.exclude_manually_created:`**: If the user wants to exclude manually created transcripts, it tries to find only the generated transcripts by calling `find_generated_transcript`.

- **`elif parsed_args.exclude_generated:`**: If the user wants to exclude generated transcripts, it fetches only manually created ones using `find_manually_created_transcript`.

- **`else:`**: If neither flag is set, it fetches any available transcript by calling `find_transcript`.

- **`if parsed_args.translate:`**: If the user specifies a translation language, it translates the transcript to the specified language using `translate`.

- **`return transcript.fetch():`**: Finally, it fetches the actual transcript content and returns it.

### 6. `_parse_args` Method:

- **`parser = argparse.ArgumentParser(...):`**: Initializes an `ArgumentParser` instance that provides the functionality for parsing command-line arguments. It also provides a description of the program.

- **`parser.add_argument(...):`**: The `add_argument` method is used to define each command-line option/flag that the program can accept. This includes:


    - `--list-transcripts`: Flag to list the available transcripts without fetching them.

    - `video_ids`: A positional argument to specify a list of YouTube video IDs.

    - `--languages`: A list of languages in order of preference for fetching the transcript.

    - `--exclude-generated`: Flag to exclude automatically generated transcripts.

    - `--exclude-manually-created`: Flag to exclude manually created transcripts.

    - `--format`: Specifies the format for the transcript output.

    - `--translate`: Specifies the target language for translation.

    - `--http-proxy`: HTTP proxy URL.

    - `--https-proxy`: HTTPS proxy URL.

    - `--cookies`: Path to the cookies file for YouTube authentication.


- **`return self._sanitize_video_ids(...):`**: After defining the arguments, the method parses them using `parse_args` and then sanitizes the video IDs by removing any escape characters (like backslashes) using `_sanitize_video_ids`.

### 7. `_sanitize_video_ids` Method:

- **`args.video_ids = [video_id.replace("\\", "") for video_id in args.video_ids]:`**: This ensures that any backslashes in the video IDs are removed, as they may have been escaped in the input.

- **`return args:`**: Returns the sanitized arguments with cleaned-up video IDs.

## Conclusion:

This class acts as a CLI utility that interacts with the YouTube Transcript API. It takes multiple command-line options and video IDs, fetches the desired transcripts, handles errors, and formats the results based on user preferences. It also supports features like excluding certain types of transcripts, translating transcripts, and routing requests through proxies. The program is designed to be run from the command line, making it user-friendly for fetching YouTube video transcripts.


In [18]:
import argparse


class YouTubeTranscriptCli(object):
    def __init__(self, args):
        self._args = args

    def run(self):
        parsed_args = self._parse_args()

        if parsed_args.exclude_manually_created and parsed_args.exclude_generated:
            return ""

        proxies = None
        if parsed_args.http_proxy != "" or parsed_args.https_proxy != "":
            proxies = {"http": parsed_args.http_proxy, "https": parsed_args.https_proxy}

        cookies = parsed_args.cookies

        transcripts = []
        exceptions = []

        for video_id in parsed_args.video_ids:
            try:
                transcripts.append(
                    self._fetch_transcript(parsed_args, proxies, cookies, video_id)
                )
            except Exception as exception:
                exceptions.append(exception)

        return "\n\n".join(
            [str(exception) for exception in exceptions]
            + (
                [
                    FormatterLoader()
                    .load(parsed_args.format)
                    .format_transcripts(transcripts)
                ]
                if transcripts
                else []
            )
        )

    def _fetch_transcript(self, parsed_args, proxies, cookies, video_id):
        transcript_list = YouTubeTranscriptApi.list_transcripts(
            video_id, proxies=proxies, cookies=cookies
        )

        if parsed_args.list_transcripts:
            return str(transcript_list)

        if parsed_args.exclude_manually_created:
            transcript = transcript_list.find_generated_transcript(
                parsed_args.languages
            )
        elif parsed_args.exclude_generated:
            transcript = transcript_list.find_manually_created_transcript(
                parsed_args.languages
            )
        else:
            transcript = transcript_list.find_transcript(parsed_args.languages)

        if parsed_args.translate:
            transcript = transcript.translate(parsed_args.translate)

        return transcript.fetch()

    def _parse_args(self):
        parser = argparse.ArgumentParser(
            description=(
                "This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. "
                "It also works for automatically generated subtitles and it does not require a headless browser, like "
                "other selenium based solutions do!"
            )
        )
        parser.add_argument(
            "--list-transcripts",
            action="store_const",
            const=True,
            default=False,
            help="This will list the languages in which the given videos are available in.",
        )
        parser.add_argument(
            "video_ids", nargs="+", type=str, help="List of YouTube video IDs."
        )
        parser.add_argument(
            "--languages",
            nargs="*",
            default=[
                "en",
            ],
            type=str,
            help=(
                'A list of language codes in a descending priority. For example, if this is set to "de en" it will '
                "first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails "
                "to do so. As I can't provide a complete list of all working language codes with full certainty, you "
                "may have to play around with the language codes a bit, to find the one which is working for you!"
            ),
        )
        parser.add_argument(
            "--exclude-generated",
            action="store_const",
            const=True,
            default=False,
            help="If this flag is set transcripts which have been generated by YouTube will not be retrieved.",
        )
        parser.add_argument(
            "--exclude-manually-created",
            action="store_const",
            const=True,
            default=False,
            help="If this flag is set transcripts which have been manually created will not be retrieved.",
        )
        parser.add_argument(
            "--format",
            type=str,
            default="pretty",
            choices=tuple(FormatterLoader.TYPES.keys()),
        )
        parser.add_argument(
            "--translate",
            default="",
            help=(
                "The language code for the language you want this transcript to be translated to. Use the "
                "--list-transcripts feature to find out which languages are translatable and which translation "
                "languages are available."
            ),
        )
        parser.add_argument(
            "--http-proxy",
            default="",
            metavar="URL",
            help="Use the specified HTTP proxy.",
        )
        parser.add_argument(
            "--https-proxy",
            default="",
            metavar="URL",
            help="Use the specified HTTPS proxy.",
        )
        parser.add_argument(
            "--cookies",
            default=None,
            help="The cookie file that will be used for authorization with youtube.",
        )

        return self._sanitize_video_ids(parser.parse_args(self._args))

    def _sanitize_video_ids(self, args):
        args.video_ids = [video_id.replace("\\", "") for video_id in args.video_ids]
        return args

## Step 8: Integrate and Query Google Gemini API

### 1. Imports

- **`import google.generativeai as genai`**: This imports the generativeai module from Google, which allows interaction with Google Gemini's generative models, enabling you to send queries to them and receive responses.

- **`import os`**: This imports the os module, which provides functions for interacting with the operating system, such as reading environment variables.

- **`from dotenv import load_dotenv`**: This imports the load_dotenv function from the dotenv module, which loads environment variables from a .env file into the Python environment. This is typically used to securely store sensitive data like API keys.

### 2. Loading the Environment Variables

- **`load_dotenv()`**: This function loads the environment variables from a .env file into the current Python session. It helps in managing configuration data like API keys, paths, and other sensitive information securely.

### 3. Setting API Key and Model Constants

- **`gemini_api_key = os.getenv("GEMINI_API_KEY")`**: This retrieves the value of the GEMINI_API_KEY from the environment variables. The API key is required for authenticating requests to the Gemini API. If the key is not found, it will return None.

- **`DEFAULT_GEMINI_MODEL = "gemini-1.5-flash"`**: This sets the default model to use with the Gemini API. In this case, it is set to "gemini-1.5-flash", which specifies a particular version or configuration of the Gemini model.

### 4. Function to Get the Gemini Client

- **`def get_gemini_client(model=DEFAULT_GEMINI_MODEL):`**: This defines a function `get_gemini_client` that takes an optional argument model (defaulting to `DEFAULT_GEMINI_MODEL`). This function is used to initialize and return a client that can interact with the specified Gemini model.

    - **`genai.configure(api_key=gemini_api_key)`**: This configures the genai module by passing the API key (`gemini_api_key`) so that the client can authenticate requests to the Gemini API.
    
    - **`model = genai.GenerativeModel(model_name=model)`**: This initializes the `GenerativeModel` object from genai, using the provided model name (model). This object will be used to interact with the specified Gemini model for generating content.
    
    - **`return model`**: The function returns the initialized `GenerativeModel` client that is ready to generate content using the specified model.

### 5. Function to Query Gemini

- **`def gemini(query, model=DEFAULT_GEMINI_MODEL):`**: This defines a function `gemini` that takes two arguments: `query`, which is the prompt or query you want to send to the Gemini model, and `model`, which specifies the model to use (defaulting to `DEFAULT_GEMINI_MODEL`).

    - **`model = get_gemini_client(model)`**: This calls the `get_gemini_client` function to get the Gemini client initialized with the specified model.

    - **`response = model.generate_content(query)`**: This sends the provided query to the Gemini model for content generation. The `generate_content` method is used to request content based on the query, and it returns a response.

    - **`return response.text`**: The function returns the `text` attribute from the response, which contains the generated content from the Gemini model based on the query.

## Summary

This script provides two main functions:
- **`get_gemini_client`**: To initialize and return a client to interact with Google Gemini models.

- **`gemini`**: To send a query to the model and return the generated content.

The environment variables are loaded using the `dotenv` package, and the API key is securely retrieved for authenticating requests to Gemini's API.

In [28]:
import google.generativeai as genai
import os
from dotenv import load_dotenv

load_dotenv()

gemini_api_key = os.getenv("GEMINI_API_KEY")
DEFAULT_GEMINI_MODEL = "gemini-1.5-flash"


def get_gemini_client(model=DEFAULT_GEMINI_MODEL):
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel(model_name=model)
    return model


def gemini(query, model=DEFAULT_GEMINI_MODEL):
    model = get_gemini_client(model)
    response = model.generate_content(query)
    return response.text

## Step 9: Fetch and Convert YouTube Transcript to Markdown Format

### **Imports**

#### 1. **`from youtube_transcript_api.formatters import TextFormatter`**
- This imports the `TextFormatter` class from the `youtube_transcript_api.formatters` module. The `TextFormatter` is used to format the transcript into plain text from the raw transcript data.

#### 2. **`from youtube_transcript_api import YouTubeTranscriptApi`**
- This imports the `YouTubeTranscriptApi` class from the `youtube_transcript_api` module. It is used to fetch the transcripts for YouTube videos.

#### 3. **`from IPython.display import display, Markdown`**
- This imports the `display` and `Markdown` functions from the `IPython.display` module. The `display` function is used to display objects (in this case, Markdown) in Jupyter notebooks or interactive Python sessions. `Markdown` is used to format text in Markdown syntax for rendering.

### **Function to Convert Transcript to Markdown Format**

#### `def transcript_to_markdown(text_transcript: str) -> str:`
- This defines a function `transcript_to_markdown`, which takes `text_transcript` (a plain text representation of a YouTube video transcript) as an input and returns the formatted transcript in Markdown format.

#### **Prompt Creation**
- **`prompt = f"""..."""`**: This creates a formatted prompt string for an AI model (presumably Google Gemini, from previous code) to convert the plain text transcript into a properly formatted Markdown version. The prompt specifies:

  - Convert the given text transcript into Markdown format.

  - Include all information from the transcript without altering the facts.

  - Identify and tag speaker boundaries, assigning labels like 'Speaker 1', 'Speaker 2', etc.

  - Clean up and properly format the Markdown output.

  - It appends the `text_transcript` as the input for the AI model to process.

#### **Response from Gemini Model**
- **`response = gemini(query=prompt)`**: This sends the formatted prompt to the `gemini` function (which interacts with the Gemini model) and stores the response in the `response` variable. The model will process the transcript and return a properly formatted Markdown version.

- **`return response`**: This returns the formatted transcript generated by the Gemini model. The response is in Markdown format, which can be displayed later.

### **Formatter Setup**

#### `formatter = TextFormatter()`
- This creates an instance of the `TextFormatter` class. This formatter is used to convert the raw transcript (which is in JSON format) into plain text.

### **Fetching and Formatting the Transcript**

#### `video_id = "no_XaCE969Y"`
- This defines the `video_id` variable with a sample YouTube video ID (`no_XaCE969Y`). This ID will be used to fetch the transcript for this video.

#### `transcript = YouTubeTranscriptApi.get_transcript(video_id)`
- This calls the `get_transcript` method from the `YouTubeTranscriptApi` class to fetch the raw transcript data for the specified video ID. The transcript will be in JSON format, with each entry containing timestamps and text.

#### `text_transcript = formatter.format_transcript(transcript)`
- This uses the `TextFormatter` instance to format the fetched transcript into plain text. The `format_transcript` method will convert the raw JSON transcript into a clean, human-readable text format.

### **Converting the Transcript to Markdown**

#### `transcript_markdown = transcript_to_markdown(text_transcript)`
- This passes the formatted plain text transcript (`text_transcript`) to the `transcript_to_markdown` function, which generates the Markdown-formatted version of the transcript by calling the Gemini model.

### **Displaying the Markdown Transcript**

#### `display_markdown = Markdown(transcript_markdown)`
- This creates an instance of the `Markdown` class with the `transcript_markdown`. The `Markdown` class is used to format the content into Markdown so it can be rendered properly in environments that support Markdown (e.g., Jupyter notebooks).

#### `display(display_markdown)`
- This uses the `display` function to render the `display_markdown` object in the output cell of a Jupyter notebook or IPython environment, showing the formatted transcript in Markdown.

### **Summary**

- The code fetches a transcript from a YouTube video using `YouTubeTranscriptApi`.

- It formats the raw transcript into plain text using `TextFormatter`.

- The formatted transcript is then sent to the `gemini` function, which uses an AI model (likely Google Gemini) to convert it into a properly formatted Markdown transcript.

- Finally, the Markdown-formatted transcript is displayed using the IPython `display` function in a Jupyter notebook or similar environment.


In [None]:
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api import YouTubeTranscriptApi
from IPython.display import display, Markdown


def transcript_to_markdown(text_transcript: str) -> str:
    """Transformes a text transcript into markdown format with speaker boundaries."""
    prompt = f"""
        - Given a transcript of a video in text format, convert it into markdown format.
        - Make sure no details are missed out and all the information in the transcript is included. 
        - Don't hallucinate or make up factual information. 
        - Identify 'speaker boundaries' and tag them with the speaker's name. If you don't have the name just say 'Speaker 1', 'Speaker 2', etc.
        - Clean the markdown and format it properly. 
        Transcript: {text_transcript}
    """
    response = gemini(query=prompt)
    return response


formatter = TextFormatter()

video_id = "no_XaCE969Y"
transcript = YouTubeTranscriptApi.get_transcript(video_id)
text_transcript = formatter.format_transcript(transcript)
transcript_markdown = transcript_to_markdown(text_transcript)
display_markdown = Markdown(transcript_markdown)
display(display_markdown)

## Conclusion:

This app provides a streamlined way to retrieve, format, and display YouTube video transcripts in a structured and readable Markdown format. The key steps involve:

### 1. **Fetching Transcripts:**
- The app fetches the transcript data for a specified YouTube video using the `YouTubeTranscriptApi`. This allows users to get both manually created and automatically generated transcripts.

### 2. **Formatting the Transcript:**
- The raw transcript data is formatted into plain text using the `TextFormatter` class, making it easy to work with.

### 3. **Enhancing the Transcript:**
- By leveraging the power of AI (via the Gemini model), the plain text transcript is further processed into a well-organized and polished Markdown format. This includes proper speaker identification, clear text separation, and formatting.

### 4. **Displaying the Output:**
- The final Markdown-formatted transcript is displayed in a user-friendly manner using the IPython `display` function, which works seamlessly in Jupyter notebooks or similar environments.

This app serves as a powerful tool for users who want to extract, clean, and present YouTube transcripts in a readable, well-structured format, making it especially useful for tasks such as content analysis, transcription review, or preparing content for documentation.


---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

