# Errors

In this notebook i will document issues i have encountered during solving the project.   
These notes are supposed to help explain why in the end version of the project i decided to use certain aproaches over the alternatives

## Get the captions from the video
I have attempted to get captions ssing various APIs (which will be preprocesed and used as data on which i will train my model).
It didnt work.


### youtube_transcript_api

In [2]:
import logging
from youtube_transcript_api import YouTubeTranscriptApi

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_transcript(video_id):
    """Fetches the transcript from YouTube based on the provided video_id."""
    logging.info("Starting transcript fetch for video_id: %s", video_id)
    try:
        # Fetch transcript data from YouTube
        transcript_data = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine transcript parts into one string
        transcript = " ".join([item['text'] for item in transcript_data])
        logging.info("Transcript fetched successfully for video_id: %s", video_id)
        return transcript
    except Exception as e:
        logging.error("Failed to fetch transcript: %s", e)
        return ""

# Przykład użycia funkcji
if __name__ == "__main__":
    video_id = 'yPIZ9b9Q-fc'  # Przykładowy ID wideo
    transcript = fetch_transcript(video_id)
    if transcript:
        print("Pobrany transkrypt:", transcript)
    else:
        print("Nie udało się pobrać transkryptu.")


2024-10-28 18:10:18,502 - INFO - Starting transcript fetch for video_id: yPIZ9b9Q-fc
2024-10-28 18:10:19,273 - ERROR - Failed to fetch transcript: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=yPIZ9b9Q-fc! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!


Nie udało się pobrać transkryptu.


### pytube

In [3]:
import logging
from pytube import YouTube

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_transcript(video_id):
    """Fetches the auto-generated transcript from YouTube based on the provided video_id."""
    logging.info("Starting transcript fetch for video_id: %s", video_id)
    try:
        video_url = f"https://www.youtube.com/watch?v={video_id}"
        yt = YouTube(video_url)
        
        # Check for auto-generated English captions
        if "a.en" in yt.captions:
            transcript = yt.captions["a.en"].generate_srt_captions()
            logging.info("Transcript fetched successfully for video_id: %s", video_id)
            return transcript
        else:
            logging.error("Auto-generated English subtitles not available for this video.")
            return ""
    except Exception as e:
        logging.error("Failed to fetch transcript: %s", e)
        return ""

def check_available_captions(video_id):
    """Prints all available captions for the video to help identify the correct caption code."""
    video_url = f"https://www.youtube.com/watch?v={video_id}"
    yt = YouTube(video_url)
    available_captions = yt.captions
    print("Available captions:", available_captions)
    return available_captions

# Przykład użycia funkcji
if __name__ == "__main__":
    video_id = 'yPIZ9b9Q-fc'  # Przykładowy ID wideo
    # Sprawdzenie dostępnych napisów
    available_captions = check_available_captions(video_id)
    print("Dostępne napisy:", available_captions)

    # Próba pobrania transkryptu
    transcript = fetch_transcript(video_id)
    if transcript:
        print("Pobrany transkrypt:", transcript)
    else:
        print("Nie udało się pobrać transkryptu.")


2024-10-28 18:10:19,588 - INFO - Starting transcript fetch for video_id: yPIZ9b9Q-fc
2024-10-28 18:10:19,746 - ERROR - Auto-generated English subtitles not available for this video.


Available captions: {}
Dostępne napisy: {}
Nie udało się pobrać transkryptu.


### yt_dlp

In [4]:
import logging
import yt_dlp
import requests

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_auto_subtitles(video_url):
    """Fetches the URL for auto-generated English subtitles from YouTube using yt-dlp."""
    logging.info("Starting subtitle fetch for video_url: %s", video_url)
    ydl_opts = {
        'writesubtitles': True,
        'subtitleslangs': ['en'],
        'skip_download': True,
        'subtitlesformat': 'srt'
    }
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(video_url, download=False)
            subtitles = info.get("requested_subtitles")
            if subtitles and 'en' in subtitles:
                subtitle_url = subtitles['en'].get("url")
                logging.info("Subtitle URL fetched successfully.")
                return subtitle_url
            else:
                logging.error("Auto-generated English subtitles not available.")
                return None
    except Exception as e:
        logging.error("Failed to fetch subtitles: %s", e)
        return None

def save_subtitles_from_url(subtitle_url, output_path="data/raw/subtitles.srt"):
    """Downloads subtitles from the provided URL and saves them to a file."""
    if not subtitle_url:
        logging.error("No subtitle URL provided.")
        return False
    try:
        response = requests.get(subtitle_url)
        response.raise_for_status()
        with open(output_path, "w", encoding="utf-8") as file:
            file.write(response.text)
        logging.info("Subtitles saved to file: %s", output_path)
        return True
    except Exception as e:
        logging.error("Failed to save subtitles: %s", e)
        return False

# Main function to fetch and save subtitles
def main(video_url):
    """Main function to fetch and save the subtitles."""
    logging.info("Starting subtitle download process.")
    
    # Step 1: Fetch subtitle URL
    subtitle_url = fetch_auto_subtitles(video_url)
    if not subtitle_url:
        logging.error("Subtitle fetch failed. Exiting.")
        return

    # Step 2: Save the subtitles from URL for inspection
    success = save_subtitles_from_url(subtitle_url, output_path="data/raw/subtitles.srt")
    if success:
        logging.info("Subtitle download process completed successfully.")
    else:
        logging.error("Subtitle saving failed.")

# Example usage
if __name__ == "__main__":
    video_url = 'https://www.youtube.com/watch?v=yPIZ9b9Q-fc'
    main(video_url)


2024-10-28 18:10:19,806 - INFO - Starting subtitle download process.
2024-10-28 18:10:19,820 - INFO - Starting subtitle fetch for video_url: https://www.youtube.com/watch?v=yPIZ9b9Q-fc


[youtube] Extracting URL: https://www.youtube.com/watch?v=yPIZ9b9Q-fc
[youtube] yPIZ9b9Q-fc: Downloading webpage
[youtube] yPIZ9b9Q-fc: Downloading ios player API JSON
[youtube] yPIZ9b9Q-fc: Downloading mweb player API JSON
[youtube] yPIZ9b9Q-fc: Downloading m3u8 information


2024-10-28 18:10:23,434 - ERROR - Auto-generated English subtitles not available.
2024-10-28 18:10:23,438 - ERROR - Subtitle fetch failed. Exiting.
