# YouTube Video Summarizer

In [38]:
import re
from youtube_transcript_api import YouTubeTranscriptApi
from sumy.parsers.plaintext import PlaintextParser      
from sumy.nlp.tokenizers import Tokenizer              
from sumy.summarizers.lsa import LsaSummarizer  
import nltk  
from openai import OpenAI

## Task 1: Design and ImplementationInput (YouTube URL) → [Regex: Extract Video ID] → [YouTube Transcript API: Fetch Transcript] → [OpenAI API: Generate Summary] → Output (Summary) LSA-Based Summary  


In [22]:
# 1. Extract Video ID from YouTube URL
def extract_video_id(url):
    pattern = r"(?:v=|\/)([0-9A-Za-z_-]{11}).*"
    match = re.search(pattern, url)
    return match.group(1) if match else None

In [24]:
# 2. Fetch Transcript from YouTube
def fetch_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = ' '.join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        print(f"[Error] Could not fetch transcript: {e}")
        return None

In [26]:
# 3. Summarize using Sumy LSA 
# Instead of the Open API, we used Sumy as OPEN AI requires a premium account

def summarize_text_sumy(text, sentence_count=5):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    summary = summarizer(parser.document, sentence_count)
    return '\n'.join(str(sentence) for sentence in summary)

In [28]:
# 4. Full Process Function
def extract_transcript(url):
    video_id = extract_video_id(url)
    if not video_id:
        return "[Error] Invalid YouTube URL format."

    print(f"Extracted Video ID: {video_id}")

    transcript = fetch_transcript(video_id)
    if not transcript:
        return "[Error] No transcript available."
        
    return transcript

In [42]:
# Method to be used to generate Summary using OPEN AI
def generate_summary(transcript, prompt_template):
    client = OpenAI(api_key="XXXXXXXXXXX")
    prompt = prompt_template.format(transcript=transcript)
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": "You are a concise summarizer."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150
    )
    return response.choices[0].message.content

In [34]:
if __name__ == "__main__":
    youtube_url = input("Enter the YouTube video URL: ")
    result_Full_Transcript = extract_transcript(youtube_url)
    summary = summarize_text_sumy(result_Full_Transcript)

    print("\n--- Transcript Summary ---\n")
    print(summary)

Enter the YouTube video URL:   https://www.youtube.com/watch?v=x7X9w_GIm1s


Extracted Video ID: x7X9w_GIm1s

--- Transcript Summary ---

python a highlevel interpreted programming language famous for its zen-like code it's arguably the most popular language in the world because it's easy to learn yet practical for serious projects in fact you're watching this YouTube video in a python web application right now it was created by Guido van rossom and released in 1991 who named it after Monty Python's Flying Circus which is why you'll sometimes find spaming eggs instead of Foo and bar and code samples it's commonly used to build serers side applications like web apps with the framework and is the language of choice for Big Data analysis and machine learning many students choose python to start learning decode because of its emphasis on readability as outlined by the Zen of python beautiful is better than ugly while explicit is better than implicit python is very simple but avoids the temptation to sprinkle in Magic that causes ambiguity its code is often organize

In [44]:
# Testing Open AI
prompt_template = "Summarize the following transcript in 3-5 sentences, capturing the main points: {transcript}"
summary = generate_summary(result_Full_Transcript, prompt_template)
print(summary)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

#### Finding: Using Open AI requires a premium account hence resorted to using Sumy

## Task 2 Challenges and Solutions 

### Challenges and Solutions

#### Potential Challenges

| Challenge               | Description |
|--------------------------|-------------|
| No transcript available  | Some videos have no captions or are auto-generated in unsupported languages. |
| API rate limits          | OpenAI and YouTube Transcript APIs may restrict excessive requests. |
| Long transcripts         | GPT models have input token limits (e.g., 4096 for GPT-3.5). |
| Summary accuracy         | The model may misinterpret domain-specific content. |
| Network errors           | Transcript fetching or API calls might fail intermittently. |


#### Solutions and Workarounds 

### Challenges and Solutions

| Challenge            | Solution |
|-----------------------|----------|
| No transcript         | Display a friendly error message; suggest uploading a manual transcript. |
| API limits            | Add exponential backoff and caching mechanisms; upgrade to higher API tiers if needed. |
| Long transcripts      | Split transcript into chunks and summarize each, then summarize the summaries. |
| Summary accuracy      | Use fine-tuned models or add technical glossary context in the prompt. |
| Network issues        | Use try/except blocks and implement retries with logging. |


#### Task 3 Evaluation

#### Evaluating Effectiveness: To evaluate the effectiveness of the summaries:

User Feedback: Collect qualitative feedback from users (e.g., software engineers) on whether summaries capture key technical points and are concise.
Comparison with Manual Summaries: Compare tool-generated summaries with human-written summaries of the same videos to assess completeness and accuracy.
A/B Testing: Test different prompt variations (e.g., bullet-point vs. paragraph summaries) and measure user preference or task completion time (e.g., how quickly users grasp the video’s content).evant.


#### Metrics/Methods for Quality and Accuracy

Metrics:

ROUGE Score: Measure overlap between tool-generated summaries and human-written reference summaries (e.g., ROUGE-1 for word overlap, ROUGE-L for sequence similarity).

Cosine Similarity: Compute semantic similarity between the summary and transcript embeddings (using models like BERT or SentenceTransformers).

Conciseness Ratio: Ratio of summary length to transcript length (target: 5-10% of original length).

Methods:
Manual Review: Have domain experts (e.g., engineers) rate summaries on a scale (e.g., 1-5) for accuracy, relevance, and clarity.
Task-Based Evaluation: Ask users to answer questions about the video’s content using only the summary, measuring correctness.well.


## Task 4 Extension: Proposed Features

| Feature                  | Benefit |
|---------------------------|---------|
| Multi-language Support    | Use translation APIs or multilingual models for non-English transcripts. |
| Browser Extension         | Users can summarize any video directly while browsing YouTube. |
| Summarization Modes       | Allow user to choose between bullet points, paragraph, or TL;DR format. |
| Visual Summaries          | Generate mind maps or key-point graphics using tools like Graphviz. |
| Bookmarking and Saving    | Save past summaries and allow search/filter functionality. |
| Voice Summary Generation  | Use TTS (text-to-speech) APIs to produce audio summries. |

