# Course Playlist Summarizer & Chatbot

This notebook provides a complete pipeline for processing a **YouTube video playlist** into:
- 📝 Cleaned transcripts using `Whisper`
- 📄 Automated summaries using `OpenAI` LLMs
- 🧠 A searchable **vector database** using `Chroma` and `LangChain`
- 💬 An interactive chatbot for asking questions about the video content

### 🔧 Key Features:
- Download audio from YouTube playlists
- Transcribe using OpenAI Whisper
- Clean transcripts to remove filler or irrelevant text
- Summarize each video into 2–3 key points
- Build a Chroma vector store for semantic retrieval
- Ask questions via a CLI chatbot based on the video content

> This notebook is perfect for anyone building educational AI tools or automating learning from video content.


## 1. Install Required Libraries


In [1]:
!pip install -q openai langchain langchain-openai langchain-community chromadb yt_dlp whisper
!pip install -q ffmpeg-python

## 2. Import Libraries and Set API Key


In [None]:
import os
import yt_dlp
import whisper
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
from dotenv import load_dotenv
import os

# Load API key from .env file
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## 3. Prepare Project Directories


In [3]:
# Create directories
os.makedirs("audios", exist_ok=True)
os.makedirs("transcripts", exist_ok=True)
os.makedirs("outputs", exist_ok=True)

## 4. Clean Transcript Texts


In [None]:
import re

# Clean transcript text before embedding
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\[(.*?)\]', '', text)
    text = re.sub(r'(?i)(welcome( to)?|hello everyone|hi everyone|today we will learn|today we\'re going to learn).*?[.!?]', '', text)
    text = re.sub(r'(?i)(don\'t forget to subscribe|please like and subscribe|this video is sponsored by|click the bell icon).*?[.!?]', '', text)
    text = re.sub(r'\b(uh|um|erm|you know|like)\b', '', text, flags=re.IGNORECASE)
    text = text.encode('utf-8', errors='ignore').decode('utf-8')
    return text.strip()


## 5. Download Playlist Audio 

In [4]:
def download_playlist(playlist_url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': 'audios/%(title)s.%(ext)s',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([playlist_url])

# Input your playlist URL
playlist_url = input("Enter YouTube playlist URL: ")
download_playlist(playlist_url)

Enter YouTube playlist URL: https://youtube.com/playlist?list=PLhQjrBD2T383Cqo5I1oRrbC1EKRAKGKUE&feature=shared
[youtube:tab] Extracting URL: https://youtube.com/playlist?list=PLhQjrBD2T383Cqo5I1oRrbC1EKRAKGKUE&feature=shared
[youtube:tab] PLhQjrBD2T383Cqo5I1oRrbC1EKRAKGKUE: Downloading webpage
[youtube:tab] PLhQjrBD2T383Cqo5I1oRrbC1EKRAKGKUE: Redownloading playlist API JSON with unavailable videos
[download] Downloading playlist: CS50's Introduction to Cybersecurity
[youtube:tab] PLhQjrBD2T383Cqo5I1oRrbC1EKRAKGKUE page 1: Downloading API JSON
[youtube:tab] Playlist CS50's Introduction to Cybersecurity: Downloading 6 items of 6
[download] Downloading item 1 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=kmJlnUfMd7I
[youtube] kmJlnUfMd7I: Downloading webpage
[youtube] kmJlnUfMd7I: Downloading tv client config
[youtube] kmJlnUfMd7I: Downloading player fded239a-main
[youtube] kmJlnUfMd7I: Downloading tv player API JSON
[youtube] kmJlnUfMd7I: Downloading ios player API JSON

## 6. Transcribe Audio Files with Whisper and Apply Cleaning


In [None]:
def transcribe_audios():
    model = whisper.load_model("base")
    for filename in os.listdir("audios"):
        if filename.endswith(".mp3"):
            audio_path = os.path.join("audios", filename)
            result = model.transcribe(audio_path)
            
            # Apply cleaning before saving
            cleaned_text = clean_text(result["text"])
            
            transcript_path = os.path.join("transcripts", f"{os.path.splitext(filename)[0]}.txt")
            with open(transcript_path, "w", encoding="utf-8") as f:
                f.write(cleaned_text)

# Transcribe all audio files
transcribe_audios()


100%|███████████████████████████████████████| 139M/139M [00:06<00:00, 22.9MiB/s]


## 8. Summarize Each Transcript and Export to CSV


In [6]:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
import os
import csv

def summarize_transcripts():
    # Initialize LLM
    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name="gpt-3.5-turbo", temperature=0.2)

    # Define prompt for summarizing each video into main points
    prompt_template = PromptTemplate(
        input_variables=["text", "title"],
        template="Summarize the following transcript from a video titled '{title}' into 2-3 main points. Format the response as a numbered list (e.g., 1. Point, 2. Point, 3. Point). Focus on the key topics or themes:\n\n{text}"
    )

    # Prepare CSV data
    csv_data = []

    print("Summary of each video in the playlist:")
    # Process each transcript
    for filename in os.listdir("transcripts"):
        if filename.endswith(".txt"):
            with open(os.path.join("transcripts", filename), "r", encoding="utf-8") as f:
                content = f.read()

            # Use filename (without .txt) as video title
            video_title = os.path.splitext(filename)[0]

            # Generate summary for this video
            summary = llm.invoke(
                prompt_template.format(
                    text=content[:10000],  # Limit to 10k chars to avoid token limits
                    title=video_title
                )
            )

            # Print summary for this video
            print(f"\nVideo: {video_title}")
            print(summary.content)

            # Add to CSV data
            csv_data.append([video_title, summary.content])

    # Save summaries to CSV
    csv_file_path = os.path.join("outputs", "video_summaries.csv")
    with open(csv_file_path, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Video Title", "Summary"])  # Write header
        writer.writerows(csv_data)  # Write data
    print(f"\nSummaries saved to {csv_file_path}")

# Generate summaries and save to CSV
summarize_transcripts()

Summary of each video in the playlist:

Video: CS50 Cybersecurity - Lecture 2 - Securing Systems
1. Encryption is a key solution to securing systems, including Wi-Fi networks, to protect data and prevent eavesdropping.
2. HTTP, a common protocol for web communication, is vulnerable to eavesdropping and man-in-the-middle attacks due to lack of encryption.
3. Packet sniffing is a threat that allows attackers to intercept and view unencrypted data packets, highlighting the importance of encryption to protect data during transmission.

Video: CS50 Cybersecurity - Lecture 0 - Securing Accounts
1. The lecture focuses on the importance of securing accounts in the digital world, emphasizing the concepts of authentication and authorization to ensure that only the right individuals have access to specific systems or information.
2. Passwords play a crucial role in authentication, and it is essential to have strong, unique passwords that are not easily guessable to protect against threats like di

## 9. Create and Persist Chroma Vector Store


In [7]:
def create_vector_db():
    # Load transcripts into documents
    documents = []
    for filename in os.listdir("transcripts"):
        if filename.endswith(".txt"):
            with open(os.path.join("transcripts", filename), "r", encoding="utf-8") as f:
                content = f.read()
                documents.append(Document(page_content=content, metadata={"source": filename}))

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)

    # Create vector database
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings, persist_directory="./chroma_db")
    return vectorstore

# Create and persist vector database
vectorstore = create_vector_db()

## 10. Setup Chatbot with RetrievalQA


In [8]:
def setup_chatbot(vectorstore):
    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name="gpt-3.5-turbo", temperature=0)
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True
    )
    return qa_chain

# Initialize chatbot
qa_chain = setup_chatbot(vectorstore)

## 11. Archive Transcripts Folder


In [11]:
import shutil # Import the shutil module

shutil.make_archive('transcripts', 'zip', 'transcripts')

'/content/transcripts.zip'

# ✅ chat with AI and let him answer based on the provided Data

In [9]:
def chat_with_bot():
    print("Chatbot ready! Ask questions about the playlist videos (type 'exit' to quit).")
    while True:
        query = input("You: ")
        if query.lower() == "exit":
            break
        result = qa_chain({"query": query})
        print(f"Bot: {result['result']}")
        print(f"Source: {result['source_documents'][0].metadata['source']}\n")

# Start chatting
chat_with_bot()

Chatbot ready! Ask questions about the playlist videos (type 'exit' to quit).
You: how to to secure systems?


  result = qa_chain({"query": query})


Bot: To secure systems, encryption is a key solution to many problems. When it comes to networked systems like Wi-Fi, choosing secured networks over unsecured ones is important. Additionally, implementing multiple layers of security can help raise the bar for potential adversaries, making it more difficult for them to access your systems. Regularly updating software and using strong, unique passwords for accounts are also essential steps in securing systems.
Source: CS50 Cybersecurity - Lecture 2 - Securing Systems.txt

You: how to secure accounts?
Bot: To secure your accounts effectively, you can follow these recommendations:

1. Use a password manager to store and generate strong, unique passwords for each account.
2. Enable two-factor authentication using a native application on your phone or a physical key fob instead of SMS.
3. Start by securing your most important accounts first and gradually work on others.
4. Use passwords that are at least eight characters long and consider up