# **Lecture Script: Video Summarization using AI**
## **Introduction**
Hello everyone! Today, we are going to talk about an exciting application of Artificial Intelligence—**Video Summarization**.

Have you ever watched a long video and wished you could get the key highlights in just a few minutes? That’s exactly what video summarization does!

## **Why is Video Summarization Important?**
Imagine you have a 2-hour-long football match, but you only want to see the best moments—goals, saves, and celebrations. Instead of manually watching and selecting clips, AI can do this for you automatically!

## **Where is Video Summarization Used?**
This technology is widely used in:

✅ News Highlights – AI can summarize long news reports into short clips.

✅ Sports Replays – Only the best moments (like goals and saves) are extracted.

✅ Lecture Recordings – AI can condense a 1-hour class into a 5-minute summary.

✅ Movie Trailers – Automatically generate trailers by picking the most exciting scenes.



## **How Does Video Summarization Work?**
The process of summarizing a video involves the following steps:

1️⃣ Extract Key Frames – Identify the most important moments from the video.

2️⃣ Generate Captions – Use AI to describe what is happening in each frame.

3️⃣ Summarize the Story – Use a language model (like GPT) to create a meaningful story.

4️⃣ Create a Short Video – Combine the key frames into a new video.

5️⃣ Add Voice Narration – Generate an audio summary using text-to-speech.

Now, let’s go step by step through this process with hands-on coding! 🚀

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd '/content/drive/MyDrive/Colab Notebooks/Video-Summarization/'


/content/drive/MyDrive/Colab Notebooks/Video-Summarization


In [None]:
pwd

'/content/drive/MyDrive/Colab Notebooks/Video-Summarization'

In [None]:
!ls

app.py		    Dockerfile	   requirements.txt  summary_video.mp4	  Video-summarization.ipynb
big-buck-bunny.mp4  narration.mp3  scenes	     video_subtitles.srt


# **Step 1: Extract Key Frames from a Video**
**What are Key Frames?**

A key frame is a snapshot of an important moment in a video.
Instead of processing every single frame, we only keep the essential ones.
### **Method 1: Extract Frames at Regular Intervals**
We use OpenCV, a popular library for image and video processing.

Example Code: Extract Frames Every 30 Frames

In [None]:
import cv2
import os

def extract_keyframes(video_path, output_folder, frame_interval=30):
    os.makedirs(output_folder, exist_ok=True)
    cap = cv2.VideoCapture(video_path)
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            frame_path = os.path.join(output_folder, f"frame_{frame_count}.jpg")
            cv2.imwrite(frame_path, frame)

        frame_count += 1

    cap.release()

# Example usage
extract_keyframes("big-buck-bunny.mp4", "frames")


👉 This function extracts frames every 30 frames and saves them as images.

### **Method 2: Extract Frames Using Scene Detection**
Instead of selecting frames randomly, we use PySceneDetect to detect real scene changes in the video.

Example:


Imagine you are watching your favorite cartoon, and every time a new scene starts, you want to take a picture of it. This program does exactly that! It watches a video and automatically takes a snapshot whenever the scene changes. Just like when you see a new background or a different character in a cartoon, the program detects that and saves a picture. This helps in summarizing videos or picking out the important moments!

In [None]:

!pip install scenedetect[opencv] opencv-python
import os
import cv2
from scenedetect import open_video, SceneManager, ContentDetector
def save_scene_frames(video_path, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    video = open_video(video_path)

    scene_manager = SceneManager()
    scene_manager.add_detector(ContentDetector(threshold=27.0))  # Adjust sensitivity
    # Detect scenes
    scene_manager.detect_scenes(video)
    scenes = scene_manager.get_scene_list()

    cap = cv2.VideoCapture(video_path)  # Open video for frame extraction

    for i, (start, end) in enumerate(scenes):
        frame_time = start.get_frames()  # Extract frame at scene start
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_time)
        ret, frame = cap.read()

        if ret:
            frame_path = os.path.join(output_folder, f"scene_{i+1}.jpg")
            cv2.imwrite(frame_path, frame)
            print(f"Saved: {frame_path}")

    cap.release()



# Example usage
save_scene_frames("big-buck-bunny.mp4", "scenes")




INFO:pyscenedetect:Detecting scenes...


Saved: scenes/scene_1.jpg
Saved: scenes/scene_2.jpg
Saved: scenes/scene_3.jpg
Saved: scenes/scene_4.jpg
Saved: scenes/scene_5.jpg
Saved: scenes/scene_6.jpg
Saved: scenes/scene_7.jpg
Saved: scenes/scene_8.jpg
Saved: scenes/scene_9.jpg
Saved: scenes/scene_10.jpg
Saved: scenes/scene_11.jpg
Saved: scenes/scene_12.jpg
Saved: scenes/scene_13.jpg


**Lecture Script: Scene Detection and Frame Extraction in Python**

**Introduction**

Hello everyone! Today, we are going to break down a Python script that detects scene changes in a video and extracts a representative frame for each scene. This script makes use of the `scenedetect` library along with `OpenCV`, which is a powerful tool for image and video processing.

By the end of this lecture, you will understand:

- How to install and import necessary libraries.
- How to use `scenedetect` to detect scene changes in a video.
- How to extract and save frames using `OpenCV`.
- How `ContentDetector` works internally.

---

### **Step 1: Installing Required Libraries**

```python
!pip install scenedetect[opencv] opencv-python
```

This command installs the required libraries:

- `scenedetect[opencv]`: A Python library used for detecting scene changes in videos.
- `opencv-python`: OpenCV, which helps in reading and processing video frames.

For Jupyter Notebook users, prefixing the command with `!` allows execution in the terminal.

---

### **Step 2: Importing Necessary Modules**

```python
import os
import cv2
from scenedetect import open_video, SceneManager, ContentDetector
```

- `os`: A built-in Python module to handle file and directory operations.
- `cv2`: The OpenCV library for video and image processing.
- `open_video`: A function from `scenedetect` to open video files.
- `SceneManager`: Manages the scene detection process.
- `ContentDetector`: A detector that identifies scene changes based on content differences.

---

### **Step 6: Setting Up Scene Detection**

```python
scene_manager = SceneManager()
scene_manager.add_detector(ContentDetector(threshold=27.0))
```

- `SceneManager()`: Initializes a scene detection manager.
- `add_detector(ContentDetector(threshold=27.0))`: Adds a content-based scene detector with a threshold of 27.0.
  - Lower values make it more sensitive (detecting more scenes).
  - Higher values make it less sensitive (detecting fewer scenes).

### **How `ContentDetector` Works Internally**

The `ContentDetector` works by analyzing the visual differences between consecutive frames of a video. It calculates a difference metric using histograms, which measure how much the color distribution changes from one frame to the next.

#### **Example Explanation**
Imagine you have a flipbook where each page has a slightly different drawing. If you flip through it quickly, the differences between pages are small. But when you turn to a completely new scene, the drawing changes significantly. `ContentDetector` detects these large differences and marks them as scene changes.

#### **Step-by-step Process**
1. It reads two consecutive frames from the video.
2. It converts each frame into a histogram (a mathematical representation of color distribution).
3. It calculates the difference between the histograms of the two frames.
4. If the difference exceeds the threshold (e.g., 27.0), it considers it a scene change.
5. The starting frame of each detected scene is recorded.

#### **Real-World Analogy**
Think of watching a movie trailer. If two shots look very similar (e.g., a slow zoom on a face), they might be part of the same scene. But if it suddenly cuts to an explosion, that’s a big change—just like `ContentDetector` identifying a new scene!

---

### **Summary**
This script:

1. Installs and imports necessary libraries.
2. Opens the video file and sets up scene detection.
3. Detects scene changes using `ContentDetector` by comparing color histograms.
4. Extracts a frame at the beginning of each detected scene.
5. Saves the extracted frames as images.

This method is useful for:

- Automatic summarization of videos.
- Extracting key moments for analysis.
- Pre-processing data for machine learning tasks.

I hope this breakdown helped you understand the script! Feel free to ask any questions. Happy coding!




## **Step 2: Generate Captions for Frames**
Once we have extracted the key frames, we need to describe what is happening in each image using AI.

### **What is BLIP?**
BLIP (Bootstrapped Language-Image Pretraining) is an AI model that generates text captions from images.

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os

# Load BLIP Model for Image Captioning
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

def generate_image_caption(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(image, return_tensors="pt")
    caption_ids = model.generate(**inputs)
    caption = processor.decode(caption_ids[0], skip_special_tokens=True)
    return caption

# Example Usage: Generate captions for all extracted frames
image_folder = "scenes"
captions = []

for filename in sorted(os.listdir(image_folder)):  # Sort to maintain order
    if filename.endswith(".jpg"):
        image_path = os.path.join(image_folder, filename)
        caption = generate_image_caption(image_path)
        captions.append(f"{filename}: {caption}")

# Combine captions into a single text
  # This will be sent to ChatGPT
video_summary_input = "\n".join(captions)
print(video_summary_input)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



scene_1.jpg: a rabbit is sitting in the grass in a forest
scene_10.jpg: a white rabbit standing next to a tree
scene_11.jpg: a white rabbit standing in a field with a red ball
scene_12.jpg: a man is standing in the grass next to a tree
scene_13.jpg: a white rabbit standing in a field
scene_2.jpg: a tree with leaves
scene_3.jpg: the secret in the secret secret secret secret secret secret secret secret secret secret secret secret secret secret secret secret
scene_4.jpg: a white rabbit standing in a field of flowers
scene_5.jpg: a pig standing in a field of grass
scene_6.jpg: a white bird is perched on a tree branch
scene_7.jpg: a rabbit is standing in the middle of a field of flowers
scene_8.jpg: a white rabbit is running in the grass
scene_9.jpg: a white rabbit standing in a field with a butterfly flying above


**Lecture Script: Understanding BLIP for Image Captioning**

---

## **Introduction**
Hello everyone, and welcome to today's lecture! In this session, we will be exploring how to use a powerful AI model called BLIP (Bootstrapped Language-Image Pretraining) to generate captions for images.

By the end of this lecture, you will understand:
- What BLIP is and how it works.
- The purpose of each line of the given code.
- How to process multiple images to generate captions.

Let's get started!

---

## **Step 1: Importing Required Libraries**
```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os
```
### **Explanation:**
1. **`transformers` Library:** This is a Hugging Face library that provides state-of-the-art models for Natural Language Processing (NLP) and Vision-Language tasks.
    - `BlipProcessor`: Prepares input data for the BLIP model by converting images and text into a format the model understands.
    - `BlipForConditionalGeneration`: The pre-trained BLIP model used for generating image captions.
2. **`PIL (Python Imaging Library)`**: Used for opening and processing images.
3. **`os` Module**: Helps in navigating directories and working with files.

#### **Example:**
If you are working with images stored in a folder, the `os` module allows you to loop through all image files and process them automatically.

---

## **Step 2: Loading the BLIP Model**
```python
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
```
### **Explanation:**
1. **Loading the Pre-Trained Processor**:
    - `BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")` downloads and loads the BLIP processor, which helps in preparing input images.
2. **Loading the Pre-Trained Model**:
    - `BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")` loads the BLIP model specifically designed for captioning images.

This means we are now ready to process images and generate meaningful captions!

---

## **Step 3: Creating a Function to Generate Captions**
```python
def generate_image_caption(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(image, return_tensors="pt")
    caption_ids = model.generate(**inputs)
    caption = processor.decode(caption_ids[0], skip_special_tokens=True)
    return caption
```
### **Breaking it Down:**
1. **Open the Image**:
   - `Image.open(image_path).convert("RGB")`: Opens the image file and ensures it is in RGB format (3 color channels: Red, Green, and Blue).
2. **Process the Image**:
   - `inputs = processor(image, return_tensors="pt")`: Converts the image into a format the model can understand (PyTorch tensors).
3. **Generate Caption**:
   - `caption_ids = model.generate(**inputs)`: Uses the BLIP model to predict a caption for the image.
4. **Decode the Caption**:
   - `caption = processor.decode(caption_ids[0], skip_special_tokens=True)`: Converts the generated token IDs into a human-readable caption.
5. **Return the Caption**:
   - The function outputs the generated caption as a string.

#### **Example Usage:**
If you pass an image named `example.jpg`, this function will return a caption describing the image.

---

## **Step 4: Processing Multiple Images**
```python
image_folder = "scenes"
captions = []

for filename in sorted(os.listdir(image_folder)):  # Sort to maintain order
    if filename.endswith(".jpg"):
        image_path = os.path.join(image_folder, filename)
        caption = generate_image_caption(image_path)
        captions.append(f"{filename}: {caption}")
```
### **What Happens Here?**
1. **Define the Folder Containing Images**:
   - `image_folder = "scenes"`: Specifies the directory where images are stored.
2. **Create an Empty List to Store Captions**:
   - `captions = []`: Stores generated captions for each image.
3. **Loop Through Images in the Folder**:
   - `sorted(os.listdir(image_folder))`: Lists all files and sorts them alphabetically to maintain order.
   - `if filename.endswith(".jpg")`: Ensures only `.jpg` files are processed.
4. **Generate Captions for Each Image**:
   - `image_path = os.path.join(image_folder, filename)`: Creates the full path of the image file.
   - `generate_image_caption(image_path)`: Calls our function to generate a caption.
   - `captions.append(f"{filename}: {caption}")`: Stores the filename along with its generated caption.

#### **Example Output:**
If the folder contains `image1.jpg`, `image2.jpg`, and `image3.jpg`, the captions list may look like:
```
['image1.jpg: A dog playing in the park.',
 'image2.jpg: A sunset over the mountains.',
 'image3.jpg: A person riding a bicycle.']
```

---

## **Step 5: Preparing the Final Output**
```python
video_summary_input = "\n".join(captions)
print(video_summary_input)
```
### **Explanation:**
1. **Join Captions into a Single Text**:
   - `"\n".join(captions)`: Combines all captions into a single string, with each caption on a new line.
2. **Print the Final Output**:
   - `print(video_summary_input)`: Displays the generated captions.

#### **Example Output:**
```
image1.jpg: A dog playing in the park.
image2.jpg: A sunset over the mountains.
image3.jpg: A person riding a bicycle.
```
This combined text can be used as an input to a chatbot or a summarization model.

---

## **Summary**
- We used the BLIP model to generate image captions.
- We processed images from a folder and generated meaningful descriptions.
- We combined captions into a structured text summary.

This is a powerful technique that can be applied in automated video summarization, accessibility features, and content tagging.

That’s it for today! If you have any questions, feel free to ask. Happy coding! 🚀



#Detailed Explanation

### **How BLIP Works Internally to Generate Image Captions**  

The BLIP (Bootstrapped Language-Image Pretraining) model is designed to understand images and generate natural language descriptions based on what it sees. Let’s break down how it works internally step by step.

---

## **1. Overview of BLIP's Architecture**
BLIP is a vision-language model that combines an image encoder and a text decoder. Internally, it operates using **two main components**:

1. **Vision Encoder (Image Understanding)**
   - This part of the model processes the input image to extract meaningful visual features.
   - It uses a **Vision Transformer (ViT)**, which converts an image into a sequence of feature representations (similar to how text is tokenized in NLP models).
   
2. **Text Decoder (Caption Generation)**
   - This component takes the extracted visual features and generates a human-readable caption.
   - It is based on a **Transformer-based language model**, similar to GPT or BERT, which generates text word-by-word.

---

## **2. Step-by-Step Process of Image Captioning**

Let's go step by step through how BLIP processes an image to generate a caption.

### **Step 1: Preprocessing the Image**
- The image is first converted to an RGB format and resized to a standard shape.
- It is then transformed into a tensor, which is the numerical representation of the image.

**Example:**
If the input image is a picture of a cat sitting on a chair, the preprocessing step converts it into a numerical format that the model can understand.

### **Step 2: Feature Extraction with the Vision Encoder**
- The pre-trained **Vision Transformer (ViT)** extracts key features from the image.
- These features capture objects, textures, colors, and spatial relationships.

**Example:**
The model identifies that the image contains a "cat," "chair," and "background details."

### **Step 3: Generating the Caption with the Text Decoder**
- The text decoder, a transformer-based model, takes the extracted image features and generates a caption.
- It uses an **auto-regressive generation process**, meaning it predicts one word at a time based on previous words and visual features.

**Example:**
1. The model starts with a special start token (`[CLS]`).
2. It predicts the first word, e.g., `"A"`.
3. Based on `"A"` and the image features, it predicts the next word, e.g., `"cat"`.
4. It continues predicting words until it reaches a stop condition.

**Final Caption:** `"A cat sitting on a chair."`

---

## **3. Detailed Example with Input and Output**
### **Input**
Imagine we give BLIP the following image:

📷 *(An image of a dog playing with a ball in the park.)*

### **Processing Internally**
- The vision encoder processes the image and extracts key features:
  ```
  [Dog, Ball, Grass, Sky]
  ```
- The text decoder generates a caption step by step:
  ```
  "A" → "dog" → "playing" → "with" → "a" → "ball" → "in" → "the" → "park."
  ```

### **Output**
The final generated caption:
```
"A dog playing with a ball in the park."
```

---

## **4. Why is BLIP Effective?**
- **Pretraining on Large Datasets**: BLIP is trained on millions of image-text pairs, enabling it to learn rich relationships between vision and language.
- **Transformer-Based Architecture**: Uses powerful transformer models to understand context and generate coherent captions.
- **Fine-Tuning Capabilities**: Can be adapted for specific tasks like detailed descriptions, storytelling, or question answering.

---

## **5. Summary**
- **BLIP first encodes the image using a Vision Transformer (ViT).**
- **It then passes extracted features to a Transformer-based text decoder.**
- **The model generates a caption word by word based on the image features.**
- **The final caption describes the most important elements of the image in natural language.**

This is how BLIP internally works to generate image captions! 🚀 Let me know if you want further clarifications.

👉 Now we have text descriptions of each extracted frame!

## **Step 3: Summarize the Captions into a Story**
Once we have captions for all frames, we need to turn them into a coherent story.

Example Code: Using GPT to Summarize

In [None]:
from openai import OpenAI

client = OpenAI(
  api_key="********"
)


def summarize_video(captions):
    prompt = f"Summarize the following sequence of video frames into a meaningful story:\n\n{captions}"

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "You are an AI that summarizes video content."},
                  {"role": "user", "content": prompt}]
    )

    return completion.choices[0].message.content

# Get the summary
video_summary = summarize_video(video_summary_input)
print(video_summary)




In a serene forest, a curious white rabbit finds its way through a vibrant landscape. It begins by sitting among the grass (scene_1.jpg) before exploring further. Eventually, it encounters a towering tree (scene_10.jpg) and a colorful field with a playful red ball (scene_11.jpg), delighting in the simple joys of nature.

As the day progresses, the rabbit wanders into a lush field (scene_13.jpg) filled with flowers and the gentle company of a pig grazing nearby (scene_5.jpg). Inspiration strikes as it notices a white bird perched peacefully in the branches above (scene_6.jpg), emphasizing the beauty of their coexistence.

The adventure continues with the rabbit dancing through the flowers (scene_4.jpg) and discovering its playful side, as it gleefully runs through the grass (scene_8.jpg) while a butterfly flutters overhead (scene_9.jpg). All the while, a man stands nearby, admiring the scene and perhaps pondering the secrets of nature (scene_12.jpg, scene_3.jpg).

This whimsical tale ca

👉 Now, GPT will generate a short summary of the video!

**Lecture Script: Understanding the OpenAI Video Summarization Code**

### Introduction

Hello everyone! Today, we will break down a Python script that interacts with OpenAI’s API to summarize video captions. This script will take text-based descriptions of video frames and generate a meaningful summary.

By the end of this lecture, you will understand:

1. How to import and use OpenAI’s API in Python.
2. How to create an API client for making requests.
3. How to structure prompts for AI-based summarization.
4. How to extract and display the response from OpenAI.
5. How ChatGPT-4 processes input and generates meaningful summaries.

---

### **Step 1: Importing the Required Library**

```python
from openai import OpenAI
```

#### **Explanation:**

- This line imports the `OpenAI` class from the `openai` module.
- The `OpenAI` library provides an interface to communicate with OpenAI’s models, such as GPT-4.
- Make sure you have the OpenAI package installed using:
  ```bash
  pip install openai
  ```

---

### **Step 2: Setting Up the OpenAI Client**

```python
client = OpenAI(
  api_key="your-api-key-here"
)
```

#### **Explanation:**

- We create an `OpenAI` client object, which will be used to send requests to OpenAI’s API.
- The `api_key` is required to authenticate the request. Replace "your-api-key-here" with your actual API key.
- Never share your API key publicly as it provides access to OpenAI’s services and can incur costs.

**Example:** If a user wants to connect with OpenAI, they would initialize the client as:

```python
client = OpenAI(api_key="sk-XXXXXX")
```

---

### **Step 3: Creating the Video Summarization Function**

```python
def summarize_video(captions):
    prompt = f"Summarize the following sequence of video frames into a meaningful story:\n\n{captions}"
```

#### **Explanation:**

- We define a function `summarize_video(captions)` which takes `captions` (text extracted from a video) as input.
- A `prompt` is created using an f-string. It asks the AI model to summarize the video captions.
- The `\n\n` ensures proper formatting for readability.

**Example:** If `captions` contains:

```python
captions = "A man walks into a store. He picks up an apple and pays at the counter."
```

The `prompt` will be:

```python
"Summarize the following sequence of video frames into a meaningful story:\n\nA man walks into a store. He picks up an apple and pays at the counter."
```

---

### **Step 4: Sending a Request to OpenAI’s Chat Model**

```python
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an AI that summarizes video content."},
            {"role": "user", "content": prompt}
        ]
    )
```

#### **Explanation:**

- We call `client.chat.completions.create()` to send a request to OpenAI.
- `model="gpt-4o-mini"` specifies the model used.
- `messages` is a list containing:
  - A system message: Defines AI’s role as a video summarizer.
  - A user message: Contains the actual prompt with captions.
- The AI processes this request and generates a response.

---

### **Step 5: Extracting and Returning the AI’s Response**

```python
    return completion.choices[0].message.content
```

#### **Explanation:**

- `completion.choices[0]` accesses the first (and usually only) response.
- `.message.content` extracts the generated summary text.
- The function returns the summarized content.

---

### **Step 6: Calling the Function and Printing the Summary**

```python
video_summary = summarize_video(video_summary_input)
print(video_summary)
```

#### **Explanation:**

- `summarize_video(video_summary_input)` calls our function with an input text (`video_summary_input`).
- The returned summary is stored in `video_summary`.
- `print(video_summary)` displays the AI-generated summary.

**Example Output:** If `video_summary_input` is:

```python
"A man enters a store, looks around, and buys an apple."
```

The printed output could be:

```python
"A man visits a store and purchases an apple."
```

---

### **Step 7: How ChatGPT-4 Generates a Meaningful Summary**

To understand how ChatGPT-4 processes captions and generates a story, let's break it down:

1. **Tokenization:**

   - The input text (captions) is broken down into smaller units called tokens (words, subwords, or characters).
   - Example: "A man walks into a store." → ["A", "man", "walks", "into", "a", "store"]

2. **Context Understanding:**

   - GPT-4 uses deep learning to analyze the sequence of tokens and extract the meaning.
   - It identifies key entities (e.g., "man", "store") and their actions ("walks into", "buys").

3. **Pattern Recognition:**

   - The model has been trained on large datasets containing stories, summaries, and structured narratives.
   - It compares the input with similar patterns it has learned before.

4. **Coherent Generation:**

   - The AI predicts the next most likely words based on context.
   - It restructures the input into a smooth and concise story.

5. **Post-Processing:**

   - The final output is checked for fluency and logical coherence before being returned.

**Example Breakdown:** Input Captions:

```python
"A boy kicks a ball. The ball hits a window. The window breaks."
```

Processing Steps:

- Identifies key elements: (Boy → kicks → Ball → hits → Window → breaks)
- Recognizes cause-effect relationships.
- Generates a structured summary:

```python
"A boy accidentally breaks a window while playing with a ball."
```

This is how GPT-4 transforms scattered captions into a meaningful story!

---

### **Conclusion**

In today’s lesson, we learned:

- How to import and set up OpenAI’s API.
- How to structure a function to summarize video captions.
- How to format and send a request to OpenAI.
- How to extract and print the AI-generated response.
- How ChatGPT-4 processes text internally to generate meaningful summaries.

This script is a great starting point for AI-based video summarization. You can enhance it by integrating it with video-to-text tools like Whisper for automatic transcript generation.

Happy coding!



## **Step 4: Convert Frames into a Short Video**
Now, we combine the key frames into a new summarized video.

Example Code: Creating Summary Video

In [None]:
import moviepy.editor as mp

def create_summary_video(image_folder, output_video):
    images = sorted([os.path.join(image_folder, img) for img in os.listdir(image_folder) if img.endswith(".jpg")])
    clips = [mp.ImageClip(img).set_duration(2) for img in images]  # 2 sec per frame

    video = mp.concatenate_videoclips(clips, method="compose")
    video.write_videofile(output_video, fps=24)

# Example usage
create_summary_video("scenes", "summary_video.mp4")

Moviepy - Building video summary_video.mp4.
Moviepy - Writing video summary_video.mp4





Moviepy - Done !
Moviepy - video ready summary_video.mp4


## **Step 5: Add Voice Narration**
We use gTTS (Google Text-to-Speech) to add a voice-over to the video.

In [None]:
!pip install gtts



In [None]:
from gtts import gTTS

def generate_voice_narration(captions, output_audio):
    text = "".join(captions)
    print(text)
    tts = gTTS(text, lang="en")
    tts.save(output_audio)

# Example usage
generate_voice_narration(video_summary, "narration.mp3")

In a serene forest, a curious white rabbit finds its way through a vibrant landscape. It begins by sitting among the grass (scene_1.jpg) before exploring further. Eventually, it encounters a towering tree (scene_10.jpg) and a colorful field with a playful red ball (scene_11.jpg), delighting in the simple joys of nature.

As the day progresses, the rabbit wanders into a lush field (scene_13.jpg) filled with flowers and the gentle company of a pig grazing nearby (scene_5.jpg). Inspiration strikes as it notices a white bird perched peacefully in the branches above (scene_6.jpg), emphasizing the beauty of their coexistence.

The adventure continues with the rabbit dancing through the flowers (scene_4.jpg) and discovering its playful side, as it gleefully runs through the grass (scene_8.jpg) while a butterfly flutters overhead (scene_9.jpg). All the while, a man stands nearby, admiring the scene and perhaps pondering the secrets of nature (scene_12.jpg, scene_3.jpg).

This whimsical tale ca

# **Conclusion**
✅ Extracted key frames from a video

✅ Generated captions using AI

✅ Summarized the captions into a story

✅ Created a short video with highlights

✅ Added voice narration

This is how AI can automatically summarize long videos into short, meaningful clips! 🚀

In [None]:
!pip install streamlit opencv-python transformers torch pillow moviepy gtts scenedetect[opencv]


Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_c

In [None]:
file_name = 'requirements.txt'
file_path = folder_path + file_name

# Open the file and write to it
with open(file_path, 'w') as f:
    f.write('This is a sample text file created using Google Colab.')

print(f"File saved to: {file_path}")

File saved to: /content/drive/MyDrive/Colab Notebooks/Video-Summarization/requirements.txt


In [None]:
file_name = 'app.py'
file_path = folder_path + file_name

# Open the file and write to it
with open(file_path, 'w') as f:
    f.write('Hello This is app.py')

print(f"File saved to: {file_path}")

File saved to: /content/drive/MyDrive/Colab Notebooks/Video-Summarization/app.py


In [None]:
dockerfile_content = """
# Use Python base image
FROM python:3.9

# Set working directory
WORKDIR /app

# Copy all project files
COPY . .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Expose Streamlit port
EXPOSE 8501

# Run the Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
"""

# Define the file path for the Dockerfile
dockerfile_path = folder_path + 'Dockerfile'

# Open the file and write the Dockerfile content
with open(dockerfile_path, 'w') as f:
    f.write(dockerfile_content)

print(f"Dockerfile saved to: {dockerfile_path}")


Dockerfile saved to: /content/drive/MyDrive/Colab Notebooks/Video-Summarization/Dockerfile


In [None]:
!ls


app.py	big-buck-bunny.mp4  drive  frames  narration.mp3  sample_data  scenes
