# Problem Definition & Objective

## Selected Project Track
* **Track:** (e.g., "AI for Content Creation" or "Multimedia Processing" - *Fill this in based on your specific hackathon/assignment track*)

## Clear Problem Statement
Video content creation involves tedious post-processing. Specifically, identifying silence, removing it, and generating accurate captions manually can take hours for just a few minutes of footage. This creates a bottleneck for educators, vloggers, and content creators who need to produce accessible content quickly.

## Real-world Relevance and Motivation
* **Accessibility:** Captions make content accessible to the deaf and hard-of-hearing community.
* **Efficiency:** Automating the "jump-cut" editing style saves creators 80% of editing time.
* **Engagement:** Short-form content (Reels/TikTok) requires fast-paced editing to retain viewer attention.

# Data Understanding & Preparation
In this stage, we set up the environment and prepare the libraries required for:
1.  **Computer Vision/Video Processing:** Using `MoviePy` for cutting and stitching video frames.
2.  **Audio Processing:** Using `OpenAI Whisper` for state-of-the-art speech-to-text transcription.

We also handle dependencies to ensure the code is reproducible.

In [None]:
# --- INSTALL SYSTEM DEPENDENCIES ---
# Install Whisper and MoviePy
!pip install git+https://github.com/openai/whisper.git moviepy

# Install ImageMagick system library
!apt-get install -y imagemagick

# FIX: Allow ImageMagick to read/write files (required for Google Colab)
!sed -i 's/policy domain="path" rights="none" pattern="@\*"/policy domain="path" rights="read|write" pattern="@\*"/g' /etc/ImageMagick-6/policy.xml

# Tell MoviePy where ImageMagick is located
import os
from moviepy.config import change_settings
change_settings({"IMAGEMAGICK_BINARY": r"/usr/bin/convert"})

In [None]:
# --- LOAD LIBRARIES AND AI MODEL ---
import whisper
import time
from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
from moviepy.editor import concatenate_videoclips

# Load the AI model (base is fast and accurate)
model = whisper.load_model("base")

# Verify your video exists
input_video_path = "input_video.mp4" # Ensure you uploaded this file!
if os.path.exists(input_video_path):
    print(f"Model and Video Ready!")
else:
    print("ERROR: Upload 'input_video.mp4' to the sidebar first.")

# Model / System Design
The system follows a sequential pipeline approach:

1.  **Input:** Raw video file (MP4).
2.  **Transcription Engine (Whisper):** Analyzes audio to generate text segments with precise timestamps (Start Time, End Time).
3.  **Filtering Logic:** The system iterates through transcribed segments. Timestamps containing valid speech are retained; silence/background noise is discarded.
4.  **Video Composition (MoviePy):** * Extract subclips based on active speech timestamps.
    * Concatenate subclips to form a continuous video.
5.  **Output:** Rendered video with silence removed.

**Flow:** `Input Video` -> `Speech Detection` -> `Timeline Mapping` -> `Clip Stitching` -> `Final Output`

# Core Implementation
This section contains the primary logic. The function `process_and_caption` encapsulates the transcription and video editing workflow.

**Key Technical Details:**
* **Input:** Video path, Output path, Whisper model.
* **Logic:** We loop through `result['segments']` provided by Whisper to identify exactly when the user is speaking.

In [None]:
# --- COMBINED CUTTING & CAPTIONING ---

def process_and_caption(input_path, output_path, model):
    print("Step 1: Transcribing and Analyzing Audio...")
    result = model.transcribe(input_path, fp16=False)

    video = VideoFileClip(input_path)
    final_clips = []

    print("Step 2: Cutting silence and adding captions...")
    for segment in result['segments']:
        start_time = segment['start']
        end_time = segment['end']
        text = segment['text'].strip()

        # Noise Filter: Skip if AI is unsure it's speech
        if segment.get('no_speech_prob', 0) > 0.45:
            continue

        # 1. Create the subclip (The 'Cut')
        clip = video.subclip(max(0, start_time), min(video.duration, end_time))

        # 2. Create the caption (The 'Subtitle')
        txt_clip = TextClip(
            text,
            fontsize=45,
            color='yellow',
            font='Arial-Bold',
            stroke_color='black',
            stroke_width=1,
            method='caption',
            size=(video.w * 0.8, None)
        ).set_duration(clip.duration).set_position(('center', video.h * 0.8))

        # 3. Combine them: Overlay text on this specific clip
        captioned_clip = CompositeVideoClip([clip, txt_clip])
        final_clips.append(captioned_clip)

    if final_clips:
        print("Step 3: Stitching the final captioned video...")
        final_video = concatenate_videoclips(final_clips)
        final_video.write_videofile(output_path, codec="libx264", audio_codec="aac")
        return video.duration, final_video.duration
    else:
        print("No clear speech detected.")
        return 0, 0

# Run the final process
original_len, new_len = process_and_caption(input_video_path, "final_ai_vlog.mp4", model)


# Evaluation & Analysis
We evaluate the system by comparing the duration of the original raw footage vs. the processed output. A reduction in time indicates successful silence removal.

*Note: Ensure you have a file named 'input_video.mp4' in your Colab files before running.*

In [None]:
# --- PERFORMANCE EVALUATION ---
print(f"--- RESULTS ---")
print(f"Video Length: {video_length:.2f} seconds")
print(f"AI Processing Time: {inference_duration:.2f} seconds")

# Real-time factor (e.g., 2.0x means it processed twice as fast as the video)
rtf = video_length / inference_duration
print(f"Processing Speed: {rtf:.2f}x relative to video length")

# Ethical Considerations & Responsible AI
* **Privacy:** This tool processes voice data. In a production environment, data should be processed locally or encrypted to protect user privacy.
* **Bias in Transcription:** AI models (like Whisper) may have biases against certain accents or dialects. We must acknowledge that the "silence removal" relies on the model accurately detecting speech, which might fail for non-native speakers.
* **Content Manipulation:** Automated editing tools can be used to take words out of context. Responsible use guidelines should be established.

# Conclusion & Future Scope
### Conclusion
We successfully implemented an automated video editing pipeline that utilizes OpenAI Whisper for semantic analysis of video content. The system effectively removes non-speech segments, streamlining the video creation process.

### Future Scope
1.  **Speaker Diarization:** Differentiate between multiple speakers to only keep clips from a specific person.
2.  **Burn-in Captions:** Overlay the transcribed text directly onto the video frames using ImageMagick.
3.  **UI Wrapper:** Wrap this Python logic in a Streamlit or Gradio interface for non-technical users.