<h1 style="color: red;">Running of script:</h1>
<h3 style="color: orange;">Ensure ffmpeg is installed and available in your environment.</h1>
<h3 style="color: orange;">The script will prompt the user to enter a YouTube URL and select the video's source language for better translation to English.</h1>
<h3 style="color: orange;">The video will be downloaded using the video ID for a unique filename, audio will be extracted to a file with the same unique ID, transcribed to English, and semantic chunks of the transcription will be returned.</h1>
<h3 style="color: orange;">Logs will be stored in a file named process_log_<timestamp>.log for each run.</h1>

<h1 style="color: blue;">Step 0: Installalation of libraries</h1>

In [None]:
# Ensure necessary packages are installed
!pip install yt-dlp moviepy openai-whisper gradio pydub

import os
import re
import logging
import time
from yt_dlp import YoutubeDL
from moviepy.editor import VideoFileClip
import whisper
import gradio as gr
from pydub import AudioSegment
from pydub.silence import split_on_silence

# Mapping from language names to language codes
LANGUAGE_CODES = {
    "English": "en",
    "Spanish": "es",
    "German": "de",
    "French": "fr",
    "Chinese": "zh",
    "Japanese": "ja",
    "Korean": "ko",
    "Hindi": "hi",
    "Tamil": "ta",
    "Malayalam": "ml",
    "Urdu": "ur",
    "Bengali": "bn",
    "Kannada": "kn",
    "Marathi": "mr",
    "Punjabi": "pa",
    # Add more languages as needed
}

<h1 style="color: blue;">Creation of Log file</h1>

In [None]:
# Create a log file with a unique name based on the current timestamp
log_filename = f"process_log_{int(time.time())}.log"
logging.basicConfig(filename=log_filename, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

<h1 style="color: blue;">Step 1: Download Video and Extract Audio</h1>


In [None]:
def get_video_id(url):
    try:
        # Extract the video ID from the URL
        match = re.search(r'v=([^&]+)', url)
        if match:
            return match.group(1)
        else:
            raise ValueError("Invalid YouTube URL")
    except Exception as e:
        logging.error(f"Error extracting video ID: {e}")
        raise

# Step 1: Download Video and Extract Audio

def download_youtube_video(url):
    try:
        video_id = get_video_id(url)
        output_path = f"{video_id}.mp4"
        ydl_opts = {
            'format': 'bestvideo+bestaudio/best',
            'outtmpl': output_path,
            'merge_output_format': 'mp4'
        }
        with YoutubeDL(ydl_opts) as ydl:
            ydl.download([url])
        logging.info(f"Video downloaded successfully: {output_path}")
        return output_path
    except Exception as e:
        logging.error(f"Error downloading video: {e}")
        raise RuntimeError(f"Error downloading video: {e}")


<h4 style="color: orange;">Library: yt-dlp</h1>
<h4 style="color: orange;">Reason: yt-dlp is a powerful command-line program to download videos from YouTube and other video platforms. It is a fork of the popular youtube-dl with additional features and bug fixes.</h1>
<h4 style="color: orange;">Description: This function takes a YouTube URL, downloads the best available video and audio streams, and merges them into an MP4 file. The yt-dlp library is highly reliable and supports a wide range of video qualities and formats.</h1>

In [None]:
def extract_audio_from_video(video_path):
    try:
        audio_path = video_path.replace(".mp4", ".wav")
        video = VideoFileClip(video_path)
        video.audio.write_audiofile(audio_path)
        logging.info(f"Audio extracted successfully: {audio_path}")
        return audio_path
    except Exception as e:
        logging.error(f"Error extracting audio: {e}")
        raise RuntimeError(f"Error extracting audio: {e}")

<h4 style="color: orange;">Library: moviepy</h1>
<h4 style="color: orange;">Reason: moviepy is a versatile library for video editing in Python. It allows easy manipulation of video and audio files.</h1>
<h4 style="color: orange;">Description: This function extracts the audio from the downloaded video and saves it as a WAV file.</h1>

<h1 style="color: blue;">Step 2: Transcription of Audio</h1>


In [None]:
# Step 2: Transcription of Audio

def transcribe_audio(audio_path, source_language=None):
    try:
        # Check if audio file exists
        if not os.path.isfile(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the whisper model
        model = whisper.load_model("base")
        
        # Convert language name to code
        source_language_code = LANGUAGE_CODES.get(source_language)
        
        # Set options for transcription, including the source language and target language as English
        options = {"language": source_language_code} if source_language_code else {}
        options["task"] = "translate"
        
        # Transcribe the audio with specified options
        result = model.transcribe(audio_path, **options)
        
        return result["text"], result["segments"]
    except Exception as e:
        raise RuntimeError(f"Error transcribing audio: {e}")

<h4 style="color: orange;">Library: whisper</h1>
<h4 style="color: orange;">Reason: whisper is an open-source ASR (Automatic Speech Recognition) model developed by OpenAI. It supports multiple languages and provides high-quality transcription.</h1>
<h4 style="color: orange;">Description: This function transcribes the audio file, optionally specifying the language to improve accuracy. It returns the full transcript and individual segments with timestamps.</h1>

<h1 style="color: blue;">Step 3: Semantic Chunking of Data</h1>


In [None]:
# Step 3: Semantic Chunking of Data

def create_semantic_chunks(transcript, segments, max_chunk_length=15.0):
    try:
        chunks = []
        current_chunk = {"text": "", "start_time": None, "end_time": None}
        for segment in segments:
            start_time = segment["start"]
            end_time = segment["end"]
            text = segment["text"]
            if current_chunk["start_time"] is None:
                current_chunk["start_time"] = start_time
            if (end_time - current_chunk["start_time"]) > max_chunk_length:
                current_chunk["end_time"] = end_time
                chunks.append(current_chunk)
                current_chunk = {"text": "", "start_time": start_time, "end_time": None}
            current_chunk["text"] += text + " "
            current_chunk["end_time"] = end_time
        if current_chunk["text"]:
            chunks.append(current_chunk)
        
        logging.info("Semantic chunking completed successfully")
        return [{"chunk_id": idx + 1, "chunk_length": chunk["end_time"] - chunk["start_time"], "text": chunk["text"].strip(), "start_time": chunk["start_time"], "end_time": chunk["end_time"]} for idx, chunk in enumerate(chunks)]
    except Exception as e:
        logging.error(f"Error creating semantic chunks: {e}")
        raise RuntimeError(f"Error creating semantic chunks: {e}")

<h4 style="color: orange;">Description: This function creates semantic chunks from the transcription segments.</h1>
<h4 style="color: orange;">Logic:</h4>
<h4 style="color: white;">Initialize: Start with an empty chunk.</h4>
<h4 style="color: white;">Iterate Segments: Loop through each segment, accumulating text until the chunk reaches the maximum length.</h4>
<h4 style="color: white;">Chunk Creation: If adding a segment would exceed the max chunk length, finalize the current chunk and start a new one.</h4>
<h4 style="color: white;">Finalize: Ensure the last chunk is added to the list.</h4>
<h4 style="color: orange;">Reasoning: This approach ensures that each chunk is semantically meaningful by grouping segments together and respects the maximum length constraint for manageability.</h4>

<h1 style="color: blue;">Step 4: Gradio Interface for User Interaction</h1>


In [None]:
# Gradio Interface

def process_video(url, source_language):
    try:
        video_path = download_youtube_video(url)
        audio_path = extract_audio_from_video(video_path)
        
        # Get absolute path of the audio file
        audio_path = os.path.abspath(audio_path)
        
        transcript, segments = transcribe_audio(audio_path, source_language)
        semantic_chunks = create_semantic_chunks(transcript, segments)
        return semantic_chunks
    except Exception as e:
        logging.error(f"Error processing video: {e}")
        return str(e)

iface = gr.Interface(
    fn=process_video,
    inputs=[
        gr.inputs.Textbox(lines=2, placeholder="Enter YouTube URL here..."),
        gr.inputs.Dropdown(list(LANGUAGE_CODES.keys()), label="Select Source Language (for translation to English)")
    ],
    outputs="json",
    title="YouTube Video Semantic Chunker",
    description="Extracts high-quality, meaningful (semantic) segments from a YouTube video and translates them to English. Select the source language for better translation accuracy."
)

# Launch Gradio app
iface.launch()

<h4 style="color: orange;">Library: gradio</h4>
<h4 style="color: orange;">Reason: gradio is a library for creating interactive user interfaces for machine learning models in Python. It allows for easy deployment and user interaction.</h4>
<h4 style="color: orange;">Description: This interface allows users to input a YouTube URL and an optional language code, processes the video, and returns the semantic chunks in JSON format.</h4>

<h2 style="color: red;">Strengths and Weaknesses of the Approach</h2>
<h3 style="color: yellow;">Strengths:</h3>


<h4 style="color: white;">High Precision: By focusing on semantic chunks and limiting the chunk length, the approach ensures high-quality, meaningful segments.</h4>
<h4 style="color: white;">Multilingual Support: The use of the Whisper model allows for transcription in multiple languages.</h4>
<h4 style="color: white;">User-Friendly Interface: The Gradio app provides an easy-to-use interface for non-technical users.</h4>
<h4 style="color: white;">Modular Design: Each step is modular, making it easy to adapt and extend the workflow for different use cases.</h4>

<h3 style="color: yellow;">Weaknesses:</h3>
<h4 style="color: white;">Dependency on Whisper Model: The approach relies on the Whisper model's language support and accuracy. Unsupported languages or poor model performance on certain accents can affect results.</h4>
<h4 style="color: white;">Fixed Chunk Length: The max chunk length is fixed, which might not be optimal for all types of content. More dynamic chunking strategies could be explored.</h4>
<h4 style="color: white;">Audio Quality: Poor audio quality or significant background noise can impact transcription accuracy.</h4>
<h4 style="color: white;">Computationally Intensive: Processing long videos or videos with high-resolution audio can be computationally intensive and time-consuming.</h4>