## Task 1: Semantic Chunking of a Youtube Video

**Problem Statement:**

The objective is to extract high-quality, meaningful (semantic) segments from the specified YouTube video: [Watch Video](https://www.youtube.com/watch?v=Sby1uJ_NFIY).

Suggested workflow:
1. **Download Video and Extract Audio:** Download the video and separate the audio component.
2. **Transcription of Audio:** Utilize an open-source Speech-to-Text model to transcribe the audio. *Provide an explanation of the chosen model and any techniques used to enhance the quality of the transcription.*
3. **Time-Align Transcript with Audio:** *Describe the methodology and steps for aligning the transcript with the audio.*
4. **Semantic Chunking of Data:** Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length. *Explain the logic used for semantic chunking and discuss the strengths and weaknesses of your approach.*

**Judgement Criteria:**

1. **Precision-Oriented Evaluation:** The evaluation focuses on precision rather than recall. Higher scores are achieved by reporting fewer but more accurate segments rather than a larger number of segments with inaccuracies. Segment accuracy is determined by:
   - **Transcription Quality:** Accuracy of the text transcription for each audio chunk.
   - **Segment Quality:** Semantic richness of the text segments.
   - **Timestamp Accuracy:** Precision of the start and end times for each segment. Avoid audio cuts at the start or end of a segment.
   
2. **Detailed Explanations:** Provide reasoning behind each step in the process.
3. **Generalization:** Discuss the general applicability of your approach, potential failure modes on different types of videos, and adaptation strategies for other languages.
4. **[Bonus-1]** **Gradio-app Interface:** Wrap your code in a gradio-app which takes in youtube link as input and displays the output in a text-box.
5. **[Bonus-2]** **Utilizing Ground-Truth Transcripts:** Propose a method to improve the quality of your transcript using a ground-truth transcript provided as a single text string. Explain your hypothesis for this approach. *Note that code-snippet isn't required for this question.*

  As an example - for the audio extracted from [yt-link](https://www.youtube.com/watch?v=ysLiABvVos8), how can we leverage transcript scraped from [here](https://www.newsonair.gov.in/bulletins-detail/english-morning-news-7/), to improve the overall transcription quality of segments?

**Submission Format:**

Your submission should be a well-documented Jupyter notebook capable of reproducing your results. The notebook should automatically install all required dependencies and output the results in the specified format.

- **Output Format:** Provide the results as a list of dictionaries, each representing a semantic chunk. Each dictionary should include:
  - `chunk_id`: A unique identifier for the chunk (integer).
  - `chunk_length`: The duration of the chunk in seconds (float).
  - `text`: The transcribed text of the chunk (string).
  - `start_time`: The start time of the chunk within the video (float).
  - `end_time`: The end time of the chunk within the video (float).

```python
sample_output_list = [
    {
        "chunk_id": 1,
        "chunk_length": 14.5,
        "text": "Here is an example of a semantic chunk from the video.",
        "start_time": 0.0,
        "end_time": 14.5,
    },
    # Additional chunks follow...
]
```

Ensure that your code is clear, well-commented, and easy to follow, with explanations for each major step and decision in the process. The notebook should be able to install all the dependencies automatically and generate the reported output when run.


---



In [None]:
pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m794.3 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0


In [None]:
from pytube import YouTube
from moviepy.editor import *

In [None]:
pip install SpeechRecognition


Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.10.4


In [None]:
from pytube import YouTube

# Function to download video from YouTube to a specific location
def download_video(url, output_path):
    yt = YouTube(url)
    stream = yt.streams.first()  # Get the highest resolution stream
    stream.download(output_path)

# Example usage
video_url = "https://www.youtube.com/watch?v=Sby1uJ_NFIY"
output_video_path = "t1/video.mp4"  # Specify the desired output path
download_video(video_url, output_video_path)


In [None]:
pip install moviepy



In [None]:
import moviepy.editor as mp

def extract_audio(video_path, output_audio_path):
    # Load the video file
    video_clip = mp.VideoFileClip("/content/t1/video.mp4/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.mp4")
    # Extract the audio
    audio_clip = video_clip.audio

    # Write the audio to a file
    audio_clip.write_audiofile("/content/t1/audio.wav")

# Example usage
video_path = "/content/t1/video.mp4/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.mp4"  # Replace with the path to your video file
output_audio_path = "/content/t1/audio.wav"  # Specify the desired output path for the audio file

extract_audio(video_path, output_audio_path)


MoviePy - Writing audio in /content/t1/audio.wav


                                                                        

MoviePy - Done.




In [None]:
from transformers import pipeline

In [None]:
pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
import pytube
from pydub import AudioSegment
from pydub.playback import play
from IPython.display import Audio, display
from transformers import pipeline

In [None]:
whisper = pipeline('automatic-speech-recognition', model = 'openai/whisper-medium')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [None]:
## The 'whisper' pipeline will transcribe the speech in the audio to text
transcript = whisper('/content/t1/audio.wav')

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


In [None]:
transcript

[" Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? Okay, I am not hearing this at all. This is like a post-lunch energy downer or something. Let's hear it. Are you guys awake? Alright. You better be because we have a superstar guest here. You heard the $41 million, and I didn't hear honestly anything she said after that. So we're going to ask for about $40 million from him by the end of this conversation, okay? But let's get started. I want to introduce Vivek and Pratyush, his co-founder who's not here. We wanted to start with playing a video of what OpenHati does. I encourage all of you to go to the website, serveron.ai, and check it out. But let me start by introducing Vivek. Vivek is a dear friend and he's very, very modest, one of the most modest guys that I know. But his personal journey, Vivek, you've been, you got a PhD from Carnegie Mellon, you started and sold the company to Magma. And Vivek and I moved ba

In [None]:
# Extract the transcription text from the dictionary
transcript_text = transcript.get('text', '')

# Save the transcript as a text file
if transcript_text:
    with open('/content/t1/transcript.txt', 'w') as f:
        f.write(transcript_text)
else:
    print("Transcription not available.")

In [None]:
# Read the transcript file
with open("/content/t1/transcript.txt", "r") as file:
    transcript = file.read()

# Split the transcript into sentences or phrases
sentences = transcript.split(". ")

# Define parameters
words_per_second = 2  # Average words per second in the audio
max_chunk_duration = 15  # Maximum duration for each chunk (in seconds)

# Perform semantic chunking
chunks = []
chunk_id = 1
current_chunk = {"chunk_id": chunk_id, "start_time": 0, "end_time": 0, "text": ""}
chunk_duration = 0

for sentence in sentences:
    # Calculate the duration of the current sentence (you can adjust this based on your data)
    sentence_duration = len(sentence.split()) / words_per_second

    # If adding this sentence would exceed max_chunk_duration, start a new chunk
    if chunk_duration + sentence_duration > max_chunk_duration:
        current_chunk["end_time"] = current_chunk["start_time"] + chunk_duration
        current_chunk["chunk_length"] = chunk_duration
        chunks.append(current_chunk)
        chunk_id += 1
        chunk_duration = 0
        current_chunk = {"chunk_id": chunk_id, "start_time": current_chunk["end_time"], "end_time": 0, "text": ""}

    # Add sentence to current chunk
    current_chunk["text"] += sentence + ". "
    chunk_duration += sentence_duration

# Add the last chunk
current_chunk["end_time"] = current_chunk["start_time"] + chunk_duration
current_chunk["chunk_length"] = chunk_duration
chunks.append(current_chunk)

# Format chunks as list of dictionaries
sample_output_list = []
for chunk in chunks:
    sample_output_list.append({
        "chunk_id": chunk["chunk_id"],
        "chunk_length": chunk["chunk_length"],
        "text": chunk["text"],
        "start_time": chunk["start_time"],
        "end_time": chunk["end_time"],
    })

# Print sample output list
for chunk_dict in sample_output_list:
    print(chunk_dict)


{'chunk_id': 1, 'chunk_length': 15.0, 'text': ' Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? Okay, I am not hearing this at all. ', 'start_time': 0, 'end_time': 15.0}
{'chunk_id': 2, 'chunk_length': 13.5, 'text': "This is like a post-lunch energy downer or something. Let's hear it. Are you guys awake? Alright. You better be because we have a superstar guest here. ", 'start_time': 15.0, 'end_time': 28.5}
{'chunk_id': 3, 'chunk_length': 7.5, 'text': "You heard the $41 million, and I didn't hear honestly anything she said after that. ", 'start_time': 28.5, 'end_time': 36.0}
{'chunk_id': 4, 'chunk_length': 11.0, 'text': "So we're going to ask for about $40 million from him by the end of this conversation, okay? But let's get started. ", 'start_time': 36.0, 'end_time': 47.0}
{'chunk_id': 5, 'chunk_length': 12.0, 'text': "I want to introduce Vivek and Pratyush, his co-founder who's not here. We wanted to start with pl