🚀 **Join Our Speech Team at Sarvam AI!** 🚀



We at [Sarvam AI](https://www.sarvam.ai/) are on a mission to revolutionize the GenAI landscape in India, and we're looking for talented **Data Engineers** and **ML Engineers** to help us achieve this. Become part of a team dedicated to building state-of-the-art ML systems for Speech Recognition and Text-to-Speech applications in Indian languages.



**Why Join Our Speech Team?**

- 🧠 **Develop Cutting-Edge ML Models:** Work on the forefront of machine learning to develop systems that understand and speak multiple Indian languages.

- 📊 **Handle Large-Scale Data Sets:** Dive into dataset curation, managing and processing over 1M hours of audio data to train our models.

- 💻 **Advanced GPU Infrastructure:** Get hands-on experience with our massive A100 and H100 GPU cluster, pushing the boundaries of what AI can achieve.

- 🛠️ **Solve Real-World Problems:** Tackle hard applied research and engineering challenges with a direct impact on our products.

- 🌐 **Work with Top Talent:** Collaborate with some of India's best ML Engineers and Product Developers in an environment that values innovation and creativity.



**We're Excited to Offer the Following Opportunities (Full Time only):**

- 🌱 **Summer Internship (2 months on-site/remote, earning up to 50k per month):** Ideal for freshers or students with a foundational grasp of ML and programming.

  - **Data Engineer:** Engage in web scraping, manage distributed data processing, and develop robust data pipelines.

  - **ML Engineer:** Train, monitor and eval state-of-the-art speech models.

  

- 🔥 **AI Residency (6 months on-site, earning up to 1L per month):** Perfect for those with professional experience or significant expertise. Contribute to our groundbreaking projects and refine your skills.

  - **Data Engineer:** Oversee large-scale data mining and sophisticated engineering tasks for massive data flows.

  - **ML Engineer:** Train advanced speech models on powerful GPU clusters using frameworks like PyTorch and JAX, and tools like NeMo and HuggingFace Transformers.



**About Sarvam AI:**

Sarvam AI is a well-funded GenAI startup focused on creating full-stack GenAI systems and applications tailored for India. Based out of Bangalore and Chennai, Sarvam AI is a place for those who are passionate about driving significant advancements through Generative AI, have a love for Indian languages, and are eager to make a substantial impact.



**Join Us!**

Complete our [Hiring Challenge in the Colab Notebook](https://colab.research.google.com/drive/1EiiLTf5zB8Jm2PxdU3H20rWUr40FrsGM?usp=sharing) to showcase your skills! After completing the challenge, please apply through this [Google Form](https://forms.gle/yP4Vd9QhiETTqEtW8). We’re eager to see your innovative solutions and potentially welcome you aboard to help shape the future of AI in India.



*Note: Admissions are on a rolling basis until all positions are filled. Apply early to secure your spot in our team!*

---





---



# 🌟 Welcome to the Speech Team Hiring Challenge! 🚀

Hey there! We're thrilled to kick off this exciting challenge with two awesome tasks tailored to test your prowess in speech and text data analysis. These tasks are crucial for our hiring process and mirror the real-world scenarios our team loves to tackle. 🎯

**Task 1: Semantic Chunking of a YouTube Video** 📹
- Dive into extracting meaningful audio-text pairs from a specific video. Show us your skill in achieving precise segmentation and alignment!

**Task 2: Exploratory Data Analysis of New Testament Audio and Text** 📖
- Get your hands dirty with a deep dive into the audio and text from the New Testament in your mother tongue. We're looking for sharp insights that could revolutionize text-to-speech and speech-to-text technologies.

Please tackle both tasks with your full creativity and analytical skills as they are equally important in our evaluation. 🤓 🏋️‍♂️ Your innovative approaches, depth of analysis, and tech-savviness will be key to understanding how well you fit into our dynamic team.

**Submission Instructions:**
- Make sure to create a copy of your Google Colab notebook for each task.
- Set the notebook to **shareable** and grant viewing access to **abhigyan@sarvam.ai**.
- Once you've perfected your work, please paste the link to your notebook in the [Google Form provided](https://forms.gle/qxy2LF4Jtph7xhYHA). This step is crucial for us to review your submissions properly.

Good luck, and let's see what amazing things you can uncover! 🌈👀



---



## Task 1: Semantic Chunking of a Youtube Video

**Problem Statement:**

The objective is to extract high-quality, meaningful (semantic) segments from the specified YouTube video: [Watch Video](https://www.youtube.com/watch?v=Sby1uJ_NFIY).

Suggested workflow:
1. **Download Video and Extract Audio:** Download the video and separate the audio component.
2. **Transcription of Audio:** Utilize an open-source Speech-to-Text model to transcribe the audio. *Provide an explanation of the chosen model and any techniques used to enhance the quality of the transcription.*
3. **Time-Align Transcript with Audio:** *Describe the methodology and steps for aligning the transcript with the audio.*
4. **Semantic Chunking of Data:** Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length. *Explain the logic used for semantic chunking and discuss the strengths and weaknesses of your approach.*

**Judgement Criteria:**

1. **Precision-Oriented Evaluation:** The evaluation focuses on precision rather than recall. Higher scores are achieved by reporting fewer but more accurate segments rather than a larger number of segments with inaccuracies. Segment accuracy is determined by:
   - **Transcription Quality:** Accuracy of the text transcription for each audio chunk.
   - **Segment Quality:** Semantic richness of the text segments.
   - **Timestamp Accuracy:** Precision of the start and end times for each segment. Avoid audio cuts at the start or end of a segment.
   
2. **Detailed Explanations:** Provide reasoning behind each step in the process.
3. **Generalization:** Discuss the general applicability of your approach, potential failure modes on different types of videos, and adaptation strategies for other languages.
4. **[Bonus-1]** **Gradio-app Interface:** Wrap your code in a gradio-app which takes in youtube link as input and displays the output in a text-box.
5. **[Bonus-2]** **Utilizing Ground-Truth Transcripts:** Propose a method to improve the quality of your transcript using a ground-truth transcript provided as a single text string. Explain your hypothesis for this approach. *Note that code-snippet isn't required for this question.*

  As an example - for the audio extracted from [yt-link](https://www.youtube.com/watch?v=ysLiABvVos8), how can we leverage transcript scraped from [here](https://www.newsonair.gov.in/bulletins-detail/english-morning-news-7/), to improve the overall transcription quality of segments?

**Submission Format:**

Your submission should be a well-documented Jupyter notebook capable of reproducing your results. The notebook should automatically install all required dependencies and output the results in the specified format.

- **Output Format:** Provide the results as a list of dictionaries, each representing a semantic chunk. Each dictionary should include:
  - `chunk_id`: A unique identifier for the chunk (integer).
  - `chunk_length`: The duration of the chunk in seconds (float).
  - `text`: The transcribed text of the chunk (string).
  - `start_time`: The start time of the chunk within the video (float).
  - `end_time`: The end time of the chunk within the video (float).

```python
sample_output_list = [
    {
        "chunk_id": 1,
        "chunk_length": 14.5,
        "text": "Here is an example of a semantic chunk from the video.",
        "start_time": 0.0,
        "end_time": 14.5,
    },
    # Additional chunks follow...
]
```

Ensure that your code is clear, well-commented, and easy to follow, with explanations for each major step and decision in the process. The notebook should be able to install all the dependencies automatically and generate the reported output when run.


---



In [None]:
# Install dependencies
!curl https://raw.githubusercontent.com/readbeyond/aeneas/master/install_dependencies.sh -o install_dependencies.sh
!bash install_dependencies.sh
%pip install yt-dlp moviepy SpeechRecognition SpeechRecognition[whisper-local] tiktoken semchunk aeneas gradio

# imports
import yt_dlp as youtube_dl
import pathlib
import gradio as gr
import speech_recognition as sr
import tiktoken
import semchunk

In [None]:
# Variables
transcript_path = "transcript.txt"
audio_path = "audio.wav"
video_path = "video.mp4"
video_url = ""
syncmap_path = "syncmap.json"


In [None]:
# Your code for Task 1 goes here!

# Given url
vid_url = "https://www.youtube.com/watch?v=Sby1uJ_NFIY"

def download_save_video(url, save_path):
    ydl_opts = {
        'format': 'best',
        'outtmpl': save_path
    }
    # only download if the file does not exist
    if pathlib.Path(save_path).exists() is False:
        print(f"Downloading video from url:{url} at {save_path}")
        with youtube_dl.YoutubeDL(ydl_opts) as ydl:
            ydl.download([url])
        print("Successfully downloaded the video!")
    else:
      print("Video already exists... Skipping.")

# using moviepy to extract audio from video
from moviepy.editor import VideoFileClip

def extract_and_write_audio(video_path, audio_path):
    if pathlib.Path(video_path) is False:
        print("Video file does not exist! Exiting...")
        return
    if pathlib.Path("audio.mp3").exists():
        print("Audio file already exists! Exiting...")
        return
    video = VideoFileClip(video_path)
    audio = video.audio
    print("Extracting audio from the video and writing it to the disk...")
    audio.write_audiofile(audio_path)
    print("Successfully written")


def download_and_extract_audio(url):
  download_save_video(url,video_path)
  extract_and_write_audio(video_path,audio_path)



In [None]:

def use_whisper_openai(audio_path):
    print("Initializing Whisper...")
    r = sr.Recognizer()

    with sr.AudioFile(audio_path) as source:
        print("Listening to audio...")
        audio = r.record(source)
    return r.recognize_whisper(audio)




def chunk_text(text):
    chunk_size = 15
    encoder = tiktoken.encoding_for_model('gpt-4')
    token_counter = lambda text: len(encoder.encode(text))
    chunked = semchunk.chunk(text,chunk_size,token_counter)

    end_string = ""

    for line in chunked:
        end_string += line + "\n"

    return end_string

def return_semantic_rich_script(audio_path):
    return chunk_text(use_whisper_openai(audio_path))


In [None]:
import aeneas
from aeneas.executetask import ExecuteTask
from aeneas.task import Task

def execute_aeneas(transcript_path, audio_path, syncmap_path,text_openai):

  with open(transcript_path,'w+') as file:
    file.write(text_openai)
  print("Starting aeneas task...")
  config_string = u"task_language=eng|is_text_type=plain|os_task_file_format=json"
  task = Task(config_string=config_string)

  task.audio_file_path_absolute = audio_path
  task.text_file_path_absolute = transcript_path
  task.sync_map_file_path_absolute = syncmap_path

  # Process the Task
  ExecuteTask(task).execute()

  # Output the sync map to a file
  task.output_sync_map_file()


In [None]:

def rewrite_new_syncmap(syncmap_path):
  import json

  # Load the original syncmap
  with open(syncmap_path, 'r') as f:
      original_syncmap = json.load(f)

  # If original_syncmap is a dictionary, get the list from it
  if isinstance(original_syncmap, dict):
      # Assuming the list is under the key 'fragments'
      original_syncmap = original_syncmap.get('fragments', [])

  # If original_syncmap is a string, parse it as JSON
  elif isinstance(original_syncmap, str):
      original_syncmap = json.loads(original_syncmap)

  # Convert to the desired format
  converted_syncmap = []
  chunk_id = 1
  for item in original_syncmap:
      start_time = float(item['begin'])
      end_time = float(item['end'])
      text = ' '.join(item['lines'])

      while end_time - start_time > 15:
          # If the chunk is too long, split it
          converted_item = {
              "chunk_id": chunk_id,
              "chunk_length": 15,
              "text": text[:100],  # Assuming an average speaking rate of 150 wpm, 100 characters should be less than 15 seconds
              "start_time": start_time,
              "end_time": start_time + 15,
          }
          converted_syncmap.append(converted_item)
          start_time += 15
          text = text[100:]
          chunk_id += 1

      # Add the remaining part of the chunk
      converted_item = {
          "chunk_id": chunk_id,
          "chunk_length": round(end_time - start_time, 2),
          "text": text,
          "start_time": start_time,
          "end_time": end_time,
      }
      converted_syncmap.append(converted_item)
      chunk_id += 1

  # Write the converted syncmap to a new file
  with open(syncmap_path, 'w') as f:
      json.dump(converted_syncmap, f, indent=4)

In [None]:
def start(url):
  download_and_extract_audio(url)
  transcript = return_semantic_rich_script(audio_path)
  print(transcript)
  execute_aeneas(transcript_path,audio_path,syncmap_path,transcript)
  rewrite_new_syncmap(syncmap_path)
  syncmap_text = ""
  with open(syncmap_path,'r') as file:
    syncmap_text = file.read()
  print(syncmap_text)
  return syncmap_text

iface = gr.Interface(fn=start,inputs="text",outputs="text")
iface.launch(share=True,debug = True)

IMPORTANT: You are using gradio version 3.36.1, however version 4.29.0 is available, please upgrade.
--------
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://e32cfc8e11f34dc3eb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
