#### Project Overview

This project demonstrates a customer care audio separation and transcription system. This tutorial uses the Whisper model from OpenAI for audio processing and Separation and Transcription. The main steps include loading the audio files, splitting the stereo audio into two channels (representing two speakers), transcribing the audio, and extracting useful information from the transcriptions.

### Installation of Required Libraries
Lets install the pydub library for audio processing and the Whisper library for transcription.

In [6]:
!pip install pydub
!pip install -qq ipython==7.34.0
!pip install git+https://github.com/openai/whisper.git
!pip install faster-whisper
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-t7t813xc
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-t7t813xc
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://download.pytorch.org/whl/cu118


#### Restart the session before running below codes

In [None]:
import os
os._exit(00)

### Load and Process Audio Files
Load the audio file, split it into two channels, and save them as separate files.

Here, I used the input audio file in stereo format where each channel represents a different speaker.

In [1]:
# Python3 program to demonstrate splitting a stereo audio file into mono channels using pydub

# Import AudioSegment from pydub
from pydub import AudioSegment
import os

# Define the audio file and the folder where it's located
audio_filename = "Sample Order Taking  Customer Support.mp3"
foldername = "/content"

# Load the stereo audio file as an AudioSegment instance
stereo_audio = AudioSegment.from_file(
    os.path.join(foldername,audio_filename),
    format="mp3")

# Split the stereo audio file into two mono audio segments
mono_audios = stereo_audio.split_to_mono()

# Export and save the left channel (index 0) of the mono audio segments
mono_left = mono_audios[0].export(
    audio_filename[:-4]+"_left.wav",
    format="wav")

# Export and save the right channel (index 1) of the mono audio segments
mono_right = mono_audios[1].export(
    audio_filename[:-4]+"_right.wav",
    format="wav")

#### Transcribe Audio Using Whisper

Load the Whisper model and transcribe the audio files. Extract the necessary information from the transcription results and save them into JSON files. This includes the segment ID, start and end times, transcribed text, and word-level details.

In [2]:
from whisper import load_model
import torch
import time
from faster_whisper import WhisperModel
import json

# Initialize the Whisper model with specified size and settings
model_size = "large-v3"
whisper_model = WhisperModel(model_size, device="cuda", compute_type="float16")

start = time.time()

# Define the paths for the left and right channel audio files
audio_file_path_left = os.path.join(foldername, audio_filename[:-4]+"_left.wav")
audio_file_path_right = os.path.join(foldername, audio_filename[:-4]+"_right.wav")

# Transcribe the left channel audio file with word timestamps
segments_left, info_left = whisper_model.transcribe(audio_file_path_left, beam_size=1, word_timestamps=True)
# Transcribe the right channel audio file with word timestamps
segments_right, info_right = whisper_model.transcribe(audio_file_path_right, beam_size=1, word_timestamps=True)

# Initialize lists to hold raw and processed results
raw_results = []
processed_results = []

# Process the transcription segments for the left channel
for segment in segments_left:
    segment_dict = segment._asdict()
    raw_results.append(segment_dict)

   # Prepare processed data with word-level details
    processed_data = {
        "id": segment_dict["id"],
        "start": segment_dict["start"],
        "end": segment_dict["end"],
        "text": segment_dict["text"],
        "words": [{
            "start": word.start,
            "end": word.end,
            "word": word.word,
            "probability": word.probability
        } for word in segment_dict.get('words', [])]
    }
    processed_results.append(processed_data)

# Write the raw and processed results to JSON files for the left channel
try:
    with open("/content/Sample Order Taking  Customer Support_left_raw_results.json", "w") as raw_file, \
         open("/content/Sample Order Taking  Customer Support_left_extracted.json", "w") as extracted_file:
        json.dump(raw_results, raw_file, indent=4)
        json.dump(processed_results, extracted_file, indent=4)
except IOError as e:
    print("An error occurred while writing files:", e)

print("Time taken for left audio:", time.time() - start)

# Process the transcription segments for the right channel
for segment in segments_right:
    segment_dict = segment._asdict()
    raw_results.append(segment_dict)

    # Prepare processed data with word-level details
    processed_data = {
        "id": segment_dict["id"],
        "start": segment_dict["start"],
        "end": segment_dict["end"],
        "text": segment_dict["text"],
        "words": [{
            "start": word.start,
            "end": word.end,
            "word": word.word,
            "probability": word.probability
        } for word in segment_dict.get('words', [])]
    }
    processed_results.append(processed_data)

# Write the raw and processed results to JSON files for the right channel
try:
    with open("/content/Sample Order Taking  Customer Support_right_raw_results.json", "w") as raw_file, \
         open("/content/Sample Order Taking  Customer Support_right_extracted.json", "w") as extracted_file:
        json.dump(raw_results, raw_file, indent=4)
        json.dump(processed_results, extracted_file, indent=4)
except IOError as e:
    print("An error occurred while writing files:", e)

print("Time taken for right audio:", time.time() - start)

# Clear GPU VRAM to free up memory
del whisper_model
torch.cuda.empty_cache()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Time taken for left audio: 9.075596570968628
Time taken for right audio: 16.160569429397583


#### Aligner - Combining and Sorting Transcription Results

In [2]:
import json

# Function to load JSON data from a given file path
def load_json(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

# Load the extracted transcription results for the left and right channels
left_json = load_json('/content/Sample Order Taking  Customer Support_left_extracted.json')
right_json = load_json('/content/Sample Order Taking  Customer Support_right_extracted.json')

# Combine the entries from both channels and sort them by the start time
combined_data = left_json + right_json
sorted_data = sorted(combined_data, key=lambda x: x['start'])

# Assign speaker roles and create a final structured output
final_output = []
for index, entry in enumerate(sorted_data, start=1):
    speaker = "CET Agent" if entry in left_json else "Customer"
    final_output.append({
        "id": index,
        "text": entry['text'].strip(),
        "speaker": speaker
    })

# Save the final transcriped output to a new JSON file
with open('/content/final_transcriped_output.json', 'w') as file:
    json.dump(final_output, file, indent=4)

print("Transcription completed and saved to final_transcriped_output.json.")


Transcription completed and saved to final_transcriped_output.json.
