### Author: Jesus Cantu Jr.
### Last Updated: October 10, 2023

There are several transcription services that can be employed in Python, either through direct API calls or through SDKs provided by the service providers, for example: 

1. **AWS Transcribe**:
   - Amazon Transcribe is a service that provides automatic speech recognition (ASR) to convert spoken language into written text. It can be used for various applications, including transcription of audio recordings, voice assistants, and more.
   - Python SDK: [AWS SDK for Python (Boto3)](https://aws.amazon.com/sdk-for-python/) 

2. . **Google Cloud Speech-to-Text**:
   - Google's service supports multiple languages and offers features like automatic punctuation, speaker diarization, and recognition of specific words or phrases.
   - Python SDK: [Google Cloud Client Library for Python](https://cloud.google.com/python/docs/reference/speech/latest)

3. **IBM Watson Speech to Text**:
   - IBM's Watson Speech to Text supports various features like keyword spotting, speaker labels, and custom language models.
   - Python SDK: [IBM Watson Developer Cloud Python SDK](https://github.com/watson-developer-cloud/python-sdk)

4. **Microsoft Azure Speech Service**:
   - Part of Azure Cognitive Services, Microsoft's offering supports real-time continuous recognition and batch transcription.
   - Python SDK: [Azure SDK for Python](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/speech)

5. **Rev.ai**:
   - Rev.ai provides both automated and human-powered transcription services. Their automated API is straightforward to use with Python.
   - Python SDK: [Rev.ai Python SDK](https://github.com/revdotcom/revai-python-sdk)

6. **Speechmatics**:
   - Speechmatics offers transcription services in multiple languages.
   - While they don't have an official Python SDK, their RESTful API is easy to use with Python's `requests` library.

7. **AssemblyAI**:
   - AssemblyAI is another transcription service with a straightforward API.
   - They provide [Python examples](https://docs.assemblyai.com/overview/getting-started) in their documentation to help you get started quickly.

In this notebook we will be comparing several online speech-to-text systems and their ability to transcribe audio files that include children's voices.  

## AWS Transcribe
AWS Transcribe Summarize simplifies the process of distilling valuable information from audio and video assets.
It leverages machine learning to generate concise and coherent summaries of spoken or recorded information. Users can extract key insights, identify important topics, and create textual summaries from lengthy recordings.

Prerequisites:
1.  __AWS Account__: An AWS account is required to access [AWS services](https://aws.amazon.com/), including Transcribe. 
2. __IAM Credentials__: You need AWS Identity and Access Management (IAM) credentials, specifically an Access Key ID and Secret Access Key. These credentials will allow your code to authenticate with AWS services. 
3. __Python SDK__: Install the AWS SDK.


In [None]:
! pip install boto3

__Step 1__: Upload the audio file to an `S3 bucket`:

In [74]:
import boto3
import configparser

def get_aws_credentials(api_key_file):
    """Read AWS credentials from a file."""
    config = configparser.ConfigParser()
    try:
        config.read(api_key_file)
        aws_access_key_id = config.get('AWS', 'aws_access_key_id')
        aws_secret_access_key = config.get('AWS', 'aws_secret_access_key')
        aws_region = config.get('AWS', 'aws_region')
    except configparser.NoSectionError:
        print("Section 'AWS' not found in the credentials file.")
        return None, None, None
    except configparser.NoOptionError as e:
        print(f"Error reading {e.option} from the credentials file.")
        return None, None, None
    
    return aws_access_key_id, aws_secret_access_key, aws_region

def upload_to_s3(local_file_path, bucket_name, s3_file_path):
    """Upload a local file to an S3 bucket."""
    aws_access_key_id, aws_secret_access_key, aws_region = get_aws_credentials('aws_credentials.txt')
    
    if not all([aws_access_key_id, aws_secret_access_key, aws_region]):
        print("Failed to retrieve AWS credentials.")
        return
    
    try:
        s3_client = boto3.client('s3', 
                                 aws_access_key_id = aws_access_key_id, 
                                 aws_secret_access_key = aws_secret_access_key, 
                                 region_name = aws_region)
        s3_client.upload_file(local_file_path, bucket_name, s3_file_path)
        print(f'File ({local_file_path}) uploaded to S3 Bucket ({bucket_name}).')
    except Exception as e:
        print(f"Error uploading file to S3: {e}")

# Example usage
local_file = './original_audio_files/AR31_021108a.wav'
api_key_file = 'aws_credentials.txt'
bucket = 'speech-to-text-processing'
s3_path = f'audio_files/{local_file}'

upload_to_s3(local_file, bucket, s3_path)

File (AR31_021108a.wav) uploaded to S3 Bucket (speech-to-text-processing).


__Step 2__: Use `Amazon Transcribe` for Batch Processing

Amazon Transcribe is a service that converts audio to text. To transcribe an audio file you've uploaded to S3 using Amazon Transcribe, you'll typically follow these steps:

1. Start a transcription job.
2. Monitor the status of the transcription job.
3. Retrieve the transcription once the job is complete.

Let's create a function to transcribe the audio file using Amazon Transcribe.

In [75]:
import requests
from datetime import datetime

def start_transcription_job(s3_uri, job_name, region, language_code = 'en-US'):
    """
    Start a transcription job with Amazon Transcribe.
    
    Args:
    - s3_uri (str): The S3 URI of the audio file.
    - job_name (str): A unique name for the transcription job.
    - region (str): AWS region for the Transcribe service.
    - language_code (str): The language code for the input audio. Default is 'en-US'.
    
    Returns:
    - str: The transcription text if successful, otherwise None.
    """
    # Initialize the boto3 client for Transcribe
    transcribe_client = boto3.client('transcribe', region_name = region)
    
    try:
        # Start transcription job
        response = transcribe_client.start_transcription_job(
            TranscriptionJobName = job_name,
            Media = {'MediaFileUri': s3_uri},
            MediaFormat = 'wav',
            LanguageCode = 'en-US',
            Settings = {
                'ShowSpeakerLabels': True,
                'MaxSpeakerLabels': 2  # Change based on the number of speakers in your audio
            }
        )
       
        # Wait for the transcription job to complete
        while True:
            status = transcribe_client.get_transcription_job(TranscriptionJobName = job_name)
            if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
                break
            print("Waiting for transcription job to complete...")
            time.sleep(30)  # Wait for 30 seconds before checking the status again
        
        # If the transcription job completed successfully, retrieve and return the transcript
        if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
            transcript_uri = status['TranscriptionJob']['Transcript']['TranscriptFileUri']
            response = requests.get(transcript_uri)
            transcript_data = response.json()
            return transcript_data['results']['transcripts'][0]['transcript']
        else:
            print("Transcription job failed.")
            return None
        
    except Exception as e:
        print(f"Error starting transcription job: {e}")
        return None

# Example usage
s3_uri = f's3://{bucket}/{s3_path}'
job_name = f'aws_transcription_{local_file.split(".")[0]}_{datetime.now().strftime("%Y%m%d%H%M%S")}'
region = 'us-east-2'  # Bucket region

transcript = start_transcription_job(s3_uri, job_name, region)
print(transcript)

# Save the transcription to a text file
output_file = f"./generated_audio_transcripts/AWS_{local_file}_transcription.txt"
with open(output_file, 'w') as f:
    f.write(transcript)

print(f"\nTranscription saved to {output_file}")


Waiting for transcription job to complete...
See Suzanne. I come back. Yeah. Sure. Yeah. Yeah. Oh, all right. Ready? I don't think he's right. Oh. Oh, yeah. Yeah. Yeah. Yeah. We gotta sit on the, oh, no, no, you just hit me in the mouth. Not all the people I know. Come on. Oh, you pooped in your pants. Ray? We didn't make it, did we? Yeah. Yeah, I didn't get to you in time, did she? Yeah. What? You too. Yeah, you're too stinky. That was supposed to go in the peepy pot. No, no, that too was supposed to go in the pot. Uh, what? I, yeah. Yuck. Yeah. Ok. It's, I, ok. Yeah. Oh, yeah. Two where you have to learn how to poop in the pot. Yeah. But, oh, sure. Crap. The po po po, right? Oh. What? No. Screaming love. What? Yeah. Would be poop in the pot. Mama wouldn't have to wipe you, honey. This is a stupid, where were you supposed to poop in the pot? That, uh, ok. Yeah, I guess. What are you doing with your tongue? Yeah. Yeah. She, huh? Yeah. Yeah. Yeah. You didn't pee pee in the pot. You poop

This script initializes a transcription job, waits for it to complete, then retrieves and saves the transcription. Amazon Transcribe's accuracy can be influenced by a variety of factors, including audio quality, speaker accents, background noise, and the complexity of the content. 

## Google Cloud Speech-to-Text 
`Google Cloud Speech-to-Text` is a machine learning-powered service by Google Cloud that converts spoken language into text. It supports over 120 languages, offers both real-time and batch processing, and includes features like noise robustness, speaker diarization, automatic punctuation, and word-level confidence scoring.

Prerequisites:
1. __Set Up Google Cloud SDK and Credentials__:
- Ensure you've set up the Google Cloud SDK on your machine.
- Create a service account in the Google Cloud Console and download the JSON key.
- Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded service account key:

In [50]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path_to_your_service_account_key.json"
print('Google Application Credentails saved succesfully as an environmental variable!')

2. __Install the Google Cloud Speech Library__:

In [None]:
! pip install --upgrade google-cloud-speech

__Transcribe audio file using Google Cloud Speech-to-Text__: This code uses the `Google Cloud Speech-to-Text API` to transcribe an audio file stored in Google Cloud Storage to text and saves the transcription to a text file. It provides flexibility to specify the language and enable/disable speaker diarization as needed.

In [55]:
from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import types

def transcribe_google_speech_to_text_gcs(gcs_uri, language = "en-US", enable_speaker_diarization = True):
    # Instantiates a client
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri = gcs_uri)
    config = types.RecognitionConfig(
        encoding = types.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz = 16000,
        language_code = language,
        enable_speaker_diarization = enable_speaker_diarization
    )

    # Using the long-running recognize method
    operation = client.long_running_recognize(config = config, audio = audio)

    # Waiting for the operation to complete (this might take some time depending on the audio length)
    response = operation.result(timeout = 3600)  # wait for a maximum of 1 hour

    # Extracting and returning the transcription
    transcription = ""
    for result in response.results:
        transcription += result.alternatives[0].transcript + "\n"
    
    return transcription

# Call the function using the GCS URI of your audio file
gcs_uri = "gs://sample-voice-recordings/audio_files/AR31_021108a.wav"
transcription = transcribe_google_speech_to_text_gcs(gcs_uri, enable_speaker_diarization = False)

# Save the transcription to a text file
output_file = f"./generated_audio_transcripts/Google_API_{file_path}_transcription.txt"
with open(output_file, 'w') as f:
    f.write(transcription)

print(transcription)
print(f"\nTranscription saved to {output_file}")


shoe shine
 PB
 OPP Pride
 you just hit me in the mouth
 oh
 you pooped in your pants red
 we didn't make it did we
 yeah
 can you say hi Josh, didn't get to you in time yeah you too stinky
 That was supposed to go in the pee pee pot
 no
 that too was supposed to go in the pot shoe shoe shoe
yuck
 2
 ray you have to learn how to poop in the pot yeah
 poo poo pie
 Peppa Pig Peppa Pig
 no screaming
 if you poop in the pot Mama wouldn't have to wipe you honey
 where were you supposed to poop in the pot
 yeah
 yeah
 pee pee but I'm good yeah
 yeah but he's broke
 all right
 yeah cute
 yeah yeah
 YouTube
 tires
 yeah


Transcription saved to ./audio_transcripts/Google_API_AR31_021108a.wav_transcription.txt


In the previous code, the audio file was uploaded to the Google Cloud Storage (GCS) bucket directly using the Google Cloud Console. However, we can upload files and transcribe them, with punctuation and diarization, using Python: 

In [62]:
import os
import soundfile as sf
from google.cloud import storage
from google.cloud import speech
from num2words import num2words

def measure_sample_rate(local_file):
    with sf.SoundFile(local_file, "r") as sound_file:
        sample_rate = sound_file.samplerate
    return sample_rate


def upload_audio_to_gcs(local_file, gcs_bucket, gcs_filename):
    print(f"Uploading {local_file} to GCS Bucket: {gcs_bucket}...")
    print()
    client = storage.Client()
    bucket = client.bucket(gcs_bucket)
    blob = bucket.blob(gcs_filename)

    if blob.exists():
        print(f"File {gcs_filename} already exists. Skipping upload.")
    else:
        blob.upload_from_filename(local_file)
        print(f"File {gcs_filename} uploaded successfully.")

    return f"gs://{gcs_bucket}/{gcs_filename}"

def transcribe_audio(gcs_uri, convert_numeric_to_text = True, sample_rate = None,
                     enable_diarization = True, min_num_speaker = None, max_num_speaker = None):
    print(f"Transcribing with punctuation and diarization...")
    print()
    client = speech.SpeechClient()

    # Configure the audio settings
    audio = speech.RecognitionAudio(uri = gcs_uri)
    config = speech.RecognitionConfig(
        encoding = speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz = sample_rate,
        language_code = "en-US",
        enable_automatic_punctuation = True,
        enable_word_time_offsets = True,
        diarization_config = speech.SpeakerDiarizationConfig(
            enable_speaker_diarization = enable_diarization,
            min_speaker_count = min_num_speaker,
            max_speaker_count = max_num_speaker,
        ),
    )

    # Perform the asynchronous transcription
    operation = client.long_running_recognize(config = config, audio = audio)
    response = operation.result()

    # Extract the transcriptions with speaker labels
    transcriptions = []
    for result in response.results:
        alternative = result.alternatives[0]
        words = []
        for word_info in alternative.words:
            word = word_info.word
            if convert_numeric_to_text and word.isdigit():
                if int(word) < 10:
                    word = num2words(int(word))
                else:
                    word = num2words(int(word), lang = 'en')
            words.append(word)
        speaker_label = result.alternatives[0].words[0].speaker_tag
        transcriptions.append({"transcript": " ".join(words), "speaker_label": speaker_label})

    return transcriptions

def save_transcription(transcription, text_filename):
    with open(text_filename, "w") as f:
        f.write(f"Speaker {transcription['speaker_label']}: {transcription['transcript']}\n")

# Specify the full path to your audio file
audio_file = "./original_audio_files/AR31_021108a.wav"

# Measure the sample rate of the audio file
sample_rate = measure_sample_rate(audio_file)

# Specify your GCS bucket and folder
gcs_bucket = "sample-voice-recordings"  
gcs_folder = "audio_files"  

# Specify the text folder where transcriptions will be saved
text_folder = "./generated_audio_transcripts" 

gcs_filename = os.path.join(gcs_folder, os.path.basename(audio_file))
gcs_uri = upload_audio_to_gcs(audio_file, gcs_bucket, gcs_filename)
sample_rate = measure_sample_rate(audio_file)
transcriptions = transcribe_audio(gcs_uri, convert_numeric_to_text = True, sample_rate = sample_rate,
                                  enable_diarization = True, min_num_speaker = 1, max_num_speaker = 2)

# Generate the output file name
audio_file_name = os.path.splitext(os.path.basename(audio_file))[0]  # Extract the base name without extension
output_file_name = f"Google_API_{audio_file_name}_transcription_2.txt"
text_filename = os.path.join(text_folder, output_file_name)


# Save the transcription to a text file
with open(text_filename, 'w') as f:
    for transcription in transcriptions:
        f.write(f"Speaker {transcription['speaker_label']}: {transcription['transcript']}\n")

print("Transcription saved to", text_filename)

# Print the transcriptions
print("Transcription:")
for transcription in transcriptions:
    print(f"Speaker {transcription['speaker_label']}:", transcription['transcript'])


Uploading AR31_021108a.wav to GCS Bucket: sample-voice-recordings...

File audio_files/AR31_021108a.wav already exists. Skipping upload.
Transcribing with punctuation and diarization...

Transcription saved to ./audio_transcripts/Google_API_AR31_021108a_transcription_2.txt
Transcription:
Speaker 0: Shoe shine.
Speaker 0: PB.
Speaker 0: OPP Pride.
Speaker 0: You just hit me in the mouth.
Speaker 0: Oh,
Speaker 0: you pooped in your pants red?
Speaker 0: We didn't make it, did we?
Speaker 0: Yeah.
Speaker 0: Can you say hi? Josh, didn't get to you in time. Yeah, you too stinky.
Speaker 0: That was supposed to go in the pee pee pot.
Speaker 0: No.
Speaker 0: That too was supposed to go in the pot, shoe, shoe shoe.
Speaker 0: Yuck.
Speaker 0: 2.
Speaker 0: Ray, you have to learn how to poop in the pot. Yeah.
Speaker 0: Poo, poo pie.
Speaker 0: Peppa Pig Peppa Pig.
Speaker 0: No screaming.
Speaker 0: If you poop in the pot, Mama wouldn't have to wipe you honey.
Speaker 0: Where were you sup

This code begins by uploading a specified local audio file to a GCS bucket and folder. The script then measures the sample rate of the audio file using the `measure_sample_rate` function. Next, it utilizes the `Google Cloud Speech-to-Text API` for transcription, offering features like converting numeric digits to text, enabling speaker diarization, and specifying the range of speakers. The resulting transcriptions, complete with speaker labels, are saved to a text file. 

## Microsoft Azure Speech Service
`Microsoft Azure Speech Service` offers both real-time continuous recognition and batch transcription. With continuous recognition, the function will start recognition and wait for results to come in continuously until the recognition is stopped. This is better suited for longer audio files.

Prerequisites:
1. __Azure Subscription__: If you don't have an Azure subscription, you can create a [free account](https://azure.microsoft.com/en-us/free).
2. __Speech Service__: Create a Speech service in the Azure portal. Note down the API key and the service region.
3. __Python SDK__: Install the Azure SDK:

In [None]:
! pip install azure-cognitiveservices-speech

__Transcribing Audio using Microsoft Azure Speech Service__: The transcription result will be printed in real-time as the audio is processed. The continuous recognition mode can be more suitable for longer audio files and can provide more granularity in the results, especially when combined with diarization.

In [71]:
import os
import logging
import time
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig, ResultReason, OutputFormat

# Set up logging
logging.basicConfig(filename='transcription.log', level=logging.INFO)

def read_subscription_key_from_file(file_path):
    try:
        with open(file_path, 'r') as file:
            return file.readline().strip()
    except Exception as e:
        logging.error(f"Error reading subscription key from file: {str(e)}")
        raise

def transcribe_audio_continuous(file_path, subscription_key, service_region):
    try:
        # Set up the Azure Speech configuration
        speech_config = SpeechConfig(subscription=subscription_key, region=service_region)
        speech_config.request_word_level_timestamps()
        speech_config.output_format = OutputFormat.Detailed

        # Configure the audio source
        audio_config = AudioConfig(filename=file_path)

        # Initialize the speech recognizer
        recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

        done = False

        def stop_cb(evt):
            nonlocal done
            done = True

        # Connect callbacks to the events fired by the recognizer
        recognizer.recognized.connect(lambda evt: print(evt.result.text))
        recognizer.session_stopped.connect(stop_cb)
        recognizer.canceled.connect(stop_cb)

        # Start continuous recognition
        recognizer.start_continuous_recognition()
        while not done:
            time.sleep(0.5)

        # Stop continuous recognition
        recognizer.stop_continuous_recognition()

        return recognizer

    except Exception as e:
        logging.error(f"Error during transcription: {str(e)}")
        return None

# Example usage
file_path = "./original_audio_files/AR31_021108a.wav"
subscription_key_file = "azure_credentials.txt"
subscription_key = read_subscription_key_from_file(subscription_key_file)
service_region = "eastus"  # Azure Service Region

transcribe_audio_continuous(file_path, subscription_key, service_region)


She shoeshine.
Ohh.
Yeah. Hmm.
Yeah.
Hey.
Session stopped.
Right.
Hi.
Ohh PP Pride.
Yeah, yeah, yeah.
Yeah, we gotta sit down, baby. ****.
Ohh no, ohh. You just hit me in the mouth.
Put on the paper, Anna. Come on.
Ohh.
You pooped in your pants, Ray.
We didn't make it, did we? No.
Yeah. Can you say shoot? Yeah, shoot. Ohh.
Mama didn't get to you in time, did she? Yeah.
What to get Yeah, you too stinky.
That was supposed to go in the pee pee pot.
Ohh no.
That, too, was supposed to go in a pot.
What? Not I shoot. Yeah. Yuck. Yeah. OK.
Choo Choo.
She.
Yeah.
I.
Who?
Where you have to learn how to poop in a pot. Yeah, she.
But.
Crap.
Poo poo pot.

Bam, bam, bam, bam bam bam.
Ohh.
No screaming.
Wow.
Ohh.
Yeah.
If you poop in the pot, Mama wouldn't have to.
Why?
You honey.
Well, this is a good thing.
Where were you supposed to poop?
In the pot.
I.
Yeah.
Yeah, yeah.
Yeah.
Yeah.
What are you doing with your tongue?
Yeah.
Please.
Thank you.
Hey.
Run pee pee bag away.
So you didn't pee? Pee in th

<azure.cognitiveservices.speech.SpeechRecognizer at 0x123764e50>

__TODO__: Save the transcription to a text file. 

__Transcribing Audio using Microsoft Azure Speech Service__: Here's how you can modify the function to include diarization.

In [19]:
import os
import time
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig, ResultReason, OutputFormat

def read_subscription_key_from_file(file_path):
    """
    Read the Azure subscription key from a file.

    Args:
    - file_path (str): Path to the file containing the subscription key.

    Returns:
    - str: The subscription key.
    """
    with open(file_path, 'r') as file:
        return file.readline().strip()

def transcribe_audio_continuous_with_diarization(file_path, subscription_key, service_region):
    """
    Transcribe audio from a file using Azure Speech Service's continuous recognition with diarization.

    Args:
    - file_path (str): Path to the audio file.
    - subscription_key (str): Azure Speech Service subscription key.
    - service_region (str): Azure service region (e.g., 'westus').

    Returns:
    - None: Prints the transcription result with speaker labels.
    """
    # Set up the Azure Speech configuration
    speech_config = SpeechConfig(subscription = subscription_key, region = service_region)
    speech_config.request_word_level_timestamps()
    speech_config.output_format = OutputFormat.Detailed
    speech_config.set_property_by_name("EnableSpeakerDiarization", "true")
    speech_config.set_property_by_name("SpeakerCount", "2")  # Assumes 2 speakers in the audio, adjust if needed

    # Configure the audio source
    audio_config = AudioConfig(filename = file_path)

    # Initialize the speech recognizer
    recognizer = SpeechRecognizer(speech_config = speech_config, audio_config = audio_config)

    done = False

    # Connect callbacks to the events fired by the recognizer
    def recognized_handler(evt):
        print(f"Speaker {evt.result.properties['Property_SpeakerId']}: {evt.result.text}")

    def session_stopped_handler(evt):
        nonlocal done
        print("Session stopped.")
        done = True

    def canceled_handler(evt):
        nonlocal done
        print(f"Speech Recognition canceled: {evt.reason}. Error details: {evt.error_details}")
        done = True

    recognizer.recognized.connect(recognized_handler)
    recognizer.session_stopped.connect(session_stopped_handler)
    recognizer.canceled.connect(canceled_handler)

    # Start continuous recognition
    recognizer.start_continuous_recognition()
    while not done:
        time.sleep(5)

    # Stop continuous recognition
    recognizer.stop_continuous_recognition()

# Example usage
file_path = "./original_audio_files/AR31_021108a.wav"
subscription_key_file = "azure_credentials.txt"
subscription_key = read_subscription_key_from_file(subscription_key_file)
service_region = "eastus"  # Azure Service Region

transcribe_audio_continuous_with_diarization(file_path, subscription_key, service_region)


Session stopped.


__TODO__: Diagnose problems with diarization. There might be service limitations; the audio file is long. 

## Rev.avi

`Rev.ai` is an automatic speech recognition (ASR) service offered by Rev.com, renowned for its transcription and captioning offerings. The service converts spoken language into written text using advanced ASR models, ensuring high accuracy. It supports multiple languages, can differentiate between speakers, and provides detailed transcriptions with punctuation, capitalization, and timestamps. The API also offers real-time transcription capabilities through WebSockets and options for content filtering.

Prerequisites:

1. __Rev.ai Account__: Create an account on [Rev.ai](https://www.rev.ai/). Once you sign up, you'll be given an API key.
2. __Python Requests Library__: Install the requests library, which makes it easy to call the API:


In [None]:
! pip install requests

__Transcribing Audio with Rev.ai API__: The code provides an end-to-end process for transcribing a local audio file using the Rev.ai API. It submits the audio file, checks the transcription status at regular intervals, fetches the transcription once complete, and then saves it to a file.

Speaker diarization is automatically performed on audio files that contain multiple speakers. When the transcription is returned, different speakers are labeled in the output, allowing you to determine which segments of the transcription correspond to which speaker.

In [29]:
import requests
import time

def read_api_key_from_file(file_path):
    """Read the API key from a file."""
    with open(file_path, 'r') as file:
        return file.readline().strip()

def submit_file_for_transcription(file_path, api_key):
    """Submit a local audio file to Rev.ai for transcription."""
    url = "https://api.rev.ai/speechtotext/v1/jobs"
    headers = {
        "Authorization": f"Bearer {api_key}"
    }

    with open(file_path, 'rb') as audio_file:
        response = requests.post(url, headers = headers, files = {"media": audio_file})

    if response.status_code != 200:
        print(f"Error submitting file: {response.text}")
        return None

    job_details = response.json()
    print(f"Job submitted successfully. Job ID: {job_details['id']}")
    return job_details['id']

def check_job_status(job_id, api_key):
    """Check the status of a transcription job."""
    url = f"https://api.rev.ai/speechtotext/v1/jobs/{job_id}"
    headers = {
        "Authorization": f"Bearer {api_key}"
    }
    response = requests.get(url, headers = headers)
    return response.json()["status"]

def fetch_transcription(job_id, api_key):
    """Fetch the transcription results for a completed job."""
    url = f"https://api.rev.ai/speechtotext/v1/jobs/{job_id}/transcript"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Accept": "text/plain"  # Request plain text transcription
    }
    response = requests.get(url, headers = headers)
    return response.text

# Example usage
file_path = "./original_audio_files/AR31_021108a.wav"
api_key_file = "revai_api_credentials.txt"
api_key = read_api_key_from_file(api_key_file)

job_id = submit_file_for_transcription(file_path, api_key)

# Poll the API until the transcription is done
while True:
    status = check_job_status(job_id, api_key)
    if status == "transcribed":
        break
    elif status in ["failed", "invalid"]:
        print("Transcription failed.")
        exit()
    else:
        print("Job is still being processed. Waiting for another 30 seconds.")
        time.sleep(30)

transcription = fetch_transcription(job_id, api_key)

# Print the transcription
print("Transcription:\n")
print(transcription)

# Save the transcription to a text file
output_file = f"./generated_audio_transcripts/Revai_API_{file_path}_transcription.txt"
with open(output_file, 'w') as f:
    f.write(transcription)

print(f"\nTranscription saved to {output_file}")

Job submitted successfully. Job ID: 7UgQcaSbXMh7r0AQ
Job is still being processed. Waiting for another 30 seconds.
Job is still being processed. Waiting for another 30 seconds.
Job is still being processed. Waiting for another 30 seconds.
Transcription:

Speaker 0    00:00:15    Yeah, she pee.  
Speaker 2    00:00:21    Alright, Ray,  
Speaker 1    00:00:22    Sit  
Speaker 2    00:00:22    On the pee pee pot. Oh, peepee pot? Yeah, yeah, yeah, yeah. We gotta sit on the peepee  
Speaker 1    00:00:39    Pot.  
Speaker 2    00:00:41    Oh no. Ow. You just hit me in the mouth.  
Speaker 0    00:00:45    Sit on the pee pee pot  
Speaker 2    00:00:46    <laugh>. I know. Come on. Oh, you pooped in your pants, Ray. We didn't make it, did we?  
Speaker 0    00:01:03    No, it my baby pot.  
Speaker 2    00:01:06    Yeah. Did you stay with  
Speaker 0    00:01:08    I chew? Yeah,  
Speaker 2    00:01:10    I chew.  
Speaker 0    00:01:11    I  
Speaker 2    00:01:12    Momma didn't get to you 

__TODO__: Improve speaker diarization. 

__Transcribing Audio with Rev.ai [SDK](https://github.com/revdotcom/revai-python-sdk)__: 

In [None]:
! pip install --upgrade rev_ai

In [72]:
from rev_ai import apiclient
import time

def read_api_key_from_file(file_path):
    """Read the API key from a file."""
    with open(file_path, 'r') as file:
        return file.readline().strip()

api_key_file = "revai_api_credentials.txt"
api_key = read_api_key_from_file(api_key_file)
file_path = "./original_audio_files/AR31_021108a.wav"

# create your client
client = apiclient.RevAiAPIClient(api_key)

# send a local file
job = client.submit_job_local_file(file_path)

# Poll the API until the transcription is done
while True:
    job_details = client.get_job_details(job.id)
    status = job_details.status
    if status == "transcribed":
        break
    elif status in ["failed", "invalid"]:
        print("Transcription failed.")
        exit()
    else:
        print("Job is still being processed. Waiting for another 60 seconds.")
        time.sleep(60)

# retrieve transcript as text
transcript_text = client.get_transcript_text(job.id)
print(transcript_text)

# retrieve transcript as JSON
#transcript_json = client.get_transcript_json(job.id)

# retrieve transcript as a Python object
# transcript_object = client.get_transcript_object(job.id)

# Save the transcription to a text file
output_file = f"./generated_audio_transcripts/Revai_SDK_{file_path}_transcription.txt"
with open(output_file, 'w') as f:
    f.write(transcript_text)

print(f"\nTranscription saved to {output_file}")


Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.
Job is still being processed. Waiting for another 60 seconds.


KeyboardInterrupt: 

__TODO__: Figure out why SDK takes longer. 

## AssemblyAI
`AssemblyAI` is a company that offers a powerful speech-to-text API, designed to convert spoken language into written text. Their platform is built on top of deep learning models, which are trained on massive amounts of data to provide accurate transcriptions.

Prerequisites:
1. First, sign up for an account with [AssemblyAI](https://www.assemblyai.com/).
2. Once you have an account, you'll be provided with an API key, which you'll need to make requests.

__Transcribing Audio with AssemblyAI__: The provided code uses the official `AssemblyAI Python SDK (assemblyai)` to simplify the process of submitting audio files for transcription. The SDK handles many of the underlying details, such as uploading the file, submitting the transcription request, polling for completion, and fetching the result.

In [None]:
# Install the SDK
! pip install -U assemblyai

In [39]:
import assemblyai as aai

def read_api_key_from_file(file_path):
    """Read the API key from a file."""
    with open(file_path, 'r') as file:
        return file.readline().strip()

# Set your API key
api_key_file = "assemblyai_api_credentials.txt"
api_key = read_api_key_from_file(api_key_file)
aai.settings.api_key = api_key

# Path to the audio file
file_path = "./original_audio_files/AR31_021108a.wav"

# Create a transcription configuration with speaker labels enabled
config = aai.TranscriptionConfig(speaker_labels=True)

# Initialize the transcriber and submit the audio file for transcription
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(file_path, config=config)

# Prepare the transcription text
transcription_text = "\n".join([f"Speaker {utterance.speaker}: {utterance.text}" for utterance in transcript.utterances])

# Print the transcription with speaker labels
print(transcription_text)

# Save the transcription to a text file
output_file = f"./generated_audio_transcripts/AssemblyAI_{file_path.split('/')[-1]}_transcription.txt"
with open(output_file, 'w') as f:
    f.write(transcription_text)

print(f"\nTranscription saved to {output_file}")


Speaker A: Sushi shine go by she shine. Yeah. She be sheep. All right, ray pot you.
Speaker B: Oh, no.
Speaker A: Ow.
Speaker B: You just hit me in the mouth.
Speaker A: Sit on the peepee pot. I know. Come on.
Speaker B: Oh, you pooped in your pants. Ray, we didn't make it, did we?
Speaker A: No. It's my baby pie.
Speaker B: Yeah.
Speaker A: Can you stay?
Speaker B: Yeah, I chew. Mama didn't get to you in time, did she?
Speaker A: Yeah.
Speaker B: You're, too, Stinky. That was supposed to go in the peepee pot.
Speaker A: Oh, no.
Speaker B: That too was supposed to go in the pot, wasn't it?
Speaker A: Yeah.
Speaker B: Yuck.
Speaker A: Yeah.
Speaker B: Ray, you have to learn how to poop in the pot. Mama wouldn't have to wipe your honey.
Speaker A: Well.
Speaker B: Where were you supposed to poop? In the pot.
Speaker A: Yeah, wasn't it?
Speaker B: What are you doing with your tongue, little yeah.
Speaker A: Thank you, peepee bat.
Speaker B: You didn't peepee in the pot. You pooped in your

__TODO:__ Use AssemblyAI's [LeMUR](https://www.assemblyai.com/docs/guides/processing-audio-with-llms-using-lemur)  (Leveraging Large Language Models to Understand Recognized Speech) framework to process audio files with an LLM. 