<a href="https://colab.research.google.com/github/Elahekhezri/gcloud-batch-transcription/blob/main/batch_transcription_gcloud_speech_to_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This script transcribes batches of audio files using Google Cloud's Speech-to-Text API. It then automatically structures the data into a pandas DataFrame with filenames and corresponding transcriptions.

☀️ Recommended for qualitative researchers transcribing audio interviews or survey voice notes.

☀️ Recommended for audio files in languages not widely supported by online batch transcription tools (e.g., Persian).

💡 For the final dataframe to be analysis-ready:
- Name audio files after participant IDs.

- Name folders after research variables or conditions.


# Setup

## 🏗️ Environment:
- [Google Colab](https://colab.research.google.com/) (Recommended, minimal setup).
- If running locally, authentication setup may differ.

## 🗃️ Storage
- Google Cloud Storage Bucket (easy integration with Google Cloud Speech-to-Text)
- ⚠️ Tweak the code to pull audio from other sources if you're not using storage bucket.

## Prep

1. Go to [Google Cloud Console](https://console.cloud.google.com/).
2. Create a new project or select an existing one.
3. Enable [Cloud Speech-to-Text API](https://console.cloud.google.com/apis/library/speech.googleapis.com?) for the project.
4. (optional) Enable [Cloud Storage API](https://console.cloud.google.com/apis/library/storage-component.googleapis.com?) for the project -> Go to [Cloud Storage Buckets](https://console.cloud.google.com/storage/) -> create a bucket -> upload the folder containing the audio files.
5. (optional) Depending on your authentication method, you may need to [create a service account](https://cloud.google.com/iam/docs/service-accounts-create).

# ⚠️ Warning
As of Summer 2024, this script ran at no cost. However, Google Cloud pricing may change, so check the latest terms before use.


# Code

In [None]:
!pip install google-cloud-speech google.cloud google.api google-cloud-storage

In [None]:
#@title Libraries
from google.api_core.client_options import ClientOptions
from google.cloud import speech, storage
from google.cloud.speech_v2 import SpeechClient, types as cloud_speech
from google.colab import auth, data_table
from google.oauth2 import service_account
import pandas as pd
from IPython.display import display
import os

In [None]:
#@title Authentication

PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = 'europe-west4'  # You can modify this based on your recognizer's location
recognizer_path = f"projects/{PROJECT_ID}/locations/{LOCATION}/recognizers/_"

## option 1: quick authentication in google colab

auth.authenticate_user(project_id=PROJECT_ID)

## option 2: authentication using a service account (recommended)

credentials_file = "path-to-credentials.json" # @param {type:"string"}

credentials = service_account.Credentials.from_service_account_file(credentials_file)

# Initialize the Google Cloud Speech client
speech_client = speech.SpeechClient(credentials=credentials)

# Initialize the Google Cloud Storage client
storage_client = storage.Client(credentials=credentials)

bucket_name = "your-bucket-name" # @param {type:"string"}
bucket = storage_client.bucket(bucket_name)

In [None]:
#@title Batch Transcription

# define batch recognize function

MAX_AUDIO_LENGTH_SECS = 8 * 60 * 60

def run_batch_recognize(client, credentials, gcs_uri: str) -> str:
  # Instantiate a client.
  client = SpeechClient(credentials = credentials,
      client_options=ClientOptions(
          api_endpoint="europe-west4-speech.googleapis.com", # change accordingly
      ),
  )

  audio = speech.RecognitionAudio(uri=gcs_uri)

  config = cloud_speech.RecognitionConfig(
      auto_decoding_config={},
      features=cloud_speech.RecognitionFeatures(enable_automatic_punctuation=True,
          enable_word_time_offsets=True,
          enable_spoken_punctuation=True,
          use_enhanced=True,
        ),
      model="chirp", # change accordingly
      language_codes=["fa-IR"], # change accordingly
  )

  files = [cloud_speech.BatchRecognizeFileMetadata(uri=gcs_uri)]

  request = cloud_speech.BatchRecognizeRequest(
      recognizer=recognizer_path,
      config=config,
      files=files,
      recognition_output_config=cloud_speech.RecognitionOutputConfig(
            inline_response_config=cloud_speech.InlineOutputConfig(),
            ),
  )
  operation = client.batch_recognize(request=request)

  print("Operation in progress...")
  response = operation.result(timeout=3 * MAX_AUDIO_LENGTH_SECS)

  transcripts = [result.alternatives[0].transcript for result in response.results[gcs_uri].transcript.results]
  return " ".join(transcripts)

# fetch all audio files from your Cloud Storage bucket within the specified folder prefix.

gcs_uris = []
file_names = []

folder_prefix = "/your-audio-folder" # @param {type:"string"}
blobs = bucket.list_blobs(prefix=folder_prefix)

for blob in blobs:
    if blob.name.endswith('.mp3'): #change accordingly (e.g., .wav)
        gcs_uri = f"gs://{bucket_name}/{blob.name}"
        gcs_uris.append(gcs_uri)
        file_names.append(blob.name.split('/')[-1])

print(gcs_uris)
print(file_names)

# Loop through audio files, transcribe each, and store results as (filename, transcript) tuples
transcriptions = []

for gcs_uri, file_name in zip(gcs_uris, file_names):
    print(f"Transcribing {gcs_uri}...")
    try:
        transcript = run_batch_recognize(speech_client, credentials, gcs_uri)
        transcriptions.append((file_name, transcript))
    except Exception as e:
        print(f"Failed to transcribe {gcs_uri}: {e}")
        transcriptions.append((file_name, ""))

# Create a DataFrame from collected tuples
import pandas as pd
df = pd.DataFrame(transcriptions, columns=["File Name", "Transcription"])
display(df)