# Automating Speech-to-Text Transcription

UX researchers and enthnogoraphers frequently conduct interviews in order to gather data. However, audio recordings are extremely tedious and costly to transcribe. Many might be surprised to learn that the first completely-automatic translation was done all the way back in 1954. Although it was limited (with a mere six grammar rules and 250 lexical items in its vocabulary), the experiment developed jointly by Georgetown University and IBM translated more than sixty Russian sentences into English. Interestingly enought, Leon Dostert, one of the scientists involved in the experiment, predicted that generalized machine translation would be realized in "five, perhaps, three years" <sup>[1](#myfootnote1)</sup>. Alas, progress was more slower than the sanguine scientist suggested.

Advancements in machine learning and the exponential growth in computating power, however, have finally made machine translation a reality, albeit a perfectible one. Powered by the same speech recognition technology that works on your android and Chrome OS devices, the Google Cloud Platform provides the best speech-to-text conversion tool available today. And it's practically free! 

Audio recordings under 60 minutes are free and recordings over 60 minutes are USD \$0.024/minute. A three-hour interview would therefore cost you only $4.32. And what's to stop you from [splitting the audio file](https://www.cnet.com/news/how-to-split-audio-tracks/) into 3 or 4 chunks?

I am going to show you how -- with little to no computing experience -- you can use it right now!

Let's get to it...

## Pre-processing

**Note**: The Speech-to-Text API has [certain requirements](https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig) in order to work. Notably, the audio file must be encoded in a suitable format (e.g. .flac, .wav) and supports only 1 channel (i.e. mono, not stereo). The API supports sample rates from 8000 to 48000Hz. For best results, at least 16000Hz is recommended. I will shortly provide a script for encoding files as FLAC.

You will also need to set up a Google Cloud Platform account which will then allow you to create a bucket to store your audio file on the cloud and request an api key. The `gcs_uri` object is the link to your file in the bucket and `cred_key` is the api key stored as a .

## How it works

#### Get modules

In [1]:
import os
import json

#### Get modules from Google Cloud client package

In [2]:
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

#### Iniitiate the client

I highly recommend storing your API key as a JSON file if you plan to run this from command line.

In [3]:
cred_key = 'C:\myPath\myAPI-key.json' # api key json file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = cred_key
client = speech.SpeechClient()

#### Load link to audio file from Google Cloud Platform

Here you can specify the language and dialect you are using. The complete list of supported regional languages can be found on Google Cloud's [language support page](https://cloud.google.com/speech-to-text/docs/languages).

In [4]:
gcs_uri = 'gs://mybucket/myaudio.flac' # must include full path if audio is inside bucket directories
lang_code = 'en-US' # Use 'en-US' for US English

audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.FLAC, 
    sample_rate_hertz=48000,
    language_code=lang_code 
)

#### Detect Speech

In [5]:
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=None) # results

#### Obtain Results
Process results to return a dictionary with the transcribed text and degree of confidence for each operation and list of transcription results.

In [6]:
transcript_list = []
result_dict = {}

for n, result in enumerate(response.results):
    transcript = result.alternatives[0].transcript
    confidence = result.alternatives[0].confidence

    result_value = {
        'transcript': transcript,
        'confidence': confidence
    }

    # dictionary with transcription and confidence interval
    result_key = 'result_{}'.format(n)
    result_dict[result_key] = result_value

    # joined list of transcribed data
    transcript_list.append(transcript)
    joined_transcript_list = ' |'.join(transcript_list)

#### Write file to system

In [7]:
json_filename = 'audio_transcription_results.json'
text_filename = 'audio_transcription_results.txt'

with open(json_filename, 'w') as f:
    f.write(json.dumps(result_dict))
    
with open(text_filename, 'w') as f:
        f.write(str(joined_transcript_list)) 

###### References

<a name="myfootnote1">1</a>: Hutchins (2004). ["The Georgetown-IBM Experiment Demonstrated in January 1954."](http://www.hutchinsweb.me.uk/AMTA-2004.pdf) in *Machine Translation: From Real Users to Research*, pp 102-114.