# Notebook 1: Generating Transcripts From Audio Files Using AWS Transcribe

This is the first of two notebooks in which we attempt to investigate the accuracy of AWS Transcribe across accents from different countries. In this first notebook, we will be putting our fifteen audio files through Transcribe and obtaining a transcript for each one.

### Getting Started

We will use two AWS services in this analysis: **AWS Transcribe** and **Amazon Simple Storage Service (S3)**. Transcribe, our primary service of interest, uses machine learning to transcribe speech detected in audio files. For example, one could use this technology to provide closed captioning for a video. We will also need S3, which allows us to store all of our audio, transcript, and data files in one place. In S3, you create buckets to store your desired objects or files, and so we have created the following three buckets:
* **actual-transcripts-for-comprison**, a bucket for the actual transcript files of each excerpt,
* **audio-files-to-be-transcribed**, a bucket to hold the audio files we want to transcribe, and 
* **aws-generated-transcripts**, a bucket to receive and store the output from Transcribe. 

Click the icons below to learn more about the services we're using.

<div class="row">
    <a href="https://aws.amazon.com/transcribe/">
        <img top="20px" left="20px" border="0" alt="AWS Transcribe" src="https://docs.google.com/uc?export=download&id=1pEMVXZrauRRe7Wg1mMgdKpSUdBQOjS4v" width="150">
    </a>
    <a href="https://aws.amazon.com/s3/">
        <img top="20px" left="200px" border="0" alt="AWS S3" src="https://docs.google.com/uc?export=download&id=1nZg4pSvadAnGPP9RwuNy6RevMPTbp2LU" width="150">
    </a>
</div>

## Part 1 - Set-up
#### Permissions:
First, we need to ensure we have the correct permissions to use Transcribe and S3 as we intend. The permissions we require are  **AmazonS3FullAccess** and **AmazonTranscribeFullAccess**, and they can be attached in this way:
1. In Amazon SageMaker, go to Notebook instances and click the name of the instance that will be used to run Transcribe.
2. Scroll down to the "Permissions and encryption" section and click the link titled "IAM role ARN," which opens the IAM Management Console in a new tab. IAM (short for Identity and Access Management) is a tool used to keep track of all of the users in an organization, as well as their permissions and capabilities within AWS.
3. Click "Attach polices" and search for the aforementioned permissions in the search bar, checking the box beside each desired permission and selecting "Attach policy" to attach the selected permission(s) to the SageMaker instance.
4. Check in your list of policies that you can see the newly added ones. 

*For a walkthrough of attaching these permissions, see the video below (the video does not contain sound):*

In [1]:
from IPython.display import HTML
HTML("""
<iframe src="https://player.vimeo.com/video/479619829" width="640" height="360" frameborder="0" allow="fullscreen" allowfullscreen></iframe>
""")

#### Upload files to S3:
As mentioned in our blog, we used a text-to-speech website to generate audio files of our exerpts. In order to efficiently bring these files through Transcribe, we put them into an S3 bucket, and we also created a bucket to hold the actual, correct transcripts of each of our passages. Try using your own audio files and passages here as well!  

*Note: Your S3 buckets do not need to be public, nor do the objects in them.*

Now that we have attached the correct permissions and uploaded our files, we can move over to SageMaker, start up our instance, and create a new JupyterLab notebook (we used a conda_python3 notebook).

## Part 2 - Using AWS Transcribe

In our notebook, we must import `boto3`, the Python Software Development Kit (SDK) for Amazon Web Services. This SDK allows us to use AWS services, such as S3 and Transcribe, from Python. You can read more about `boto3` [here](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html). After importing `boto3`, we indicate that we would like to use the Transcribe API by calling `boto3.client()`. We also import the `time` package, to be used in our transcript generations.

In [2]:
import boto3
transcribe = boto3.client('transcribe')
import time

Now we can begin generating our transcripts. For the first one, we'll examine each part of the code to describe its purpose.  

First, we will create variables for Transcribe to use. We create a `job_name`, which must be a unique name from all other job names, including those of jobs previously run. We also create an identifier, `job_uri`, which points to the specific S3 bucket and file for Transcribe to use. (Recall, our audio files are stored in an S3 bucket called **audio-files-to-be-transcribed**.)

In [2]:
job_name = "American_Easy_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Easy+Audio+Extracted.wav"

We then initialize the transcription job with the `start_transcription_job()` function. In this function, we indicate the following:  
`TranscriptionJobName`: the name of this particular transcription job, which was previously defined  
`Media`: the media file to be transcribed, as identified in our creation of the `job_uri` object  
`MediaFormat`: the format of the media file  
`LanguageCode`: the language of the audio file  
`OutputBucketName`: the S3 bucket in which we want the transcript to be stored, in this case, **aws-generated-transcripts**. Please note that you will need to provide the name of one of your own buckets here. Alternatively, you can remove this parameter entirely, and Transcribe will return to you a link that you can paste into a new browser tab which will download the file to your computer.

In [4]:
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)

{'TranscriptionJob': {'TranscriptionJobName': 'American_Easy_Transcript',
  'TranscriptionJobStatus': 'IN_PROGRESS',
  'LanguageCode': 'en-US',
  'MediaFormat': 'wav',
  'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Easy+Audio+Extracted.wav'},
  'StartTime': datetime.datetime(2020, 11, 15, 20, 39, 51, 823000, tzinfo=tzlocal()),
  'CreationTime': datetime.datetime(2020, 11, 15, 20, 39, 51, 790000, tzinfo=tzlocal())},
 'ResponseMetadata': {'RequestId': 'd7c49596-229c-47a5-9882-b420ac159640',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 15 Nov 2020 20:39:51 GMT',
   'x-amzn-requestid': 'd7c49596-229c-47a5-9882-b420ac159640',
   'content-length': '330',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

We can check the status of the transcription job as well with the following code, which calls `get_transcription_job()` and evaluates it. Notice the use of `time.sleep(10)` here, which is a function of the `time` package we imported earlier. This function pauses the code's operation for the number of seconds specified in the parentheses. By passing "10" into the function, we are checking the status of our transcription job every 10 seconds. If the job is not complete yet, we can choose to display a message every so often. When the job is complete, information about the transcription job will be displayed. The, we can look in our designated S3 bucket for our transcription and additional data regarding it as a JSON file.

In [5]:
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'American_Easy_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Easy+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/American_Easy_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 20, 39, 51, 823000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 20, 39, 51, 790000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 20, 40, 59, 673000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '0f570a7a-bf4d-4cc9-a6fd-fe5da0af2e48', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-t

Now we repeat the code above for each audio file we wish to transcribe. We will generate transcripts for five accents, each at three levels of passage difficulty, and store the resulting transcripts in the **aws-generated-transcrips** bucket.

In [6]:
job_name = "American_Medium_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Medium+Audio+Extracted+2.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'American_Medium_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 48000, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Medium+Audio+Extracted+2.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/American_Medium_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 20, 42, 11, 503000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 20, 42, 11, 481000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 20, 43, 9, 561000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '14693ef9-87c8-4ec2-976b-65628296254c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'co

In [7]:
job_name = "American_Hard_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Hard+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'American_Hard_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/American+Hard+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/American_Hard_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 20, 43, 24, 193000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 20, 43, 24, 170000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 20, 44, 25, 35000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '56a2a065-775c-4f35-82c8-755efb8e481a', 'HTTPStatusCode': 200, 'HTTPHeade

In [11]:
job_name = "British_Easy_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Easy+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'British_Easy_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Easy+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/British_Easy_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 18, 35, 245000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 18, 35, 220000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 19, 41, 542000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '0e8d5cbf-894c-4236-aa50-b4a9309cb7b1', 'HTTPStatusCode': 200, 'HTTPHeaders

In [9]:
job_name = "British_Medium_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Medium+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'British_Medium_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Medium+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/British_Medium_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 15, 35, 273000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 15, 35, 249000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 16, 37, 835000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '18212e60-7514-43e8-9c27-c5348a58a5ca', 'HTTPStatusCode': 200, 'HTTPH

In [10]:
job_name = "British_Hard_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Hard+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'British_Hard_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/British+Hard+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/British_Hard_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 16, 53, 321000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 16, 53, 297000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 17, 53, 69000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '31a3bc81-118f-4904-af12-cfa6f065e6b2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type'

In [12]:
job_name = "Chinese_Easy_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Easy+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Chinese_Easy_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Easy+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Chinese_Easy_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 21, 30, 98000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 21, 30, 78000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 22, 42, 215000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': 'bedc3776-d4a0-4700-80e0-6efff0e33c49', 'HTTPStatusCode': 20

In [13]:
job_name = "Chinese_Medium_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Medium+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Chinese_Medium_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Medium+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Chinese_Medium_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 25, 34, 545000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 25, 34, 513000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 26, 42, 813000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '6ad9dff5-dde8-48d4-aa2b-b7dae725714a', 'HTTPStatusCode': 200, 'HTTPH

In [14]:
job_name = "Chinese_Hard_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Hard+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Chinese_Hard_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Chinese+Hard+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Chinese_Hard_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 33, 1, 652000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 33, 1, 630000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 34, 3, 350000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '66a761c7-8860-4365-9e2c-79450e6fd3e6', 'HTTPStatusCode': 200, 'HTTPHeaders': 

In [17]:
job_name = "Hindi_Easy_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Easy+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Hindi_Easy_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Easy+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Hindi_Easy_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 41, 33, 840000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 41, 33, 817000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 42, 41, 937000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': 'c74aafbd-4498-46f6-9ac2-9a093677c2c1', 'HTTPStatusCode': 200, 'HTTPHeaders': {'c

In [20]:
job_name = "Hindi_Medium_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Medium+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Hindi_Medium_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Medium+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Hindi_Medium_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 45, 23, 90000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 45, 23, 66000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 46, 25, 82000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': 'a6a29f7e-8beb-4280-8863-a8a90c13f115', 'HTTPStatusCode': 200, 'HTTPHeaders': 

In [21]:
job_name = "Hindi_Hard_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Hard+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Hindi_Hard_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Hindi+Hard+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Hindi_Hard_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 49, 21, 55000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 49, 21, 21000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 50, 18, 402000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '25dae86a-5451-4ffa-b24e-18829bf4331a', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'appl

In [22]:
job_name = "Spanish_Easy_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Easy+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Spanish_Easy_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Easy+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Spanish_Easy_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 51, 5, 709000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 51, 5, 683000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 52, 20, 15000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': 'c83113fa-4625-483a-9410-f31866f40b11', 'HTTPStatusCode': 200

In [23]:
job_name = "Spanish_Medium_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Medium+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Spanish_Medium_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Medium+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Spanish_Medium_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 54, 1, 454000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 54, 1, 430000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 55, 12, 912000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': '51b93a24-3ccb-4abf-afd6-604845e9a6a9', 'HTTPStatusCode': 200, 'HTTPHea

In [24]:
job_name = "Spanish_Hard_Transcript"
job_uri = "https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Hard+Audio+Extracted.wav"

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='wav',
    LanguageCode='en-US',
    OutputBucketName='aws-generated-transcripts'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(10)
print(status)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'Spanish_Hard_Transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'wav', 'Media': {'MediaFileUri': 'https://audio-files-to-be-transcribed.s3.amazonaws.com/Spanish+Hard+Audio+Extracted.wav'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/aws-generated-transcripts/Spanish_Hard_Transcript.json'}, 'StartTime': datetime.datetime(2020, 11, 15, 21, 55, 31, 511000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 11, 15, 21, 55, 31, 486000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2020, 11, 15, 21, 56, 34, 820000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}}, 'ResponseMetadata': {'RequestId': 'a89b9ce8-0ad4-4a81-afaf-9f3a5da1bad0', 'HTTPStatusCode': 200, 'HTTPHeaders

## In Summary:
We have now produced fifteen transcripts using AWS Transcribe, and stored them in an S3 bucket. In the next notebook, we will retrieve the transcripts from our bucket, extract the information we are interested in, and analyze and visualize our data.