---
# Notebook setup

Update the following values:
* `region` - Make sure you select a region where all the AWS AI services are available.
* `bucket_name` - Make sure you enter an existing bucket in the same region as above.
* `bucket_prefix` - Enter a prefix if needed, else leave it blank.

In [80]:
region = 'eu-west-1' # Update it to your region of choice
bucket_name = 'aws-mikasino-sagemaker' # Update it to the S3 bucket name
bucket_prefix = '' # Make sure you include trailing slash (/) if you are adding a prefix

Load the necessary libraries and upload the video file to S3 for later use.

In [81]:
import boto3
import IPython
import time
import json
import urllib
import urllib.request
import random
from datetime import datetime
#from prettytable import PrettyTable
import logging

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
polly = boto3.client('polly', region_name=region)
s3 = boto3.client('s3', region_name=region)

s3.upload_file('./jeff.mp4', bucket_name, bucket_prefix + 'jeff.mp4')

---
# Demo #1: Amazon Polly

We're making an API call to Amazon Polly service here to convert the text to speech. Text to be converted is in variable `Text` as an SSML string. It will be converted to `mp3` format and stored in a local file.  
The voice is controlled by setting the `VoiceId` to `Salli`.

For the full list of voices to be used with Polly check [the documentation page](https://docs.aws.amazon.com/en_pv/polly/latest/dg/voicelist.html).

In [84]:
response = polly.synthesize_speech(
  Text="<speak><amazon:auto-breaths frequency='low' volume='soft' duration='x-short'>Amazon Polly is a Text-to-Speech service, \
  that uses advanced deep learning technologies to synthesize speech that sounds like a human. With dozens of lifelike voices, variety of languages, \
  you can select the ideal voice and build speech-enabled applications that work in many different countries.</amazon:auto-breaths></speak>",
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Salli")
    
outfile = "polly-salli-intro.mp3"
data = response['AudioStream'].read()

with open(outfile,'wb') as f:
  f.write(data)

print('Converted text to voice and stored it locally as %s' % (outfile))

Converted text to voice and stored it locally as polly-salli-intro.mp3


<audio width="360" height="270" controls src="polly-salli-intro.mp3" />

Amazon Polly supports standard SSML tags such as prosody, which enables you to control the volume, rate, and pitch of the speech out.  
In the following example, we demonstrate how you can use manual `<amazon:breath>` and `<prosody>` tags together to convey emotional or dramatic tone in speech.  
Let's try with a scared voice of Matthew.

In [85]:
response = polly.synthesize_speech(
  Text="<speak><amazon:breath duration='medium' volume='x-loud'/><prosody rate='115%'> <prosody volume='x-loud'> Salli? <break time='300ms'/> \
  </prosody> Is that you?</prosody></speak>",
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Matthew")
    
outfile = "polly-matthew-scared.mp3"
data = response['AudioStream'].read()

with open(outfile,'wb') as f:
  f.write(data)

print('Converted text to voice and stored it locally as %s' % (outfile))

Converted text to voice and stored it locally as polly-matthew-scared.mp3


<audio width="360" height="270" controls src="polly-matthew-scared.mp3" />

This example uses an uncertain voice of Matthew.

In [96]:
response = polly.synthesize_speech(
  Text="<speak> <prosody rate='60%'> I am not sure <amazon:breath duration='x-long' volume='soft'/> <break time='150ms'/> </prosody> <prosody rate='90%'>I think I need to think about it. </prosody> </speak>",
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Matthew")
    
outfile = "polly-matthew-uncertain.mp3"
data = response['AudioStream'].read()

with open(outfile,'wb') as f:
  f.write(data)

print('Converted text to voice and stored it locally as %s' % (outfile))

Converted text to voice and stored it locally as polly-matthew-uncertain.mp3


<audio width="360" height="270" controls src="polly-matthew-uncertain.mp3" />

The last example is a breathless voice of Salli.  
By incorporating breath sounds into speech output from text, Polly is able to provide more naturally sounding speech, particularly for long-form text narration.  
Visit the [Amazon Polly documentation](http://docs.aws.amazon.com/polly/latest/dg/supported-ssml.html) for more information on SSML tags.

In [105]:
response = polly.synthesize_speech(
  Text="<speak> <amazon:breath duration='long' volume='x-loud'/><prosody rate='120%'> <prosody volume='loud'> Wow! <amazon:breath duration='long' volume='loud'/> \
  </prosody> That was quite fast <amazon:breath duration='medium' volume='x-loud'/> I almost beat my personal best time on this track. </prosody> </speak>",
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Salli")
    
outfile = "polly-salli-breathless.mp3"
data = response['AudioStream'].read()

with open(outfile,'wb') as f:
  f.write(data)

print('Converted text to voice and stored it locally as %s' % (outfile))

Converted text to voice and stored it locally as polly-salli-breathless.mp3


<audio width="360" height="270" controls src="polly-salli-breathless.mp3" />

---
## Custom Lexicon for Polly

Pronunciation lexicons enable you to customize the pronunciation of words. They give you additional control over how Polly pronounces words uncommon to the selected language.  

Examples of lexicon usage can be:
* If your text includes an acronym, such as W3C. Use a lexicon to define an alias for this so that it is read in the full, expanded form - World Wide Web Consortium.
* Common words are sometimes stylized with numbers taking the place of letters, as with "g3t sm4rt" (get smart). Humans can read these words correctly. However, a Text-to-Speech (TTS) engine reads the text literally, pronouncing the name exactly as it is spelled. Use a lexicon to customize the synthesized speech for proper pronunciation - get smart.

For additional details about Polly Lexicons, refer to the [Managing Lexicons page](https://docs.aws.amazon.com/polly/latest/dg/managing-lexicons.html).

Let's use a custom lexicon here to properly convert internet slangs to speech.

In [110]:
testlex = '''<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" 
      xml:lang="en-US">
  <lexeme>
    <grapheme>W3C</grapheme>
    <alias>World Wide Web Consortium</alias>
  </lexeme>
</lexicon>'''

internetslanglexicon = '''<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" xml:lang="en-US">
  <lexeme>
    <grapheme>La vita &#x00E8; bella</grapheme>
    <phoneme>ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>Roberto</grapheme>
    <phoneme>ɹəˈbɛːɹɾoʊ</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>Benigni</grapheme>
    <phoneme>bɛˈniːnji</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>IHAC</grapheme>
    <alias>I have a customer</alias>
  </lexeme>
  <lexeme>
    <grapheme>2day</grapheme>
    <alias>Today</alias>
  </lexeme>
  <lexeme>
    <grapheme>2moro</grapheme>
    <alias>Tomorrow</alias>
  </lexeme>
  <lexeme>
    <grapheme>2nite</grapheme>
    <alias>Tonite</alias>
  </lexeme>
  <lexeme>
    <grapheme>ASAP</grapheme>
    <alias>As soon as possible</alias>
  </lexeme>
  <lexeme>
    <grapheme>IIRC</grapheme>
    <alias>If I remember correctly</alias>
  </lexeme>
  <lexeme>
    <grapheme>POV</grapheme>
    <alias>Point of View</alias>
  </lexeme>
  <lexeme>
    <grapheme>TTYL</grapheme>
    <alias>Talk to you later</alias>
  </lexeme>
  <lexeme>
    <grapheme>THX</grapheme>
    <alias>Thanks</alias>
  </lexeme>
  <lexeme>
    <grapheme>YW</grapheme>
    <alias>You are Welcome</alias>
  </lexeme>
</lexicon>'''
# lexicon_data = lexicon_file.read()
# response = polly.put_lexicon(Name=arguments.name, Content=lexicon_data)
        
lexicon = polly.put_lexicon(
    Name = 'customlexicon',
    Content = internetslanglexicon
)

---
Use Polly to synthesize speech without using custom lexicon and with custom lexicon.

In [111]:
text_to_convert='''IHAC looking for way to convert text based chat conversations to speech.
IIRC, that is possible through custom lexicon in Polly. I want to know your POV on this.
I have to respond back to the customer 2nite, hence can you please let me know ASAP, THX.'''

no_lex_res = polly.synthesize_speech(
  Engine='neural',
  Text='<speak>' + text_to_convert +'</speak>',
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Joanna")
     
outfile_nolex = "polly-joanna-neural_nolexicon.mp3"
data_nolex = no_lex_res['AudioStream'].read()

with open(outfile_nolex,'wb') as f:
  f.write(data_nolex)

print('Converted text to voice and stored it locally as %s' % (outfile_nolex))

lex_res = polly.synthesize_speech(
  Engine='neural',
  Text='<speak>' + text_to_convert +'</speak>',
  LexiconNames=['customlexicon'],
  TextType="ssml",
  OutputFormat="mp3",                                           
  VoiceId="Joanna")
     
outfile_lex = "polly-joanna-neural_lexicon.mp3"
data_lex = lex_res['AudioStream'].read()

with open(outfile_lex,'wb') as f:
  f.write(data_lex)

print('Converted text to voice and stored it locally as %s' % (outfile_lex))

Converted text to voice and stored it locally as polly-joanna-neural_nolexicon.mp3
Converted text to voice and stored it locally as polly-joanna-neural_lexicon.mp3


---
Listen to the synthesized speech without lexicon and with lexicon to hear the difference.

Text:  
IHAC looking for way to convert text based chat conversations to speech.  
IIRC, that is possible through custom lexicon in Polly. I want to know your POV on this.  
I have to respond back to the customer 2nite, hence can you please let me know ASAP, THX.

### Without Lexicon

<audio width="360" height="270" controls src="polly-joanna-neural_nolexicon.mp3" />

### With Lexicon

<audio width="360" height="270" controls src="polly-joanna-neural_lexicon.mp3" />

---
# Demo #1: Amazon Transcribe

Using Amazon Transcribe we are going to generate the text from the video file. Transcribe will provide a signed S3 URL which will contain the transcribed text in JSON forma. Output of the transcribe will contain the speaker identification labels, timestamp when a particular word was heard, etc.

Click the below arrow to expand the video.

*Before playing the video, start the transcribe job by executing the next cell since it will take few seconds to complete the transcribe job.*

---

<details>
  <summary>Video to be transcribed</summary>
  <video width="640" height="480" controls src="./jeff.mp4" />
</details>

In [112]:
# Converting mp4 to text
transcribe = boto3.client('transcribe')
timestamp = datetime.now().strftime('%Y-%m-%d-%H%M%S')
job_name = "TranscribeDemo-" + timestamp
job_uri = 'https://s3-{}.amazonaws.com/{}/{}jeff.mp4'.format(region, bucket_name, bucket_prefix)
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='mp4',
    LanguageCode='en-US',
    MediaSampleRateHertz=44100,
    Settings={'MaxSpeakerLabels': 2,'ShowSpeakerLabels': True }    
)
print('Transcribing the video is in progress ', end='')
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        print('')
        print('Transcribing the video job completed with status %s\n' % status['TranscriptionJob']['TranscriptionJobStatus'])
        break
    print('.', end='')
    time.sleep(5)
# pprint(status)
url = status['TranscriptionJob']['Transcript']['TranscriptFileUri']
print('Download the text output of transcribe job from the following URL:\n%s ' % url)
transcript='transcript_{}.json'.format(job_name)
urllib.request.urlretrieve(url,transcript)

Transcribing the video is in progress .................
Transcribing the video job completed with status COMPLETED

Download the text output of transcribe job from the following URL:
https://s3.eu-west-1.amazonaws.com/aws-transcribe-eu-west-1-prod/892616959688/TranscribeDemo-2019-10-30-232724/0471a2ba-b888-447f-b4d5-339242572f39/asrOutput.json?X-Amz-Security-Token=AgoJb3JpZ2luX2VjEL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJHMEUCIQDK3uvTQ2i9DKFZ1Ru%2FVt%2FvxlNX10Ql11sMLwDv4id9%2FQIgUm8TPWYh4BvIytu5mxqmgEOpaPpJqmMIRNRmMMKMm%2BoqhQQIyP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgw1ODcwMTc2NjM0MTciDBqLLaSH2skyJ%2BsrEyrZA5hkBOY%2Bho8GWhqj9%2Fcm0vGTTD7hXkUD8pIxPK9VPvOMgwC9mV7cny8dIL8LkIfRJjyFtyqHhl1rbSLXOuYCNXa1i58LKp7UwnfBxMPvT8vMt%2FA64xKRnYCW6S7LtOOAHAwyKlrb4J4Ciy%2FHF%2Bkd3Ob6Qp1ltTfMaLdRAHQxFBLvyqG74BYsPLzpiwZrA177hbi1UpY23SvmWm6gJvCY66QzBtbGXIxiU%2F4yOcNXFyCZ2gbtL4NuIcHkKRN79whtGmuad2JW%2BzngblzZEDbFI%2FXnZS%2FvIq0b13o13d3aIz35FKL3dQY1TbCUpodCWbXiRZFlN1qe8zyQgSbN24kLHDaMzXDCvrsNUf1iRjh

('transcript_TranscribeDemo-2019-10-30-232724.json',
 <http.client.HTTPMessage at 0x7f5bdc11f518>)

---
Read the output of the transcribe job and print the text generated from the video.

In [70]:
result = json.load(open(transcript))
transcript_text = result['results']['transcripts'][0]['transcript']
print(transcript_text)

I guess my first question is, Is this Is this the underpinnings of tech over the next 10 years as we seem to be emerging from the period of frantic growth and development in smartphones? Well, I think it's I think it's gigantic. Um, I do, I think, natural language understanding, I think machine learning in general, Artificial intelligence. Uh, this. It's hard to overstate how big of an impact it's gonna have on society over the next 20 years, so


---

Now we are going to label the text based on the speaker labels to display the content specifically spoken by a speaker.

We also use Amazon Comprehend to identify the sentiment of the text for both he speakers.

In [72]:
'''
# read the json output from disk (debugging)
with open('asrOutput.json') as f:
    data = json.load(f)
'''

data = json.load(open(transcript))

# create a list to store start and stop times of the speaker in seconds
spk_0 = []
spk_1 = []

# iterate over the speaker segments from the json
for x in data['results']['speaker_labels']['segments']:
        # check for which speaker a label was submitted
        if x['speaker_label'] == 'spk_0':
                # we need to convert float to int by multiplying *100, else we cannot use it later on to compare ranges
                start         = int(float(x['start_time']) * 100)
                end         = int(float(x['end_time']) * 100)

                # append the start and stop times to a list
                spk_0.append([start, end])

        # check for which speaker a label was submitted
        if x['speaker_label'] == 'spk_1':
                # we need to convert float to int by multiplying *100, else we cannot use it later on to compare ranges
                start         = int(float(x['start_time']) * 100)
                end         = int(float(x['end_time']) * 100)

                # append the start and stop times to a list
                spk_1.append([start, end])

res = []
speaker0                 = []
speaker1                 = []
curr_speaker         = ''


for x in data['results']['items']:
        txt         = x['alternatives'][0]['content']
        # check if the item has a start_time - if not, its probably punctuation which doesn't come with a timestamp. 
        if 'start_time' in x:
                start         = int(float(x['start_time']) * 100)
                end         = int(float(x['end_time']) * 100)
                for y in spk_0:                        
                        if start in range(y[0], y[1]) and end in range(y[0], y[1]):
                                curr_speaker = 'spk_0'
                for y in spk_1:                        
                        if start in range(y[0], y[1]) and end in range(y[0], y[1]):
                                curr_speaker = 'spk_1'
        if curr_speaker == 'spk_0':
                if x['type'] == 'punctuation' and txt != ',':
                        speaker0.append(txt+'\n')
                elif txt == ',' or txt[0].isupper():
                        speaker0.append(txt)
                else:
                        speaker0.append(' '+txt)
        if curr_speaker == 'spk_1':
                if x['type'] == 'punctuation' and txt != ',':
                        speaker1.append(txt+'\n')
                elif txt == ',' or txt[0].isupper():
                        speaker1.append(txt)
                else:
                        speaker1.append(' '+txt)

# check sentiment of both speakers
def check_sentiment(x, y):
        c = boto3.client(service_name='comprehend', region_name='eu-west-1')
        s = y+','
        x = c.detect_sentiment(Text=x, LanguageCode='en')
        y =  ' Mixed : '+str(x['SentimentScore']['Mixed'])
        y += '\t Positive :'+str(x['SentimentScore']['Positive'])
        y += '\t Negative : '+str(x['SentimentScore']['Negative'])
        y += '\t Neutral : '+str(x['SentimentScore']['Neutral'])
        y += '\t Sentiment : '+str(x['Sentiment'])
        return y


# print full text for both speakers
print('Speaker1:')
print(''.join(speaker1))
print('Sentiment of speaker 1 : '+check_sentiment(''.join(speaker1), 'speaker 1 '))
print('\n')
print('Speaker2:')
print(''.join(speaker0))
print('\n')
print('Sentiment of speaker 2 : '+check_sentiment(''.join(speaker0), 'speaker 0 '))


Speaker1:
I guess my first question is,Is thisIs this the underpinnings of tech over the next 10 years as we seem to be emerging from the period of frantic growth and development in smartphones?
Well,I think it's
Sentiment of speaker 1 :  Mixed : 2.5586166884750128e-05	 Positive :0.021759752184152603	 Negative : 0.13760659098625183	 Neutral : 0.8406080603599548	 Sentiment : NEUTRAL


Speaker2:
I think it's gigantic.
Um,I do,I think, natural language understanding,I think machine learning in general,Artificial intelligence.
Uh, this.
It's hard to overstate how big of an impact it's gonna have on society over the next 20 years, so


Sentiment of speaker 2 :  Mixed : 1.1363149496901315e-05	 Positive :0.4650130271911621	 Negative : 0.0057877665385603905	 Neutral : 0.529187798500061	 Sentiment : NEUTRAL


---
## Amazon Translate Demo

Convert the transribed text to German language using Amazon Translate.

In [76]:
# -*- coding: utf-8 -*-
translate = boto3.client('translate', region_name=region)

message = transcript_text

result=translate.translate_text(
    Text=message,
    SourceLanguageCode='en',
    TargetLanguageCode='de'
)

# print(json.dumps(result, sort_keys=True, indent=4, default=str))
print(result['TranslatedText'])

Ich denke, meine erste Frage ist, ist das die Grundlagen der Technologie in den nächsten 10 Jahren, da wir aus der Zeit des hektischen Wachstums und der Entwicklung in Smartphones entstehen scheinen? Nun, ich glaube, ich denke, es ist gigantisch. Ähm, ich denke, natürliches Sprachverständnis, ich denke, maschinelles Lernen im Allgemeinen, künstliche Intelligenz. Äh, das hier. Es ist schwer zu übertreiben, wie groß es für die Gesellschaft in den nächsten 20 Jahren sein wird, also


---
## Amazon Comprehend Demo

Now using Amazon Comprehend detect the language from the above text. By providing the detected language as input detect the sentiment, entities and key phrases in the text.

In [77]:
comprehend = boto3.client('comprehend')

text = result['TranslatedText']

language_detected = comprehend.detect_dominant_language(Text=text)['Languages'][0]['LanguageCode']

entity_res = comprehend.detect_entities(Text=text, LanguageCode=language_detected)
senti_res = comprehend.detect_sentiment(Text=text, LanguageCode=language_detected)
key_res = comprehend.detect_key_phrases(Text=text, LanguageCode=language_detected)

In [78]:
print('Language detected is %s \n' % language_detected)

print('Sentiment of the text has been identified as %s with the score of %s \n' % (senti_res['Sentiment'], senti_res['SentimentScore'][senti_res['Sentiment'].title()]))

keyphrases = [[], [], []]
for k in key_res['KeyPhrases']:
    if k['Score'] > .99:
        keyphrases[0].append(k['Text'] + '\n')
    elif k['Score'] > .98:
        keyphrases[1].append(k['Text'] + '\n')
    elif k['Score'] > .97:
        keyphrases[2].append(k['Text'] + '\n')
           
print('Key Phrases identified from the text:')

keytable = PrettyTable(['Score', 'Key Phrases'])
if keyphrases[0]:
    keytable.add_row(['.99', ''.join(keyphrases[0])])
    keytable.add_row(['--', '--------------------'])
if keyphrases[1]:
    keytable.add_row(['.98', ''.join(keyphrases[1])])
    keytable.add_row(['--', '--------------------'])
if keyphrases[2]:
    keytable.add_row(['.97', ''.join(keyphrases[2])])
print(keytable)
print('\n')

entity_thershold = 0.80

topentity = {}
for e in entity_res['Entities']:
    if e['Score'] > entity_thershold:
        topentity[e['Score']] = {e['Text']: e['Type']}
        
top10 = sorted(topentity, reverse=True)[:10]

table = PrettyTable(['Text', 'Type','Score'])
for t in top10:
    table.add_row([list(topentity[t].keys())[0], list(topentity[t].values())[0], t])
    
print('Top 10 entities identified:')    
print(table)

Language detected is de 

Sentiment of the text has been identified as NEUTRAL with the score of 0.9595580697059631 

Key Phrases identified from the text:


NameError: name 'PrettyTable' is not defined

---
## Real-time Audio Transcription using Amazon Transcribe Websockets

Earlier we have seen how to transcribe an existing video file stored in S3. Now let's look at an example how we can to real-time audio transcripton using the Amazon Traanscribe Websockets API.

*You have to update the text `[AMPLIFY_URL]` with the actual Amplify Console URL created as part of the prerequisites. You also need to key in the access key and secret key of the user that you created as part of the prerequisites.*

---
## Amazon Rekognition

Get the DemoWebsite URL which will be available in the Outputs section of the Media Analysis Solution CloudFormation stack. In the next cell replace the text [DemoWebsite_URL] with the URL that you copied.

In [79]:
%%HTML
<h3>Real-time Audio Transcription</h3>
<br>
<object type="text/html" data=[AMPLIFY_URL] width="1000" height="600"> <embed src="[AMPLIFY_URL]"></embed></object>