Title: Books on Tape - Author Classification from Dante and Shakespeare

Given a line from either Shakespeare or Dante, we will train a machine learning model that will accurately predict the author.

We will investigate three candidate models and compare their performance: Amazon’s BlazingText, Naive Bayes, and KNeighbors.

We have chosen two scenes from the Merchant of Venice by William Shakespeare and two scenes from the Divine Comedy by Dante Alighieri. These scenes will be transcribed using Amazon Transcribe. Word transcription data will then be processed and cleaned. Transcriptions with low confidence will be evaluated for removal. Data will then be explored and formatted.

Formatted data will be used to train models and features selected. A second set of data will be chosen for and processed for validation. Each model’s performance will be evaluated and compared.

Questions we are now considering/ Interesting thoughts:
1. How well can Amazon Transcribe transcribe the unique language of Shakespeare? Will transcriptions with low confidence turn out to be revealing language for classifying authors?
2. Shakespeare is written in iambic pentameter. Dante is originally written in Terza Rima and often translated into iambic pentameter. Is meter a revealing feature for author classification?  Can meter be used as a feature?
3. Features of interest - words common by one author or another; uniqueness as a feature; clusters of words; semantics as a feature - Dante will have a lot of “fiery, hell, burning”

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
from __future__ import print_function
import time

In [43]:
#transcribe chapters function
def transcribe_chapters (job_name, job_uri): 
    transcribe = boto3.client('transcribe')
    output_bucket = bucket
    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': job_uri},
        MediaFormat='mp3',
        LanguageCode='en-US',
        OutputBucketName=output_bucket
    )
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        print("Not ready yet...")
        time.sleep(5)
    print(status)
    !aws transcribe list-transcription-jobs
    !aws transcribe delete-transcription-job --transcription-job-name $job_name

In [42]:
#set session variables
sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = 'crazycurlygirlbucket311' # Replace with your own bucket name if needed
print(bucket)
prefix = 'BookProphet' #Replace with the prefix under which you want to store the data if needed

arn:aws:iam::023375022819:role/service-role/AmazonSageMaker-ExecutionRole-20181029T121824
crazycurlygirlbucket311


In [37]:
# get the files
!wget https://etc.usf.edu/lit2go/audio/mp3/the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3
!wget https://etc.usf.edu/lit2go/audio/mp3/the-merchant-of-venice-014-merchant-of-venice-act-3-scene-1.600.mp3
!wget http://www.archive.org/download/divine_comedy_librivox/divinecomedy_longfellow_05_dante.mp3
!wget http://www.archive.org/download/divine_comedy_librivox/divinecomedy_longfellow_10_dante.mp3
     

--2019-03-13 17:54:56--  https://etc.usf.edu/lit2go/audio/mp3/the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3
Resolving etc.usf.edu (etc.usf.edu)... 131.247.120.45
Connecting to etc.usf.edu (etc.usf.edu)|131.247.120.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2744761 (2.6M) [audio/mpeg]
Saving to: ‘the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3’


2019-03-13 17:54:57 (6.55 MB/s) - ‘the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3’ saved [2744761/2744761]

--2019-03-13 17:54:57--  https://etc.usf.edu/lit2go/audio/mp3/the-merchant-of-venice-014-merchant-of-venice-act-3-scene-1.600.mp3
Resolving etc.usf.edu (etc.usf.edu)... 131.247.120.45
Connecting to etc.usf.edu (etc.usf.edu)|131.247.120.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6927279 (6.6M) [audio/mpeg]
Saving to: ‘the-merchant-of-venice-014-merchant-of-venice-act-3-scene-1.600.mp3’


2019-03-13 17:54:58 (11

In [38]:
# save MP3 files to S3
MP3Location = prefix + '/MP3Files'

sess.upload_data(path='the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3', bucket=bucket, key_prefix=MP3Location)
sess.upload_data(path='the-merchant-of-venice-014-merchant-of-venice-act-3-scene-1.600.mp3', bucket=bucket, key_prefix=MP3Location)
sess.upload_data(path='divinecomedy_longfellow_05_dante.mp3', bucket=bucket, key_prefix=MP3Location)
sess.upload_data(path='divinecomedy_longfellow_10_dante.mp3', bucket=bucket, key_prefix=MP3Location)



's3://crazycurlygirlbucket311/BookProphet/MP3Files/divinecomedy_longfellow_10_dante.mp3'

In [44]:
#create dictionary of job names and uri
chapters = {
    "merchant1": "s3://crazycurlygirlbucket311/BookProphet/MP3Files/the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3",
    "merchant2": "s3://crazycurlygirlbucket311/BookProphet/MP3Files/the-merchant-of-venice-014-merchant-of-venice-act-3-scene-1.600.mp3",
    "divine1": "s3://crazycurlygirlbucket311/BookProphet/MP3Files/divinecomedy_longfellow_05_dante.mp3",
    "divine2": "s3://crazycurlygirlbucket311/BookProphet/MP3Files/divinecomedy_longfellow_10_dante.mp3" 
}

In [45]:
# transcribe chapters using function
for ch, uri in chapters.items():
    transcribe_chapters(ch,uri)

Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName': 'merchant1', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'mp3', 'Media': {'MediaFileUri': 's3://crazycurlygirlbucket311/BookProphet/MP3Files/the-merchant-of-venice-005-merchant-of-venice-act-2-scene-1.589.mp3'}, 'Transcript': {'TranscriptFileUri': 'https://s3.amazonaws.com/crazycurlygirlbucket311/merchant1.json'}, 'CreationTime': datetime.datetime(2019, 3, 13, 18, 0, 35, 522000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2019, 3, 13, 18, 2, 37, 114000, tzinfo=tzlo