### NOTE!

Be sure to read the README as part of this Github repo to do the pre-requesite steps in preparing your AWS for 
usage by this script.

**Given AWS calls use real resources all API calls that actually call AWS are initially commented out, take away comments and run cells with knowledge of what they are doing!**


In [59]:
import os
import boto3
from botocore.client import ClientError
import tscribe
import json
import tarfile
from pathlib import Path

'''User Entry:

Please insert your AWS access key id and AWS secret access key.

NOTE: Do NOT push AWS secret access key to Git EVER!
'''
# AWS Access Information
AWS_ACCESS_KEY_ID = "Enter in AWS Access Key ID"
AWS_SECRET_ACCESS_KEY = "Enter in AWS Secret Access Key"

# AWS Bucket Information
S3_INTERVIEW_BUCKET_NAME = ""  # Name of s3 bucket where your interview files are
S3_TRANSCRIPT_BUCKET_NAME = "" # Name of s3 bucket where your transcript files will be put
S3_MODEL_OUTPUT_BUCKET_NAME = ""  # Name of s3 bucket where model output will be put

# AWS ARN & User Information
# ARN code corresponding to custom classifier in AWS (https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html)
S3_CUSTOM_CLASSIFIER_MODEL = ''
# ARN code corresponding to AWS role that has correct permissions to run classification & entity recognition (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) 
S3_MODEL_RUNNING_ROLE = ''
'''
User Entry Area completed.
'''

# Create needed clients
session = boto3.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
s3 = session.resource('s3')
s3_client = boto3.client('s3')
transcribe = boto3.client('transcribe', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY, 
                          region_name="us-east-1")
comprehend_client = boto3.client('comprehend', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY, 
                          region_name="us-east-1")

Let's ensure 

In [58]:
# Ensure all buckets are properly created in S3
try:
    s3.meta.client.head_bucket(Bucket=S3_INTERVIEW_BUCKET_NAME)
except ClientError:
    # The bucket does not exist or you have no access.
    print(f"ClientError when trying to retreive: {S3_INTERVIEW_BUCKET_NAME}. The bucket does not exist or you have no access.\n")
    
try:
    s3.meta.client.head_bucket(Bucket=S3_TRANSCRIPT_BUCKET_NAME)
except ClientError:
    # The bucket does not exist or you have no access.
    print(f"ClientError when trying to retreive: {S3_TRANSCRIPT_BUCKET_NAME}. The bucket does not exist or you have no access.\n")

try:
    s3.meta.client.head_bucket(Bucket=S3_MODEL_OUTPUT_BUCKET_NAME)
except ClientError:
    # The bucket does not exist or you have no access.
    print(f"ClientError when trying to retreive: {S3_MODEL_OUTPUT_BUCKET_NAME}. The bucket does not exist or you have no access.\n")

ParamValidationError: Parameter validation failed:
Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"

# Part 1 - Creating Interview JSON Transcriptions

Let's create the transcription job(s) for your audio/video interview files. The following cell will gather all of the interview audio files, then subsequently create transcription jobs on AWS to transcribe your audio to JSON formatted transcriptions!

### WARNING!
The following cell instructs AWS to create transcriptions of your audio files! This means real requests are being made to AWS! As this can cost real resources be careful using this cell!

In [42]:
# Collect all interview audio files
media_dict = {}
interview_bucket = s3.Bucket(S3_INTERVIEW_BUCKET_NAME)
for my_bucket_object in interview_bucket.objects.all():
    media_dict[my_bucket_object.key] = f"https://{S3_INTERVIEW_BUCKET_NAME}.s3.amazonaws.com/{my_bucket_object.key}"

# Loop through all interviews and transcribe them
for file, uri in media_dict.items():
    job_uri = uri
    job_name = (file.split('.')[0]).replace(" ", "")
    
    print(f"Transcribing file: {file} with uri: {job_uri} under job name: {job_name}")
#     transcribe.start_transcription_job(
#         TranscriptionJobName=job_name,
#         Media={'MediaFileUri': job_uri},
#         MediaFormat=file.split('.')[1],
#         LanguageCode='en-US',
#         Settings={'ShowSpeakerLabels': True,'MaxSpeakerLabels': 2},
#         OutputBucketName=S3_TRANSCRIPT_BUCKET_NAME
#     )


Transcribing file: qual1_par1.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par1.mp3 under job name: qual1_par1
Transcribing file: qual1_par2 pt 1.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par2 pt 1.mp3 under job name: qual1_par2pt1
Transcribing file: qual1_par2 pt 2.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par2 pt 2.mp3 under job name: qual1_par2pt2
Transcribing file: qual1_par3.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par3.mp3 under job name: qual1_par3
Transcribing file: qual1_par4 pt 1.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par4 pt 1.mp3 under job name: qual1_par4pt1
Transcribing file: qual1_par5.mp3 with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual1_par5.mp3 under job name: qual1_par5
Transcribing file: qual2_UC Berkeley 2.m4a with uri: https://infoorginterviewbucket.s3.amazonaws.com/qual2_UC Berkeley 2.m4a under job name: qual2_UCBerkeley

### NOTE

After all AWS call cells you should check your AWS portal to see when the jobs are done, then move onto next steps.

# Part 1.5 Transform JSON Transcripts to .txt

Now that the transciption jobs are done we have a collection of transcripts in JSON format. This format is not condusive to analysis so let's convert them to .txt, there is a handy python library (tscribe) that will help us do this. 

Serialize all JSON objects from S3 into local files on your machine

In [50]:
# Create temporary folder for local transcript saving
if not os.path.exists('./temptranscriptfolder'):
    os.mkdir('./temptranscriptfolder')

# Serialize and save locally to machine
transcript_bucket = s3.Bucket(S3_TRANSCRIPT_BUCKET_NAME)
for my_bucket_object in transcript_bucket.objects.all():
    if 'json' in my_bucket_object.key:
        file_name = my_bucket_object.key.split(".")[0]
        myjson = json.loads(my_bucket_object.get()['Body'].read())
        
        # Serializing json
        json_object = json.dumps(myjson, indent=4)

        print(f"Writing JSON to file: {file_name}.json")
#         # Writing to sample.json
#         with open(f"./temptranscriptfolder/{file_name}.json", "w") as outfile:
#             outfile.write(json_object)

Writing JSON to file: qual1_par1.json
Writing JSON to file: qual1_par2pt1.json
Writing JSON to file: qual1_par2pt2.json
Writing JSON to file: qual1_par3.json
Writing JSON to file: qual1_par4pt1.json
Writing JSON to file: qual1_par5.json
Writing JSON to file: qual2_UCBerkeley.json
Writing JSON to file: qual2_UCBerkeley2.json
Writing JSON to file: qual2_UCBerkeley3.json
Writing JSON to file: qual2_UCBerkeley4.json
Writing JSON to file: qual2_UniversityofCalifornia.json
Writing JSON to file: qual3_NewRecording31.json
Writing JSON to file: qual3_NewRecording32.json
Writing JSON to file: qual3_Subject3Recording.json
Writing JSON to file: qual3_Subject4.json
Writing JSON to file: qual3_par1Recording.json
Writing JSON to file: qual3_par2Recording.json
Writing JSON to file: qual4_par1.json
Writing JSON to file: qual4_par2.json
Writing JSON to file: qual4_par3.json
Writing JSON to file: qual4_par4.json
Writing JSON to file: qual4_par5.json
Writing JSON to file: qual4_par6.json
Writing JSON to f

Take local JSON files and transform them into .vtt, then into .txt for further analysis

In [51]:
# JSON -> .vtt -> .txt
for json_file in os.listdir('./temptranscriptfolder'):
    file_name = json_file.split(".")[0]
    print(f"Transforming file {json_file} to vtt.")
#     transcription_base_file_name = f"./temptranscriptfolder/{json_file}"
#     tscribe.write(f"./temptranscriptfolder/{json_file}", save_as=f"./temptranscriptfolder/{file_name}", format="vtt")
    print(f"Transform vtt files to txt for analysis.\n")
#     p = Path(f'./temptranscriptfolder/{file_name}.vtt')
#     p.rename(p.with_suffix('.txt'))

Transforming file qual1_par1.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par1.txt to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par2pt1.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par2pt1.txt to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par2pt2.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par2pt2.txt to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par3.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par3.txt to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par4pt1.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par4pt1.txt to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par5.json to vtt.
Transform vtt files to txt for analysis.

Transforming file qual1_par5.txt to vtt.
Transform vtt files to

Upload .txt files back to S3 transcript bucket, preparing them for classification and entity recognition.

In [52]:
# Upload .txt files to S3
for file in os.listdir('./temptranscriptfolder'):
    if '.txt' in file:
        file_path = f'./temptranscriptfolder/{file}'
        try:
            print(f"Uploading {file_path} to bucket {S3_TRANSCRIPT_BUCKET_NAME} as object_name: {file.split('.')[0]}")
#             response = s3_client.upload_file(file_path, S3_TRANSCRIPT_BUCKET_NAME, file.split('.')[0])
        except ClientError as e:
            logging.error(e)

Uploading ./temptranscriptfolder/qual1_par1.txt to bucket infororgtranscriptbucket as object_name: qual1_par1
Uploading ./temptranscriptfolder/qual1_par2pt1.txt to bucket infororgtranscriptbucket as object_name: qual1_par2pt1
Uploading ./temptranscriptfolder/qual1_par2pt2.txt to bucket infororgtranscriptbucket as object_name: qual1_par2pt2
Uploading ./temptranscriptfolder/qual1_par3.txt to bucket infororgtranscriptbucket as object_name: qual1_par3
Uploading ./temptranscriptfolder/qual1_par4pt1.txt to bucket infororgtranscriptbucket as object_name: qual1_par4pt1
Uploading ./temptranscriptfolder/qual1_par5.txt to bucket infororgtranscriptbucket as object_name: qual1_par5
Uploading ./temptranscriptfolder/qual2_UCBerkeley.txt to bucket infororgtranscriptbucket as object_name: qual2_UCBerkeley
Uploading ./temptranscriptfolder/qual2_UCBerkeley2.txt to bucket infororgtranscriptbucket as object_name: qual2_UCBerkeley2
Uploading ./temptranscriptfolder/qual2_UCBerkeley3.txt to bucket infororgtra

### Part 2 - Run Entity Recognition on Transcript .txt files
Now that we have all our transcripts in the desire format, we can run classification and entity recognition on them! 

It is prefered to run the classificaion with a custom model that is more tuned for the interviews you do! 

In [46]:
# Create Output URI
output_uri = f"s3://{S3_MODEL_OUTPUT_BUCKET_NAME}"

# Fill this in with file names that were used for training as they should not have analysis ran on them
training_list = []

# Run Classification and Entity Recognition on all transcripts
transcript_bucket = s3.Bucket(S3_TRANSCRIPT_BUCKET_NAME)
for my_bucket_object in transcript_bucket.objects.all():
    if '.txt' in my_bucket_object.key and my_bucket_object.key.split('.')[0] not in training_list:
        uri = f"s3://{S3_TRANSCRIPT_BUCKET_NAME}/{my_bucket_object.key}"
        
        print(f"Running custom classification on {my_bucket_object.key}")
        # Run custom classification 
#         response = comprehend_client.start_document_classification_job(
#             JobName=f"{my_bucket_object.key.split('.')[0]}",
#             InputDataConfig={
#                 'S3Uri': uri,
#                 'InputFormat': 'ONE_DOC_PER_FILE'
#             },
#             OutputDataConfig={
#                 'S3Uri': output_uri,
#             },
#             DataAccessRoleArn=S3_MODEL_RUNNING_ROLE,
#             DocumentClassifierArn=S3_CUSTOM_CLASSIFIER_MODEL
#         )
        
        print(f"Running entity recognition on {my_bucket_object.key}")
        # Run entity recognition
#         response = comprehend_client.start_entities_detection_job(
#             JobName=f"{my_bucket_object.key.split('.')[0]}",
#             InputDataConfig={
#                 'S3Uri': uri,
#                 'InputFormat': 'ONE_DOC_PER_FILE'
#             },
#             OutputDataConfig={
#                 'S3Uri': output_uri,
#             },
#             LanguageCode='en',
#             DataAccessRoleArn=S3_MODEL_RUNNING_ROLE,
#         )

Running custom classification on MyTranscriptionJob_Kevin.txt
Running entity recognition on MyTranscriptionJob_Kevin.txt
Running custom classification on qual1_par2pt1.txt
Running entity recognition on qual1_par2pt1.txt
Running custom classification on qual1_par3.txt
Running entity recognition on qual1_par3.txt
Running custom classification on qual1_par4pt1.txt
Running entity recognition on qual1_par4pt1.txt
Running custom classification on qual1_par5.txt
Running entity recognition on qual1_par5.txt
Running custom classification on qual2_UCBerkeley2.txt
Running entity recognition on qual2_UCBerkeley2.txt
Running custom classification on qual2_UCBerkeley3.txt
Running entity recognition on qual2_UCBerkeley3.txt
Running custom classification on qual2_UCBerkeley4.txt
Running entity recognition on qual2_UCBerkeley4.txt
Running custom classification on qual2_UniversityofCalifornia.txt
Running entity recognition on qual2_UniversityofCalifornia.txt
Running custom classification on qual3_NewRec

### Part 3 - Analyze Entity Recognition Output
Now that all those analysis jobs are done we can now analyze the output! 

This should be a playground for whatever you may need to do but provided is some code that will retreive the model output, then output the classifier results and the top 5 results that came from the entity recognition. Enjoy!

In [54]:
# Create temporary folder for local saving of model output
if not os.path.exists('temptargzfolder'):
    os.mkdir('temptargzfolder')

model_output_bucket = s3.Bucket(S3_MODEL_OUTPUT_BUCKET_NAME)
for my_bucket_object in model_output_bucket.objects.all():
    if 'tar.gz' in my_bucket_object.key:
        print(f"Downloading model analysis file: {my_bucket_object.key}")
#         s3_client.download_file(S3_MODEL_OUTPUT_BUCKET_NAME, my_bucket_object.key, "./temptargzfolder/s3object.tar.gz")

Downloading model analysis file: 422470722668-CLN-1454d307e34c6bf8c704669f48f6b5b5/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-2ec7daabfbf98953683f59d9c93043c2/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-30acc617bcedfbb26d0d42939523a90c/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-3481a24f297e4cf2ba227a36bca1b191/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-3e5f12be76914ccdd26ec24c06ef25f9/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-4bc830308cffb42c638c89bfe1e12f9e/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-502df2637b374a17a8710ec5da26957c/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-5128235f3ca8c5390e7ec1f6210b3c9c/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-5fadb5ed8deb762853e51f1ecba13bff/output/output.tar.gz
Downloading model analysis file: 422470722668-CLN-6349b5318ad91a

In [55]:
# Read .tar.gz files and find entry types
interview_dict = {}
for zipped_file in os.listdir("./temptargzfolder/"):
    tar = tarfile.open(f"./temptargzfolder/{zipped_file}", "r:gz")
    for member in tar.getmembers():
        f = tar.extractfile(member)
        if f is not None:
            content = f.read()
            decoded_content = json.loads(content.decode('utf-8'))
            source_file = decoded_content['File']
            if source_file not in interview_dict:
                interview_dict[source_file] = {
                    'Top Entities': None,
                    'Classification': None
                }
            
            print(f"Analyzing model output for file: {source_file}...")
            if 'Entities' in decoded_content.keys():
                entry_list = []
                for entry in decoded_content['Entities']:
                    entry_list.append((entry['Score'], entry['Text']))
                entry_list.sort(reverse=True)
                interview_dict[source_file]['Top Entities'] = entry_list[:5]
            else:
                interview_dict[source_file]['Classification'] = decoded_content['Classes']
print("\n")

for file, data in interview_dict.items():
    print(f"Analysis for {file}:")
    print(f"Classification: {data['Classification']}")
    print(f"Top Entities: {data['Top Entities']}\n")

Analyzing model output for file: qual1_par3.txt...
Analyzing model output for file: qual3_Subject3Recording.txt...
Analyzing model output for file: qual4_par3.txt...
Analyzing model output for file: qual2_UCBerkeley4.txt...
Analyzing model output for file: qual2_UCBerkeley3.txt...
Analyzing model output for file: MyTranscriptionJob_Kevin.txt...
Analyzing model output for file: qual1_par5.txt...
Analyzing model output for file: qual2_UniversityofCalifornia.txt...
Analyzing model output for file: ux3_par1.txt...
Analyzing model output for file: qual4_par7.txt...
Analyzing model output for file: qual3_par1Recording.txt...
Analyzing model output for file: MyTranscriptionJob_Kevin.txt...
Analyzing model output for file: ux2_par1.txt...
Analyzing model output for file: qual4_par1.txt...
Analyzing model output for file: qual4_par6.txt...
Analyzing model output for file: qual1_par4pt1.txt...
Analyzing model output for file: ux3_user6.txt...
Analyzing model output for file: qual2_UCBerkeley3.tx