Version: 02.14.2023

# Capstone Project: Bringing It All Together

In this lab, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. You could use AWS Managed Services, such as Amazon Comprehend, or use the Amazon SageMaker models. Have fun on whichever path you choose.

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

To assist you, all of the previous labs have been provided in this workspace.

## Lab steps

To complete this lab, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Submitting your work

1. In the lab console, choose **Submit** to record your progress and when prompted, choose **Yes**.

1. If the results don't display after a couple of minutes, return to the top of these instructions and choose **Grades**.

     **Tip**: You can submit your work multiple times. After you change your work, choose **Submit** again. Your last submission is what will be recorded for this lab.

1. To find detailed feedback on your work, choose **Details** followed by **View Submission Report**.

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [1]:
bucket = "c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w"
job_data_access_role = 'arn:aws:iam::631168390258:role/service-role/c96181a2162114l4859276t1w-ComprehendDataAccessRole-1NUXO7WC0SJKM'

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [2]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

2021-04-26 20:17:33  410925369 Mod01_Course Overview.mp4
2021-04-26 20:10:02   39576695 Mod02_Intro.mp4
2021-04-26 20:31:23  302994828 Mod02_Sect01.mp4
2021-04-26 20:17:33  416563881 Mod02_Sect02.mp4
2021-04-26 20:17:33  318685583 Mod02_Sect03.mp4
2021-04-26 20:17:33  255877251 Mod02_Sect04.mp4
2021-04-26 20:23:51   99988046 Mod02_Sect05.mp4
2021-04-26 20:24:54   50700224 Mod02_WrapUp.mp4
2021-04-26 20:26:27   60627667 Mod03_Intro.mp4
2021-04-26 20:26:28  272229844 Mod03_Sect01.mp4
2021-04-26 20:27:06  309127124 Mod03_Sect02_part1.mp4
2021-04-26 20:27:06  195635527 Mod03_Sect02_part2.mp4
2021-04-26 20:28:03  123924818 Mod03_Sect02_part3.mp4
2021-04-26 20:31:28  171681915 Mod03_Sect03_part1.mp4
2021-04-26 20:32:07  285200083 Mod03_Sect03_part2.mp4
2021-04-26 20:33:17  105470345 Mod03_Sect03_part3.mp4
2021-04-26 20:35:10  157185651 Mod03_Sect04_part1.mp4
2021-04-26 20:36:27  187435635 Mod03_Sect04_part2.mp4
2021-04-26 20:36:40  280720369 Mod03_Sect04_part3.mp4
2021-04-26 20:40:01  443479

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos.

In [3]:
!aws s3 cp s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/ s3://{bucket}/input/ --recursive

copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_Intro.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Intro.mp4
copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_Sect03.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect03.mp4
copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_Sect05.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect05.mp4
copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_WrapUp.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_WrapUp.mp4
copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_Sect01.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect01.mp4
copy: s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod02_Sect04.mp4 to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect04.mp4
copy: 

In [4]:
from boto3 import client

conn = client('s3') 
for key in conn.list_objects(Bucket=bucket)['Contents']:
    print(key['Key'])

input/Mod01_Course Overview.mp4
input/Mod02_Intro.mp4
input/Mod02_Sect01.mp4
input/Mod02_Sect02.mp4
input/Mod02_Sect03.mp4
input/Mod02_Sect04.mp4
input/Mod02_Sect05.mp4
input/Mod02_WrapUp.mp4
input/Mod03_Intro.mp4
input/Mod03_Sect01.mp4
input/Mod03_Sect02_part1.mp4
input/Mod03_Sect02_part2.mp4
input/Mod03_Sect02_part3.mp4
input/Mod03_Sect03_part1.mp4
input/Mod03_Sect03_part2.mp4
input/Mod03_Sect03_part3.mp4
input/Mod03_Sect04_part1.mp4
input/Mod03_Sect04_part2.mp4
input/Mod03_Sect04_part3.mp4
input/Mod03_Sect05.mp4
input/Mod03_Sect06.mp4
input/Mod03_Sect07_part1.mp4
input/Mod03_Sect07_part2.mp4
input/Mod03_Sect07_part3.mp4
input/Mod03_Sect08.mp4
input/Mod03_WrapUp.mp4
input/Mod04_Intro.mp4
input/Mod04_Sect01.mp4
input/Mod04_Sect02_part1.mp4
input/Mod04_Sect02_part2.mp4
input/Mod04_Sect02_part3.mp4
input/Mod04_WrapUp.mp4
input/Mod05_Intro.mp4
input/Mod05_Sect01_ver2.mp4
input/Mod05_Sect02_part1_ver2.mp4
input/Mod05_Sect02_part2.mp4
input/Mod05_Sect03_part1.mp4
input/Mod05_Sect03_part2.m

In [5]:
import boto3
import os, io, struct, json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import uuid
from time import sleep
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

Matplotlib is building the font cache; this may take a moment.
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...


In [6]:
transcribe_client = boto3.client("transcribe")

In [7]:
output_files=[]
transcribe_output_prefix = 'transcribed'
for key in conn.list_objects_v2(Bucket=bucket, Prefix='input')['Contents']:
    if 'temp' in key['Key']:
        continue
    object_name=key['Key']
    media_input_uri = f's3://{bucket}/{object_name}'

    #create the transcription job
    job_uuid = uuid.uuid1()
    transcribe_job_name = f"transcribe-job-{job_uuid}"
    output_file = object_name.split('.')[0].replace(" ","_")
    transcribe_output_filename = f'{transcribe_output_prefix}-{output_file}.txt'
    output_files.append([transcribe_output_filename,object_name,""])
    print(f'{media_input_uri} transcribed to {transcribe_output_filename}')

    response = transcribe_client.start_transcription_job(
        TranscriptionJobName=transcribe_job_name,
        Media={'MediaFileUri': media_input_uri},
        MediaFormat='mp4',
        LanguageCode='en-US',
        OutputBucketName=bucket,
        OutputKey=transcribe_output_filename
    )

s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod01_Course Overview.mp4 transcribed to transcribed-input/Mod01_Course_Overview.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Intro.mp4 transcribed to transcribed-input/Mod02_Intro.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect01.mp4 transcribed to transcribed-input/Mod02_Sect01.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect02.mp4 transcribed to transcribed-input/Mod02_Sect02.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect03.mp4 transcribed to transcribed-input/Mod02_Sect03.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect04.mp4 transcribed to transcribed-input/Mod02_Sect04.txt
s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/input/Mod02_Sect05.mp4 transcribed to transcribed-input/Mod02_Sect05.txt
s3://c96181

In [8]:
print(output_files)

[['transcribed-input/Mod01_Course_Overview.txt', 'input/Mod01_Course Overview.mp4', ''], ['transcribed-input/Mod02_Intro.txt', 'input/Mod02_Intro.mp4', ''], ['transcribed-input/Mod02_Sect01.txt', 'input/Mod02_Sect01.mp4', ''], ['transcribed-input/Mod02_Sect02.txt', 'input/Mod02_Sect02.mp4', ''], ['transcribed-input/Mod02_Sect03.txt', 'input/Mod02_Sect03.mp4', ''], ['transcribed-input/Mod02_Sect04.txt', 'input/Mod02_Sect04.mp4', ''], ['transcribed-input/Mod02_Sect05.txt', 'input/Mod02_Sect05.mp4', ''], ['transcribed-input/Mod02_WrapUp.txt', 'input/Mod02_WrapUp.mp4', ''], ['transcribed-input/Mod03_Intro.txt', 'input/Mod03_Intro.mp4', ''], ['transcribed-input/Mod03_Sect01.txt', 'input/Mod03_Sect01.mp4', ''], ['transcribed-input/Mod03_Sect02_part1.txt', 'input/Mod03_Sect02_part1.mp4', ''], ['transcribed-input/Mod03_Sect02_part2.txt', 'input/Mod03_Sect02_part2.mp4', ''], ['transcribed-input/Mod03_Sect02_part3.txt', 'input/Mod03_Sect02_part3.mp4', ''], ['transcribed-input/Mod03_Sect03_part1.

In [9]:
job=None
while True:
    job = transcribe_client.get_transcription_job(TranscriptionJobName = transcribe_job_name)
    if job['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED','FAILED']:
        break
    print('.', end='')
    sleep(20)
        
print(job['TranscriptionJob']['TranscriptionJobStatus'])

.COMPLETED


In [11]:
s3_client = boto3.client('s3')
transcribed_text = []
for transcribe_output_filename in output_files:
    result = s3_client.get_object(Bucket=bucket, Key=transcribe_output_filename[0]) 
    data = json.load(result['Body']) 
    transcription = data['results']['transcripts'][0]['transcript']
    transcribe_output_filename[2] = transcription

print(output_files[0])

['transcribed-input/Mod01_Course_Overview.txt', 'input/Mod01_Course Overview.mp4', "Hi and welcome to Amazon Academy of Machine Learning Foundations in this module, you'll learn about the course objectives, various job roles in the machine learning domain and where you can go to learn more about machine learning. After completing this module, you should be able to identify course prerequisites and objectives indicate the role of the data scientist in business and identify resources for further learning. We're now going to look at the prerequisites for taking this course. Before you take this course, we recommend that you first complete Aws Academy Cloud Foundations. You should also have some general technical knowledge of it including foundational computer literacy skills like basic computer concepts, email file management and a good understanding of the internet. We also recommend that you have intermediate skills with Python programming and a general knowledge of applied statistics. 

## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [12]:
import pandas as pd
df = pd.DataFrame(data=output_files, columns=['OutputFile','Video','Transcription'] )

In [13]:
df.head()

Unnamed: 0,OutputFile,Video,Transcription
0,transcribed-input/Mod01_Course_Overview.txt,input/Mod01_Course Overview.mp4,Hi and welcome to Amazon Academy of Machine Le...
1,transcribed-input/Mod02_Intro.txt,input/Mod02_Intro.mp4,Hi and welcome to module two of Aws Academy ma...
2,transcribed-input/Mod02_Sect01.txt,input/Mod02_Sect01.mp4,Hi and welcome to section one in this section....
3,transcribed-input/Mod02_Sect02.txt,input/Mod02_Sect02.mp4,Hi and welcome back in this section. We're goi...
4,transcribed-input/Mod02_Sect03.txt,input/Mod02_Sect03.mp4,Hi and welcome back. This is section three and...


In [14]:
def normalize_text(content):
    text = re.sub(r"http\S+", "", content ) # Remove urls
    text = text.lower() # Lowercase 
    text = text.strip() # Remove leading/trailing whitespace
    text = re.sub('\s+', ' ', text) # Remove extra space and tabs
    text = re.sub('\n',' ',text) # remove newlines
    text = re.compile('<.*?>').sub('', text) # Remove HTML tags/markups:
    return text

In [15]:
%%time
df['Transcription_normalized'] = df['Transcription'].apply(normalize_text)

CPU times: user 14 ms, sys: 18 µs, total: 14 ms
Wall time: 13.3 ms


In [16]:
pd.set_option('display.max_colwidth', 150)
df.head()

Unnamed: 0,OutputFile,Video,Transcription,Transcription_normalized
0,transcribed-input/Mod01_Course_Overview.txt,input/Mod01_Course Overview.mp4,"Hi and welcome to Amazon Academy of Machine Learning Foundations in this module, you'll learn about the course objectives, various job roles in th...","hi and welcome to amazon academy of machine learning foundations in this module, you'll learn about the course objectives, various job roles in th..."
1,transcribed-input/Mod02_Intro.txt,input/Mod02_Intro.mp4,"Hi and welcome to module two of Aws Academy machine learning in this module, we're going to introduce machine learning. We'll first look at the bu...","hi and welcome to module two of aws academy machine learning in this module, we're going to introduce machine learning. we'll first look at the bu..."
2,transcribed-input/Mod02_Sect01.txt,input/Mod02_Sect01.mp4,Hi and welcome to section one in this section. We're going to talk about what machine learning is. This course is an introduction to machine learn...,hi and welcome to section one in this section. we're going to talk about what machine learning is. this course is an introduction to machine learn...
3,transcribed-input/Mod02_Sect02.txt,input/Mod02_Sect02.mp4,Hi and welcome back in this section. We're going to look at the types of business problems. Machine learning can help you solve. Machine learning ...,hi and welcome back in this section. we're going to look at the types of business problems. machine learning can help you solve. machine learning ...
4,transcribed-input/Mod02_Sect03.txt,input/Mod02_Sect03.mp4,Hi and welcome back. This is section three and we're going to give you a quick high level overview of machine learning terminology and a typical w...,hi and welcome back. this is section three and we're going to give you a quick high level overview of machine learning terminology and a typical w...


## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [17]:
s3_resource = boto3.Session().resource('s3')

def upload_comprehend_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

    

In [18]:
comprehend_file = 'comprehend_input.csv'
prefix='capstone'
upload_comprehend_s3_csv(comprehend_file, 'comprehend', df['Transcription_normalized'].str.slice(0,5000))
test_url = f's3://{bucket}/{prefix}/comprehend/{comprehend_file}'
print(f'Uploaded input to {test_url}')

Uploaded input to s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/capstone/comprehend/comprehend_input.csv


In [19]:
# Comprehend client information
comprehend_client = boto3.client(service_name="comprehend")

# Other job parameters
input_data_format = 'ONE_DOC_PER_LINE'
job_uuid = uuid.uuid1()
job_name = f"kpe-job-{job_uuid}"
input_data_s3_path = test_url
output_data_s3_path = f's3://{bucket}/'

In [20]:
# Begin the inference job
kpe_response = comprehend_client.start_key_phrases_detection_job(
    InputDataConfig={'S3Uri': input_data_s3_path,
                     'InputFormat': input_data_format},
    OutputDataConfig={'S3Uri': output_data_s3_path},
    DataAccessRoleArn=job_data_access_role,
    JobName=job_name,
    LanguageCode='en'
)

# Get the job ID
kpe_job_id = kpe_response['JobId']

In [21]:
job_name = f'entity-job-{job_uuid}'
entity_response = comprehend_client.start_entities_detection_job(
    InputDataConfig={'S3Uri': input_data_s3_path,
                     'InputFormat': input_data_format},
    OutputDataConfig={'S3Uri': output_data_s3_path},
    DataAccessRoleArn=job_data_access_role,
    JobName=job_name,
    LanguageCode='en'
)
# Get the job ID
entity_job_id = entity_response['JobId']

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [22]:
my_ip = "YOUR IP/24"

In [23]:
!pip install --upgrade pip
!pip install opensearch
!pip install opensearch-py
!pip install requests
!pip install requests-aws4auth

Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/50/c2/e06851e8cc28dcad7c155f4753da8833ac06a5c704c109313b8d5a62968a/pip-23.2.1-py3-none-any.whl.metadata
  Downloading pip-23.2.1-py3-none-any.whl.metadata (4.2 kB)
Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.2
    Uninstalling pip-23.2:
      Successfully uninstalled pip-23.2
Successfully installed pip-23.2.1
Collecting opensearch
  Downloading opensearch-0.9.2.tar.gz (38 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: opensearch
  Building wheel for opensearch (setup.py) ... [?25ldone
[?25h  Created wheel for opensearch: filename=opensearch-0.9.2-py3-none-any.whl size=39842 sha256=2abe7879b055c7df

In [24]:
es_client = boto3.client('es')

In [25]:
access_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "*"
                },
                "Action": "es:*",
                "Resource": "*",
                "Condition": { "IpAddress": { "aws:SourceIp": my_ip } }
            }
        ]
    }

In [26]:
response = es_client.create_elasticsearch_domain(
    DomainName = 'nlp-lab',
    ElasticsearchVersion = '7.9',
    ElasticsearchClusterConfig={
        "InstanceType": 'm3.large.elasticsearch',
        "InstanceCount": 2,
        "DedicatedMasterEnabled": False,
        "ZoneAwarenessEnabled": False
    },
    AccessPolicies = json.dumps(access_policy)
)

In [27]:
# Get current job status
kpe_job = comprehend_client.describe_key_phrases_detection_job(JobId=kpe_job_id)

# Loop until job is completed
waited = 0
timeout_minutes = 30
while kpe_job['KeyPhrasesDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(10)
    waited += 10
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    print('.', end='')
    kpe_job = comprehend_client.describe_key_phrases_detection_job(JobId=kpe_job_id)

print('Ready')

.......................Ready


In [28]:
# Get current job status
entity_job = comprehend_client.describe_entities_detection_job(JobId=entity_job_id)

# Loop until job is completed
waited = 0
timeout_minutes = 30
while entity_job['EntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(10)
    waited += 10
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    print('.', end='')
    entity_job = comprehend_client.describe_entities_detection_job(JobId=entity_job_id)

print('Ready')

Ready


In [29]:
kpe_comprehend_output_file = kpe_job['KeyPhrasesDetectionJobProperties']['OutputDataConfig']['S3Uri']
print(f'output filename: {kpe_comprehend_output_file}')

kpe_comprehend_bucket, kpe_comprehend_key = kpe_comprehend_output_file.replace("s3://", "").split("/", 1)

s3r = boto3.resource('s3')
s3r.meta.client.download_file(kpe_comprehend_bucket, kpe_comprehend_key, 'output-kpe.tar.gz')

output filename: s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/631168390258-KP-9e71dbef753c758702e35eb78411b535/output/output.tar.gz


In [30]:
# Extract the tar file
import tarfile
tf = tarfile.open('output-kpe.tar.gz')
tf.extractall()
# Rename the output
!mv 'output' 'kpe_output'

In [31]:
entity_comprehend_output_file = entity_job['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']
print(f'output filename: {entity_comprehend_output_file}')

entity_comprehend_bucket, entity_comprehend_key = entity_comprehend_output_file.replace("s3://", "").split("/", 1)

s3r = boto3.resource('s3')
s3r.meta.client.download_file(entity_comprehend_bucket, entity_comprehend_key, 'output-entity.tar.gz')

# Extract the tar file
import tarfile
tf = tarfile.open('output-entity.tar.gz')
tf.extractall()
# Rename the output
!mv 'output' 'entity_output'

output filename: s3://c96181a2162114l4859276t1w631168390258-labbucket-420z8rnx845w/631168390258-NER-4d2169985facf939b9cf8ef12f94c9c6/output/output.tar.gz


In [32]:
import json
data = []
with open ('kpe_output', "r") as myfile:
    for line in myfile:
        data.append(json.loads(line))

In [33]:
kpdf = pd.DataFrame(data, columns=['KeyPhrases','Line'])
kpdf.head()

Unnamed: 0,KeyPhrases,Line
0,"[{'BeginOffset': 26, 'EndOffset': 33, 'Score': 0.8727456819495607, 'Text': 'academy'}, {'BeginOffset': 37, 'EndOffset': 65, 'Score': 0.75145980506...",0
1,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.6214939014751475, 'Text': 'module two'}, {'BeginOffset': 33, 'EndOffset': 52, 'Score': 0.99413129...",1
2,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.9589880415051381, 'Text': 'section three'}, {'BeginOffset': 72, 'EndOffset': 99, 'Score': 0.99856...",4
3,"[{'BeginOffset': 17, 'EndOffset': 29, 'Score': 0.9997500325485067, 'Text': 'this section'}, {'BeginOffset': 53, 'EndOffset': 62, 'Score': 0.999940...",5
4,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.9879459585199664, 'Text': 'section one'}, {'BeginOffset': 34, 'EndOffset': 46, 'Score': 0.9997940...",2


In [34]:
import json
data = []
with open ('entity_output', "r") as myfile:
    for line in myfile:
        data.append(json.loads(line))

In [35]:
entitydf = pd.DataFrame(data, columns=['Entities','Line'])
entitydf.head()

Unnamed: 0,Entities,Line
0,"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 546, 'EndOff...",0
1,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.8810036668646961, 'Text': 'module two', 'Type': 'OTHER'}, {'BeginOffset': 33, 'EndOffset': 36, 'S...",1
2,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.6055191654832432, 'Text': 'section three', 'Type': 'OTHER'}, {'BeginOffset': 460, 'EndOffset': 46...",4
3,"[{'BeginOffset': 169, 'EndOffset': 182, 'Score': 0.5567611517410612, 'Text': 'all the tools', 'Type': 'QUANTITY'}, {'BeginOffset': 193, 'EndOffset...",5
4,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.8874841148902434, 'Text': 'section one', 'Type': 'OTHER'}, {'BeginOffset': 183, 'EndOffset': 188,...",2


In [36]:
def extract_entities(entities, entity_type):
    filtered_entities=[]
    for entity in entities:
        if entity['Type'] == entity_type:
            filtered_entities.append(entity)
    return filtered_entities

In [37]:
# df['plot_normalized'] = df['plot'].apply(normalize_text)    
entitydf['location'] = entitydf['Entities'].apply(lambda x: extract_entities(x, 'LOCATION'))
entitydf['organization'] = entitydf['Entities'].apply(lambda x: extract_entities(x, 'ORGANIZATION'))

entitydf.head()

Unnamed: 0,Entities,Line,location,organization
0,"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 546, 'EndOff...",0,[],"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 561, 'EndOff..."
1,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.8810036668646961, 'Text': 'module two', 'Type': 'OTHER'}, {'BeginOffset': 33, 'EndOffset': 36, 'S...",1,[],"[{'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': 'ORGANIZATION'}]"
2,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.6055191654832432, 'Text': 'section three', 'Type': 'OTHER'}, {'BeginOffset': 460, 'EndOffset': 46...",4,"[{'BeginOffset': 3626, 'EndOffset': 3628, 'Score': 0.8398416469737435, 'Text': 'uk', 'Type': 'LOCATION'}]",[]
3,"[{'BeginOffset': 169, 'EndOffset': 182, 'Score': 0.5567611517410612, 'Text': 'all the tools', 'Type': 'QUANTITY'}, {'BeginOffset': 193, 'EndOffset...",5,[],"[{'BeginOffset': 307, 'EndOffset': 314, 'Score': 0.6630072743677669, 'Text': 'jupiter', 'Type': 'ORGANIZATION'}, {'BeginOffset': 329, 'EndOffset':..."
4,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.8874841148902434, 'Text': 'section one', 'Type': 'OTHER'}, {'BeginOffset': 183, 'EndOffset': 188,...",2,[],[]


In [38]:
entitydf.set_index('Line', inplace = True)
entitydf.sort_index(inplace=True)
kpdf.set_index('Line', inplace=True)
kpdf.sort_index(inplace=True)
entitydf.head()

Unnamed: 0_level_0,Entities,location,organization
Line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 546, 'EndOff...",[],"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 561, 'EndOff..."
1,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.8810036668646961, 'Text': 'module two', 'Type': 'OTHER'}, {'BeginOffset': 33, 'EndOffset': 36, 'S...",[],"[{'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': 'ORGANIZATION'}]"
2,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.8874841148902434, 'Text': 'section one', 'Type': 'OTHER'}, {'BeginOffset': 183, 'EndOffset': 188,...",[],[]
3,"[{'BeginOffset': 763, 'EndOffset': 767, 'Score': 0.5536403180912558, 'Text': 'more', 'Type': 'QUANTITY'}, {'BeginOffset': 935, 'EndOffset': 951, '...",[],[]
4,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.6055191654832432, 'Text': 'section three', 'Type': 'OTHER'}, {'BeginOffset': 460, 'EndOffset': 46...","[{'BeginOffset': 3626, 'EndOffset': 3628, 'Score': 0.8398416469737435, 'Text': 'uk', 'Type': 'LOCATION'}]",[]


In [39]:
m1 = kpdf.merge(entitydf, left_index=True, right_index=True)
m1.sort_index(inplace=True)
pd.set_option('display.max_colwidth', 200)
m1.head()

Unnamed: 0_level_0,KeyPhrases,Entities,location,organization
Line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[{'BeginOffset': 26, 'EndOffset': 33, 'Score': 0.8727456819495607, 'Text': 'academy'}, {'BeginOffset': 37, 'EndOffset': 65, 'Score': 0.7514598050663223, 'Text': 'machine learning foundations'}, {'...","[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 546, 'EndOffset': 551, 'Score': 0.951490867966341, 'Text': 'fi...",[],"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 561, 'EndOffset': 572, 'Score': 0.9064068657229468, 'Text': 'a..."
1,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.6214939014751475, 'Text': 'module two'}, {'BeginOffset': 33, 'EndOffset': 52, 'Score': 0.9941312955832783, 'Text': 'aws academy machine'}, {'BeginO...","[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.8810036668646961, 'Text': 'module two', 'Type': 'OTHER'}, {'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': ...",[],"[{'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': 'ORGANIZATION'}]"
2,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.9879459585199664, 'Text': 'section one'}, {'BeginOffset': 34, 'EndOffset': 46, 'Score': 0.9997940602015712, 'Text': 'this section'}, {'BeginOffset'...","[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.8874841148902434, 'Text': 'section one', 'Type': 'OTHER'}, {'BeginOffset': 183, 'EndOffset': 188, 'Score': 0.9281728391309813, 'Text': 'first', 'Ty...",[],[]
3,"[{'BeginOffset': 24, 'EndOffset': 36, 'Score': 0.9996331597282907, 'Text': 'this section'}, {'BeginOffset': 61, 'EndOffset': 70, 'Score': 0.999921626630556, 'Text': 'the types'}, {'BeginOffset': 7...","[{'BeginOffset': 763, 'EndOffset': 767, 'Score': 0.5536403180912558, 'Text': 'more', 'Type': 'QUANTITY'}, {'BeginOffset': 935, 'EndOffset': 951, 'Score': 0.9698888667844483, 'Text': 'three main ty...",[],[]
4,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.9589880415051381, 'Text': 'section three'}, {'BeginOffset': 72, 'EndOffset': 99, 'Score': 0.9985641596345158, 'Text': 'a quick high level overview'...","[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.6055191654832432, 'Text': 'section three', 'Type': 'OTHER'}, {'BeginOffset': 460, 'EndOffset': 468, 'Score': 0.9755549403380089, 'Text': 'one task'...","[{'BeginOffset': 3626, 'EndOffset': 3628, 'Score': 0.8398416469737435, 'Text': 'uk', 'Type': 'LOCATION'}]",[]


In [40]:
mergedDf = df.merge(m1, left_index=True, right_index=True)

In [41]:
mergedDf.head()

Unnamed: 0,OutputFile,Video,Transcription,Transcription_normalized,KeyPhrases,Entities,location,organization
0,transcribed-input/Mod01_Course_Overview.txt,input/Mod01_Course Overview.mp4,"Hi and welcome to Amazon Academy of Machine Learning Foundations in this module, you'll learn about the course objectives, various job roles in the machine learning domain and where you can go to ...","hi and welcome to amazon academy of machine learning foundations in this module, you'll learn about the course objectives, various job roles in the machine learning domain and where you can go to ...","[{'BeginOffset': 26, 'EndOffset': 33, 'Score': 0.8727456819495607, 'Text': 'academy'}, {'BeginOffset': 37, 'EndOffset': 65, 'Score': 0.7514598050663223, 'Text': 'machine learning foundations'}, {'...","[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 546, 'EndOffset': 551, 'Score': 0.951490867966341, 'Text': 'fi...",[],"[{'BeginOffset': 19, 'EndOffset': 33, 'Score': 0.9060990005037325, 'Text': 'amazon academy', 'Type': 'ORGANIZATION'}, {'BeginOffset': 561, 'EndOffset': 572, 'Score': 0.9064068657229468, 'Text': 'a..."
1,transcribed-input/Mod02_Intro.txt,input/Mod02_Intro.mp4,"Hi and welcome to module two of Aws Academy machine learning in this module, we're going to introduce machine learning. We'll first look at the business problems that can be solved by machine lear...","hi and welcome to module two of aws academy machine learning in this module, we're going to introduce machine learning. we'll first look at the business problems that can be solved by machine lear...","[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.6214939014751475, 'Text': 'module two'}, {'BeginOffset': 33, 'EndOffset': 52, 'Score': 0.9941312955832783, 'Text': 'aws academy machine'}, {'BeginO...","[{'BeginOffset': 19, 'EndOffset': 29, 'Score': 0.8810036668646961, 'Text': 'module two', 'Type': 'OTHER'}, {'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': ...",[],"[{'BeginOffset': 33, 'EndOffset': 36, 'Score': 0.9779731824490148, 'Text': 'aws', 'Type': 'ORGANIZATION'}]"
2,transcribed-input/Mod02_Sect01.txt,input/Mod02_Sect01.mp4,"Hi and welcome to section one in this section. We're going to talk about what machine learning is. This course is an introduction to machine learning, which is also known as ML. But first we'll di...","hi and welcome to section one in this section. we're going to talk about what machine learning is. this course is an introduction to machine learning, which is also known as ml. but first we'll di...","[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.9879459585199664, 'Text': 'section one'}, {'BeginOffset': 34, 'EndOffset': 46, 'Score': 0.9997940602015712, 'Text': 'this section'}, {'BeginOffset'...","[{'BeginOffset': 19, 'EndOffset': 30, 'Score': 0.8874841148902434, 'Text': 'section one', 'Type': 'OTHER'}, {'BeginOffset': 183, 'EndOffset': 188, 'Score': 0.9281728391309813, 'Text': 'first', 'Ty...",[],[]
3,transcribed-input/Mod02_Sect02.txt,input/Mod02_Sect02.mp4,Hi and welcome back in this section. We're going to look at the types of business problems. Machine learning can help you solve. Machine learning is used all across your digital lives. Your email ...,hi and welcome back in this section. we're going to look at the types of business problems. machine learning can help you solve. machine learning is used all across your digital lives. your email ...,"[{'BeginOffset': 24, 'EndOffset': 36, 'Score': 0.9996331597282907, 'Text': 'this section'}, {'BeginOffset': 61, 'EndOffset': 70, 'Score': 0.999921626630556, 'Text': 'the types'}, {'BeginOffset': 7...","[{'BeginOffset': 763, 'EndOffset': 767, 'Score': 0.5536403180912558, 'Text': 'more', 'Type': 'QUANTITY'}, {'BeginOffset': 935, 'EndOffset': 951, 'Score': 0.9698888667844483, 'Text': 'three main ty...",[],[]
4,transcribed-input/Mod02_Sect03.txt,input/Mod02_Sect03.mp4,Hi and welcome back. This is section three and we're going to give you a quick high level overview of machine learning terminology and a typical workflow. We will cover these topics in more detail...,hi and welcome back. this is section three and we're going to give you a quick high level overview of machine learning terminology and a typical workflow. we will cover these topics in more detail...,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.9589880415051381, 'Text': 'section three'}, {'BeginOffset': 72, 'EndOffset': 99, 'Score': 0.9985641596345158, 'Text': 'a quick high level overview'...","[{'BeginOffset': 30, 'EndOffset': 43, 'Score': 0.6055191654832432, 'Text': 'section three', 'Type': 'OTHER'}, {'BeginOffset': 460, 'EndOffset': 468, 'Score': 0.9755549403380089, 'Text': 'one task'...","[{'BeginOffset': 3626, 'EndOffset': 3628, 'Score': 0.8398416469737435, 'Text': 'uk', 'Type': 'LOCATION'}]",[]


In [42]:
pd.set_option('display.max_colwidth', 50)
mergedDf.head()

Unnamed: 0,OutputFile,Video,Transcription,Transcription_normalized,KeyPhrases,Entities,location,organization
0,transcribed-input/Mod01_Course_Overview.txt,input/Mod01_Course Overview.mp4,Hi and welcome to Amazon Academy of Machine Le...,hi and welcome to amazon academy of machine le...,"[{'BeginOffset': 26, 'EndOffset': 33, 'Score':...","[{'BeginOffset': 19, 'EndOffset': 33, 'Score':...",[],"[{'BeginOffset': 19, 'EndOffset': 33, 'Score':..."
1,transcribed-input/Mod02_Intro.txt,input/Mod02_Intro.mp4,Hi and welcome to module two of Aws Academy ma...,hi and welcome to module two of aws academy ma...,"[{'BeginOffset': 19, 'EndOffset': 29, 'Score':...","[{'BeginOffset': 19, 'EndOffset': 29, 'Score':...",[],"[{'BeginOffset': 33, 'EndOffset': 36, 'Score':..."
2,transcribed-input/Mod02_Sect01.txt,input/Mod02_Sect01.mp4,Hi and welcome to section one in this section....,hi and welcome to section one in this section....,"[{'BeginOffset': 19, 'EndOffset': 30, 'Score':...","[{'BeginOffset': 19, 'EndOffset': 30, 'Score':...",[],[]
3,transcribed-input/Mod02_Sect02.txt,input/Mod02_Sect02.mp4,Hi and welcome back in this section. We're goi...,hi and welcome back in this section. we're goi...,"[{'BeginOffset': 24, 'EndOffset': 36, 'Score':...","[{'BeginOffset': 763, 'EndOffset': 767, 'Score...",[],[]
4,transcribed-input/Mod02_Sect03.txt,input/Mod02_Sect03.mp4,Hi and welcome back. This is section three and...,hi and welcome back. this is section three and...,"[{'BeginOffset': 30, 'EndOffset': 43, 'Score':...","[{'BeginOffset': 30, 'EndOffset': 43, 'Score':...","[{'BeginOffset': 3626, 'EndOffset': 3628, 'Sco...",[]


In [43]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import requests

In [44]:
from time import sleep
alive = es_client.describe_elasticsearch_domain(DomainName='nlp-lab')
while alive['DomainStatus']['Processing']:
    print('.', end='')
    sleep(10)
    alive = es_client.describe_elasticsearch_domain(DomainName='nlp-lab')
    
print('ready!')

....................................................................................ready!


In [45]:
sleep(60)
es_domain = es_client.describe_elasticsearch_domain(DomainName='nlp-lab')
es_endpoint = es_domain['DomainStatus']['Endpoint']

In [46]:
region= 'us-east-1' 
service = 'es' 
credentials = boto3.Session().get_credentials()

awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
es = OpenSearch(
    hosts = [{'host': es_endpoint, 'port': 443}],
    http_auth = awsauth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

In [47]:
transcription = mergedDf.iloc[3,2]
keyphrases = mergedDf.iloc[3,4]
location = mergedDf.iloc[3,6]
organization = mergedDf.iloc[3,7]
movie_name = mergedDf.iloc[3,1]

document = {"name": movie_name, "transcription": transcription, "keyphrases": keyphrases, "location":location, "organization": organization}
print(document)

{'name': 'input/Mod02_Sect02.mp4', 'transcription': "Hi and welcome back in this section. We're going to look at the types of business problems. Machine learning can help you solve. Machine learning is used all across your digital lives. Your email spam filter is the result of a machine learning program that was trained with examples of spam and regular email messages based on books. You're reading or products you bought machine learning programs can predict other books or products you're likely to be interested in. Again, the machine learning program was trained with data from other readers habits and purchases. When detecting credit card fraud, the machine learning program was trained on examples of transactions that turned out to be fraud along with normal transactions. You can probably think of many more examples from social media applications using facial detection to group your photos to detecting brain tumors in brain scans or finding anomalies in x rays. There are three main ty

In [48]:
from opensearchpy import helpers

def gendata(start, stop):    
    if stop>mergedDf.shape[0]:
        stop = mergedDf.shape[0]
    for i in range(start, stop):
        yield {
            "_index":'movies',
            "_type": "_doc", 
            "_id":i, 
            "_source": {"name": mergedDf.iloc[i,1], "transcription": mergedDf.iloc[i,2], "keyphrases": mergedDf.iloc[i,4], "location":mergedDf.iloc[i,6], "organization": mergedDf.iloc[i,7]}
        }

In [49]:
%%time
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
es = OpenSearch(
    hosts = [{'host': es_endpoint, 'port': 443}],
    http_auth = awsauth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)
helpers.bulk(es, gendata(0,mergedDf.shape[0]))

CPU times: user 53.1 ms, sys: 1.57 ms, total: 54.7 ms
Wall time: 1.5 s


(46, [])

In [50]:
#####Creating the Kibana dashboard#####

In [51]:
print(f'https://{es_endpoint}/_plugin/kibana')

https://search-nlp-lab-2ji5uwg3obudwvrjwpvkxbmice.us-east-1.es.amazonaws.com/_plugin/kibana


In [52]:
########### some cleanup #################

In [53]:
response = es_client.delete_elasticsearch_domain(
    DomainName='nlp-lab'
)

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2023 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*
