# 2. Amazon Comprehend

<img src="./images/comprehend.png">

Amazon Comprehend는 Natural Language Processing (NLP) 를 사용하여 문서들의 콘텐츠에서 insights 를 추출하는 서비스입니다.

-----------------------------------
> 1. **Input** : UTF-8 포맷의 text 파일
> 2. **API** : JAVA, .NET, Python
> 3. **제공 기능** (언어 별로 미제공 기능 있음)
>    - **Entities, Key phrases, Language, Sentiments, Topic Modeling**
>    - **Syntax , Custom Classification (한국어 미지원), Custom Entity Recognition (영어만 가능)**  
-----------------------------------

In [None]:
%load_ext autoreload
%autoreload 2

# External Dependencies:
import time
import boto3
import json
import pandas as pd
import tarfile

comprehend = boto3.client('comprehend')
s3 = boto3.resource('s3')

In [None]:
# ...We just retrieve it here:
%store -r
assert bucket_name, "Variable `bucket_name` missing from IPython store"
print(bucket_name)

## bucket_name ='transcribe-comprehend-demo-test-XXX'  ## CloudFormation의 ouput에서 나온 S3Bucket

assert transcript, "Variable `transcript` missing from IPython store"
print(transcript)

### Detecting the Dominant Language
-----------------------------------
- **Language (한국어 지원)** : 주요 언어명, 100여개 식별 가능 (한국어 지원)

In [None]:
%%time
print('Calling DetectDominantLanguage')
res = comprehend.detect_dominant_language(Text = transcript)
result = res['Languages'][0]
LanguageCode = result['LanguageCode']
print("Language : {}, Score : {} \n\n".format(LanguageCode, result['Score']))

print("[detail result] \n" + json.dumps(res, sort_keys=True, indent=4))

print("End of DetectDominantLanguage\n")

### Detecting Named Entities 
-----------------------------------
- **Entities (한국어 지원)** : Named Entitiy Recognition 수행
<img src="./images/NER항목.png" width='600'>

In [None]:
print('Calling DetectEntities')
res = comprehend.detect_entities(Text=transcript, LanguageCode=LanguageCode)
list_result =[]
for result in res['Entities']:
    list_result.append([result['Text'], result['Type'], result['Score'], result['BeginOffset'], result['EndOffset']])
df = pd.DataFrame(list_result, columns=['Text', 'Type', 'Score','BeginOffset', 'EndOffset'])
df=df.sort_values(by='Score', ascending=False)
df

### Detecting Key Phrases
-----------------------------------
- **Key phrases (한국어 지원)** : 문서 내 키워드 추출
   - 특정 사물을 설명하는 명사구를 포함하는 문자열 의미
   - 명사구(관사+형용사+명사)와 신뢰 수준을 제공
   - 모든 문서는 동일 언어로 작성되어야 함

In [None]:
print('Calling DetectKeyPhrases')
res = comprehend.detect_key_phrases(Text=transcript, LanguageCode=LanguageCode)
list_result =[]
for result in res['KeyPhrases']:
    list_result.append([result['Text'], result['Score'], result['BeginOffset'], result['EndOffset']])
df = pd.DataFrame(list_result, columns=['Text', 'Score','BeginOffset', 'EndOffset'])
df = df.sort_values(by='Score', ascending=False)
df 

### Detecting Sentiment
-----------------------------------
- **Sentiments (한국어 지원)** : 긍정, 부정, 중립, 혼합 제공

In [None]:
print('Calling DetectSentiment')

res = comprehend.detect_sentiment(Text=transcript, LanguageCode=LanguageCode)

print("Sentiment : {}  \n Positive : {:0.5f} \n Negative : {:0.5f} \n Neutral : {:0.5f} \n Mixed : {:0.5f} \n ".format(
res['Sentiment'], res['SentimentScore']['Positive'], res['SentimentScore']['Negative'], res['SentimentScore']['Neutral'], res['SentimentScore']['Mixed']))

### Detecting Syntax
-----------------------------------
- **Syntax (한국어 미지원)**
   - 17개의 Part-of-Speech (PoS) 식별
   - ADJ (형용사),ADP (전치사/후치사),ADV (부사),AUX (조동사),NOUN (명사),NUM 등 

In [None]:
# print('Calling DetectSyntax')

# res = comprehend.detect_syntax(Text=transcript, LanguageCode=LanguageCode)
# list_result =[]
# for result in res['SyntaxTokens']:
#     list_result.append([result['Text'], result['TokenId'],result['PartOfSpeech']['Tag'] ,result['PartOfSpeech']['Score'], result['BeginOffset'], result['EndOffset']])
# df = pd.DataFrame(list_result, columns=['Text', 'TokenId', 'Tag', 'Score', 'BeginOffset', 'EndOffset'])
# df = df.sort_values(by='TokenId')
# df 

text = "It is raining today in Seattle"
print('Calling DetectSyntax')
print(json.dumps(comprehend.detect_syntax(Text=text, LanguageCode='en'), sort_keys=True,
 indent=4))
print('End of DetectSyntax\n')

### Topic Modeling
-----------------------------------
- **Topic Modeling**  
   - 문서 집합에 대한 공통 테마 결정
   - 정치, 스포츠, 엔터테인먼트 등의 주제로 결정
   - 문서 내 텍스트에 대한 별도 주석이 필요 없음
   - LDA(Latent Dirichlet Allocation) 기반 학습 모델
   - 좋은 결과를 얻기 위해서는, 
       - 최소 1,000개 문서 사용
       - 각 문서 길이는 3문장 이상 필요
       - 문서가 주로 숫자 데이터 위주이면 Corpus에서 제거 


In [None]:
JobName = 'XXXXXXX' ## 작업명
topic_modeling_prefix = 'XXXXXXX'  ## S3 내 topic modeling을 위한 documents 저장 위치
topic_modeling_output = 'XXXXXXX'  ## S3 내 topic modeling 결과 위치 

input_s3_url ="s3://{}/{}".format(bucket_name, topic_modeling_prefix)
input_doc_format = "ONE_DOC_PER_FILE" ## 
output_s3_url = "s3://{}/{}".format(bucket_name, topic_modeling_output)
data_access_role_arn = "arn:aws:iam::XXXXXXXXXX:role/service-role/XXXXXXXXX-DataAccessRole-XXXXXXXXXXX"  ## CloudFormation의 ouput에서 나온 DataAccessRoleArn
number_of_topics = 10

In [None]:
%%time
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}

start_topics_detection_job_result = comprehend.start_topics_detection_job(
    JobName=JobName,
    NumberOfTopics=number_of_topics,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    DataAccessRoleArn=data_access_role_arn)

print('start_topics_detection_job_result: ' + json.dumps(start_topics_detection_job_result))
while True:
    status = comprehend.list_topics_detection_jobs(
        Filter={
            'JobName': JobName
        }
    )
    
    if status['TopicsDetectionJobPropertiesList'][0]['JobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(5)
print(status)

In [None]:
job_id = start_topics_detection_job_result["JobId"]
print('job_id: ' + job_id)

def json_default(value): 
    import datetime, json
    if isinstance(value, datetime.date): 
        return value.strftime('%Y-%m-%d') 
    raise TypeError('not JSON serializable')
    
    
describe_topics_detection_job_result = comprehend.describe_topics_detection_job(JobId=job_id)
print('describe_topics_detection_job_result: ' + json.dumps(describe_topics_detection_job_result, 
                                                            default=json_default))

In [None]:
res = describe_topics_detection_job_result['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
tmp=res.split('/')

In [None]:
bucket = tmp[2]
output_filename = tmp[3] +"/" + tmp[4] +"/" + tmp[5]+"/" + tmp[6]
output_path = './output/' + tmp[6]
print("bucket : {}, output_path : {}".format(bucket, output_path))

In [None]:
s3.Object(bucket, output_filename).download_file(output_path)

In [None]:
ap = tarfile.open(output_path)
ap.extractall('./output/topic')
ap.close()

In [None]:
pd.read_csv('./output/topic/topic-terms.csv')

In [None]:
pd.read_csv('./output/topic/doc-topics.csv')