## 1. Importing requirements and creating session variables
([Go to top](#Lab-6.1:-Implementing-Topic-Modeling-with-Amazon-Comprehend))

In this section, you will update and install the packages that you will use in the notebook. You will also create the session variables. 


In [None]:
import boto3
import uuid
# Client and session information
comprehend_client = boto3.client(service_name='comprehend')


# Constants for the S3 bucket and input data file
bucket = 'c137242a3503187l8793420t1w504868800693-labbucket-fd7g4wutxall'
data_access_role_arn = 'arn:aws:iam::504868800693:role/service-role/c137242a3503187l8793420t1w-ComprehendDataAccessRole-8283gGw7mpcK'


## 2. Importing the newsgroup files
([Go to top](#Lab-6.1:-Implementing-Topic-Modeling-with-Amazon-Comprehend))

Now define the folder to hold the data. Then, clean up the folder, which might contain data from previous experiments.

In [None]:
import os
import shutil

data_dir = '20_newsgroups'
if os.path.exists(data_dir):  # Clean up existing data folder
    shutil.rmtree(data_dir)

In [None]:
!tar -xzf ../s3/20_newsgroups.tar.gz
!ls 20_newsgroups

In [None]:
folders = [os.path.join(data_dir,f) for f in sorted(os.listdir(data_dir)) if os.path.isdir(os.path.join(data_dir, f))]
file_list = [os.path.join(d,f) for d in folders for f in os.listdir(d)]
print('Number of documents:', len(file_list))

## 3. Examining and preprocessing the data
([Go to top](#Lab-6.1:-Implementing-Topic-Modeling-with-Amazon-Comprehend))
    
In this section, you will examine the data and perform some standard natural language processing (NLP) data cleaning tasks.

In [None]:
!cat 20_newsgroups/comp.graphics/37917

In [None]:
# From sklearn.datasets.twenty_newsgroups import strip_newsgroup_header, strip_newsgroup_quoting, strip_newsgroup_footer
import re
def strip_newsgroup_header(text):
    """
    Given text in "news" format, strip the headers by removing everything
    before the first blank line.
    """
    _before, _blankline, after = text.partition('\n\n')
    return after

_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'
                       r'|^In article|^Quoted from|^\||^>)')


def strip_newsgroup_quoting(text):
    """
    Given text in "news" format, strip lines beginning with the quote
    characters > or |, plus lines that often introduce a quoted section
    (for example, because they contain the string 'writes:'.)
    """
    good_lines = [line for line in text.split('\n')
                  if not _QUOTE_RE.search(line)]
    return '\n'.join(good_lines)


def strip_newsgroup_footer(text):
    """
    Given text in "news" format, attempt to remove a signature block.

    As a rough heuristic, we assume that signatures are set apart by either
    a blank line or a line made of hyphens, and that it is the last such line
    in the file (disregarding blank lines at the end).
    """
    lines = text.strip().split('\n')
    for line_num in range(len(lines) - 1, -1, -1):
        line = lines[line_num]
        if line.strip().strip('-') == '':
            break

    if line_num > 0:
        return '\n'.join(lines[:line_num])
    else:
        return text

Next, save all of the newsgroup documents to a single file, with one document on each line.

In [None]:
with open('comprehend_input.txt','w', encoding='UTF-8') as cf:
    for line in data:
        line = line.strip()
        line = re.sub('\n',' ',line)
        line = re.sub('\r',' ',line)
        cf.write(line+'\n')

In [None]:
s3 = boto3.resource('s3')
s3.Bucket(bucket).upload_file('comprehend_input.txt', 'comprehend/newsgroups')

In [None]:
number_of_topics = 20

input_s3_url = f"s3://{bucket}/comprehend"
input_doc_format = "ONE_DOC_PER_LINE"
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}

output_s3_url = f"s3://{bucket}/outputfolder/"
output_data_config = {"S3Uri": output_s3_url}

job_uuid = uuid.uuid1()
job_name = f"top-job-{job_uuid}"

print(input_s3_url)

In [None]:
# Get current job status
from time import sleep
job = comprehend_client.describe_topics_detection_job(JobId=start_topics_detection_job_result['JobId'])

# Loop until job is completed
waited = 0
timeout_minutes = 40
while job['TopicsDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(60)
    waited += 60
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    print('.', end='')
    job = comprehend_client.describe_topics_detection_job(JobId=start_topics_detection_job_result['JobId'])

print('Ready')

In [None]:
# Extract the .tar file
import tarfile
tf = tarfile.open('output.tar.gz')
tf.extractall()

## 4. Analyzing the Amazon Comprehend Events output
([Go to top](#Lab-6.1:-Implementing-Topic-Modeling-with-Amazon-Comprehend))



In [None]:
import pandas as pd
dftopicterms = pd.read_csv("topic-terms.csv")

In [None]:
# Selecting rows based on condition
for t in range(0,number_of_topics):
    rslt_df = dftopicterms.loc[dftopicterms['topic'] == t]
    topic_list = rslt_df['term'].values.tolist()
    print(f'Topic {t:2} - {topic_list}')

In [None]:
colnames = pd.DataFrame({'topics':['topic 0', 'topic 1', 'topic 2', 'topic 3', 'topic 4', 'topic 5', 'topic 6','topic 7','topic 8','topic 9',
       'topic 10', 'topic 11', 'topic 12', 'topic 13', 'topic 14', 'topic 15', 'topic 16','topic 17','topic 18','topic 19']})

In [None]:
dfdoctopics = pd.read_csv("doc-topics.csv")
dfdoctopics.head()

In [None]:
to_chart = dfdoctopics.loc[dfdoctopics['docname'].isin(['newsgroups:1000','newsgroups:2000','newsgroups:3000','newsgroups:4000','newsgroups:5000'])]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fs = 12
# df.index = colnames['topic']
to_chart.plot(kind='bar', figsize=(16,4), fontsize=fs)
plt.ylabel('Topic assignment', fontsize=fs+2)
plt.xlabel('Topic ID', fontsize=fs+2)