# Groundtruth Labeling

<b>NOTE</b>: It is very important that in every step, you check that you are in the us-west-2 region as many resources cannot be reached across regions. <i><b>Specifically</b>, the S3 bucket must be created in <b>us-west-2</b>, and the same is for the SageMaker notebook instance.</i>

## Class material can be downloaded here

https://github.com/ConcurDataScience/ConcurMLWorkshop

<img src="//d1.awsstatic.com/product-marketing/product-page-diagram_SageMaker-Ground-Truth-Plus.b07ea09f6243c1a8a2358c704ce2a227c78b0153.png" width="70%" alt="How Amazon SageMaker Ground Truth Plus works" title="How Amazon SageMaker Ground Truth Plus works">

## Steps Overview

### Decide what to label and decide on the UI template to use 
* Out of the box, SageMaker console has text classification and image classification.
* You can also use a custom built template.

### Prepare the input
* The input is in JSON format.
* It is called an input manifest.
* It needs to provide all the data that the UI template needs.

### Create the job
* The data needs to be in S3.
* The job needs to have read/write access the S3 bucket.
* Labelers need to have user account setup to access the AWS labeling portal.
* Must specify instructions for the labeler on what to do specifically for the labeling job.

### Do labeling
* The output will be in S3
* The consolidated output will be in an output manifest file.
* It will be generated after all tasks are completed.

### Do ETL
* This extracts the data from the output manifest file
* It generates CSV file that the Athena table queries from

## Preparations

<h4>Here we install the dependencies. <b>pyathena</b> is will allow us to work with the data like a database.</h4>

In [1]:
!pip install pyathena

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pyathena
  Downloading PyAthena-2.3.2-py3-none-any.whl (37 kB)
Installing collected packages: pyathena
Successfully installed pyathena-2.3.2


In [2]:
from pyathena import connect
import pandas as pd
import json
import boto3
bucket_name = 'twitter-sentiment-hl'

### Decide UI Template to Use

* Text classification: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-text-classification.html.
* Amazon template repo: https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis.
* We will use the twitter sentiment template: [resources/sentiment-analysis-tweet.liquid](resources/sentiment-analysis-tweet.liquid).

The UI template will largely dictate the format of the input data for labeling.

In [3]:
!aws s3 cp resources/sentiment-analysis-tweet.liquid s3://$bucket_name/labeling_data_component/labeling_data_input/

upload: resources/sentiment-analysis-tweet.liquid to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid


### Prepare Raw Data

Below we have some <b>tweet data</b>. We will need to label them using the AWS Groundtruth tool.

In [4]:
rawdata = pd.read_csv('resources/tweet_data.csv', delimiter='\t')
print(rawdata.head())
rawdata = rawdata.reset_index()  # make sure indexes pair with number of rows

   tweet_id         entity                                         tweet_text
0     13085  Xbox(Xseries)  Colt is trying it again. You guys ready for so...
1     13137    Borderlands                    totally absolutely love the new
2     13009  Xbox(Xseries)  @IdleSloth1984 what the hell do you mean? Xbox...


### Generate Input Manifest

The manifest must provide the data that the UI template needs.

The format for text classification template: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-supported-data-formats.html

In [5]:
with open('resources/input.manifest', 'w') as mft:
    for index, row in rawdata.iterrows():
        task = {
            'source': row['tweet_text'],
            'entity': row['entity'],         
            'tweet_id': row['tweet_id']            
        }
        mft.write(json.dumps(task))
        mft.write("\n")

In [6]:
!aws s3 cp resources/input.manifest s3://$bucket_name/labeling_data_component/labeling_data_input/
!aws s3 cp resources/sentiments.json s3://$bucket_name/labeling_data_component/labeling_data_input/

upload: resources/input.manifest to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/input.manifest
upload: resources/sentiments.json to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/sentiments.json


## Create the labeling job

In this section we will:

* Create a labeling job from the SageMaker console.
* Create a labeling job using boto3.

In AWS, a labeling job can be created using the SageMaker console manually, or using code by invoking one of the AWS APIs.

Creating it using the console will help to understand better some concepts and the process involved, and it is generally good for one time only job. In comparison, creating using code allow you to automate the process and make it easy to repeat what you have done.

### Create a job from the console

We will create a new job using the SageMaker console. After this exercise, we will:

* understand a few concepts and,
* get a labeler account setup with your personal email.

Here are the steps we will go through:

* Generate a bucket policy which give labeling jobs access to the bucket.
* Update our <code>ConcurMLWorkshopUse</code> rolw with that policy.
* Create a job using the SageMaker console.

In [7]:
policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                f"arn:aws:s3:::{bucket_name}"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                f"arn:aws:s3:::{bucket_name}/*"
            ]
        }
    ]
}
print(json.dumps(policy, indent=2))

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::twitter-sentiment-hl"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::twitter-sentiment-hl/*"
      ]
    }
  ]
}


### Activity after the first job is created

* Check email to complete registration as labeler
* Verify that the new job is listed in Sagemaker
* Verify that a new work team is listed in SageMaker
* Complete the labeling tasks

We will review the labeling out put later since AWS takes a little time to produce the output.

### Create a job using code

Now we will create a similar job using code. After this, you will be able to

* easily repeat the process to create a new job.
* have better ability to customize your jobs going forward.

<h4>Pre and post lambda</h4>

AWS allows you to specify and pre and post processing lambda for a labeling job, which give you an opportunity the plugin a different logic to process the task data before and after it is sent to the labeling UI.

* AWS provided [PreHumanTaskLambdaArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#SageMaker-Type-HumanTaskConfig-PreHumanTaskLambdaArn) 
* AWS provided [AnnotationConsolidationLambdaArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AnnotationConsolidationConfig.html) 

In [8]:
def create_labeling_job(job_name, role_arn, bucket_name, team_arn):
    human_task_config = {
        "WorkteamArn": team_arn,
        "UiConfig": {
          "UiTemplateS3Uri": f"s3://{bucket_name}/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid"
        },
        "PreHumanTaskLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:PRE-TextMultiClass",
        "TaskTitle": "Please pick the proper sentiment for the tweet",
        "TaskDescription": "Please pick the proper sentiment for the tweet",
        "NumberOfHumanWorkersPerDataObject": 1,
        "TaskTimeLimitInSeconds": 3600,
        "TaskAvailabilityLifetimeInSeconds": 864000,
        "MaxConcurrentTaskCount": 1000,
        "AnnotationConsolidationConfig": {
          "AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:ACS-TextMultiClass"
        }
    }

    output_path = f"s3://{bucket_name}/labeling_data_component/labeling_data_output/"

    # Only if a folder for the job DOES NOT exist, then create it, or fail it!
    REGION = boto3.session.Session().region_name
    sagemaker_client = boto3.client('sagemaker', REGION)
    response = sagemaker_client.create_labeling_job(
        LabelingJobName = job_name,
        LabelAttributeName = "sentiment",
        InputConfig={
            'DataSource': {
                'S3DataSource': {
                    'ManifestS3Uri': f's3://{bucket_name}/labeling_data_component/labeling_data_input/input.manifest'
                }
            }
        },
        OutputConfig={
            'S3OutputPath': f'{output_path}',
            'KmsKeyId': 'alias/aws/s3'
        },
        RoleArn=role_arn,
        LabelCategoryConfigS3Uri = f"s3://{bucket_name}/labeling_data_component/labeling_data_input/sentiments.json",
        HumanTaskConfig=human_task_config
    )
    response

job_name = 'workshop-hl'
role_arn = 'arn:aws:iam::786774050055:role/ConcurMLWorkshopUse'
team_arn = 'arn:aws:sagemaker:us-west-2:786774050055:workteam/private-crowd/workshop'

create_labeling_job(job_name, role_arn, bucket_name, team_arn)

## Output ETL

The raw output from SageMaker is not easily consumable by a program to use it for model training. We can use an ETL to extract the data and make it available through Athena. In this step we will do the following:

* Download and inspect the output.manifest
* Create an ETL script and run it to generate a csv
* Create an Athena table that points to the csv file generated
* Query the table and review the data

### Download the output.manifest

Note that the output manifest will be generated by AWS only after all labels are done.

In [9]:
job_name='workshop-1'
!aws s3 cp s3://$bucket_name/labeling_data_component/labeling_data_output/$job_name/manifests/output/output.manifest resources/output.manifest

download: s3://twitter-sentiment-hl/labeling_data_component/labeling_data_output/workshop-1/manifests/output/output.manifest to resources/output.manifest


### Do the ETL

In [14]:
with open('resources/output.manifest', 'r') as output:
    outlines = output.readlines()

with open('resources/outputetl.csv', 'w') as outetl:
    header = ['tweet_id', 'entity', 'sentiment', 'confidence', 'tweet_text']
    outetl.write('\t'.join(header))
    outetl.write('\n')
    for line in outlines:
        data = json.loads(line)
        output = [str(data['tweet_id'])]
        output.append(data['entity'])
        for key in data:
            if key.endswith('-metadata'):
                output.append(data[key]['class-name'])
                output.append(str(data[key]['confidence']))
                                   
        output.append(data['source'])
        outetl.write('\t'.join(output))
        outetl.write('\n')

In [28]:
!aws s3 cp resources/outputetl.csv s3://$bucket_name/labeling_data_component/labeling_data_etl/

upload: resources/outputetl.csv to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_etl/outputetl.csv


### Create the Athena database and table

If you have not yet created the Athena database, do the following to create the database:
1. Navigate to Athena in AWS console
2. In Setting, Click on Manage and make sure that 
* you specify the logging folder, 
* your own accoutn is the bucket owner and 
* give yourself full control over the query results
3. Run the below command to create an Athena database:
> CREATE DATABASE `ml-workshop-db`
4. Create the Athena table
* Run below command to create the table ddl and 
* Execute the ddl in Athena to generate the table

In [29]:
ddl_query = f"""
CREATE EXTERNAL TABLE `labeling_output`(
  `tweet_id` string COMMENT 'from deserializer',
  `entity` string COMMENT 'from deserializer',
  `sentiment` string COMMENT 'from deserializer',
  `confidence` string COMMENT 'from deserializer',
  `tweet_text` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'escapeChar'='\\\\',
  'quoteChar'='\\"',
  'separatorChar'='\\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://{bucket_name}/labeling_data_component/labeling_data_etl'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1',
  'transient_lastDdlTime'='1645737537')
"""

print(ddl_query)


CREATE EXTERNAL TABLE `labeling_output`(
  `tweet_id` string COMMENT 'from deserializer',
  `entity` string COMMENT 'from deserializer',
  `sentiment` string COMMENT 'from deserializer',
  `confidence` string COMMENT 'from deserializer',
  `tweet_text` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'escapeChar'='\\',
  'quoteChar'='\"',
  'separatorChar'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://twitter-sentiment-hl/labeling_data_component/labeling_data_etl'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1',
  'transient_lastDdlTime'='1645737537')



## Wrap up - What is possible?

A demo of a custom bounding box NER template