# Groundtruth Labeling

<b>NOTE</b>: It is very important that in every step, you check that you are in the us-west-2 region as many resources cannot be reached across regions. <i><b>Specifically</b>, the S3 bucket must be created in <b>us-west-2</b>, and the same is for the SageMaker notebook instance.</i> Also, when creating the notebook instance, assign it the <b>ConcurMLWorkshopUse</b>.

## Class material can be downloaded here

https://github.com/ConcurDataScience/ConcurMLWorkshop

<img src="//d1.awsstatic.com/product-marketing/product-page-diagram_SageMaker-Ground-Truth-Plus.b07ea09f6243c1a8a2358c704ce2a227c78b0153.png" width="70%" alt="How Amazon SageMaker Ground Truth Plus works" title="How Amazon SageMaker Ground Truth Plus works">

## Steps Overview

### Decide what to label and decide on the UI template to use 
* What is your input? Text, image, audio, or video? 
* Classification, Bounding box, Single label, multi-label?
* Out of the box, SageMaker console has text classification and image classification.
* You can also use a custom built template.

### Prepare the input
* The input is in JSON format: <i>how does it look like?</i>
* It is called an input manifest: <i>how does it look like?</i>
* It needs to provide all the data that the UI template can consume: <i>what data?</i>

### Create the job
* The data needs to be in S3: <i>what data to upload?</i>
* The job needs to have read/write access to the S3 bucket: <i>what access policies to update?</i>
* Labelers need to have user account setup to access the AWS labeling portal: <i>how to setup the accounts?</i>
* Must specify instructions for the labeler on what to do specifically for the labeling job: <i>how do we do that?</i>

### Do labeling
* The output will be in S3: <i>what are the outputs?</i>
* The consolidated output will be in an output manifest file: <i>how does it look like?</i>
* It will be generated after all tasks are completed: <i>why?</i>

### Do ETL
* This extracts the data from the output manifest file: <i>why ETL?</i>
* It generates CSV file that the Athena table queries from: <i>How to setup Athena table?</i>

## Preparations

Here we install the dependencies. <b>pyathena</b> will allow us to work with the data with SQL.

In [1]:
!pip install pyathena

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pyathena
  Downloading PyAthena-2.3.2-py3-none-any.whl (37 kB)
Installing collected packages: pyathena
Successfully installed pyathena-2.3.2


### Dependencies

Let's do some imports and **define the bucket name** that we will use though out this session.**

In [2]:
from pyathena import connect
import pandas as pd
import json
import boto3
bucket_name = 'twitter-sentiment-hl'

### Decide UI Template to Use

* Text classification: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-text-classification.html - for the first job.
* Amazon template repo: https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis.
* We will use the twitter sentiment template: [resources/sentiment-analysis-tweet.liquid](resources/sentiment-analysis-tweet.liquid) - for the second job.

The UI template will largely dictate the format of the input data for labeling.

In [3]:
!aws s3 cp resources/sentiment-analysis-tweet.liquid s3://$bucket_name/labeling_data_component/labeling_data_input/

upload: resources/sentiment-analysis-tweet.liquid to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid


### Prepare Raw Data

Below we have some <b>tweet data</b>. We will need to label them using the AWS Groundtruth tool.

In [4]:
rawdata = pd.read_csv('resources/tweet_data.csv', delimiter='\t')
print(rawdata.head())
rawdata = rawdata.reset_index()  # make sure indexes pair with number of rows

   tweet_id         entity                                         tweet_text
0     13085  Xbox(Xseries)  Colt is trying it again. You guys ready for so...
1     13137    Borderlands                    totally absolutely love the new
2     13009  Xbox(Xseries)  @IdleSloth1984 what the hell do you mean? Xbox...


### Generate Input Manifest

The manifest must provide the data that the UI template needs. Typically in a manifest, 
* you want to include the data referenced by the UI template, 
* you want to have ID that allow you to uniquely identify the data, and 
* you can have any other data that the UI template does not need, just in case if it is convenient to do so.

The format for text classification template: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-supported-data-formats.html

In [5]:
with open('resources/input.manifest', 'w') as mft:
    for index, row in rawdata.iterrows():
        task = {
            'source': row['tweet_text'],
            'entity': row['entity'],         
            'tweet_id': row['tweet_id']            
        }
        mft.write(json.dumps(task))
        mft.write("\n")

Next let's:
* take a look at the generated manifest file and make sure it looks good.
* upload it and also upload the <i>sentiments.json</i> file which will will use for the second job.<br/>
* **verify** that they are in S3.

In [6]:
!aws s3 cp resources/input.manifest s3://$bucket_name/labeling_data_component/labeling_data_input/
!aws s3 cp resources/sentiments.json s3://$bucket_name/labeling_data_component/labeling_data_input/

upload: resources/input.manifest to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/input.manifest
upload: resources/sentiments.json to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_input/sentiments.json


### Create a job from the console

We will create a new job using the SageMaker console. After this exercise, we will:

* understand a few concepts and,
* get a labeler account setup with your personal email - a one-time thing. You won't need to do this the second time.

Here are the steps we will go through:

* Generate a bucket policy which give labeling jobs access to the bucket.
* Update our <code>ConcurMLWorkshopUse</code> rolw with that policy.
* Create a job using the SageMaker console.

#### Give role access to S3 bucket

Even with the S3FullAccess, it is not enough for the labeling UI to access the S3 bucket. You must update your role with bucket specific permissions as shown below. More information about this can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-security-permission-console-access.html).

In [10]:
policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                f"arn:aws:s3:::{bucket_name}"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                f"arn:aws:s3:::{bucket_name}/*"
            ]
        }
    ]
}
print(json.dumps(policy, indent=2))

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::twitter-sentiment-hl"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::twitter-sentiment-hl/*"
      ]
    }
  ]
}


#### Activity after the first job is created

* Check email to complete registration as labeler.
* Verify that the new job is listed in Sagemaker.
* Verify that a new work team is listed in SageMaker.
* Verify that your email address is listed in the team in the **Confirmed** status.
* Complete the labeling tasks.

We will review the labeling out put later since AWS takes a little time to produce the output.

### Create a job using code

Now we will create a similar job using code. After this, you will be able to

* easily repeat the process to create a new job.
* know how to customize your jobs going forward.

#### Configurations

AWS allows you to specify and pre and post processing lambda for a labeling job, which give you an opportunity the plugin a different logic to process the task data before and after it is sent to the labeling UI.

* AWS provided [PreHumanTaskLambdaArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#SageMaker-Type-HumanTaskConfig-PreHumanTaskLambdaArn) 
* AWS provided [AnnotationConsolidationLambdaArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AnnotationConsolidationConfig.html) 

In [15]:
job_name = 'workshop-hl2'
role_arn = 'arn:aws:iam::786774050055:role/ConcurMLWorkshopUse'
team_arn = 'arn:aws:sagemaker:us-west-2:786774050055:workteam/private-crowd/workshop'
pre_lambda_arn = "arn:aws:lambda:us-west-2:081040173940:function:PRE-TextMultiClass"
post_lambda_arn = "arn:aws:lambda:us-west-2:081040173940:function:ACS-TextMultiClass"

human_task_config = {
    "WorkteamArn": team_arn,
    "UiConfig": {
      "UiTemplateS3Uri": f"s3://{bucket_name}/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid"
    },
    "PreHumanTaskLambdaArn": pre_lambda_arn,
    "TaskTitle": "Please pick the proper sentiment for the tweet",
    "TaskDescription": "Please pick the proper sentiment for the tweet",
    "NumberOfHumanWorkersPerDataObject": 1,
    "TaskTimeLimitInSeconds": 3600,
    "TaskAvailabilityLifetimeInSeconds": 864000,
    "MaxConcurrentTaskCount": 1000,
    "AnnotationConsolidationConfig": {
      "AnnotationConsolidationLambdaArn": post_lambda_arn
    }
}

output_path = f"s3://{bucket_name}/labeling_data_component/labeling_data_output/"

#### Invoking SageMaker API

In [16]:
def create_labeling_job(job_name, role_arn, bucket_name, team_arn):
    # Only if a folder for the job DOES NOT exist, then create it, or fail it!
    REGION = boto3.session.Session().region_name
    sagemaker_client = boto3.client('sagemaker', REGION)
    response = sagemaker_client.create_labeling_job(
        LabelingJobName = job_name,
        LabelAttributeName = "sentiment",
        InputConfig={
            'DataSource': {
                'S3DataSource': {
                    'ManifestS3Uri': f's3://{bucket_name}/labeling_data_component/labeling_data_input/input.manifest'
                }
            }
        },
        OutputConfig={
            'S3OutputPath': f'{output_path}',
            'KmsKeyId': 'alias/aws/s3'
        },
        RoleArn=role_arn,
        LabelCategoryConfigS3Uri = f"s3://{bucket_name}/labeling_data_component/labeling_data_input/sentiments.json",
        HumanTaskConfig=human_task_config
    )
    print(response)

create_labeling_job(job_name, role_arn, bucket_name, team_arn)

{'LabelingJobArn': 'arn:aws:sagemaker:us-west-2:786774050055:labeling-job/workshop-hl2', 'ResponseMetadata': {'RequestId': '67e80f3f-c023-4637-b2fe-2c4e61af0170', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '67e80f3f-c023-4637-b2fe-2c4e61af0170', 'content-type': 'application/x-amz-json-1.1', 'content-length': '87', 'date': 'Sun, 03 Apr 2022 18:51:35 GMT'}, 'RetryAttempts': 0}}


## Output ETL

The raw output from SageMaker is not easily consumable by a program to use it for model training. We can use an ETL to extract the data and make it available through Athena. In this step we will do the following:

* Download and inspect the output.manifest
* Create an ETL script and run it to generate a csv
* Create an Athena table that points to the csv file generated
* Query the table and review the data

### Download the output.manifest

Note that the output manifest will be generated by AWS only after all labels are done.

In [24]:
job_name='twitter-test-3'
!aws s3 cp s3://$bucket_name/labeling_data_component/labeling_data_output/$job_name/manifests/output/output.manifest resources/output.manifest > /dev/null
with open('resources/output.manifest', 'r') as mft:
    output = mft.read()
    print(output)

{"source":"Colt is trying it again. You guys ready for some sweet sweet music. Shhhhhh and listen","entity":"Xbox(Xseries)","tweet_id":13085,"twitter-test-3":1,"twitter-test-3-metadata":{"class-name":"Negative","job-name":"labeling-job/twitter-test-3","confidence":0,"type":"groundtruth/text-classification","human-annotated":"yes","creation-date":"2022-04-03T03:43:02.232662"}}
{"source":"totally absolutely love the new","entity":"Borderlands","tweet_id":13137,"twitter-test-3":1,"twitter-test-3-metadata":{"class-name":"Negative","job-name":"labeling-job/twitter-test-3","confidence":0,"type":"groundtruth/text-classification","human-annotated":"yes","creation-date":"2022-04-03T03:43:02.232671"}}
{"source":"@IdleSloth1984 what the hell do you mean? Xbox x is litterly a pc. What the hell is the use of buying a worse pc. At least the ps5 will have a better controller. And more. Plus we actually HAVE vr. You restarted xbot","entity":"Xbox(Xseries)","tweet_id":13009,"twitter-test-3":0,"twitter-

### Elements in output manifest

* The **source**.
* The label **class-name** under **twitter-test-3-metadata**.
* The **confidence**. For more detail about how this is calculated see the corresponding [aws help page](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html#sms-output-confidence).
* Other pass-through fields: **tweet_id** and **entity**.

### Do the ETL

In [25]:
with open('resources/output.manifest', 'r') as output:
    outlines = output.readlines()

with open('resources/outputetl.csv', 'w') as outetl:
    header = ['tweet_id', 'entity', 'sentiment', 'confidence', 'tweet_text']
    outetl.write('\t'.join(header))
    outetl.write('\n')
    for line in outlines:
        data = json.loads(line)
        output = [str(data['tweet_id'])]
        output.append(data['entity'])
        for key in data:
            if key.endswith('-metadata'):
                output.append(data[key]['class-name'])
                output.append(str(data[key]['confidence']))
                                   
        output.append(data['source'])
        outetl.write('\t'.join(output))
        outetl.write('\n')

In [26]:
!aws s3 cp resources/outputetl.csv s3://$bucket_name/labeling_data_component/labeling_data_etl/

upload: resources/outputetl.csv to s3://twitter-sentiment-hl/labeling_data_component/labeling_data_etl/outputetl.csv


### Create the Athena database and table

If you have not yet created the Athena database, do the following to create the database:
1. Navigate to Athena in AWS console
2. Check to make sure that you are in the us-west-2 region (Oregon)
3. In Setting, Click on Manage and make sure that 
* you specify the logging folder, 
* your own accoutn is the bucket owner and 
* give yourself full control over the query results
4. Run the below command to create an Athena database:
> CREATE DATABASE `ml-workshop-db`
5. Create the Athena table
* Run below command to create the table ddl and 
* Execute the ddl in Athena to generate the table

In [27]:
ddl_query = f"""
CREATE EXTERNAL TABLE `labeling_output`(
  `tweet_id` string COMMENT 'from deserializer',
  `entity` string COMMENT 'from deserializer',
  `sentiment` string COMMENT 'from deserializer',
  `confidence` string COMMENT 'from deserializer',
  `tweet_text` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'escapeChar'='\\\\',
  'quoteChar'='\\"',
  'separatorChar'='\\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://{bucket_name}/labeling_data_component/labeling_data_etl'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1',
  'transient_lastDdlTime'='1645737537')
"""

print(ddl_query)


CREATE EXTERNAL TABLE `labeling_output`(
  `tweet_id` string COMMENT 'from deserializer',
  `entity` string COMMENT 'from deserializer',
  `sentiment` string COMMENT 'from deserializer',
  `confidence` string COMMENT 'from deserializer',
  `tweet_text` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'escapeChar'='\\',
  'quoteChar'='\"',
  'separatorChar'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://twitter-sentiment-hl/labeling_data_component/labeling_data_etl'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1',
  'transient_lastDdlTime'='1645737537')



## Wrap up - What is possible?

A demo of a custom bounding box NER template