# Groundtruth Labeling

<img src="//d1.awsstatic.com/product-marketing/product-page-diagram_SageMaker-Ground-Truth-Plus.b07ea09f6243c1a8a2358c704ce2a227c78b0153.png" width="100%" alt="How Amazon SageMaker Ground Truth Plus works" title="How Amazon SageMaker Ground Truth Plus works">

Steps:

* Decide what to present to the labeler - the UI template.
* Prepare the input
* Create the job
* Do labeling
* ETL
* Training

## Install Dependencies

<h4>Here we install the dependencies. <b>pyathena</b> is will allow us to work with the data like a database.</h4>

In [104]:
!pip install pyathena

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [105]:
from pyathena import connect
import pandas as pd
import json
import boto3

## Decide UI Template to Use

This will dictate the format of the input data for labeling. AWS has a list of templates you can use as is or modify to make your own: https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis. Some of these are what's provided by Groundtruth console as the out-of-the-box ones.

In [111]:
!aws s3 cp resources/sentiment-analysis-tweet.liquid s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/

upload: resources/sentiment-analysis-tweet.liquid to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid


## Prepare Raw Data

Below we have some tweet data. We will need to label them using the AWS Groundtruth tool.

In [106]:
rawdata = pd.read_csv('resources/tweet_data.csv', delimiter='\t')
print(rawdata.head())
rawdata = rawdata.reset_index()  # make sure indexes pair with number of rows

   tweet_id                                         tweet_text
0     13085  Colt is trying it again. You guys ready for so...
1     13137                    totally absolutely love the new
2     13009  @IdleSloth1984 what the hell do you mean? Xbox...
3     13049  Please win that I was forever Xbox gang, but I...
4     13071  It is hard to overcome how devastating the del...


## Generate Input Manifest

The manifest data format is determined by the task template. AWS as a few out-of-the-box supported templates and their corresponding supported manifest data format is https://docs.aws.amazon.com/sagemaker/latest/dg/sms-supported-data-formats.html

In [107]:
with open('resources/input.manifest', 'w') as mft:
    for index, row in rawdata.iterrows():
        task = {
            'source': row['tweet_text'],
            'tweet_id': row['tweet_id']            
        }
        mft.write(json.dumps(task))
        mft.write("\n")

In [110]:
!aws s3 cp resources/input.manifest s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/
!aws s3 cp resources/sentiments.json s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/

upload: resources/sentiment-analysis-tweet.liquid to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid
upload: resources/input.manifest to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/input.manifest
upload: resources/sentiments.json to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/sentiments.json


## Create the labeling job

* AWS provided PreHumanTaskLambdaArn can be found on page https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#SageMaker-Type-HumanTaskConfig-PreHumanTaskLambdaArn
* AWS provided AnnotationConsolidationLambdaArn can be found on page https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AnnotationConsolidationConfig.html

In [109]:
def create_labeling_job(job_name):
    human_task_config = {
        "WorkteamArn": "arn:aws:sagemaker:us-west-2:995383923238:workteam/private-crowd/workshop",
        "UiConfig": {
          "UiTemplateS3Uri": "s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/sentiment-analysis-tweet.liquid"
        },
        "PreHumanTaskLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:PRE-TextMultiClass",
        "TaskTitle": "Please pick the proper sentiment for the tweet",
        "TaskDescription": "Please pick the proper sentiment for the tweet",
        "NumberOfHumanWorkersPerDataObject": 1,
        "TaskTimeLimitInSeconds": 3600,
        "TaskAvailabilityLifetimeInSeconds": 864000,
        "MaxConcurrentTaskCount": 1000,
        "AnnotationConsolidationConfig": {
          "AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:ACS-TextMultiClass"
        }
    }

    output_path = "s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_output/"

    # Only if a folder for the job DOES NOT exist, then create it, or fail it!
    REGION = boto3.session.Session().region_name
    sagemaker_client = boto3.client('sagemaker', REGION)
    response = sagemaker_client.create_labeling_job(
        LabelingJobName = job_name,
        LabelAttributeName = "sentiment",
        InputConfig={
            'DataSource': {
                'S3DataSource': {
                    'ManifestS3Uri': 's3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/input.manifest'
                }
            }
        },
        OutputConfig={
            'S3OutputPath': f'{output_path}',
            'KmsKeyId': 'alias/aws/s3'
        },
        RoleArn="arn:aws:iam::995383923238:role/service-role/AmazonSageMaker-ExecutionRole-20220326T142852",
        LabelCategoryConfigS3Uri="s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/sentiments.json",
        HumanTaskConfig=human_task_config,
        Tags=[{"Key":"RoleType", "Value":"dsml"}]
    )

job_name='workshop-hl7'
create_labeling_job(job_name)

## Output ETL

<h4>The output manifest will be generated by AWS only after all labels are done.</h4>

In [93]:
!aws s3 cp s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_output/workshop-hl2/manifests/output/output.manifest resources/output.manifest

download: s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_output/workshop-hl2/manifests/output/output.manifest to resources/output.manifest


In [94]:
with open('resources/output.manifest', 'r') as output:
    outlines = output.readlines()

with open('resources/outputetl.csv', 'w') as outetl:
    header = ['tweet_id', 'sentiment', 'confidence', 'tweet_text']
    outetl.write('\t'.join(header))
    outetl.write('\n')
    for line in outlines:
        data = json.loads(line)
        output = [str(data['tweet_id'])]
        for key in data:
            if key.endswith('-metadata'):
                output.append(data[key]['class-name'])
                output.append(str(data[key]['confidence']))
                                   
        output.append(data['source'])
        outetl.write('\t'.join(output))
        outetl.write('\n')

In [95]:
!aws s3 cp resources/outputetl.csv s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_etl/

upload: resources/outputetl.csv to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_etl/outputetl.csv


## Next Step

<h4>Once the model is trained. It can produce predictions, but some will have low scores initially. Let's label those with low scores so that it can be used to improve the model.</h4>

<img src="//d1.awsstatic.com/a2i/Product-Page-Diagram_A2I%402x.2fe2e8e5eed05fa5045835c3f3f04e23a9245e9c.png" alt="Product-Page-Diagram_A2I@2x" title="Product-Page-Diagram_A2I@2x">

In [99]:
conn = connect(s3_staging_dir="s3://aws-athena-query-results-us-west-2-995383923238/ ",
               region_name="us-west-2")
lowscore = pd.read_sql_query(""" SELECT tweet_id, COALESCE(TRY(CAST(score AS double)),0.0) as score, sentiment, tweet_text FROM "ml-workshop-db"."labeling_prediction" where COALESCE(TRY(CAST(score AS double)),0.0)<0.8 """, conn)
print(lowscore.head())

  tweet_id  score   sentiment  \
0    13179   0.42    Negative   
1     8413   0.23    Negative   
2     8419   0.62  Irrelevant   
3     8467   0.52    Negative   
4     8516   0.41    Negative   

                                          tweet_text  
0  Ok so this is getting to bother me... Even Mic...  
1  keep it a bean, the shooting is a huge problem...  
2  I remember when I called Danny Green a garbage...  
3  It's funny. Last year I won my first 20 or som...  
4  @NBA2K I brought it worth I vc it’s not showin...  


In [102]:
with open('resources/input_predicted.manifest', 'w') as mft:
    for index, row in lowscore.iterrows():
        task = {
            'source': row['tweet_text'],
            'tweet_id': row['tweet_id']        
        }
        mft.write(json.dumps(task))
        mft.write("\n")

In [103]:
!aws s3 cp resources/input_predicted.manifest s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/

upload: resources/input_predicted.manifest to s3://datascience-ml-workshop-prep/labeling_data_component/labeling_data_input/input_predicted.manifest
