## Exploring the tweebank dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'

#### Install the autolabel library

In [2]:
!pip install 'refuel-autolabel[openai]'





#### Download the dataset

In [2]:
from autolabel import get_data

get_data('tweebank')

Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/tweebank/seed.csv to seed.csv...
Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/tweebank/test.csv to test.csv...
100% [........................................] [174002/174002] bytes

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [2]:
import json

from autolabel import LabelingAgent

In [3]:
# load the config
with open('config_tweebank.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `named_entity_recognition` (since it's a named entity recognition task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at extracting Person, Organization, Location, and Miscellaneous entities...` (how we describe the task to the LLM)
* `prompt.labels`: `[
            "Location",
            "Organization",
            "Person",
            "Miscellaneous"
        ]` (the full list of labels to choose from)
* `prompt.few_shot_num`: 3 (how many labeled examples to provide to the LLM)

In [4]:
config

{'task_name': 'PersonLocationOrgMiscNER',
 'task_type': 'named_entity_recognition',
 'dataset': {'label_column': 'CategorizedLabels',
  'text_column': 'example',
  'delimiter': ','},
 'model': {'provider': 'llama',
  'name': '/workspace/hf-relevant-sampling-2483'},
 'prompt': {'task_guidelines': 'Your job is to extract all the named entities mentioned in text exactly as they appear, and classify them into the following categories: {labels}. Ensure that the output is in a JSON format, where keys are categories and values is a list of substrings corresponding to that category. Output only the JSON object and nothing else.\\n',
  'labels': ['Location', 'Organization', 'Person', 'Miscellaneous'],
  'example_template': 'Input: {example}\nOutput:\n{CategorizedLabels}',
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 'semantic_similarity',
  'few_shot_num': 0}}

In [5]:
# create an agent for labeling
agent = LabelingAgent(config=config)

INFO 10-05 21:47:27 llm_engine.py:72] Initializing an LLM engine with config: model='/workspace/hf-relevant-sampling-2483', tokenizer='/workspace/hf-relevant-sampling-2483', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, seed=0)


2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Added key: store_based_barrier_key:3 to store for rank: 0
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Added key: store_based_barrier_key:4 to store for rank: 0
2023-10-05 21:47:28 torch.distributed.distributed_c10d INFO: Rank 0: Completed stor

INFO 10-05 21:48:16 llm_engine.py:199] # GPU blocks: 1468, # CPU blocks: 327


In [6]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

Output()

In [7]:
# now, do the actual labeling
ds = agent.run(ds, max_items=1000)

Output()

2023-10-05 21:55:41 autolabel.tasks.named_entity_recognition ERROR: unterminated string literal (detected at line 1) (<unknown>, line 1). Could not parse LLM output: {'Location': ['W.Monroe - Monroe'], 'Organization': ['Samaritan 's Purse'], 'Person': [], 'Miscellaneous': []}
2023-10-05 21:56:28 autolabel.tasks.named_entity_recognition ERROR: unterminated string literal (detected at line 1) (<unknown>, line 1). Could not parse LLM output: {'Location': [], 'Organization': [], 'Person': ['Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi', 'Peter Capaldi',

KeyboardInterrupt: 

In [None]:
ds.save("inference.csv")