---

# First Look at the Datasets
> Thorsten Trinkaus

---

---
## Imports
---

In [1]:
import os
import random
from datasets import load_dataset

---
## Load Datasets
---

While OntoNotes and FIGER have their own Hugging Face datasets, Ultra Fine
Entity Typing needs to be loaded manually from files. For more information see
Choi, E., Levy, O., Choi, Y., & Zettlemoyer (2018). Ultra-Fine Entity Typing. In
Proceedings of the ACL. Association for Computational Linguistics (https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html).

In [2]:
# Hugging Face Token
HF_TOKEN = os.getenv("HF_TOKEN")

# OntoNotes 5
# revision because the dataset uses dataset scripts, 
# which datasets 4.0 does not support anymore.
ds_onto = load_dataset(
    "tner/ontonotes5", 
    revision="refs/convert/parquet", 
    token=HF_TOKEN
)

# FIGER
ds_figer = load_dataset("DGME/figer", token=HF_TOKEN)

# Ultra Fine Crowdsourced
ds_fine_crowd = load_dataset(
    "json",
    data_files={
        "train": "./ultra_fine/crowd/train.json",
        "validation": "./ultra_fine/crowd/dev.json",
        "test": "./ultra_fine/crowd/test.json"
    }
)

# Ultra Fine Distantly Supervised
# This dataset does not have a test split!
ds_fine_ds = load_dataset(
    "json",
    data_files={
        "train": "./ultra_fine/ds/el_train.json",
        "validation": "./ultra_fine/ds/el_dev.json",
        "test": "./ultra_fine/ds/el_dev.json"
    }
)

---
## First Tests
---

In [3]:
def print_random(ds, n=5, split="train"):
    """
    Print n random samples from the dataset.

    Args:
        ds: The dataset from which to sample.
        n: The number of samples (default is 5).
        split: The dataset split to sample from (default is "train").
    """
    for _ in range(n):
        idx = random.randint(0, len(ds[split]) - 1)
        sample = ds[split][idx]
        print(sample)

### OntoNotes

In [4]:
print_random(ds_onto)
print("Size of train split:", len(ds_onto["train"]))

{'tokens': ['US', 'Ambassador', 'Richard', 'Holbrooke', 'said', ',', 'President', 'Kostunica', 'is', 'to', 'be', 'congratulated', '.'], 'tags': [7, 0, 4, 5, 0, 0, 0, 4, 0, 0, 0, 0, 0]}
{'tokens': ['Total', 'truck', 'production', 'fell', '22', '%', 'from', 'a', 'year', 'earlier', 'to', '315,546', 'units', '.'], 'tags': [0, 0, 0, 0, 13, 14, 0, 2, 3, 3, 0, 1, 0, 0]}
{'tokens': ['Mr.', 'Levin', ',', 'former', 'head', 'of', 'EPA', "'s", 'regulatory', 'reform', 'staff', ',', 'adapted', 'this', 'from', 'his', 'November', 'column', 'for', 'the', 'Journal', 'of', 'the', 'Air', 'and', 'Waste', 'Management', 'Association', '.'], 'tags': [0, 4, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 11, 12, 12, 12, 12, 12, 12, 12, 12, 0]}
{'tokens': ['Thanks', 'Wolf', '/.'], 'tags': [0, 4, 0]}
{'tokens': ['So', 'how', 'are', 'they', 'getting', 'back', '?'], 'tags': [0, 0, 0, 0, 0, 0, 0]}
Size of train split: 59924


How many entries of the train split do not contain any named entities?

In [5]:
tags_col = ds_onto["train"]["tags"]
zero_counter = sum(1 for tags in tags_col if tags and not any(tags))
print(zero_counter)

26804


### FIGER

In [6]:
print_random(ds_figer)
print("Size of train split:", len(ds_figer["train"]))

{'mention_span': 'Psary', 'left_context_token': ['It', 'lies', 'approximately', '3kmmi0', 'north', 'of'], 'right_context_token': [',', '8kmmi0on', 'north', 'of', 'Bedzin', ',', 'and', '19kmmi0on', 'north-east', 'of', 'the', 'regional', 'capital', 'Katowice', '.'], 'y_str': ['/location/city', '/location']}
{'mention_span': 'Operation Barbarossa', 'left_context_token': ['The', 'Southern', 'Front', 'directed', 'military', 'operations', 'during', 'the', 'Soviet', 'occupation', 'of', 'Bessarabia', 'and', 'Northern', 'Bukovina', 'in', '1940', ',', 'and', 'then', 'was', 'formed', 'twice', 'after', 'the', 'June', '1941', 'German', 'invasion', ','], 'right_context_token': ['.'], 'y_str': ['/event/military_conflict', '/event']}
{'mention_span': 'Pennsylvania', 'left_context_token': ['Daniel', '"', 'Dan', '"', 'Onorato', '(', 'born', 'February', '5', ',', '1961', ')', 'is', 'the', 'current', 'Chief', 'Executive', 'of', 'Allegheny', 'County', ','], 'right_context_token': ['.'], 'y_str': ['/governm

### Ultra Fine

Crowdsourced

In [7]:
print_random(ds_fine_crowd)
print("Size of train split:", len(ds_fine_crowd["train"]))

{'annot_id': 'wex/20110513/1/35/206:5:1', 'mention_span': 'The reservoirs', 'right_context_token': ['were', 'officially', 'opened', 'on', '15', 'December', '1931', 'by', 'Governor', 'of', 'Hong', 'Kong', 'William', 'Peel', ',', 'becoming', 'the', 'fourth', 'and', 'last', 'reservoir', 'group', 'ever', 'built', 'on', 'Hong', 'Kong', 'Island', ',', 'after', 'Pok', 'Fu', 'Lam', ',', 'Tai', 'Tam', 'and', 'Wong', 'Nai', 'Chung', '.'], 'y_str': ['point'], 'left_context_token': []}
{'annot_id': 'APW_ENG_20001116.0360:19:659', 'mention_span': 'it', 'right_context_token': ['may', 'not', 'have', 'been', 'to', 'any', 'political', 'advantage', 'of', 'his', ',', "''", 'Peterson', 'said', '.', '``'], 'y_str': ['concept', 'idea', 'decision', 'choice', 'attempt', 'activity'], 'left_context_token': ['It', 'was', 'Bill', 'Clinton', 'who', 'opened', 'up', 'and', 'had', 'the', 'vision', 'for', 'moving', 'ahead', 'rather', 'aggressively', 'with', 'our', 'relationship', 'with', 'Vietnam', 'when', 'in', 'many

Distantly Supervised

In [8]:
print_random(ds_fine_ds)
print("Size of train split:", len(ds_fine_ds["train"]))

{'left_context_token': ['is', 'a', 'parasitic', 'nematode', 'worm', 'that', 'lives', 'in', 'the', 'swimbladders', 'of', 'eels', '-LRB-'], 'y_str': ['country', 'geography', 'location', 'island'], 'right_context_token': ['spp', '.', '-RRB-'], 'y': [1, 3756, 546, 51], 'y_type': [0, 0, 0, 0], 'goal_y_str': ['/location/country', '/geography', '/location', '/geography/island'], 'y_type_str': ['KB', 'KB', 'KB', 'KB'], 'goal_y': [5, 30, 0, 39], 'annot_id': '1154849', 'mention_span': 'Anguilla'}
{'left_context_token': ['The', '2007-08', 'Philippine', 'Basketball', 'Association', '-LRB-', 'PBA', '-RRB-', 'Philippine', 'Cup', 'or', 'known', 'as', 'the', '2007-08', 'Smart', 'PBA', 'Philippine', 'Cup', 'for', 'sponsorship', 'reasons', ',', 'is', 'the', 'first', 'conference', 'of', 'the'], 'y_str': ['event'], 'right_context_token': ['PBA', 'season', '.'], 'y': [109], 'y_type': [0], 'goal_y_str': ['/event'], 'y_type_str': ['KB'], 'goal_y': [9], 'annot_id': '1865997', 'mention_span': '2007-08'}
{'left