## Managing Fine-Grained Data with `thresh`

In [1]:
!pip install -q thresh

In [1]:
from thresh import load_interface, convert_dataset

### Data Loading
When you collect data on thresh.tools, you will eventually have a large amount of `<<data>>.json` files to manage. 

Instead of having to create dataloaders to verify each interface (which leads to boilerplate code, and can become quite complex), our `thresh` library handles all of this for you!

This notebook will demonstrate the basic functionality of `load_interface` and `convert_dataset` to manage your data.

In [2]:
# Let's begin by downloading the SALSA typology and demo dataset
!curl -so salsa.yml https://thresh.tools/templates/salsa.yml
!curl -so salsa.json https://thresh.tools/data/salsa.json

In [3]:
# We can load any interface YML file into a Python object
SALSA = load_interface("salsa.yml")

# Here we can view some features from the interface
print(SALSA.template_label)
print(SALSA.template_description)
print(SALSA.citation)

💃 SALSA
Success and FAilure Linguistic Simplification Annotation
@article{heineman2023dancing,
  title={Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA},
  author={Heineman, David and Dou, Yao and Maddela, Mounica and Xu, Wei},
  journal={arXiv preprint arXiv:2305.14458},
  year={2023}
}


In [4]:
# We can load our JSON data by using our interface object and calling load_annotations()
salsa_data = SALSA.load_annotations("salsa.json")

# As we can see, it creates a custom object, with custom Edit and Annotation classes according
# to the SALSA typology
print(salsa_data[0])

SalsaEntry(
  annotator: annotator_1, 
  system: new-wiki-1/Human-2-written, 
  target: An important aspect of Fungi in Art is the protection of artwork from fungal damage. || Another important aspect is to make sure mycologists, artists, and society are all on the same page., 
  source: Further important aspects of Fungi in Art relate to preservation of artworks from fungal decay and contamination, as well as initiatives fostering and supporting works able to stimulate dialogues between mycologists (fungal researchers), artists, and society (as for example from the 'Massee Art Grant' by the British Mycological Society or works encouraged and supported by the Fungi Foundation)., 
  edits: [DeletionEdit(
    input_idx: [[259, 397]], 
    annotation: DeletionAnnotation(
      deletion_type: GoodDeletion(
        val: 3
      ), 
      coreference: False, 
      grammar_error: False
    )
  ), DeletionEdit(
    input_idx: [[216, 237]], 
    annotation: DeletionAnnotation(
      deletion_t

In [5]:
# We can view attributes of an entry via direct calls
salsa_entry = salsa_data[0]

print(salsa_entry.system)
print(salsa_entry.annotator)
print(salsa_entry.edits[0])

new-wiki-1/Human-2-written
annotator_1
DeletionEdit(
  input_idx: [[259, 397]], 
  annotation: DeletionAnnotation(
    deletion_type: GoodDeletion(
      val: 3
    ), 
    coreference: False, 
    grammar_error: False
  )
)


In [6]:
# If we use our interface to create our own dataset, we can export it back to JSON
SALSA.export_data(salsa_data, "salsa_export.json")

### Internal Data Classes

The `thresh` has four classes: `Interface`, `Entry`, `Edit` and `Annotation`

In this section, we will show how to use the Edit and Annotation classes.

In [7]:
# Get the Entry class defined by SALSA
SalsaEntry = SALSA.get_entry_class()

# Create a new Entry object
sent = SalsaEntry(
    system="custom_entry",
    annotator=None,
    source="This is our complex sentence.",
    target="This is our complex sentence."
)

print(sent)

# We can get individual entries by directly calling attributes
print(sent.target)

SalsaEntry(
  system: custom_entry, 
  annotator: None, 
  source: This is our complex sentence., 
  target: This is our complex sentence.
)
This is our complex sentence.


In [8]:
# Let's take a look at the classes for each Edit
SALSA.edit_classes

{'deletion': thresh.names.DeletionEdit,
 'reorder': thresh.names.ReorderEdit,
 'insertion': thresh.names.InsertionEdit,
 'substitution': thresh.names.SubstitutionEdit,
 'split': thresh.names.SplitEdit,
 'structure': thresh.names.StructureEdit}

In [9]:
# We can use get_edit_class() to get a specific Edit class
InsertionEdit = SALSA.get_edit_class('insertion')

# Then, we can define a new Insertion edit
edit = InsertionEdit(
    output_idx=[[39, 40]]
)

print(edit)

InsertionEdit(
  output_idx: [[39, 40]]
)


In [10]:
# Let's add some edits to our sentence
sent.edits = []

DeletionEdit = SALSA.get_edit_class('deletion')
SubstitutionEdit = SALSA.get_edit_class('substitution')

sent.edits.append(
    SubstitutionEdit(
        input_idx=[[1, 4]],
        output_idx=[[8, 12]],
    )
)

sent.edits.append(
    DeletionEdit(
        input_idx=[[1, 5]],
    )
)

print(sent)

SalsaEntry(
  system: custom_entry, 
  annotator: None, 
  source: This is our complex sentence., 
  target: This is our complex sentence., 
  edits: [SubstitutionEdit(
    input_idx: [[1, 4]], 
    output_idx: [[8, 12]]
  ), DeletionEdit(
    input_idx: [[1, 5]]
  )]
)


In [11]:
# We can now use our typology to convert our data back to a raw format
raw_salsa_data = SALSA.to_dict([sent])
print(raw_salsa_data)

# Feel free to paste this into thresh.tools/?t=salsa to see a vizualization!

[{'source': 'This is our complex sentence.', 'target': 'This is our complex sentence.', 'metadata': {'annotator': None, 'system': 'custom_entry'}, 'edits': [{'input_idx': [[1, 4]], 'output_idx': [[8, 12]], 'category': 'substitution'}, {'input_idx': [[1, 5]], 'category': 'deletion'}]}]


In [12]:
# We can save our data file as well
SALSA.export_data(data=[sent], output_filename="salsa_custom_export.json")

In [13]:
# One side note: If you want to load, you can simply add the datasets:
# SALSA.load_annotations("dataset_1.json") + SALSA.load_annotations("dataset_2.json")

### Data Conversion
Our standardized `thresh` data format is universal across fine-grained annotation schemes. To support exisitng data formats, we created dataloaders from external datasets.

We support the following datasets:
```
frank, scarecrow, mqm, snac, propaganda, fg-rlhf, arxivedits
```

In [14]:
# First, we'll download the SNaC dataset in its original form
!pip install -q gdown
import gdown
gdown.download(f'https://drive.google.com/uc?id=1ff-pV2sX9XNDMdaPxY7v22T2i0235tcE', 'original_snac_data.json', quiet=True)

# Then, we will convert it to our standard format
snac_raw = convert_dataset(
    dataset_name="snac",
    data_path="original_snac_data.json",
    output_path='snac_data_thresh_format.json' # <- (Optional) saves data to file
)

2023-08-13 21:50:04,911 - INFO - Dataset name: snac
2023-08-13 21:50:04,913 - INFO - Data path: original_snac_data.json
2023-08-13 21:50:04,915 - INFO - Reverse flag: False
2023-08-13 21:50:04,917 - INFO - Output path (Optional): snac_data_thresh_format.json
2023-08-13 21:50:04,931 - INFO - Converting dataset...
2023-08-13 21:50:05,021 - INFO - Done!
2023-08-13 21:50:05,023 - INFO - Saving to snac_data_thresh_format.json...


In [15]:
# The data conversion is now in the JSON format, let's look at an example
snac_raw[0]

{'source': '',
 'target': 'Johnnie, a determined 19 year old girl, works at the cotton mill to support her family. She plans to pay back all the debts they have accumulated.',
 'edits': [{'id': 0, 'category': 'CharE', 'output_idx': [[0, 7]], 'votes': 1}]}

In [16]:
# To begin using this, we'll download and use the thresh.tools SNaC typology
!curl -so snac.yml https://thresh.tools/templates/snac.yml

# Now we can load and view in the standardized thresh.tools format!
SNaC = load_interface("snac.yml")
snac = SNaC.load_annotations(snac_raw)

snac[0]

SnacEntry(
  target: Johnnie, a determined 19 year old girl, works at the cotton mill to support her family. She plans to pay back all the debts they have accumulated., 
  source: , 
  edits: [ChareEdit(
    output_idx: [[0, 7]], 
    votes: 1
  )]
)

In [17]:
# Conversion to the Thresh platform is not one-way, you can convert data collected using
# Thesh and convert it back to the original dataset format with the "reverse" flag
convert_dataset(
    dataset_name="snac",
    data_path='snac_data_thresh_format.json',
    output_path='original_snac_data.json',
    reverse=True
)

2023-08-13 21:50:08,287 - INFO - Dataset name: snac
2023-08-13 21:50:08,288 - INFO - Data path: snac_data_thresh_format.json
2023-08-13 21:50:08,289 - INFO - Reverse flag: True
2023-08-13 21:50:08,289 - INFO - Output path (Optional): original_snac_data.json
2023-08-13 21:50:08,291 - INFO - Converting dataset...
2023-08-13 21:50:08,320 - INFO - Saving to original_snac_data.json...
2023-08-13 21:50:08,430 - INFO - Done!


### Examples: Loading Demo Interface Data
Now that we've seen our data loading and dataset conversion, let's see some examples across different tasks.

These cells will load our demo data for each interface as it is seen in `thresh.tools`.

In [18]:
!curl -so multipit.yml https://thresh.tools/templates/multipit.yml
!curl -so multipit.json https://thresh.tools/data/multipit.json

MultiPIT = load_interface("multipit.yml")
multipit_data = MultiPIT.load_annotations("multipit.json")

multipit_data[0]

MultipitEntry(
  target: Relax, take a deep breath, and enjoy the moment., 
  source: Relax, take it easy., 
  edits: [AddNewEdit(
    annotation: None, 
    output_idx: [[14, 48]]
  )]
)

In [19]:
!curl -so frank.yml https://thresh.tools/templates/frank.yml
!curl -so frank.json https://thresh.tools/data/frank.json

FRANK = load_interface("frank.yml")
frank_data = FRANK.load_annotations("frank.json")

frank_data[0]

FrankEntry(
  target: inverness caledonian thistle came from behind to beat inverness caledonian thistle in the scottish premiership., 
  context: Brad McKay crouched to volley in Greg Tansey's deep free-kick early in the match.And Tansey converted a penalty after Massimo Donati had fouled Ross Draper.Accies were upset Ali Crawford was not awarded a second-half spot-kick for a challenge by goalkeeper Ryan Esson but netted late on through Danny Redmond.The gap between Caley Thistle and Motherwell also stands at four points, with Well behind Hamilton on goal difference after losing to Ross County.Media playback is not supported on this deviceThe first-half performance was exactly what Inverness manager Richie Foran has been searching for and came with their backs planted firmly against the wall.They were terrific. Adversity sometimes brings out the best in people, although nerves did seem to take effect after half-time.Foran has said for some time his side just needed one win to get goin

In [20]:
!curl -so cwzcc.yml https://thresh.tools/templates/cwzcc.yml
!curl -so cwzcc.json https://thresh.tools/data/cwzcc.json

CWZCC = load_interface("cwzcc.yml")
cwzcc_data = CWZCC.load_annotations("cwzcc.json")

cwzcc_data[0]

CwzccEntry(
  target: Pirmi man iyo ta sinti dwele. Nusabe yo porke pirmi ansina. Kwando iyo keda alegre? Kwando gaha pasa el diya o mes na iyo triste?, 
  source: , 
  edits: [UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: cognate_interference
          )
        )
      )
    ), 
    output_idx: [[0, 5]]
  ), UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: homophone_graphemes
          )
        )
      )
    ), 
    output_idx: [[10, 13]]
  ), UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: homophone_graphemes
          )
        )
      )
    ), 
    output_idx: [[17, 22]]
  ), UnintentionalEdit(
    annotatio

In [21]:
!curl -so scarecrow.yml https://thresh.tools/templates/scarecrow.yml
!curl -so scarecrow.json https://thresh.tools/data/scarecrow.json

SCARECROW = load_interface("scarecrow.yml")
scarecrow_data = SCARECROW.load_annotations("scarecrow.json")

scarecrow_data[0]

ScarecrowEntry(
  p: 0.96, 
  model: gpt3, 
  temperature: 1.0, 
  gid: 10001, 
  frequency_penalty: 0, 
  id: 1, 
  source: In the wild, animals display tender moments of affection all the time. Macedonian photographer Goran Anastasovski has spent 15 years honing his skills in wildlife photography so that he can share these touching instances with others., 
  target: To honor the effort he put into his latest set of photographs, we've gathered some of our favorites for your viewing pleasure. A lion and its cub enjoy a tender moment together. The lion's paws rest on top of the front paws of its cub, who lays its head on its mother. A polar bear embraces her cub. Polar bears care for their cubs until they're about two years old, when the cubs venture out on their own. A kangaroo rubs her cheek on her joey's hand., 
  edits: [NeedsGoogleEdit(
    output_idx: [[300, 366]], 
    annotation: NeedsGoogleAnnotation(
      explanation: I'm not sure if polar bear cubs stay with their mothers un

In [22]:
!curl -so snac.yml https://thresh.tools/templates/snac.yml
!curl -so snac.json https://thresh.tools/data/snac.json

SNaC = load_interface("snac.yml")
snac_data = SNaC.load_annotations("snac.json")

snac_data[0]

SnacEntry(
  target: It is also said that seeing Anytus pass by, Socrates remarked: "How proudly the great man steps; he thinks, no doubt, he has performed some great and noble deed in putting me to death, and all because, seeing him deemed worthy of the highest honours of the state, I told him it ill became him to bring up his so in a tan-yard." He adds that Homer has ascribed to some at the point of death a power of forecasting things, and he too is minded to utter a prophecy. Once, for a brief space, he associated with the son of Anytus, and he seemed, 
  source: , 
  edits: [SceneeEdit(
    output_idx: [[0, 42]], 
    votes: 1
  ), GrameEdit(
    output_idx: [[264, 326]], 
    votes: 1
  ), ChareEdit(
    output_idx: [[28, 34]], 
    votes: 1
  ), ChareEdit(
    output_idx: [[341, 346]], 
    votes: 1
  ), GrameEdit(
    output_idx: [[526, 539]], 
    votes: 2
  )]
)

In [23]:
!curl -so propaganda.yml https://thresh.tools/templates/propaganda.yml
!curl -so propaganda.json https://thresh.tools/data/propaganda.json

Propaganda = load_interface("propaganda.yml")
propaganda_data = Propaganda.load_annotations("propaganda.json")

propaganda_data[0]

PropagandaEntry(
  target: The US Is Blatantly Telling Lies
  
  It’s no secret that the Trump administration has a strong distaste for Iran.
  Iran is one of the only issues on which the U.S. president has remained relatively consistent.
  Trump berated the country both before and after taking office.
  However, Trump’s anti-Iran strategy goes against the better judgment of even the most anti-Iranian advisors in his staff who don’t want to see the U.S. isolated on the world stage.
  Fortunately for Trump, however, he is not alone in his bid to isolate and demonize Iran at all costs.
  On December 12, Trump’s ambassador to the U.N., Nikki Haley, gave a grandiose speech demonizing Iran that echoed Colin Powell’s infamous performance before the U.N. in 2003.
  Haley’s essential claim was that Saudi Arabia is under attack by missiles supplied to Yemen by the Iranian government and that the world should not sit idly by as this goes on.
  “If we do nothing about the missiles fired at Saudi 

In [24]:
!curl -so fg-rlhf.yml https://thresh.tools/templates/fg-rlhf.yml
!curl -so fg-rlhf.json https://thresh.tools/data/fg-rlhf.json

FGRLHF = load_interface("fg-rlhf.yml")
fg_rlhf_data = FGRLHF.load_annotations("fg-rlhf.json")

fg_rlhf_data[0]

Fg-rlhfEntry(
  target: Bloom is the second studio album by Australian singer and songwriter Troye Sivan, released on 31 August 2018 through EMI Australia and Capitol Records. The album follows up his 2015 debut studio album, Blue Neighbourhood, and features guest appearances from Gordi and Ariana Grande. The lead single from the album was released on 10 January 2018, accompanied by a music video directed by Grant Singer. The title track, a song about queer desire, was released on 2 May as the third single. American singer Ariana Grande was released on 13 June as the album's fourth single., 
  context: 
  
  ### Passage Title: Bloom (Troye Sivan album)
  Bloom is the second studio album by Australian singer and songwriter Troye Sivan, released on 31 August 2018 through EMI Australia and Capitol Records.
  The album follows up his 2015 debut studio album, "Blue Neighbourhood", and features guest appearances from Gordi and Ariana Grande.
  It was preceded by the release of the singles "M

In [25]:
!curl -so arxivedits.yml https://thresh.tools/templates/arxivedits.yml
!curl -so arxivedits.json https://thresh.tools/data/arxivedits.json

ArXivEdits = load_interface("arxivedits.yml")
arXiv_edits_data = ArXivEdits.load_annotations("arxivedits.json")

arXiv_edits_data[0]

ArxiveditsEntry(
  sentence-2-level: 2, 
  sentence-1-level: 1, 
  easy-or-hard: easy, 
  arxiv-id: 1301.2079, 
  sentence-pair-index: 0, 
  target: The other one is panel information criteria ( [MATH] ) , corresponding to [MATH] ., 
  source: The other one is panel information criteria ( [MATH] ) , correspondence with [MATH] , they also have three styles , one of them is : [EQUATION], 
  edits: [InsertionEdit(
    annotation: InsertionAnnotation(
      intention: Format(
        val: format
      )
    ), 
    output_idx: [[81, 83]]
  ), DeletionEdit(
    annotation: DeletionAnnotation(
      intention: Content(
        val: content
      )
    ), 
    input_idx: [[84, 144]]
  ), SubstituteEdit(
    annotation: SubstituteAnnotation(
      intention: Improve-grammar-typo(
        val: improve-grammar-typo
      )
    ), 
    input_idx: [[57, 77]], 
    output_idx: [[57, 74]]
  )]
)