## Managing Fine-Grained Data with `nlproc_tools`

In [33]:
!pip install -q nlproc_tools

In [34]:
from nlproc_tools import load_interface, convert_dataset

### Data Loading
When you collect data on nlproc.tools, you will eventually have a large amount of `<<data>>.json` files to manage. 

Instead of having to create dataloaders to verify each interface (which leads to boilerplate code, and can become quite complex), our `nlproc_tools` library handles all of this for you!

This notebook will demonstrate the basic functionality of `load_interface` and `convert_dataset` to manage your data.

In [35]:
# Let's begin by downloading the SALSA typology and demo dataset
!curl -so salsa.yml https://nlproc.tools/templates/salsa.yml
!curl -so salsa.json https://nlproc.tools/data/salsa.json

In [36]:
# We can load any interface YML file into a Python object
SALSA = load_interface("salsa.yml")

# Here we can view some features from the interface
print(SALSA.template_label)
print(SALSA.template_description)
print(SALSA.citation)

💃 SALSA
Success and FAilure Linguistic Simplification Annotation
@article{heineman2023dancing,
  title={Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA},
  author={Heineman, David and Dou, Yao and Maddela, Mounica and Xu, Wei},
  journal={arXiv preprint arXiv:2305.14458},
  year={2023}
}


In [37]:
# We can load our JSON data by using our interface object and calling load_annotations()
salsa_data = SALSA.load_annotations("salsa.json")

# As we can see, it creates a custom object, with custom Edit and Annotation classes according
# to the SALSA typology
print(salsa_data[0])



SalsaEntry(
  annotator: annotator_1, 
  system: new-wiki-1/GPT-3-zero-shot, 
  source: The film has grossed over $552 million worldwide, becoming the eighth highest-grossing film of 2022., 
  target: The film has made more than $552 million at the box office and is currently the eighth most successful movie of 2022., 
  edits: [DeletionEdit(
    input_idx: [[39, 49]], 
    annotation: DeletionAnnotation(
      deletion_type: GoodDeletion(
        val: 1
      ), 
      coreference: False, 
      grammar_error: False
    )
  ), SubstitutionEdit(
    input_idx: [[13, 20]], 
    output_idx: [[13, 17]], 
    annotation: SubstitutionAnnotation(
      substitution_info_change: Same(
        val: 2
      ), 
      grammar_error: False
    )
  ), SubstitutionEdit(
    input_idx: [[21, 25]], 
    output_idx: [[18, 27]], 
    annotation: SubstitutionAnnotation(
      substitution_info_change: Same(
        val: 1
      ), 
      grammar_error: False
    )
  ), SubstitutionEdit(
    input_idx: [

In [38]:
# We can view attributes of an entry via direct calls
salsa_entry = salsa_data[0]

print(salsa_entry.system)
print(salsa_entry.annotator)
print(salsa_entry.edits[0])

new-wiki-1/GPT-3-zero-shot
annotator_1
DeletionEdit(
  input_idx: [[39, 49]], 
  annotation: DeletionAnnotation(
    deletion_type: GoodDeletion(
      val: 1
    ), 
    coreference: False, 
    grammar_error: False
  )
)


In [39]:
# If we use our interface to create our own dataset, we can export it back to JSON
SALSA.export_data(salsa_data, "salsa_export.json")

### Internal Data Classes

The nlproc_library has four classes: `Interface`, `Entry`, `Edit` and `Annotation`

In this section, we will show how to use the Edit and Annotation classes.

In [40]:
# Get the Entry class defined by SALSA
SalsaEntry = SALSA.get_entry_class()

# Create a new Entry object
sent = SalsaEntry(
    system="custom_entry",
    annotator=None,
    source="This is our complex sentence.",
    target="This is our complex sentence."
)

print(sent)

# We can get individual entries by directly calling attributes
print(sent.target)

SalsaEntry(
  system: custom_entry, 
  annotator: None, 
  source: This is our complex sentence., 
  target: This is our complex sentence.
)
This is our complex sentence.


In [41]:
# Let's take a look at the classes for each Edit
SALSA.edit_classes

{'reorder': nlproc_tools.names.ReorderEdit,
 'insertion': nlproc_tools.names.InsertionEdit,
 'deletion': nlproc_tools.names.DeletionEdit,
 'substitution': nlproc_tools.names.SubstitutionEdit,
 'structure': nlproc_tools.names.StructureEdit,
 'split': nlproc_tools.names.SplitEdit}

In [42]:
# We can use get_edit_class() to get a specific Edit class
InsertionEdit = SALSA.get_edit_class('insertion')

# Then, we can define a new Insertion edit
edit = InsertionEdit(
    output_idx=[[39, 40]]
)

print(edit)

InsertionEdit(
  output_idx: [[39, 40]]
)


In [43]:
# Let's add some edits to our sentence
sent.edits = []

DeletionEdit = SALSA.get_edit_class('deletion')
SubstitutionEdit = SALSA.get_edit_class('substitution')

sent.edits.append(
    SubstitutionEdit(
        input_idx=[[1, 4]],
        output_idx=[[8, 12]],
    )
)

sent.edits.append(
    DeletionEdit(
        input_idx=[[1, 5]],
    )
)

print(sent)

SalsaEntry(
  system: custom_entry, 
  annotator: None, 
  source: This is our complex sentence., 
  target: This is our complex sentence., 
  edits: [SubstitutionEdit(
    input_idx: [[1, 4]], 
    output_idx: [[8, 12]]
  ), DeletionEdit(
    input_idx: [[1, 5]]
  )]
)


In [44]:
# We can now use our typology to convert our data back to a raw format
raw_salsa_data = SALSA.to_dict([sent])
print(raw_salsa_data)

# Feel free to paste this into nlproc.tools/?t=salsa to see a vizualization!

{}


In [45]:
# One side note: If you want to load, you can simply add the datasets:
# SALSA.load_annotations("dataset_1.json") + SALSA.load_annotations("dataset_2.json")

### Data Conversion
Our standardized nlproc.tools data format is universal across fine-grained annotation schemes. To support exisitng data formats, we created dataloaders from external datasets.

We support the following datasets:
```
frank, scarecrow, mqm, snac, wu-etal-2023, da-san-martino-etal-2019
```

In [46]:
# First, we'll download the SNaC dataset in its original form
!pip install -q gdown
import gdown
gdown.download(f'https://drive.google.com/uc?id=1ff-pV2sX9XNDMdaPxY7v22T2i0235tcE', 'original_snac_data.json', quiet=True)

# Then, we will convert it to our standard format
snac_raw = convert_dataset(
    dataset_name="snac",
    data_path="original_snac_data.json"
)

2023-07-28 23:38:50,052 - INFO - Dataset name: snac
2023-07-28 23:38:50,053 - INFO - Data path: original_snac_data.json
2023-07-28 23:38:50,053 - INFO - Reverse flag: False
2023-07-28 23:38:50,054 - INFO - Output path (Optional): None
2023-07-28 23:38:50,056 - INFO - Converting dataset...
2023-07-28 23:38:50,084 - INFO - Done!


In [47]:
# The data conversion is now in the JSON format, let's look at an example
snac_raw[0]

{'source': '',
 'target': 'Johnnie, a determined 19 year old girl, works at the cotton mill to support her family. She plans to pay back all the debts they have accumulated.',
 'edits': [{'id': 0, 'category': 'CharE', 'output_idx': [[0, 7]], 'votes': 1}]}

In [48]:
# To begin using this, we'll download and use the nlproc.tools SNaC typology
!curl -so snac.yml https://nlproc.tools/templates/snac.yml

# Now we can load and view in the standardized nlproc.tools format!
SNaC = load_interface("snac.yml")
snac = SNaC.load_annotations(snac_raw)

snac[0]

SnacEntry(
  source: , 
  target: Johnnie, a determined 19 year old girl, works at the cotton mill to support her family. She plans to pay back all the debts they have accumulated., 
  edits: [ChareEdit(
    output_idx: [[0, 7]], 
    votes: 1
  )]
)

### Examples: Loading Demo Interface Data
Now that we've seen our data loading and dataset conversion, let's see some examples across different tasks.

These cells will load our demo data for each interface as it is seen in nlproc.tools.

In [49]:
!curl -so multipit.yml https://nlproc.tools/templates/multipit.yml
!curl -so multipit.json https://nlproc.tools/data/multipit.json

MultiPIT = load_interface("multipit.yml")
multipit_data = MultiPIT.load_annotations("multipit.json")

multipit_data[0]

MultipitEntry(
  source: Relax, take it easy., 
  target: Relax, take a deep breath, and enjoy the moment., 
  edits: [AddNewEdit(
    annotation: None, 
    output_idx: [[14, 48]]
  )]
)

In [50]:
!curl -so frank.yml https://nlproc.tools/templates/frank.yml
!curl -so frank.json https://nlproc.tools/data/frank.json

FRANK = load_interface("frank.yml")
frank_data = FRANK.load_annotations("frank.json")

frank_data[0]

KeyError: 'OtherE'

In [None]:
!curl -so cwzcc.yml https://nlproc.tools/templates/cwzcc.yml
!curl -so cwzcc.json https://nlproc.tools/data/cwzcc.json

CWZCC = load_interface("cwzcc.yml")
cwzcc_data = CWZCC.load_annotations("cwzcc.json")

cwzcc_data[0]

CwzccEntry(
  source: , 
  target: Pirmi man iyo ta sinti dwele. Nusabe yo porke pirmi ansina. Kwando iyo keda alegre? Kwando gaha pasa el diya o mes na iyo triste?, 
  edits: [UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      val: None, 
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: None
          )
        )
      )
    ), 
    output_idx: [[0, 5]]
  ), UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      val: None, 
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: None
          )
        )
      )
    ), 
    output_idx: [[10, 13]]
  ), UnintentionalEdit(
    annotation: UnintentionalAnnotation(
      val: None, 
      error_type: NonRandomErrors(
        val: ArbitrarySpelling(
          val: Phonogramical(
            val: None
          )
        )
      )
    ), 
    output_idx: [[17, 22]]
  ), UnintentionalEdit(
    a

In [None]:
!curl -so scarecrow.yml https://nlproc.tools/templates/scarecrow.yml
!curl -so scarecrow.json https://nlproc.tools/data/scarecrow.json

SCARECROW = load_interface("scarecrow.yml")
scarecrow_data = SCARECROW.load_annotations("scarecrow.json")

scarecrow_data[0]

ScarecrowEntry(
  model: gpt3, 
  frequency_penalty: 0, 
  temperature: 1.0, 
  p: 0.96, 
  gid: 10001, 
  id: 1, 
  source: In the wild, animals display tender moments of affection all the time. Macedonian photographer Goran Anastasovski has spent 15 years honing his skills in wildlife photography so that he can share these touching instances with others., 
  target: To honor the effort he put into his latest set of photographs, we've gathered some of our favorites for your viewing pleasure. A lion and its cub enjoy a tender moment together. The lion's paws rest on top of the front paws of its cub, who lays its head on its mother. A polar bear embraces her cub. Polar bears care for their cubs until they're about two years old, when the cubs venture out on their own. A kangaroo rubs her cheek on her joey's hand., 
  edits: [NeedsGoogleEdit(
    output_idx: [[300, 366]], 
    annotation: NeedsGoogleAnnotation(
      explaination: I'm not sure if polar bear cubs stay with their mothers u

In [None]:
!curl -so snac.yml https://nlproc.tools/templates/snac.yml
!curl -so snac.json https://nlproc.tools/data/snac.json

SNaC = load_interface("snac.yml")
snac_data = SNaC.load_annotations("snac.json")

snac_data[0]

SnacEntry(
  source: , 
  target: It is also said that seeing Anytus pass by, Socrates remarked: "How proudly the great man steps; he thinks, no doubt, he has performed some great and noble deed in putting me to death, and all because, seeing him deemed worthy of the highest honours of the state, I told him it ill became him to bring up his so in a tan-yard." He adds that Homer has ascribed to some at the point of death a power of forecasting things, and he too is minded to utter a prophecy. Once, for a brief space, he associated with the son of Anytus, and he seemed, 
  edits: [SceneeEdit(
    output_idx: [[0, 42]], 
    votes: 1
  ), GrameEdit(
    output_idx: [[264, 326]], 
    votes: 1
  ), ChareEdit(
    output_idx: [[28, 34]], 
    votes: 1
  ), ChareEdit(
    output_idx: [[341, 346]], 
    votes: 1
  ), GrameEdit(
    output_idx: [[526, 539]], 
    votes: 2
  )]
)

In [None]:
!curl -so da-san-martino-etal-2019.yml https://nlproc.tools/templates/da-san-martino-etal-2019.yml
!curl -so da-san-martino-etal-2019.json https://nlproc.tools/data/da-san-martino-etal-2019.json

Da_san_martino_etal_2019 = load_interface("da-san-martino-etal-2019.yml")
da_san_martino_etal_2019 = Da_san_martino_etal_2019.load_annotations("da-san-martino-etal-2019.json")

da_san_martino_etal_2019[0]

Da-san-martino-etal-2019Entry(
  source: , 
  target: The US Is Blatantly Telling Lies
  
  It’s no secret that the Trump administration has a strong distaste for Iran.
  Iran is one of the only issues on which the U.S. president has remained relatively consistent.
  Trump berated the country both before and after taking office.
  However, Trump’s anti-Iran strategy goes against the better judgment of even the most anti-Iranian advisors in his staff who don’t want to see the U.S. isolated on the world stage.
  Fortunately for Trump, however, he is not alone in his bid to isolate and demonize Iran at all costs.
  On December 12, Trump’s ambassador to the U.N., Nikki Haley, gave a grandiose speech demonizing Iran that echoed Colin Powell’s infamous performance before the U.N. in 2003.
  Haley’s essential claim was that Saudi Arabia is under attack by missiles supplied to Yemen by the Iranian government and that the world should not sit idly by as this goes on.
  “If we do nothing about t

In [None]:
!curl -so wu-etal-2023.yml https://nlproc.tools/templates/wu-etal-2023.yml
!curl -so wu-etal-2023.json https://nlproc.tools/data/wu-etal-2023.json

Wuetal2023 = load_interface("wu-etal-2023.yml")
wu_etal_2023_data = Wuetal2023.load_annotations("wu-etal-2023.json")

wu_etal_2023_data[0]

Wu-etal-2023Entry(
  source: Bloom is the second studio album by Australian singer and songwriter Troye Sivan. This album was released on August 31, 2018 through EMI Music Australia and Capitol Records. The album's title track Bloom, a song about anal sex, was promoted using the hashtag ‘#BopsBoutBottoming, which was trending on Twitter. The song Bloom was released on May 2, 2018 as the third single from Sivan's album Bloom. , 
  target: Bloom is the second studio album by Australian singer and songwriter Troye Sivan, released on 31 August 2018 through EMI Australia and Capitol Records. The album follows up his 2015 debut studio album, Blue Neighbourhood, and features guest appearances from Gordi and Ariana Grande. The lead single from the album was released on 10 January 2018, accompanied by a music video directed by Grant Singer. The title track, a song about queer desire, was released on 2 May as the third single. American singer Ariana Grande was released on 13 June as the album's 