# Summary

This notebook will create the imdb dataset for classification.  We don't pad these here, the data collator will do it on the fly.  We also don't condense because they are labeled.

In [None]:
import os

def create_dir(dir_path):
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

In [None]:
create_dir('datasets/classifier/citation_intent')

In [None]:
!curl -Lo train.jsonl https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl --output-dir 'datasets/classifier/citation_intent'
!curl -Lo dev.jsonl https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl --output-dir 'datasets/classifier/citation_intent'
!curl -Lo test.jsonl https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl --output-dir 'datasets/classifier/citation_intent'


In [1]:
from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files={
        "train": "datasets/classifier/citation_intent/train.jsonl",
        "test": "datasets/classifier/citation_intent/test.jsonl",
        "dev": "datasets/classifier/citation_intent/dev.jsonl",
    },
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'metadata'],
        num_rows: 1688
    })
    test: Dataset({
        features: ['text', 'label', 'metadata'],
        num_rows: 139
    })
    dev: Dataset({
        features: ['text', 'label', 'metadata'],
        num_rows: 114
    })
})

In [3]:
print(dataset["train"][0])
print(dataset["test"][0])
print(dataset["dev"][0])

{'text': 'Thus , over the past few years , along with advances in the use of learning and statistical methods for acquisition of full parsers ( Collins , 1997 ; Charniak , 1997a ; Charniak , 1997b ; Ratnaparkhi , 1997 ) , significant progress has been made on the use of statistical learning methods to recognize shallow parsing patterns syntactic phrases or words that participate in a syntactic relationship ( Church , 1988 ; Ramshaw and Marcus , 1995 ; Argamon et al. , 1998 ; Cardie and Pierce , 1998 ; Munoz et al. , 1999 ; Punyakanok and Roth , 2001 ; Buchholz et al. , 1999 ; Tjong Kim Sang and Buchholz , 2000 ) .', 'label': 'Background', 'metadata': {}}
{'text': 'Resnik ( 1995 ) reported a correlation of r = .9026.10 The results are not directly comparable , because he only used noun-noun pairs , words instead of concepts , a much smaller dataset , and measured semantic similarity instead of semantic relatedness .', 'label': 'CompareOrContrast', 'metadata': {}}
{'text': 'Typical examp

In [4]:
df = dataset["train"].to_pandas()
labels = df['label'].unique().tolist()
labels

['Background', 'Uses', 'CompareOrContrast', 'Extends', 'Motivation', 'Future']

In [5]:
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for i, label in enumerate(labels)}
print(label2id)
print(id2label)

{'Background': 0, 'Uses': 1, 'CompareOrContrast': 2, 'Extends': 3, 'Motivation': 4, 'Future': 5}
{0: 'Background', 1: 'Uses', 2: 'CompareOrContrast', 3: 'Extends', 4: 'Motivation', 5: 'Future'}


In [6]:
# Update labels
dataset = dataset.map(lambda examples: {"label": label2id[examples["label"]]})

In [7]:
print(dataset["train"][0])
print(dataset["test"][0])
print(dataset["dev"][0])

{'text': 'Thus , over the past few years , along with advances in the use of learning and statistical methods for acquisition of full parsers ( Collins , 1997 ; Charniak , 1997a ; Charniak , 1997b ; Ratnaparkhi , 1997 ) , significant progress has been made on the use of statistical learning methods to recognize shallow parsing patterns syntactic phrases or words that participate in a syntactic relationship ( Church , 1988 ; Ramshaw and Marcus , 1995 ; Argamon et al. , 1998 ; Cardie and Pierce , 1998 ; Munoz et al. , 1999 ; Punyakanok and Roth , 2001 ; Buchholz et al. , 1999 ; Tjong Kim Sang and Buchholz , 2000 ) .', 'label': 0, 'metadata': {}}
{'text': 'Resnik ( 1995 ) reported a correlation of r = .9026.10 The results are not directly comparable , because he only used noun-noun pairs , words instead of concepts , a much smaller dataset , and measured semantic similarity instead of semantic relatedness .', 'label': 2, 'metadata': {}}
{'text': 'Typical examples are Bulgarian ( Simov et 

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

  _torch_pytree._register_pytree_node(


In [9]:
def preprocess_function(examples):
    tokens = tokenizer(
        examples["text"],
        truncation=True,
    )
    return tokens

In [10]:
# This removes the text and id columns from the dataset as they are not needed
dataset_tokens = dataset.map(preprocess_function, batched=True, remove_columns=["metadata", "text"])
dataset_tokens 


Map: 100%|██████████| 1688/1688 [00:00<00:00, 27862.66 examples/s]
Map: 100%|██████████| 139/139 [00:00<00:00, 19694.23 examples/s]
Map: 100%|██████████| 114/114 [00:00<00:00, 18155.78 examples/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1688
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 139
    })
    dev: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 114
    })
})

In [13]:
print(dataset_tokens["train"][0])
print(dataset_tokens["test"][0])
print(dataset_tokens["dev"][0])

{'label': 0, 'input_ids': [0, 42702, 2156, 81, 5, 375, 367, 107, 2156, 552, 19, 9766, 11, 5, 304, 9, 2239, 8, 17325, 6448, 13, 3857, 9, 455, 28564, 268, 36, 5415, 2156, 7528, 25606, 732, 4422, 20082, 2156, 7528, 102, 25606, 732, 4422, 20082, 2156, 7528, 428, 25606, 12041, 282, 1115, 3994, 3592, 2156, 7528, 4839, 2156, 1233, 2017, 34, 57, 156, 15, 5, 304, 9, 17325, 2239, 6448, 7, 5281, 16762, 46563, 8117, 45774, 28201, 22810, 50, 1617, 14, 4064, 11, 10, 45774, 28201, 1291, 36, 2197, 2156, 11151, 25606, 3513, 18086, 8, 7380, 2156, 7969, 25606, 19021, 22704, 4400, 1076, 4, 2156, 6708, 25606, 5866, 324, 8, 13891, 2156, 6708, 25606, 6760, 3979, 4400, 1076, 4, 2156, 6193, 25606, 14687, 219, 677, 260, 1638, 8, 13880, 2156, 5155, 25606, 19443, 9649, 329, 4400, 1076, 4, 2156, 6193, 25606, 255, 40435, 1636, 18002, 8, 19443, 9649, 329, 2156, 3788, 4839, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [11]:
decoded_string =tokenizer.decode(dataset_tokens["train"][4]["input_ids"])
original_string = dataset["train"][4]["text"]
print(decoded_string)
print(original_string)

<s>Briscoe and Carroll ( 1997 ) report on manually analyzing an open-class vocabulary of 35,000 head words for predicate subcategorization information and comparing the results against the subcategorization details in COMLEX.</s>
Briscoe and Carroll ( 1997 ) report on manually analyzing an open-class vocabulary of 35,000 head words for predicate subcategorization information and comparing the results against the subcategorization details in COMLEX .


In [15]:
dataset_tokens.push_to_hub("ACL_ARC_dataset")

Creating parquet from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 506.80ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 566.87ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  2.55it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 520.58ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.70it/s]


CommitInfo(commit_url='https://huggingface.co/datasets/BigTMiami/ACL_ARC_dataset/commit/3483867088d05be089b41f3c2eb181b78636c87d', commit_message='Upload dataset', commit_description='', oid='3483867088d05be089b41f3c2eb181b78636c87d', pr_url=None, pr_revision=None, pr_num=None)