# 00_FiNER139-Text-Extraction for TAPT
### Purpose: Extracting text form the labeled dataset and concatenate it back to common text for the TAPT training
<hr style="height:3px; width:100%; background-color:black; border:none; margin:auto;" />

## 1. Import the finer-139 dataset from Hugging Face via the datasets library

In [1]:
from datasets import load_dataset
import pandas as pd

# Load the dataset from Hugging Face
dataset = load_dataset("nlpaueb/finer-139")

dataset
#dataset["train"][1]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 900384
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 112494
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 108378
    })
})

<hr style="height:3px; width:100%; background-color:black; border:none; margin:auto;" />

## 2. Extract only the tokens from the train-split of the dataset and restore them to flowing text

In [2]:
from datasets import DatasetDict
from tqdm import tqdm

# Beispiel: Angenommen `dataset_dict_named` ist dein DatasetDict mit "train", "validation", ...
train_data = dataset["train"]

# Tokens zu Text zusammenfügen
def tokens_to_text(example):
    return {"text": " ".join(example["tokens"])}

# Nur Text-Daten extrahieren
text_dataset = train_data.map(tokens_to_text, remove_columns=train_data.column_names, num_proc = 4)
text_dataset

Dataset({
    features: ['text'],
    num_rows: 900384
})

<hr style="height:3px; width:100%; background-color:black; border:none; margin:auto;" />

## 4. Remove boilerplate train rows without usefull content

In [3]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

token_counts = [
    (i, len(tokenizer.encode(row["text"], add_special_tokens=False)))
    for i, row in enumerate(text_dataset)
]

Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors


In [4]:
# Sort by token count and figure out where to cut
sorted_by_tokens = sorted(token_counts, key=lambda x: x[1])
top_10_tokens = sorted_by_tokens[9000:10000]  # top 10 shortest

# Output
for idx, token_count in top_10_tokens:
    #print(f"BERT-Tokens (inkl. CLS/SEP): {token_count}")
    print(idx, ": ", text_dataset[idx]['text'])
    #print("="*80)

692806 :  See Note 12 for further information .
692926 :  See Note 14 ( b ) .
692930 :  See Note 12 for further details .
693306 :  44 Table of Contents Item 2 .
694287 :  21 Table of Contents Item 2 .
694737 :  30 Table of Contents Item 2 .
694862 :  36 Table of Contents ITEM 2 .
694936 :  16 Table of Contents Item 2 .
695403 :  § 9606 ( a ) .
695420 :  25 Table of Contents ITEM 2 .
696247 :  38 Table of Contents Item 2 .
696587 :  47 Table of Contents Item 3 .
696988 :  23 Table of Contents Item 2 .
697149 :  See further discussion in Note 14 .
697797 :  84 Table of Contents Item 2 .
697811 :  See Note 17 entitled Discontinued Operations .
698026 :  See Note 20 for further information .
698030 :  See Note 11 for further information .
698858 :  See Note 13 for additional information .
699281 :  See Note 13 for further information .
700144 :  27 Table of Contents Item 2 .
700784 :  41 Table of Contents Item 2 .
701069 :  35 Table of Contents Item 2 .
701312 :  25 TABLE OF CONTENTS ITEM

In [5]:
sorted_by_tokens = sorted(token_counts, key=lambda x: x[1])
to_remove = set(idx for idx, _ in sorted_by_tokens[:10000])

text_dataset_unfiltered = text_dataset

# with_indices=True gibt dir den Index in der Filterfunktion
text_dataset = text_dataset.filter(
    lambda ex, i: i not in to_remove,
    with_indices=True,
    num_proc=4,  # 4 Worker
)
print(len(text_dataset_unfiltered), "→", len(text_dataset))

900384 → 890384


In [6]:
text_dataset

Dataset({
    features: ['text'],
    num_rows: 890384
})

In [7]:
token_counts = [
    (i, len(tokenizer.encode(row["text"], add_special_tokens=False)))
    for i, row in enumerate(text_dataset)
]

# Sort by token count and figure out where to cut
sorted_by_tokens = sorted(token_counts, key=lambda x: x[1])
top_10_tokens = sorted_by_tokens[1000:2000]  # top 10 shortest

idx_to_remove = []
# Output
for idx, token_count in top_10_tokens:
    #print(f"BERT-Tokens (inkl. CLS/SEP): {token_count}")
    print(idx, ": ", text_dataset[idx]['text'])
    #print("="*80)

535971 :  For additional information , see Note 13 .
536013 :  28 Table of Contents See Note 2 .
536039 :  37 Table of Contents Interest rate derivatives .
536043 :  For additional information , see Note 10 .
536240 :  ( 3 ) See “ Note 14 .
536242 :  ( 3 ) See “ Note 14 .
536515 :  27 Liquidity and Capital Resources General .
536802 :  57SONOCO PRODUCTS COMPANY Item 2 .
539501 :  NIPSCO 2018 Integrated Resource Plan .
540427 :  The options generally vest over four years .
540524 :  That investigation was commenced in January 2014 .
541144 :  Permanent repairs were completed in September 2018 .
543937 :  Refer to Note 10 for additional information .
544034 :  Refer to Note 10 for further detail .
545276 :  These contracts matured in January 2019 .
546584 :  The facility will mature in March 2021 .
546587 :  The project became fully operational in 2014 .
547101 :  12 AVON PRODUCTS , INC . 1 .
549121 :  Closing occurred on June 5 , 2015 .
549125 :  Closing occurred on August 16 , 2017 .
5

<hr style="height:3px; width:100%; background-color:black; border:none; margin:auto;" />

## 3. Save the text as Arrow-Format (common format for Hugging Face)

In [12]:
# Speichern als Arrow-Format (kompakt und verlustfrei)
text_dataset.save_to_disk("data/finer-tapt-text-dataset")

Saving the dataset (0/1 shards):   0%|          | 0/890384 [00:00<?, ? examples/s]

<hr style="height:3px; width:100%; background-color:black; border:none; margin:auto;" />

## 4. Check if saved file from previous step can be imported again ✅

In [14]:
from datasets import load_from_disk
text_dataset2 = load_from_disk("data/finer-tapt-text-dataset")
text_dataset2

Dataset({
    features: ['text'],
    num_rows: 890384
})