# Data Augmentation for Document Information Extraction

This notebook performs data augmentation on a dataset for document information extraction. The goal is to create synthetic variations of the original documents to increase the size and diversity of the training data, which can help improve model robustness.

The augmentation techniques implemented include:
- **Entity Swapping:** Replacing mentions of entities with other mentions of the same type and length within the document.
- **Mask and Fill:** Masking a non-entity word and using a language model (BERT) to predict a replacement.
- **Paraphrasing:** Using a large language model (GPT-3.5-turbo) to rewrite sentences while preserving entities and meaning.

## Setup and Dependencies

This section installs the required libraries and downloads necessary models for the augmentation process.

- `transformers`: Used for the BERT masked language model pipeline.
- `spacy`: Used for tokenization, sentence boundary detection, and potentially other NLP tasks (though primarily tokenization/sentencization here).
- `openai`: Used for the LLM paraphrasing function.

We also download the `en_core_web_sm` spaCy model.

In [None]:
!pip install --upgrade transformers spacy tqdm
!spacy download en_core_web_sm

In [None]:
pip install openai==0.28

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "key"

## Model and NLP Initialization

Here, we initialize the natural language processing tools used in the augmentation functions:

- **spaCy (`en_core_web_sm`):** Loaded for basic tokenization and crucially, for sentence boundary detection using the 'sentencizer' component, which is important for the paraphrasing function.
- **BERT (`bert-base-uncased`):** Loaded and configured as a `fill-mask` pipeline. This is used in the `mask_and_fill` function to suggest replacement words.

The OpenAI API key is loaded from environment variables, and the paraphrasing function is only enabled if the key is available.

## Augmentation Helpers

These functions implement the core data augmentation logic:

- `swap_entities(doc)`: Swaps entity mentions within a document. It identifies mentions of the same type and length and randomly replaces a portion of them.
- `mask_and_fill(doc)`: Selects a non-entity word in the document (within the first 128 words), masks it, and uses the BERT `fill-mask` pipeline to predict a replacement word.
- `paraphrase_llm(doc)`: Uses the OpenAI API to paraphrase the document sentence by sentence, ensuring that tagged entities are preserved and not altered by the LLM.
- Helper functions like `_collect_mentions`, `_wrap_mentions`, and `_unwrap_mentions` assist in processing entity data and preparing text for the LLM.

## Main Driver (`augment_file`)

The `augment_file` function orchestrates the augmentation process for a single input JSON file.

For each document in the input file:
1. The original document is included in the output list.
2. Copies of the document are passed to each of the augmentation helper functions (`swap_entities`, `mask_and_fill`, `paraphrase_llm`).
3. If an augmentation function successfully generates a new document, it is added to the output list.

The augmented data for each input file is saved to a corresponding output file. The `overwrite` flag controls whether existing output files are overwritten.

## Execution and File Processing

The script defines a list of datasets (train and dev) and their corresponding input and output directories.

It then iterates through each dataset:
- It finds all JSON files ending with `_all_examples.json` in the input directory.
- For each found file, it calls the `augment_file` function to perform the augmentation.

Finally, the script zips the entire `/kaggle/working/` directory, which contains the augmented data.

In [None]:
import copy
import json
import os
import random
import re
import sys
from pathlib import Path
from typing import Dict, List

import spacy
from tqdm import tqdm

try:
    import openai  # paraphrase
except ImportError:  # pragma: no cover
    openai = None  # type: ignore

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline


############################################################
# Model & NLP initialisation                               #
############################################################

FILL_MODEL_NAME = "bert-base-uncased"
MASK_PROBABILITY = 0.3  # 30 % of entities to swap
FIRST_N_WORDS_FOR_MASK = 128 # Defines the scope within which to pick a word for masking

print("Loading spaCy model …", file=sys.stderr)
NLP = spacy.load("en_core_web_sm", disable=["parser", "ner"])  # Start with tokenizer
# Add the sentencizer component to the pipeline for sentence boundary detection
if not NLP.has_pipe("sentencizer"):
    try:
        NLP.add_pipe("sentencizer")
        print("Added 'sentencizer' to spaCy pipeline.", file=sys.stderr)
    except Exception as e:
        print(f"Could not add 'sentencizer'. Trying to re-enable parser for sentence boundaries if needed: {e}", file=sys.stderr)
        # Fallback or further investigation might be needed if add_pipe fails
        # For now, this should generally work.
else:
    print("Sentencizer already in spaCy pipeline.", file=sys.stderr)


print("Loading BERT fill‑mask pipeline …", file=sys.stderr)
_fill_tokenizer = AutoTokenizer.from_pretrained(FILL_MODEL_NAME)
_fill_model = AutoModelForMaskedLM.from_pretrained(FILL_MODEL_NAME)
FILL_MASK = pipeline("fill-mask", model=_fill_model, tokenizer=_fill_tokenizer, top_k=5)
MASK_TOKEN = _fill_tokenizer.mask_token

OPENAI_ENABLED = openai is not None and os.getenv("OPENAI_API_KEY")
if not OPENAI_ENABLED:
    print("[INFO] OpenAI not available or key missing – paraphrase step will be skipped", file=sys.stderr)

########################################################################
# Augmentation helpers                                                 #
########################################################################

def _collect_mentions(entities: List[Dict]) -> List[Dict]:
    """Flatten entity dicts into a list with text, type and len, plus indices."""
    flat = []
    for ent_idx, ent in enumerate(entities):
        for m_idx, mention in enumerate(ent.get("mentions", [])):
            flat.append({
                "text": mention,
                "type": ent["type"],
                "len": len(mention),
                "ent_idx": ent_idx,
                "m_idx": m_idx,
            })
    return flat


def swap_entities(doc: Dict) -> Dict | None:
    """Swap 30 % of entity mentions with same‑type, same‑length alternates in‑doc."""
    text = doc["doc"]
    ents = copy.deepcopy(doc["entities"])  # deep‑copy because we mutate it
    mentions = _collect_mentions(ents)
    if len(mentions) < 2:
        return None
    random.shuffle(mentions)
    num_to_swap = max(1, int(len(mentions) * MASK_PROBABILITY))
    swap_map = {}
    for m in mentions[:num_to_swap]:
        cands = [c for c in mentions if c["type"] == m["type"] and c["len"] == m["len"] and c["text"] != m["text"]]
        if not cands:
            continue
        repl = random.choice(cands)
        swap_map[m["text"]] = repl["text"]
        ents[m["ent_idx"]]["mentions"][m["m_idx"]] = repl["text"]  # keep metadata aligned

    if not swap_map:
        return None

    pattern = re.compile(r"|".join(map(re.escape, swap_map.keys())))
    swapped_text = pattern.sub(lambda match: swap_map[match.group(0)], text)

    out = copy.deepcopy(doc)
    out["doc"] = swapped_text
    out["entities"] = ents
    out["aug_type"] = "swap"
    return out


def mask_and_fill(doc: Dict) -> Dict | None:
    text = doc["doc"]
    words = text.split() # Simple space-based tokenization for word list
    if len(words) < 5:
        return None

    scope = min(len(words), FIRST_N_WORDS_FOR_MASK)

    entity_spans = []
    processed_mentions_in_doc = set()
    for ent in doc["entities"]:
        for mention_text in ent.get("mentions", []):
            start_offset = 0
            while True:
                start = text.find(mention_text, start_offset)
                if start == -1:
                    break
                entity_spans.append((start, start + len(mention_text)))
                start_offset = start + 1


    token_offsets_in_scope = []
    current_offset = 0
    for i, w in enumerate(words[:scope]):
        start = text.find(w, current_offset)
        if start == -1:
            current_offset += len(w) + 1
            continue
        end = start + len(w)
        token_offsets_in_scope.append({"start": start, "end": end, "idx_in_words": i})
        current_offset = end

    non_ent_indices_in_words = []
    for token_info in token_offsets_in_scope:
        s, e, original_idx = token_info["start"], token_info["end"], token_info["idx_in_words"]
        is_entity = False
        for es, ee in entity_spans:
            if (es <= s < ee) or (es < e <= ee) or (s <= es and e >= ee):
                is_entity = True
                break
        if not is_entity:
            non_ent_indices_in_words.append(original_idx)

    if not non_ent_indices_in_words:
        return None

    idx_to_mask = random.choice(non_ent_indices_in_words)
    original_token = words[idx_to_mask]

    temp_words = list(words)
    temp_words[idx_to_mask] = MASK_TOKEN

    context_window_size = 250

    start_window = max(0, idx_to_mask - context_window_size // 2)
    end_window = min(len(temp_words), idx_to_mask + context_window_size // 2 + 1)

    if idx_to_mask < context_window_size // 2:
        end_window = min(len(temp_words), context_window_size)
    elif idx_to_mask > len(temp_words) - (context_window_size // 2):
        start_window = max(0, len(temp_words) - context_window_size)

    context_words_for_bert = temp_words[start_window:end_window]
    masked_context_for_bert = " ".join(context_words_for_bert)

    try:
        preds = FILL_MASK(masked_context_for_bert)
    except Exception as exc:
        print(f"[WARN] fill-mask failed on context for doc '{doc.get('title', 'Untitled')[:30]}...': {exc}")
        return None

    replacement = next((p["token_str"].strip() for p in preds if p["token_str"].strip().lower() != original_token.lower()), None)

    if not replacement:
        return None

    words[idx_to_mask] = replacement
    filled_sentence = " ".join(words)

    out = copy.deepcopy(doc)
    out["doc"] = filled_sentence
    out["aug_type"] = "mask"
    return out


# ------------------------ paraphrase helpers -------------------------

TAG_RE = re.compile(r"\[\[/?E\d+]]")

def _wrap_mentions(text: str, mentions: List[str]) -> str:
    tagged = text
    for i, m in enumerate(sorted(list(set(mentions)), key=len, reverse=True)):
        tagged = re.sub(re.escape(m), f"[[E{i}]]{m}[[/E{i}]]", tagged)
    return tagged


def _unwrap_mentions(text: str) -> str:
    return TAG_RE.sub("", text)


PROMPT_HEADER = (
    "You are a meticulous rewriting assistant. Paraphrase the given paragraph "
    "in fluent English while preserving **all facts and meaning**.\n"
    "The paragraph contains entity placeholders wrapped in double‑bracket tags.\n"
    "**Do not alter, remove, reorder, or add anything inside tags like [[E0]]…[[/E0]].**\n"
    "Keep roughly the same length (no summarisation).\n"
)

PROMPT_EXAMPLE = (
    "### Example\n"
    "INPUT:  [[E0]]Barack Obama[[/E0]] was born in Hawaii and became the 44th president of the United States.\n"
    "OUTPUT: [[E0]]Barack Obama[[/E0]] was born in Hawaii before going on to serve as the 44th president of the United States.\n\n"
)


def paraphrase_llm(doc: Dict) -> Dict | None:
    if not OPENAI_ENABLED:
        return None

    text = doc["doc"]
    mentions = [m_text for ent in doc["entities"] for m_text in ent.get("mentions", [])]

    tagged_full_text = _wrap_mentions(text, mentions)

    try:
        nlp_doc = NLP(tagged_full_text)
    except Exception as e:
        print(f"[WARN] spaCy NLP processing failed for paraphrase_llm on doc '{doc.get('title', 'Untitled')[:30]}...': {e}")
        return None # Cannot proceed with chunking if NLP fails

    sentences = []
    try:
        # This is where the E030 error occurred. Should be fixed by adding 'sentencizer'.
        sentences = [s.text for s in nlp_doc.sents]
        if not sentences: # If somehow no sentences are found (e.g. empty input after tagging)
             print(f"[WARN] No sentences found by spaCy for doc '{doc.get('title', 'Untitled')[:30]}...' in paraphrase_llm. Text: {tagged_full_text[:100]}")
             return None
    except Exception as e: # Catch if .sents fails for any other reason
        print(f"[WARN] Error accessing sentences (doc.sents) for '{doc.get('title', 'Untitled')[:30]}...': {e}")
        return None


    paraphrased_text_parts = []
    current_chunk_sentences = []
    current_chunk_word_count = 0
    MAX_WORDS_PER_CHUNK_FOR_LLM = 300

    for i, sent in enumerate(sentences):
        sent_word_count = len(sent.split())

        # Determine if the current chunk should be processed
        process_this_chunk = False
        if current_chunk_sentences: # If there's something in the current chunk
            if (current_chunk_word_count + sent_word_count > MAX_WORDS_PER_CHUNK_FOR_LLM) or (i == len(sentences) - 1):
                process_this_chunk = True

        # If it's the last sentence, it must be added to a chunk to be processed
        if i == len(sentences) - 1 and not process_this_chunk:
             current_chunk_sentences.append(sent)
             current_chunk_word_count += sent_word_count
             process_this_chunk = True # Mark for processing as it's the very end

        if process_this_chunk:
            chunk_to_paraphrase = " ".join(current_chunk_sentences)
            if not chunk_to_paraphrase.strip(): # Avoid processing empty chunks
                current_chunk_sentences = [sent] if not (i == len(sentences) - 1 and (current_chunk_word_count + sent_word_count > MAX_WORDS_PER_CHUNK_FOR_LLM)) else []
                current_chunk_word_count = len(sent.split()) if current_chunk_sentences else 0
                continue

            prompt_for_chunk = (
                PROMPT_HEADER
                + PROMPT_EXAMPLE
                + "### Paragraph to rewrite (this might be a segment of a larger document)\\n"
                + chunk_to_paraphrase
            )

            chunk_words_list = chunk_to_paraphrase.split()
            calculated_max_tokens = max(32, int(len(chunk_words_list) * 1.5))
            llm_max_tokens_for_chunk = min(calculated_max_tokens, 1500)

            try:
                resp = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": prompt_for_chunk}],
                    temperature=0.7,
                    max_tokens=llm_max_tokens_for_chunk,
                )
                paraphrased_chunk_content = resp.choices[0].message.content.strip()

                if len(paraphrased_chunk_content) < len(chunk_to_paraphrase) * 0.3:
                     print(f"[WARN] Paraphrased chunk too short for: {chunk_to_paraphrase[:50]}... Using original.")
                     paraphrased_text_parts.append(chunk_to_paraphrase)
                else:
                     paraphrased_text_parts.append(paraphrased_chunk_content)

            except Exception as exc_chunk:
                print(f"[WARN] OpenAI paraphrase failed for chunk: {chunk_to_paraphrase[:50]}... Error: {exc_chunk}. Using original.")
                paraphrased_text_parts.append(chunk_to_paraphrase)

            # Reset for next chunk, or start new chunk with current sentence if it wasn't part of the processed one
            if i < len(sentences) - 1: # If not the last sentence
                 if (current_chunk_word_count + sent_word_count > MAX_WORDS_PER_CHUNK_FOR_LLM) : # current 'sent' starts a new chunk
                    current_chunk_sentences = [sent]
                    current_chunk_word_count = sent_word_count
                 else: # current 'sent' was part of the processed chunk (because it was the last one that fit or the final sentence)
                    current_chunk_sentences = [] # Handled by the append below if it was the last one
                    current_chunk_word_count = 0
            else: # Processed the last chunk
                current_chunk_sentences = []
                current_chunk_word_count = 0

        # Add current sentence to the next chunk if it wasn't processed
        if not process_this_chunk and i < len(sentences): # Check i < len(sentences) to be safe
            current_chunk_sentences.append(sent)
            current_chunk_word_count += sent_word_count


    paraphrased_full_tagged_text = " ".join(paraphrased_text_parts)

    cleaned_paraphrased_text = _unwrap_mentions(paraphrased_full_tagged_text)

    if not cleaned_paraphrased_text.strip() or len(cleaned_paraphrased_text) < len(text) * 0.3:
        print(f"[WARN] Final paraphrased text significantly shorter than original or empty for doc: {doc.get('title', 'Untitled')[:30]}...")
        return None

    out = copy.deepcopy(doc)
    out["doc"] = cleaned_paraphrased_text
    out["aug_type"] = "para"
    return out

########################################################################
# Main driver                                                          #
########################################################################

def augment_file(path: Path, out_dir: Path, overwrite: bool = False):
    out_path = out_dir / path.name
    if out_path.exists() and not overwrite:
        print(f"[SKIP] {out_path} exists")
        return

    try:
        with open(path, "r", encoding="utf-8") as fh:
            data = json.load(fh)
    except Exception as e:
        print(f"[ERROR] Failed to read or parse JSON file {path}: {e}")
        return


    augmented: List[Dict] = []
    for row_idx, row in enumerate(tqdm(data, desc=path.name)):
        if not isinstance(row, dict) or "doc" not in row or "entities" not in row:
            print(f"[WARN] Skipping invalid row (index {row_idx}) in {path.name}: Missing 'doc' or 'entities'. Row content: {str(row)[:100]}")
            continue

        row_orig = copy.deepcopy(row)
        row_orig["aug_type"] = "orig"
        augmented.append(row_orig)

        doc_to_augment = copy.deepcopy(row)

        for func in (swap_entities, mask_and_fill, paraphrase_llm):
            try:
                new_row = func(copy.deepcopy(doc_to_augment))
                if new_row is not None:
                    augmented.append(new_row)
            except Exception as exc:
                title_info = row.get('title', row.get('doc_id', f'doc_idx_{row_idx}'))
                print(f"[WARN] Augmentation function {func.__name__} failed on '{str(title_info)[:40]}…': {exc}")


    if not augmented:
        print(f"[INFO] No data was augmented or generated for {path.name} (possibly all input rows were invalid or empty).")
        return

    out_path.parent.mkdir(parents=True, exist_ok=True)
    try:
        with open(out_path, "w", encoding="utf-8") as fh:
            json.dump(augmented, fh, ensure_ascii=False, indent=2)
        print(f"[WRITE] {out_path} (×{len(augmented)} docs)")
    except Exception as e:
        print(f"[ERROR] Failed to write augmented data to {out_path}: {e}")


if __name__ == "__main__":
    datasets = [
        {
            "name": "train",
            "input_dir": Path("/kaggle/input/nlp-second-try/DocIE_dataset_final_version/train"),
            "output_dir": Path("/kaggle/working/aug_train")
        },
        {
            "name": "dev",
            "input_dir": Path("/kaggle/input/nlp-second-try/DocIE_dataset_final_version/dev"),
            "output_dir": Path("/kaggle/working/aug_dev")
        }
    ]

    for dataset in datasets:
        print(f"\n{'='*50}")
        print(f"Processing {dataset['name']} set...")
        print(f"Input dir: {dataset['input_dir']}")
        print(f"Output dir: {dataset['output_dir']}")
        print(f"{'='*50}\n")

        if not dataset['input_dir'].exists():
            print(f"[ERROR] Input directory not found: {dataset['input_dir']}")
            continue

        json_files = list(dataset['input_dir'].glob("*_all_examples.json"))
        if not json_files:
            print(f"No input JSON files found (ending with _all_examples.json) for {dataset['name']} set in {dataset['input_dir']}.")
            continue

        print(f"Found {len(json_files)} JSON files to process for {dataset['name']} set.")

        for fp in json_files:
            print(f"\nStarting augmentation for file: {fp.name}")
            augment_file(fp, dataset['output_dir'], overwrite=False)

    print("\n\nData augmentation process finished.")

2025-05-29 07:31:32.295821: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748503892.466012     108 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748503892.512578     108 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading spaCy model …
Added 'sentencizer' to spaCy pipeline.
Loading BERT fill‑mask pipeline …


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Processing train set...
Input dir: /kaggle/input/nlp-second-try/DocIE_dataset_final_version/train
Output dir: /kaggle/working/aug_train

Found 5 JSON files to process for train set.

Starting augmentation for file: Communication_all_examples.json


Communication_all_examples.json: 100%|██████████| 10/10 [02:23<00:00, 14.39s/it]


[WRITE] /kaggle/working/aug_train/Communication_all_examples.json (×40 docs)

Starting augmentation for file: Government_all_examples.json


Government_all_examples.json:   0%|          | 0/9 [00:00<?, ?it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Government_all_examples.json:  11%|█         | 1/9 [00:36<04:51, 36.38s/it]

[WARN] Paraphrased chunk too short for: The fall of the [[E213]]Derg[[/E213]] was a milita... Using original.
[WARN] Paraphrased chunk too short for: Subsequently, they retook military outpost of [[E1... Using original.


Government_all_examples.json:  22%|██▏       | 2/9 [00:53<02:57, 25.34s/it]

[WARN] Paraphrased chunk too short for: [[E2[[E153]]5[[/E153]]]]Fullmetal [[E88]]Alc[[E1[[... Using original.
[WARN] Paraphrased chunk too short for: E83]]Tim [[E114]]Marcoh[[/E114]][[/E83]][[/E[[E153... Using original.


Government_all_examples.json:  33%|███▎      | 3/9 [01:07<01:58, 19.75s/it]

[WARN] Paraphrased chunk too short for: E147]]He[[/E147]] added that [[E1[[E153]]5[[/E153]... Using original.


Government_all_examples.json: 100%|██████████| 9/9 [02:09<00:00, 14.43s/it]


[WARN] Paraphrased chunk too short for: In [[E141]]1940[[/E141]], [[E9]][[E116]][[E125]]Ge... Using original.
[WRITE] /kaggle/working/aug_train/Government_all_examples.json (×36 docs)

Starting augmentation for file: Entertainment_all_examples.json


Entertainment_all_examples.json:  25%|██▌       | 3/12 [00:32<01:40, 11.17s/it]

[WARN] Paraphrased chunk too short for: [[E59]]Dumb Ways to Die[[/E59]] is an [[E127]][[E1... Using original.
[WARN] Paraphrased chunk too short for: It featured characters known as "[[E160]]Beans[[/E... Using original.


Entertainment_all_examples.json: 100%|██████████| 12/12 [02:31<00:00, 12.66s/it]


[WRITE] /kaggle/working/aug_train/Entertainment_all_examples.json (×48 docs)

Starting augmentation for file: Energy_all_examples.json


Energy_all_examples.json: 100%|██████████| 10/10 [01:51<00:00, 11.17s/it]


[WRITE] /kaggle/working/aug_train/Energy_all_examples.json (×40 docs)

Starting augmentation for file: Education_all_examples.json


Education_all_examples.json:  10%|█         | 1/10 [00:09<01:25,  9.55s/it]

[WARN] Paraphrased chunk too short for: [[E[[E43]]1[[/E43]]5]]Campus [[E[[E44]]2[[/E44]]0]... Using original.


Education_all_examples.json: 100%|██████████| 10/10 [01:37<00:00,  9.71s/it]


[WRITE] /kaggle/working/aug_train/Education_all_examples.json (×40 docs)

Processing dev set...
Input dir: /kaggle/input/nlp-second-try/DocIE_dataset_final_version/dev
Output dir: /kaggle/working/aug_dev

Found 2 JSON files to process for dev set.

Starting augmentation for file: Human_behavior_all_examples.json


Human_behavior_all_examples.json: 100%|██████████| 13/13 [03:43<00:00, 17.22s/it]


[WRITE] /kaggle/working/aug_dev/Human_behavior_all_examples.json (×51 docs)

Starting augmentation for file: Internet_all_examples.json


Internet_all_examples.json: 100%|██████████| 10/10 [01:15<00:00,  7.50s/it]

[WRITE] /kaggle/working/aug_dev/Internet_all_examples.json (×37 docs)


Data augmentation process finished.





In [None]:
!zip -r /kaggle/working/working_dir.zip /kaggle/working/*


  adding: kaggle/working/aug_dev/ (stored 0%)
  adding: kaggle/working/aug_dev/Human_behavior_all_examples.json (deflated 90%)
  adding: kaggle/working/aug_dev/Internet_all_examples.json (deflated 91%)
  adding: kaggle/working/aug_train/ (stored 0%)
  adding: kaggle/working/aug_train/Entertainment_all_examples.json (deflated 89%)
  adding: kaggle/working/aug_train/Communication_all_examples.json (deflated 90%)
  adding: kaggle/working/aug_train/Energy_all_examples.json (deflated 89%)
  adding: kaggle/working/aug_train/Government_all_examples.json (deflated 89%)
  adding: kaggle/working/aug_train/Education_all_examples.json (deflated 91%)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
