# About

The dataset GhostWriter is a creation from the [PrismAI_v2 dataset](https://huggingface.co/datasets/TheItCrOw/PrismAI_v2). We only apply some more processing (removing line breaks e.g.) and then push it with common splits.

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
import torch
import re
import gc
import numpy as np
import pandas as pd

from pathlib import Path
from IPython.display import display, HTML
from data_hub.hub import DataHub
from datasets import DatasetDict, load_dataset
from huggingface_hub import HfApi, HfFolder
from collections import Counter

torch.cuda.empty_cache()
gc.collect()
if torch.cuda.is_available():
    with torch.cuda.device(torch.cuda.current_device()):
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()

In [11]:
hf_token = (Path.home() / ".hf_token").read_text().strip()
hub = DataHub(hf_token)
dataset = hub.get_splits("TheItCrOw/PrismAI_v2")
print(dataset)

Resolving data files:   0%|          | 0/17 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/17 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/17 [00:00<?, ?it/s]

Label ID mapping:
0 → human
1 → ai
2 → fusion
train distribution:
  ai: 151977 (39.7%)
  human: 104303 (27.3%)
  fusion: 126255 (33.0%)
eval distribution:
  ai: 21711 (39.7%)
  fusion: 18037 (33.0%)
  human: 14900 (27.3%)
test distribution:
  human: 29801 (27.3%)
  ai: 43422 (39.7%)
  fusion: 36073 (33.0%)
DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 382535
    })
    eval: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 54648
    })
    test: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 109296
    })
})


In [12]:
# Regex to match placeholder-style bracketed fragments like [Your Name], [Senator's Name], etc.
# (These should have been cleaned before, but I still found some remnants so I do a double-clean)
placeholder_re = re.compile(r"\[[A-Z][A-Za-z0-9'’\-,\. ]{1,50}\]")
# Markdown and HTML-like artifacts (**bold**, ## headings, `code`, links, etc.)
md_pattern = re.compile(
    r"(\*\*|__|[*_`#>\[\]\(\)\~\-]{2,}|!\[[^\]]*\]\([^\)]*\)|<[A-Za-z\/][^>]*>)"
)

def clean_newlines(example):
    text = example["text"]
    # Replace newlines with spaces
    text = text.replace("\n", " ")
    # Remove AI-style placeholders
    text = placeholder_re.sub("", text)
    # Remove Markdown / HTML artifacts
    text = md_pattern.sub("", text)
    # Collapse multiple spaces
    text = re.sub(r"\s{2,}", " ", text).strip()
    example["text"] = text
    return example

# Apply cleaning to all splits
cleaned_dataset = DatasetDict({
    split: ds.map(clean_newlines, num_proc=8)
    for split, ds in dataset.items()
})

cleaned_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 382535
    })
    eval: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 54648
    })
    test: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 109296
    })
})

In [13]:
# Rename splits
cleaned_dataset["validation"] = cleaned_dataset.pop("eval")
cleaned_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 382535
    })
    test: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 109296
    })
    validation: Dataset({
        features: ['id', 'text', 'domain', 'date', 'source', 'lang', 'label', 'agent', 'type'],
        num_rows: 54648
    })
})

In [14]:
cleaned_dataset['train'][5]

{'id': '3e549c46-e722-4c18-a9ed-470e0297d206',
 'text': 'Sehr geehrte Damen und Herren, heute stehe ich hier, um ein Thema anzusprechen, das für die Sicherheit unseres Landes von entscheidender Bedeutung ist: die Notwendigkeit eines effektiven strategischen Frühwarnsystems in Deutschland. In einer Zeit, in der hybride Bedrohungen und geopolitische Spannungen, insbesondere in Bezug auf Russland, an der Tagesordnung sind, dürfen wir nicht tatenlos zusehen. Die Nord Stream 2-Pipeline ist nicht nur ein Infrastrukturprojekt; sie ist ein geopolitisches Instrument, das von Russland genutzt wird, um seinen Einfluss auf Europa auszuweiten. Die AfD hat in dieser Debatte eine klare Position bezogen, die nicht nur die nationalen Interessen gefährdet, sondern auch die transatlantischen Beziehungen belastet. Wie können wir es uns leisten, die Augen vor den realen Risiken zu verschließen, die mit dieser Pipeline verbunden sind? Die Antworten auf diese Fragen sind nicht nur politisch, sondern auch mor

In [15]:
repo_id = "TheItCrOw/GhostWriter"
cleaned_dataset.push_to_hub(repo_id, token=hf_token)
print(f"Pushed to https://huggingface.co/datasets/{repo_id} with splits: {list(cleaned_dataset.keys())}")

Uploading the dataset shards:   0%|          | 0/5 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/77 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  26%|##5       | 70.3MB /  272MB            

Creating parquet from Arrow format:   0%|          | 0/77 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  19%|#8        | 50.3MB /  271MB            

Creating parquet from Arrow format:   0%|          | 0/77 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  18%|#7        | 48.2MB /  273MB            

Creating parquet from Arrow format:   0%|          | 0/77 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  18%|#7        | 47.3MB /  270MB            

Creating parquet from Arrow format:   0%|          | 0/77 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  18%|#8        | 50.3MB /  274MB            

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/55 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  26%|##5       | 50.3MB /  197MB            

Creating parquet from Arrow format:   0%|          | 0/55 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  23%|##3       | 45.6MB /  195MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/55 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  26%|##6       | 50.3MB /  193MB            

Pushed to https://huggingface.co/datasets/TheItCrOw/GhostWriter with splits: ['train', 'test', 'validation']
