# Download and Prepare Datasets

Downloads and prepares datasets for WMDP bio unlearning experiments:

**Datasets:**
- Gibberish: Questions rewritten with nonsense words
- Real Words Sciency: Questions rewritten with scientific-sounding words
- Nonsensical Biology: Questions rewritten with biology jargon
- WMDP Bio Robust: Original WMDP biology questions (6 subtopics combined)

Filters nonsense datasets to match WMDP questions and saves as CSV files.

## Load Datasets


In [1]:
import os
import sys
import pandas as pd
from pathlib import Path
from datasets import load_dataset
from dotenv import load_dotenv

# Setup paths and authentication
project_root = Path.cwd()
while not (project_root / "paths.py").exists() and project_root != project_root.parent:
    project_root = project_root.parent
sys.path.insert(0, str(project_root))

from paths import ENV_FILE, DATASETS_PATH

load_dotenv(ENV_FILE)
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    raise ValueError("‚ùå HF_TOKEN not found in .env file")

# Load nonsense datasets
print("üì• Loading nonsense datasets from HuggingFace...\n")
gibberish_dataset = load_dataset("NLie2/rewrite-questions-gibberish", token=hf_token)
real_words_sciency_dataset = load_dataset("NLie2/rewrite-questions-real-words-sciency", token=hf_token)
nonsensical_biology_dataset = load_dataset("NLie2/rewrite-questions-nonsensical-biology", token=hf_token)

# Load WMDP bio dataset (all 6 configs)
print("üì• Loading WMDP Bio Robust dataset (6 configs)...\n")
wmdp_configs = [
    'bioweapons_and_bioterrorism',
    'dual_use_virology', 
    'enhanced_potential_pandemic_pathogens',
    'expanding_access_to_threat_vectors',
    'reverse_genetics_and_easy_editing',
    'viral_vector_research'
]

wmdp_dfs = []
for config in wmdp_configs:
    dataset = load_dataset("EleutherAI/wmdp_bio_robust_mcqa", config, token=hf_token)
    split = 'robust' if 'robust' in dataset else list(dataset.keys())[0]
    df = dataset[split].to_pandas()
    df['config'] = config
    wmdp_dfs.append(df)

wmdp_bio_combined = pd.concat(wmdp_dfs, ignore_index=True)

print(f"‚úÖ All datasets loaded successfully!\n")
print(f"   Gibberish: {len(gibberish_dataset['train'])} questions")
print(f"   Real Words Sciency: {len(real_words_sciency_dataset['train'])} questions")
print(f"   Nonsensical Biology: {len(nonsensical_biology_dataset['train'])} questions")
print(f"   WMDP Bio: {len(wmdp_bio_combined)} questions ({len(wmdp_configs)} configs)")


  from .autonotebook import tqdm as notebook_tqdm


üì• Loading nonsense datasets from HuggingFace...

üì• Loading WMDP Bio Robust dataset (6 configs)...

‚úÖ All datasets loaded successfully!

   Gibberish: 1243 questions
   Real Words Sciency: 1250 questions
   Nonsensical Biology: 1244 questions
   WMDP Bio: 868 questions (6 configs)


## Filter & Validate


In [2]:
# Get unique WMDP questions
wmdp_questions = set(wmdp_bio_combined['question'].values)

# Convert to pandas
gibberish_df = gibberish_dataset['train'].to_pandas()
real_words_df = real_words_sciency_dataset['train'].to_pandas()
nonsensical_df = nonsensical_biology_dataset['train'].to_pandas()

# Find questions that exist in ALL three nonsense datasets AND in WMDP
gib_questions = set(gibberish_df['original'].unique())
real_questions = set(real_words_df['original'].unique())
nons_questions = set(nonsensical_df['original'].unique())

# Get intersection: questions in all three nonsense datasets
common_questions = gib_questions.intersection(real_questions).intersection(nons_questions)

# Further filter to only those in WMDP
final_questions = common_questions.intersection(wmdp_questions)

# Filter all datasets to this common set
gibberish_filtered = gibberish_df[gibberish_df['original'].isin(final_questions)].copy()
real_words_filtered = real_words_df[real_words_df['original'].isin(final_questions)].copy()
nonsensical_filtered = nonsensical_df[nonsensical_df['original'].isin(final_questions)].copy()
wmdp_filtered = wmdp_bio_combined[wmdp_bio_combined['question'].isin(final_questions)].copy()

# Show filtering results
print(f"üîç Filtering to questions in ALL datasets:\n")
print(f"   Questions in all 3 nonsense datasets: {len(common_questions)}")
print(f"   Questions also in WMDP: {len(final_questions)}")
print(f"\n   After filtering:")
print(f"   Gibberish:           {len(gibberish_df):4d} ‚Üí {len(gibberish_filtered):3d}")
print(f"   Real Words Sciency:  {len(real_words_df):4d} ‚Üí {len(real_words_filtered):3d}")
print(f"   Nonsensical Biology: {len(nonsensical_df):4d} ‚Üí {len(nonsensical_filtered):3d}")
print(f"   WMDP Bio:            {len(wmdp_bio_combined):4d} ‚Üí {len(wmdp_filtered):3d}")
print(f"\n‚úÖ All datasets now have {len(final_questions)} matching questions!")


üîç Filtering to questions in ALL datasets:

   Questions in all 3 nonsense datasets: 1197
   Questions also in WMDP: 811

   After filtering:
   Gibberish:           1243 ‚Üí 811
   Real Words Sciency:  1250 ‚Üí 811
   Nonsensical Biology: 1244 ‚Üí 811
   WMDP Bio:             868 ‚Üí 811

‚úÖ All datasets now have 811 matching questions!


In [3]:
# Create output directory and save
output_dir = DATASETS_PATH / "input_data/wmdp-bio-forget-mcqa"
output_dir.mkdir(parents=True, exist_ok=True)

datasets_to_save = [
    ("gibberish.csv", gibberish_filtered),
    ("real_words_sciency.csv", real_words_filtered),
    ("nonsensical_biology.csv", nonsensical_filtered),
    ("wmdp_bio_robust.csv", wmdp_filtered)
]

print(f"\nüíæ Saving to: {output_dir}\n")
for filename, df in datasets_to_save:
    path = output_dir / filename
    df.to_csv(path, index=False)
    print(f"   ‚úì {filename:25s} ({len(df):3d} questions)")

print(f"\n‚úÖ All datasets saved!")



üíæ Saving to: /rds/general/user/nk1924/home/ign-inf-unlearning/data/datasets/input_data/wmdp-bio-forget-mcqa

   ‚úì gibberish.csv             (811 questions)
   ‚úì real_words_sciency.csv    (811 questions)
   ‚úì nonsensical_biology.csv   (811 questions)
   ‚úì wmdp_bio_robust.csv       (811 questions)

‚úÖ All datasets saved!


# Upload to HF

In [8]:
import os
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from datasets import Dataset

# Load environment variables
load_dotenv()
token = os.getenv('HF_TOKEN')

# Configuration - map CSV files to their repos
source_dir = Path("/rds/general/user/nk1924/home/ign-inf-unlearning/data/datasets/input_data/wmdp-bio-forget-mcqa")

file_to_repo = {
    "gibberish.csv": "NLie2/rewrite-questions-gibberish",
    "real_words_sciency.csv": "NLie2/rewrite-questions-real-words-sciency",
    "nonsensical_biology.csv": "NLie2/rewrite-questions-nonsensical-biology"
}

# Process and upload each CSV as "robust" split
for filename, repo_id in file_to_repo.items():
    file_path = source_dir / filename
    
    if not file_path.exists():
        print(f"‚ö†Ô∏è File not found: {filename}")
        continue
    
    print(f"\nProcessing: {filename}")
    print(f"Target repo: {repo_id}")
    
    df = pd.read_csv(file_path)
    
    # Convert to HF Dataset and upload as "robust" split
    dataset = Dataset.from_pandas(df)
    dataset.push_to_hub(
        repo_id=repo_id,
        token=token,
        split="robust"  # All go to "robust" split
    )
    
    print(f"‚úì Uploaded {len(df)} records to 'robust' split")
    print(f"  View at: https://huggingface.co/datasets/{repo_id}")

print("\n‚úÖ All robust versions uploaded!")


Processing: gibberish.csv
Target repo: NLie2/rewrite-questions-gibberish


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

‚úì Uploaded 811 records to 'robust' split
  View at: https://huggingface.co/datasets/NLie2/rewrite-questions-gibberish

Processing: real_words_sciency.csv
Target repo: NLie2/rewrite-questions-real-words-sciency


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

‚úì Uploaded 811 records to 'robust' split
  View at: https://huggingface.co/datasets/NLie2/rewrite-questions-real-words-sciency

Processing: nonsensical_biology.csv
Target repo: NLie2/rewrite-questions-nonsensical-biology


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

README.md: 0.00B [00:00, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


‚úì Uploaded 811 records to 'robust' split
  View at: https://huggingface.co/datasets/NLie2/rewrite-questions-nonsensical-biology

‚úÖ All robust versions uploaded!
