# Converting Multilingual PII Dataset to CoNLL Format

This notebook converts the multilingual PII dataset (`ai4privacy/open-pii-masking-500k-ai4privacy`) to CoNLL format, focusing on four languages: English, German, French, and Italian.

**Steps:**
1. Load and filter the dataset by language
2. Convert the filtered data to CoNLL format
3. Save and verify the results

The CoNLL format will have:
- One token per line with its BIO tag
- Language information as comments
- Blank lines between sentences

## 1. Setup

First, let's import the required libraries and load our dataset.

In [30]:
# Install required libraries (only needs to be run once)
!pip install -q datasets  # -q flag keeps the output quiet

# Import necessary libraries
import datasets  # Hugging Face datasets library for loading and processing datasets
import os        # For file operations like reading/writing files

# Load the dataset from Hugging Face Hub
# This dataset contains text with PII (Personally Identifiable Information) annotations
ds = datasets.load_dataset('ai4privacy/open-pii-masking-500k-ai4privacy')

# Display basic dataset information
print('Available splits:', list(ds.keys()))  # Show train/validation/test splits
print('\nExample structure:')
print(ds['train'].features)  # Show what fields are available in each example

Available splits: ['train', 'validation']

Example structure:
{'source_text': Value(dtype='string', id=None), 'masked_text': Value(dtype='string', id=None), 'privacy_mask': [{'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'value': Value(dtype='string', id=None), 'label_index': Value(dtype='int64', id=None)}], 'split': Value(dtype='string', id=None), 'uid': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None), 'region': Value(dtype='string', id=None), 'script': Value(dtype='string', id=None), 'mbert_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'mbert_token_classes': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


## 2. Filter Dataset by Language

We'll filter the dataset to include only:
- English (en)
- German (de)
- French (fr)
- Italian (it)

In [31]:
# Define the languages we want to keep in our dataset
target_languages = ['en', 'de', 'fr', 'it']  # English, German, French, Italian

# Define a function to filter dataset by language
def filter_by_languages(dataset, languages):
    """
    Filter a dataset to keep only examples in specified languages.
    Args:
        dataset: Hugging Face dataset to filter
        languages: List of language codes to keep
    Returns:
        Filtered dataset containing only examples in specified languages
    """
    return dataset.filter(lambda x: x['language'] in languages)

# Apply the filtering to all splits (train, validation, etc.)
filtered_ds = datasets.DatasetDict({
    split: filter_by_languages(ds[split], target_languages)
    for split in ds.keys()
})

# Print statistics to compare original and filtered dataset sizes
for split in filtered_ds.keys():
    orig_size = len(ds[split])
    filt_size = len(filtered_ds[split])
    print(f'\n{split} split:')
    print(f'  Original: {orig_size:,} examples')
    print(f'  Filtered: {filt_size:,} examples ({filt_size/orig_size*100:.1f}%)')

# Show distribution of languages in the filtered training set
lang_dist = filtered_ds['train'].to_pandas()['language'].value_counts()
print('\nLanguage distribution in training set:')
for lang, count in lang_dist.items():
    percentage = count/len(filtered_ds['train'])*100
    print(f'  {lang}: {count:,} examples ({percentage:.1f}%)  # {lang}=English/German/French/Italian')


train split:
  Original: 464,150 examples
  Filtered: 331,106 examples (71.3%)

validation split:
  Original: 116,077 examples
  Filtered: 82,931 examples (71.4%)

Language distribution in training set:
  en: 120,533 examples (36.4%)
  fr: 89,670 examples (27.1%)
  de: 65,899 examples (19.9%)
  it: 55,004 examples (16.6%)

Language distribution in training set:
  en: 120,533 examples (36.4%)
  fr: 89,670 examples (27.1%)
  de: 65,899 examples (19.9%)
  it: 55,004 examples (16.6%)


## 3. Convert to CoNLL Format

Now we'll convert the filtered dataset to CoNLL format, where:
- Each line contains a token and its BIO tag
- Language information is preserved as comments
- Sentences are separated by blank lines

Example:
```
# Language: en
John B-NAME
lives O
in O
Paris B-LOCATION

# Language: de
Ich O
wohne O
in O
Berlin B-LOCATION
```

In [33]:
def text_to_conll(text, spans):
    """
    Convert a text and its PII spans to CoNLL format.
    Args:
        text: The input text string
        spans: List of dictionaries containing PII annotations with 'start', 'end', and 'label' keys
    Returns:
        List of strings, each string being a CoNLL format line (token BIO-tag)
    """
    # Step 1: Tokenize the text (simple whitespace tokenization)
    tokens = text.split()
    # Initialize all tokens with 'O' (Outside) tag
    tags = ['O'] * len(tokens)
    
    def char_to_token_idx(char_pos):
        """
        Convert a character position to a token index.
        This is needed because spans have character-level positions,
        but we need token-level positions for CoNLL format.
        """
        current_pos = 0
        for i, token in enumerate(tokens):
            # Find where this token starts in the original text
            token_start = text.find(token, current_pos)
            token_end = token_start + len(token)
            # Check if our target position falls within this token
            if token_start <= char_pos < token_end:
                return i
            current_pos = token_end
        return None
    
    # Step 2: Process each span to assign BIO tags
    for span in spans:
        # Find which tokens our span starts and ends at
        start_idx = char_to_token_idx(span['start'])
        end_idx = char_to_token_idx(span['end'])
        
        if start_idx is not None:
            # B- prefix for the first token of the entity
            tags[start_idx] = f'B-{span["label"].upper()}'
            
            # I- prefix for any subsequent tokens of the same entity
            if end_idx and end_idx > start_idx:
                for i in range(start_idx + 1, end_idx + 1):
                    if i < len(tags):  # Safety check
                        tags[i] = f'I-{span["label"].upper()}'

    # Step 3: Create the CoNLL format lines
    return [f'{token} {tag}' for token, tag in zip(tokens, tags)]

def create_conll_file(dataset, split, output_path):
    """
    Create a CoNLL format file from a dataset split.
    Args:
        dataset: Hugging Face dataset
        split: Which split to process ('train', 'validation', etc.)
        output_path: Where to save the CoNLL file
    Returns:
        Number of lines written to the file
    """
    all_lines = []
    
    # Process each example in the dataset
    for example in dataset[split]:
        # Add a comment line with language information
        all_lines.append(f'# Language: {example["language"]}')
        
        # Convert this example to CoNLL format
        conll_lines = text_to_conll(example['source_text'], example['spans'])
        all_lines.extend(conll_lines)
        
        # Add blank line to separate sentences (CoNLL format requirement)
        all_lines.append('')
    
    # Write all lines to the output file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(all_lines))
    
    return len(all_lines)  # Return number of lines for statistics

# Define output file paths
train_path = 'train.conll'
val_path = 'validation.conll'

# Create CoNLL files for both splits
train_lines = create_conll_file(filtered_ds, 'train', train_path)
val_lines = create_conll_file(filtered_ds, 'validation', val_path)

# Print statistics about the created files
print(f'Created {train_path} with {train_lines:,} lines')
print(f'Created {val_path} with {val_lines:,} lines')

Created train.conll with 6,576,194 lines
Created validation.conll with 1,644,058 lines


## 4. Verify the Output

Let's check the generated CoNLL files to ensure they're formatted correctly and contain the expected information.

In [34]:
def peek_file(filepath, num_lines=20):
    """
    Display the first few lines of a file with nice formatting.
    Args:
        filepath: Path to the file to read
        num_lines: Number of lines to display (default: 20)
    """
    print(f'First {num_lines} lines of {filepath}:')
    print('-' * 50)  # Separator line for readability
    
    # Read and display the lines
    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= num_lines:
                break
            # Strip trailing whitespace for clean display
            print(line.rstrip())
    
    print('-' * 50)  # Closing separator line

# Check samples from both training and validation files
for filepath in [train_path, val_path]:
    print(f'\nSample from {filepath}:')
    peek_file(filepath)  # Show first 20 lines by default

# The output should show:
# 1. Language comments (# Language: xx)
# 2. Tokens with their BIO tags
# 3. Blank lines between sentences


Sample from train.conll:
First 20 lines of train.conll:
--------------------------------------------------
# Language: en
To-do O
list O
for O
4th B-DATE
August I-DATE
1942: I-DATE
meet O
with O
Brandy B-GIVENNAME
Haroon B-SURNAME
at O
10:17 B-TIME
to O
discuss O
the O
volunteer O
service O
record O
of O
--------------------------------------------------

Sample from validation.conll:
First 20 lines of validation.conll:
--------------------------------------------------
# Language: fr
Ma O
mère O
Astrit B-GIVENNAME
Nani O
Kofi B-SURNAME
est O
née O
à O
Ruswil B-CITY
en O
février/75 B-DATE

# Language: de
15. B-DATE
November I-DATE
1942: I-DATE
Datum O
der O
Einreichung O
--------------------------------------------------
