# Data Preparation for PII Data Detection

This notebook shares my current approach to CV, striding, visualization and dataset versioning with W&B. 

You may want to run it interactively or add W&B API key to the secrets to run it offline.

You can check out [the video from my live training session](https://www.youtube.com/watch?v=w4ZDwiSXMK0).

I also saved the outputs to the [Kaggle dataset](https://www.kaggle.com/datasets/thedrcat/pii-detection-cv-split) if you want to import it in a Kaggle training notebook. 

In [2]:
import json
import pandas as pd

train = json.load(open("../input/pii-detection-removal-from-educational-data/train.json"))
df = pd.DataFrame(train)

len(train)

6807

# CV Split

Let's start by checking out the distribution of labels across all training essays. 

In [3]:
def encode_labels(df):
    df["unique_labels"] = df["labels"].apply(lambda x: list(set(
        [l.split('-')[1] for l in x if l != 'O']
         )))
    # add 1-hot encoding
    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()
    one_hot_encoded = mlb.fit_transform(df['unique_labels'])
    one_hot_df = pd.DataFrame(one_hot_encoded, columns=mlb.classes_)
    df = pd.concat([df, one_hot_df], axis=1)
    
    # add 'OTHER' column
    df['OTHER'] = df['unique_labels'].apply(lambda x: 1 if len(x) == 0 else 0)
    
    return df, list(mlb.classes_) + ['OTHER']

df, label_classes = encode_labels(df)

for col in label_classes:
    print(f'{col}: {df[col].sum()}')

EMAIL: 24
ID_NUM: 33
NAME_STUDENT: 891
PHONE_NUM: 4
STREET_ADDRESS: 2
URL_PERSONAL: 72
USERNAME: 5
OTHER: 5862


I want all the very rare classes to be in my validation split. This is going to be an opinionated split, but I'd like to pick the following numbers into my validation: 

In [4]:
# Shuffle the dataframe
df = df.sample(frac=1, random_state=42)

# Create a 'valid' column and set it to False
df['valid'] = False

# Define the validation numbers
val_nums = {
    'EMAIL': 12,
    'ID_NUM': 12,
    'NAME_STUDENT': 100,
    'PHONE_NUM': 4,
    'STREET_ADDRESS': 2,
    'URL_PERSONAL': 20,
    'USERNAME': 5,
    'OTHER': 1000, 
}

# For each class in val_nums, randomly select the specified number of examples and set 'valid' to True
for label, num in val_nums.items():
    valid_indices = df[df[label] == 1].sample(n=num, random_state=42).index
    df.loc[valid_indices, 'valid'] = True


# Let's double check the classes per split:
for col in label_classes:
    print(f'VALID {col}: {df[df.valid == True][col].sum()}')
    print(f'TRAIN {col}: {df[df.valid == False][col].sum()}')

VALID EMAIL: 13
TRAIN EMAIL: 11
VALID ID_NUM: 13
TRAIN ID_NUM: 20
VALID NAME_STUDENT: 124
TRAIN NAME_STUDENT: 767
VALID PHONE_NUM: 4
TRAIN PHONE_NUM: 0
VALID STREET_ADDRESS: 2
TRAIN STREET_ADDRESS: 0
VALID URL_PERSONAL: 26
TRAIN URL_PERSONAL: 46
VALID USERNAME: 5
TRAIN USERNAME: 0
VALID OTHER: 1000
TRAIN OTHER: 4862


# Visualization

Let's prepare the visualization code based on [this great notebook](https://www.kaggle.com/code/sinchir0/visualization-code-using-displacy).

In [18]:
# https://www.kaggle.com/code/sinchir0/visualization-code-using-displacy
import spacy
from spacy.tokens import Span
from spacy import displacy

nlp = spacy.blank("en")

options = {
    "colors": {
        "B-NAME_STUDENT": "aqua",
        "I-NAME_STUDENT": "skyblue",
        "B-EMAIL": "limegreen",
        "I-EMAIL": "lime",
        "B-USERNAME": "hotpink",
        "I-USERNAME": "lightpink",
        "B-ID_NUM": "purple",
        "I-ID_NUM": "rebeccapurple",
        "B-PHONE_NUM": "red",
        "I-PHONE_NUM": "salmon",
        "B-URL_PERSONAL": "silver",
        "I-URL_PERSONAL": "lightgray",
        "B-STREET_ADDRESS": "brown",
        "I-STREET_ADDRESS": "chocolate",
    }
}

def visualize(row):
    doc = nlp(row.full_text)
    doc.ents = [
        Span(doc, idx, idx + 1, label=label)
        for idx, label in enumerate(row.labels)
        if label != "O"
    ]
    html = displacy.render(doc, style="ent", jupyter=False, options=options)
    return html


In [5]:
# from IPython.core.display import display, HTML
# html = visualize(df.loc[0])
# display(HTML(html))

# Truncation with stride

There are two ways to do striding here - the best is probably to use tokenizers striding method. I opted for the easy way here and applied striding using spacy tokens. This means we're still facing variable sequence length after tokenization.

In [8]:
def add_token_indices(doc_tokens):
    token_indices = list(range(len(doc_tokens)))
    return token_indices

df['token_indices'] = df['tokens'].apply(add_token_indices)

In [9]:
def rebuild_text(tokens, trailing_whitespace):
    text = ''
    for token, ws in zip(tokens, trailing_whitespace):
        ws = " " if ws == True else ""
        text += token + ws
    return text


def split_rows(df, max_length, doc_stride):
    new_df = []
    for _, row in df.iterrows():
        tokens = row['tokens']
        if len(tokens) > max_length:
            start = 0
            while start < len(tokens):
                remaining_tokens = len(tokens) - start
                if remaining_tokens < max_length and start != 0:
                    # Adjust start for the last window to ensure it has max_length tokens
                    start = max(0, len(tokens) - max_length)
                end = min(start + max_length, len(tokens))
                new_row = {}
                new_row['document'] = row['document']
                new_row['valid'] = row['valid']
                new_row['tokens'] = tokens[start:end]
                new_row['trailing_whitespace'] = row['trailing_whitespace'][start:end]
                new_row['labels'] = row['labels'][start:end]
                new_row['token_indices'] = list(range(start, end))
                new_row['full_text'] = rebuild_text(new_row['tokens'], new_row['trailing_whitespace'])
                new_df.append(new_row)
                if remaining_tokens >= max_length:
                    start += doc_stride
                else:
                    # Break the loop if we've adjusted for the last window
                    break
        else:
            new_row = {
                'document': row['document'], 
                'valid': row['valid'],
                'tokens': row['tokens'], 
                'trailing_whitespace': row['trailing_whitespace'], 
                'labels': row['labels'], 
                'token_indices': row['token_indices'], 
                'full_text': row['full_text']
            }
            new_df.append(new_row)
    return pd.DataFrame(new_df)


In [10]:
max_length = 750
doc_stride = 250
stride_df = split_rows(df, max_length, doc_stride)

In [11]:
len(df), len(stride_df)

(6807, 11468)

In [12]:
stride_df, label_classes = encode_labels(stride_df)

# Saving to W&B

It's best practice to version datasets properly and visualize them in W&B. Let's do this!

To run below code, please add your `WANDB_API_KEY` secret to Kaggle notebook secrets. You can get it [here](https://wandb.ai/authorize).

In [13]:
from kaggle_secrets import UserSecretsClient
import wandb

user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("WANDB_API_KEY")
wandb.login(key=wandb_api_key)
wandb.init(project='pii', job_type='preprocessing')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdarek[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [14]:
# Let's add our hyperparameters to the config 

wandb.config.update({
    'max_length': max_length,
    'doc_stride': doc_stride,
})

In [16]:
# Let's first log data as artifacts

df.to_parquet('raw_data.parquet', index=False)
stride_df.to_parquet('stride_data.parquet', index=False)

raw_data = wandb.Artifact(name="raw_data", type="dataset")
raw_data.add_file('raw_data.parquet')
wandb.log_artifact(raw_data)

processed_data = wandb.Artifact(name="processed_data", type="dataset")
processed_data.add_file('stride_data.parquet')
wandb.log_artifact(processed_data)

<Artifact processed_data>

In [19]:
# We will generate html viz for every train essay, wrap it up in `wandb.Html` and create a W&B table to inspect it
wandb_htmls = [wandb.Html(visualize(row)) for _, row in df.iterrows()]
df['visualization'] = wandb_htmls
table = wandb.Table(dataframe=df)
wandb.log({'original_dataset': table})



In [20]:
# Finish W&B run
wandb.finish()

VBox(children=(Label(value='30.090 MB of 30.090 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

# Share your findings

If you find some good insights from inspecting the data, please share in the comments!