#### This notebook is modified from <a href="https://www.kaggle.com/code/leonshangguan/modify-of-pii-detect-study">Modify of PII Detect Study</a>, <a href="https://www.kaggle.com/code/pjmathematician/pii-eda-presidio-baseline">PII EDA Presidio Baseline</a> and <a href="https://www.kaggle.com/code/yunsuxiaozi/pii-detect-study-notebook">PII detect study notebook</a>. 

In [None]:
# Modifications 

Firstly, big thanks to the users who provided the notebooks above. This notebook is merely adding some utility code to make a solid baseline that other users may iterate on, but all the heavy lifting was done by the above notebooks.

I encapsulated the analyzer in a class, and added code to run the analyzer on (potentially) both the training and test set. I also added validation code, so that we can analyze the performance of the analyzer on the training set.

I also added a global configuration for ease of testing, which allows the user to switch between training and inference mode. Additionally, I incorporated the external data that [https://www.kaggle.com/alejopaullier](@moth) kindly provided in his discussion post.

For now, the business logic roughly stays the same as the Modify of PII Detect Study notebook that I used beforehand.

## Resources

* My EDA notebook: https://www.kaggle.com/code/mcpenguin/eda-pii-detection-removal/notebook
* Customizing the presidio analyzer: https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/

# Version History

- v16: Original baseline
- v17: Changed score thresholds for patterns from 0.5 -> 0.8
- v20: fixed bug in evaluation code
- v21: changed url regex to not require "www"
- v23: reverted v21, added @amed's metric code
- v24: added custom ID pattern

# Configuration

In [1]:
class CONFIG:
    """
    > General Options
    """
    # global seed
    seed = 42
    # number of samples to use for testing purposes
    # if None, we use the whole training dataset
    samples_testing = None
    # flag to indicate whether to use the external training dataset
    # or just to use the original data
    use_external_train_data = True
    # whether to run the algorithm on the training set and do subsequent validation
    # with 6.8k rows, this takes almost 50 minutes to run
    run_on_train_data = True
    
    """
    > Analyzer Options
    """
    # score threshold for patterns
    id_pattern_score = 0.8
    address_pattern_score = 0.8
    email_pattern_score = 0.8
    url_pattern_score = 0.8

# Import Libraries

### Install presidio

In [2]:
#安装python库 presidio_analyzer 不从python库里下载,而是从给定的链接处下载,更新到最新版本,并减少输出信息.
!pip install -U -q presidio_analyzer --no-index --find-links=file:///kaggle/input/presidio-wheels/presidio

### Import  necessary libraries

In [3]:
import json
import pandas as pd

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from tqdm import tqdm
from typing import List
import random
import pprint
import re
import gc
from ast import literal_eval
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import fbeta_score, classification_report, confusion_matrix

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

from dateutil import parser

In [4]:
random.seed(42)

# Define Metric (F1 Beta Score)

In [5]:
# Big thanks to 
# https://www.kaggle.com/code/amedprof/pii-evaluation-metric?scriptVersionId=160040455
def pii_fbeta_score(pred_df, gt_df,beta=5):
    """
    Parameters:
    - pred_df (DataFrame): DataFrame containing predicted PII labels.
    - gt_df (DataFrame): DataFrame containing ground truth PII labels.
    - beta (float): The beta parameter for the F-beta score, controlling the trade-off between precision and recall.

    Returns:
    - float: Micro F-beta score.
    """   
    df = pred_df.merge(gt_df,how='outer',on=['document',"token"],suffixes=('_pred','_gt'))

    df['cm'] = ""

    df.loc[df.label_gt.isna(),'cm'] = "FP"


    df.loc[df.label_pred.isna(),'cm'] = "FN"
    df.loc[(df.label_gt.notna()) & (df.label_gt!=df.label_pred),'cm'] = "FN"

    df.loc[(df.label_pred.notna()) & (df.label_gt.notna()) & (df.label_gt==df.label_pred),'cm'] = "TP"
    
    FP = (df['cm']=="FP").sum()
    FN = (df['cm']=="FN").sum()
    TP = (df['cm']=="TP").sum()

    s_micro = (1+(beta**2))*TP/(((1+(beta**2))*TP) + ((beta**2)*FN) + FP)

    return s_micro

# Import Datasets

## Import Original Data

In [6]:
train_df = json.load(open("/kaggle/input/pii-detection-removal-from-educational-data/train.json"))
print(f"len(train_df):{len(train_df)}, train_df[0].keys(): {list(train_df[0].keys())}")
print("-"*50)

test_df = json.load(open('/kaggle/input/pii-detection-removal-from-educational-data/test.json'))

len(train_df):6807, train_df[0].keys(): ['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels']
--------------------------------------------------


## Load External Data (if needed)

In [7]:
if CONFIG.use_external_train_data:
    # Convert the "stringified lists" in the columns to proper Python lists
    df_train_external = pd.read_csv('/kaggle/input/pii-external-dataset/pii_dataset.csv', converters={
        'tokens': literal_eval, 
        'labels': literal_eval, 
        'trailing_whitespace': literal_eval
    })
    df_train_external.rename(columns={'text': 'full_text'}, inplace=True)
    # convert to format similar to how we load in the original data
    df_train_external = df_train_external.to_dict('records')
    train_df.extend(df_train_external)

## Sample Data (if needed)

In [8]:
if CONFIG.samples_testing != None:
    train_df = random.sample(train_df, CONFIG.samples_testing)

In [9]:
print(f"train_df length:", len(train_df))

labels = set()
label_counts = {}
for i in range(len(train_df)):
    labels.update(train_df[i]['labels'])
    for label in train_df[i]['labels']:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1
            
print(f"labels: {labels}")
print('-'*25)
print(f"label_counts: {label_counts}")

train_df length: 11241
labels: {'B-STREET_ADDRESS', 'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-URL_PERSONAL', 'B-NAME_STUDENT', 'B-PHONE_NUM', 'B-USERNAME', 'I-ID_NUM', 'B-EMAIL', 'B-ID_NUM', 'I-STREET_ADDRESS', 'B-URL_PERSONAL', 'O'}
-------------------------
label_counts: {'O': 6323308, 'B-NAME_STUDENT': 12469, 'I-NAME_STUDENT': 6763, 'B-URL_PERSONAL': 730, 'B-EMAIL': 3833, 'B-ID_NUM': 78, 'I-URL_PERSONAL': 1, 'B-USERNAME': 724, 'B-PHONE_NUM': 2425, 'I-PHONE_NUM': 3404, 'B-STREET_ADDRESS': 3545, 'I-STREET_ADDRESS': 8597, 'I-ID_NUM': 1}


## Helper Methods

In [10]:
def is_valid_date(text):
    try:
        # Attempt to parse the text as a date
        parsed_date = parser.parse(text)
        return True
    except:
        return False
    
def tokens2index(row):
    tokens  = row['tokens']
    start_ind = []
    end_ind = []
    prev_ind = 0
    for tok in tokens:
        start = prev_ind + row['full_text'][prev_ind:].index(tok)
        end = start+len(tok)
        start_ind.append(start)
        end_ind.append(end)
        prev_ind = end
    return start_ind, end_ind

# binary search
def find_or_next_larger(arr, target):
    left, right = 0, len(arr) - 1

    while left <= right:
        mid = (left + right) // 2

        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return left

def count_trailing_whitespaces(word):
    return len(word) - len(word.rstrip())

## Create Analyzer

For ease of code, we encapsulate the analyzer code in a class.

In [11]:
class MyAnalyzer:
    
    def __init__(self):
        ## Initialize the analyzer
        configuration = {
            "nlp_engine_name": "spacy",
            "models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
        }
        
        # Create NLP engine based on configuration
        provider = NlpEngineProvider(nlp_configuration=configuration)
        nlp_engine = provider.create_engine()
        
        # create ID recognizer
        id_regex = r'([A-Za-z]{2}[.?]:)?\d{12,12}'
        id_pattern = Pattern(name="id", regex=id_regex, score = CONFIG.id_pattern_score)
        id_recognizer = PatternRecognizer(supported_entity="ID_CUSTOM", patterns = [id_pattern])

        # create address recognizer
        address_regex = r'\b\d+\s+\w+(\s+\w+)*\s+((st(\.)?)|(ave(\.)?)|(rd(\.)?)|(blvd(\.)?)|(ln(\.)?)|(ct(\.)?)|(dr(\.)?))\b'
        address_pattern = Pattern(name="address", regex=address_regex, score = CONFIG.address_pattern_score)
        address_recognizer = PatternRecognizer(supported_entity="ADDRESS_CUSTOM", patterns = [address_pattern])

        # create email recognizer
        email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        email_pattern = Pattern(name="email address", regex=email_regex, score = CONFIG.email_pattern_score)
        email_recognizer = PatternRecognizer(supported_entity="EMAIL_CUSTOM", patterns = [email_pattern])

        # create url recognizer 
        url_regex = r'https?://\S+|www\.\S+'
#         url_regex = r'https?:\/\/[a-zA-Z1-9.\?\=\&]+'
        url_pattern = Pattern(name="url", regex=url_regex, score=CONFIG.url_pattern_score)
        url_recognizer = PatternRecognizer(supported_entity="URL_CUSTOM", patterns = [url_pattern])

        registry = RecognizerRegistry()
        registry.load_predefined_recognizers()
        registry.add_recognizer(id_recognizer)
        registry.add_recognizer(address_recognizer)
        registry.add_recognizer(email_recognizer)
        registry.add_recognizer(url_recognizer)

        # Pass the created NLP engine and supported_languages to the AnalyzerEngine
        self.analyzer = AnalyzerEngine(
            nlp_engine=nlp_engine, 
            supported_languages=["en"],
            registry=registry
        )
        
        ## Initialize the black list
        self.black_list = ["wikipedia", "coursera", ".pdf", ".PDF", "article", 
                           ".png", ".gov", ".work", ".ai", ".firm", ".arts", 
                           ".store", ".rec", ".biz", ".travel" ]
        
     
    def predict_tokens(self, df_: list) -> pd.DataFrame:
        """Predict the tokens that have PII in the dataframe."""
        
        # list of all predictions for each label
        PHONE_NUM, NAME_STUDENT, URL_PERSONAL, EMAIL, STREET_ADDRESS, ID_NUM, USERNAME = [],[],[],[],[],[], []

        preds = []
        
        # Find the starting and ending positions of each word after segmentation
        for i in tqdm(range(len(df_)), desc="Processing tokens2index"):
            start, end = tokens2index(df_[i])
            df_[i]['start'] = start
            df_[i]['end'] = end

        for i, d in tqdm(enumerate(df_), total=len(df_), desc="Analyzing entities"):
            #results:[type: PERSON, start: 22, end: 37, score: 0.85]
            results = self.analyzer.analyze(text=d['full_text'],
                                   entities=[
                                             #"PHONE_NUMBER", 
                                             "PERSON", 
                                             "URL_CUSTOM", #"IP_ADDRESS", #"URL",
                                             "EMAIL_ADDRESS", "EMAIL_CUSTOM", 
                                             "ADDRESS_CUSTOM",
                                             "US_SSN", "US_ITIN", "US_PASSPORT", "US_BANK_NUMBER",
                                             "USERNAME"],
                                   language='en',
        #                            score_threshold=0.2,
                                    )
            pre_preds = []
            
            # Traverse each entity found, 
            # r: [type: PERSON, start: 22, end: 37, score: 0.85]
            for r in results:
                # That is, the sth word is the beginning of an entity
                s = find_or_next_larger(d['start'], r.start)# d['start'][s] = r.start
                
                end = r.end # entity end point
                # find words in text
                word = d['full_text'][r.start:r.end]
                end = end - count_trailing_whitespaces(word)
                temp_preds = [s]
                
                try:
                    while d['end'][s+1] <= end:
                        temp_preds.append(s+1)
                        s +=1
                except:
                    pass

                tmp = False

                if r.entity_type == 'USERNAME':
                    label = 'USERNAME'
                    USERNAME.append(d['full_text'][r.start:r.end])

        #         if r.entity_type == 'PHONE_NUMBER':
        #             if is_valid_date(word):
        #                 continue
        #             label =  'PHONE_NUM'
        #             PHONE_NUM.append(d['full_text'][r.start:r.end])

                if r.entity_type == 'PERSON':
                    label = 'NAME_STUDENT'
                    NAME_STUDENT.append(d['full_text'][r.start:r.end])

                if r.entity_type == 'ADDRESS_CUSTOM':
                    label = 'STREET_ADDRESS'
                    STREET_ADDRESS.append(d['full_text'][r.start:r.end])

                if r.entity_type == 'ID_CUSTOM' or \
                    r.entity_type == 'US_SSN' or r.entity_type == 'US_ITIN' or r.entity_type == 'US_PASSPORT' or r.entity_type == 'US_BANK_NUMBER':
                    
                    label = 'ID_NUM'
                    ID_NUM.append(d['full_text'][r.start:r.end])

                if r.entity_type == 'EMAIL_ADDRESS' or r.entity_type == 'EMAIL_CUSTOM':
                    label = 'EMAIL'
                    EMAIL.append(d['full_text'][r.start:r.end])

                if r.entity_type == 'URL_CUSTOM':# or r.entity_type == 'IP_ADDRESS' or "http" in word:
                    for w in self.black_list:
                        if w in word:
                            tmp = True
                            break

                    label = 'URL_PERSONAL'
                    URL_PERSONAL.append(d['full_text'][r.start:r.end])

                if tmp:
                    continue

                for p in temp_preds:
                    if len(pre_preds) > 0:
                        """
                        When starting a new r, pre_preds[-1]['rlabel'] 
                        is still the r.entity_type of the previous entity
                        At this time, it may not be equal to the 
                        r.entity_type of this entity. 
                        
                        In other words, 
                        the first equal sign is still in the same entity.
                        """
                        if pre_preds[-1]['rlabel'] == r.entity_type and (p - pre_preds[-1]['token']==1):
                            label_f = "I-"+label
                        else:
                            label_f = "B-"+label
                    else:
                        label_f = "B-"+label
                    
                    # pre_preds contains the output that we want
                    pre_preds.append(({
                        "document": d['document'],
                        "token": p,
                        "label": label_f,
                        "rlabel": r.entity_type,
                    }))
                    
            # After traversing this data, summarize all found entities
            # and extend the preds for this document into the aggregate preds
            preds.extend(pre_preds)
            
        preds_df = pd.DataFrame(preds).iloc[:,:-1].reset_index()
        return preds_df

# Predict Train Set

In [12]:
analyzer = MyAnalyzer()

In [13]:
if CONFIG.run_on_train_data:
    train_preds = analyzer.predict_tokens(train_df)

Processing tokens2index: 100%|██████████| 11241/11241 [00:09<00:00, 1203.31it/s]
Analyzing entities: 100%|██████████| 11241/11241 [22:08<00:00,  8.46it/s]


# Evaluate Performance on Training Set

## Generate Corresponding DataFrame for "True" Answers

In [14]:
if CONFIG.run_on_train_data:
    train_act_records = []
    count = 0
    for entry in train_df:
        for idx, (token, label) in enumerate(zip(entry["tokens"], entry["labels"])):
            if label != 'O':
                train_act_records.append({
                    'row_id': count,
                    'document': entry["document"],
                    'token': idx,
                    'label': label,
                })
                count += 1

    train_act = pd.DataFrame.from_records(train_act_records)

In [15]:
# Check that we haven't missed out any true values when making the "train_act_records"
if CONFIG.run_on_train_data:
    assert len(train_act) == sum(label_counts.values()) - label_counts['O'], \
        'mismatch between number of true labels in label_counts and train_act_records'

In [16]:
def get_pred_act_lists_for_dfs(preds: pd.DataFrame, act: pd.DataFrame):
    document_idx_list = [(ex["document"], len(ex["tokens"])) for ex in train_df]
    
    preds_list = []
    act_list = []
    
    for document, len_tokens in tqdm(document_idx_list, total=len(document_idx_list)):
        preds_doc = preds[preds["document"] == document].sort_values(by="token")
        act_doc = act[act["document"] == document].sort_values(by="token")
        
        # We do a "merge" (like in mergesort) to combine the results from the preds and 
        # actual values
        preds_idx = 0
        act_idx = 0
        preds_list_sub = []
        act_list_sub = []
        for i in range(len_tokens):
            preds_head, act_head = None, None
            if preds_idx < len(preds_doc):
                preds_head = preds_doc.iloc[preds_idx]
            if act_idx < len(act_doc):
                act_head = act_doc.iloc[act_idx]
                
            if act_head is not None and act_head["token"] == i:
                act_list_sub.append(act_head["label"])
                act_idx += 1
            else:
                act_list_sub.append('O')
                
            if preds_head is not None and preds_head["token"] == i:
                preds_list_sub.append(preds_head["label"])
                preds_idx += 1
            else:
                preds_list_sub.append('O')
        
        preds_list.extend(preds_list_sub)
        act_list.extend(act_list_sub)
            
    return preds_list, act_list
    

In [17]:
if CONFIG.run_on_train_data:
    print("PII micro F-beta score:", pii_fbeta_score(train_preds, train_act, beta = 5))

PII micro F-beta score: 0.48751463129145534


In [None]:
if CONFIG.run_on_train_data:
    train_preds_list, train_act_list = get_pred_act_lists_for_dfs(train_preds, train_act)

 71%|███████   | 8008/11241 [04:58<03:26, 15.66it/s]

In [None]:
# make sure that we processed the actual train list properly
if CONFIG.run_on_train_data:
    assert dict(Counter(train_act_list)) == label_counts, 'mismatch between label counts in label_counts and train_act_list'

In [None]:
if CONFIG.run_on_train_data:
    print("-"*25)
    print("Counter for predicted labels:")
    print("-"*25)
    pprint.pprint(dict(Counter(train_preds_list)))
    print()
    
    print("-"*25)
    print("Counter for actual labels:")
    print("-"*25)
    pprint.pprint(dict(Counter(train_act_list)))
    print()

## Classification Report on Training Set

In [None]:
if CONFIG.run_on_train_data:
    print(classification_report(train_preds_list, train_act_list, digits=4))

## F Beta Score on Training Set

In [None]:
# if CONFIG.run_on_train_data:
#     print("Micro F1 Beta Score:", score(train_preds_list, train_act_list))
#     print("Macro F1 Beta Score:", macro_score(train_preds_list, train_act_list))

In [None]:
if CONFIG.run_on_train_data:
    del train_preds_list, train_act_list, train_preds, train_act
    gc.collect()

# Predict Test Set

In [None]:
test_preds = analyzer.predict_tokens(test_df)
test_preds.head()

# Submission

In [None]:
submission = pd.DataFrame(test_preds)
submission.columns = ['row_id','document', 'token', 'label']
submission.to_csv('submission.csv', index = False)
submission.head()