## External Data Sources

* [Persuade PII Dataset](https://www.kaggle.com/datasets/thedrcat/persuade-pii-dataset?rvi=1)
  * Essays from Persuade corpus, modified with synthetic PII data and corresponding labels. It was filtered for essays that contain tokens that are relevant to competition.

* [PII | External Dataset](https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset?rvi=1)
  * This is an LLM-generated external dataset that contains generated texts with their corresponding annotated labels in the required competition format.

* [NEW DATASET PII Data Detection](https://www.kaggle.com/datasets/cristaliss/new-dataset-pii-data-detection?rvi=1)
  * This dataset is a modified version of the official training which have the following changes: Revamped Labels, Token Transformation, and Token indexing

* [PII Detection Dataset (GPT)](https://www.kaggle.com/datasets/pjmathematician/pii-detection-dataset-gpt)
  * Personal data was created using python Faker package, which was then fed into the LLM to write an essay on. Overall, it contains 2000 gpt - generated essays and corresponding competition entities used in the essay.

* [AI4privacy-PII](https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii)
  * The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality. It serves a crucial role in addressing the growing concerns around personal data security in AI applications.



## Python Libraries


In [None]:
try:
    import pandas as pd
    import numpy as np
    import spacy as sp
    import re
except DeprecationWarning:
    None

## Loading Datasets


Official training data

Only load into notebook after pulling from git LFS (see above)

In [None]:
# df_train = pd.read_json("../Datasets/Official/train.json")
# df_train

Official testing data


In [None]:
df_test = pd.read_json("../Datasets/Official/test.json")
df_test

Unnamed: 0,document,full_text,tokens,trailing_whitespace
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,..."
2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"[Reporting, process, \n\n, by, Gilberto, Gambo...","[True, False, False, True, True, False, False,..."
3,20,Design Thinking for Innovation\n\nSindy Samaca...,"[Design, Thinking, for, Innovation, \n\n, Sind...","[True, True, True, False, False, True, False, ..."
4,56,Assignment: Visualization Reflection Submitt...,"[Assignment, :, , Visualization, , Reflecti...","[False, False, False, False, False, False, Fal..."
5,86,Cheese Startup - Learning Launch ​by Eladio Am...,"[Cheese, Startup, -, Learning, Launch, ​by, El...","[True, True, True, True, True, True, True, Fal..."
6,93,Silvia Villalobos\n\nChallenge:\n\nThere is a ...,"[Silvia, Villalobos, \n\n, Challenge, :, \n\n,...","[True, False, False, False, False, False, True..."
7,104,Storytelling The Path to Innovation\n\nDr Sak...,"[Storytelling, , The, Path, to, Innovation, \...","[True, False, True, True, True, False, False, ..."
8,112,Reflection – Learning Launch\n\nFrancisco Ferr...,"[Reflection, –, Learning, Launch, \n\n, Franci...","[True, True, True, False, False, True, False, ..."
9,123,Gandhi Institute of Technology and Management ...,"[Gandhi, Institute, of, Technology, and, Manag...","[True, True, True, True, True, True, False, Tr..."


PII Detection Dataset (GPT)


In [None]:
df_ext1_ai_data = pd.read_csv("../Datasets/External/ai_data.csv",names=["text","label"],header = 0)
df_ext1_ai_data

Unnamed: 0,text,label
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\r\n\r...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."
...,...,...
1995,The Importance of Personal Identity in the Dig...,"{'NAME_STUDENT': ['John', 'Suzanne Davis'], 'E..."
1996,In today's fast-paced and interconnected world...,"{'NAME_STUDENT': ['Brenda Brown'], 'EMAIL': ['..."
1997,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Joseph Carter'], 'EMAIL': [..."
1998,The Importance of Personal Identity in the Dig...,"{'NAME_STUDENT': ['Amanda', 'Johnathan', 'Shel..."


Persuade PII Dataset

In [None]:
df_ext2_persuade = pd.read_json("../Datasets/External/persuade_train_v0.json")
df_ext2_persuade

Unnamed: 0,tokens,trailing_whitespace,labels,full_text,document
0,"[You, should, join, the, seagoing, cowboys, .,...","[True, True, True, True, True, False, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",You should join the seagoing cowboys. I think ...,persuade_A70616DD5FBF
1,"[FINAL, DRAFT, \n\n, MONTH_DAY_YEAR, \n\n, PRO...","[True, False, False, False, False, False, Fals...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",FINAL DRAFT\n\nMONTH_DAY_YEAR\n\nPROPER_NAME\n...,persuade_740AE92DF22C
2,"[Dear, Mohammed, Brown, :, \n\n, I, believe, i...","[True, True, False, False, False, True, True, ...","[O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O...",Dear Mohammed Brown:\n\nI believe it would be ...,persuade_2F286C2A60B6
3,"[Dear, ,, Principal, \n\n, I, think, students,...","[False, True, False, False, True, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","Dear, Principal\n\nI think students should hav...",persuade_A1274B901F55
4,"[Dear, Sam, Ali, ,, \n\n, Suppose, a, communit...","[True, True, False, False, False, True, True, ...","[O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O...","Dear Sam Ali,\n\nSuppose a community free from...",persuade_E55967D94E2C
...,...,...,...,...,...
2213,"[I, prefer, policy, 1, because, i, think, we, ...","[True, True, True, True, True, True, True, Tru...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",I prefer policy 1 because i think we should be...,persuade_E9739967700C
2214,"[Dear, Principle, ,, \n\n, I, am, a, student, ...","[True, False, False, False, True, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","Dear Principle,\n\nI am a student at our schoo...",persuade_C9083A6A4955
2215,"[Dear, Principal, ,, \n\n, You, should, allow,...","[True, False, False, False, True, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","Dear Principal,\n\nYou should allow students t...",persuade_426B322D25DA
2216,"[Dear, Principle, ,, \n\n, I, believe, that, y...","[True, False, False, False, True, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","Dear Principle,\n\nI believe that you should a...",persuade_356088A3EC46


AI4privacy-PII

In [None]:
df_ext3_ai4priv = pd.read_json("../Datasets/External/PII43k_original.jsonl", lines=True)
df_ext3_ai4priv

Unnamed: 0,masked_text,unmasked_text,token_entity_labels,tokenised_unmasked_text
0,"In our video conference, discuss the role of e...","In our video conference, discuss the role of e...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[in, our, video, conference, ,, discuss, the, ..."
1,Could you draft a letter for [NAME_1] to send ...,"Could you draft a letter for Dietrich, Schulis...","[O, O, O, O, O, O, B-NAME, I-NAME, I-NAME, I-N...","[could, you, draft, a, letter, for, dietrich, ..."
2,Discuss the options for [FULLNAME_1] who wants...,Discuss the options for Jeffery Pfeffer who wa...,"[O, O, O, O, B-FULLNAME, I-FULLNAME, I-FULLNAM...","[discuss, the, options, for, jeff, ##ery, p, #..."
3,13. Write a press release announcing [FULLNAME...,13. Write a press release announcing Gayle Wat...,"[O, O, O, O, O, O, O, B-FULLNAME, I-FULLNAME, ...","[13, ., write, a, press, release, announcing, ..."
4,9. Develop an inventory management plan for [F...,9. Develop an inventory management plan for Ev...,"[O, O, O, O, O, O, O, O, B-FULLNAME, I-FULLNAM...","[9, ., develop, an, inventory, management, pla..."
...,...,...,...,...
42752,Please write an article on the effectiveness o...,Please write an article on the effectiveness o...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[please, write, an, article, on, the, effectiv..."
42753,Write a report on the most common online threa...,Write a report on the most common online threa...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[write, a, report, on, the, most, common, onli..."
42754,Write a blog post for [NAME_1] about the role ...,Write a blog post for Stanton LLC about the ro...,"[O, O, O, O, O, B-NAME, I-NAME, O, O, O, O, O,...","[write, a, blog, post, for, stanton, llc, abou..."
42755,14. Calculate the return on investment for [NA...,14. Calculate the return on investment for Con...,"[O, O, O, O, O, O, O, O, B-NAME, I-NAME, I-NAM...","[14, ., calculate, the, return, on, investment..."


PII | External Dataset

In [None]:
df_ext4_pii_ds = pd.read_csv("../Datasets/External/pii_dataset.csv")
df_ext4_pii_ds

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,prompt,prompt_id,name,email,phone,job,address,username,url,hobby,len
0,1073d46f-2241-459b-ab01-851be8d26436,"My name is Aaliyah Popova, and I am a jeweler ...","['My', 'name', 'is', 'Aaliyah', 'Popova,', 'an...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\r\n Aaliyah Popova is a jeweler with 13 ye...,1,Aaliyah Popova,aaliyah.popova4783@aol.edu,(95) 94215-7906,jeweler,97 Lincoln Street,,,Podcasting,363
1,5ec717a9-17ee-48cd-9d76-30ae256c9354,"My name is Konstantin Becker, and I'm a develo...","['My', 'name', 'is', 'Konstantin', 'Becker,', ...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\r\n Konstantin Becker is a developer with ...,1,Konstantin Becker,konstantin.becker@gmail.com,0475 4429797,developer,826 Webster Street,,,Quilting,255
2,353da41e-7799-4071-ab20-d959b362612e,"As Mieko Mitsubishi, an account manager at a p...","['As', 'Mieko', 'Mitsubishi,', 'an', 'account'...","[True, True, True, True, True, True, True, Tru...","['O', 'B-NAME_STUDENT', 'I-NAME_STUDENT', 'O',...",\r\n Mieko Mitsubishi is a account manager....,3,Mieko Mitsubishi,mieko_mitsubishi@msn.org,+27 61 222 4762,account manager,1309 Southwest 71st Terrace,,,Metal detecting,259
3,9324ee01-7bdc-41b1-a7a5-01307f72c20d,"My name is Kazuo Sun, and I'm an air traffic c...","['My', 'name', 'is', 'Kazuo', 'Sun,', 'and', ""...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\r\n Kazuo Sun is a air traffic controller ...,1,Kazuo Sun,kazuosun@hotmail.net,0304 2215930,air traffic controller,736 Sicard Street Southeast,,,Amateur radio,281
4,971fe266-2739-4f1b-979b-7f64e07d5a4a,"My name is Arina Sun, and I'm a dental hygieni...","['My', 'name', 'is', 'Arina', 'Sun,', 'and', ""...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\r\n Arina Sun is a dental hygienist. Write...,3,Arina Sun,arina-sun@gmail.net,0412 1245924,dental hygienist,5701 North 67th Avenue,,,Related,210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4429,a90b6b71-0e77-4089-a3a4-9a5b5c5aec78,"Hello, I'm Nicholas Moore, a man with a rich t...","['Hello,', ""I'm"", 'Nicholas', 'Moore,', 'a', '...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUDENT',...",\r\n Write a fictional semi-formal biograph...,0,Nicholas Moore,nicholas_moore8047@yahoo.gov,+91-63720 22261,videographer,35915 Patrick Mews Suite 978,n.moore,https://www.nmoore.org,Surfing,360
4430,1492ed0e-f162-424f-9c40-f3edde790ca1,"Hello, my name is Alexey Novikov and I'm a psy...","['Hello,', 'my', 'name', 'is', 'Alexey', 'Novi...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME...",\r\n Alexey Novikov is a psychologist. Writ...,3,Alexey Novikov,alexey.novikov@msn.gov,0264 828 4342,psychologist,161 Creek Road,,,Gardening,248
4431,57ef34c1-48db-4413-9573-774021e57f63,"My name is Ludmila Inoue, and I'm a person wit...","['My', 'name', 'is', 'Ludmila', 'Inoue,', 'and...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\r\n Write a fictional semi-formal biograph...,0,Ludmila Inoue,ludmila_inoue@outlook.net,(285) 815-7373,physician,706 Seagrove Road,,,Quilting,445
4432,a4486627-1c62-48b0-bf4f-53259ecc0a28,"Dr. Tu Garcia, a renowned dermatologist, embar...","['Dr.', 'Tu', 'Garcia,', 'a', 'renowned', 'der...","[True, True, True, True, True, True, True, Tru...","['O', 'B-NAME_STUDENT', 'I-NAME_STUDENT', 'O',...",\r\n Tu Garcia is a dermatologist. Write ab...,2,Tu Garcia,tugarcia@outlook.com,+91-14426 83047,dermatologist,1677 Anthony Run,t.garcia,http://blog.tu-garcia.edu,Skiing,320


NEW DATASET PII Data Detection

In [None]:
df_ext5_train_revised = pd.read_csv("../Datasets/External/train_df_con_list_index.csv")
df_ext5_train_revised

Unnamed: 0.1,Unnamed: 0,document,full_text,tokens,trailing_whitespace,index_pii
0,0,7,Design Thinking for innovation reflexion-Avril...,"['Design', 'Thinking', 'for', 'innovation', 'r...","[True, True, True, True, False, False, True, F...","[(9, 'B-NAME_STUDENT'), (10, 'I-NAME_STUDENT')..."
1,1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"['Diego', 'Estrada', '\n\n', 'Design', 'Thinki...","[True, False, False, True, True, False, False,...","[(0, 'B-NAME_STUDENT'), (1, 'I-NAME_STUDENT'),..."
2,2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"['Reporting', 'process', '\n\n', 'by', 'Gilber...","[True, False, False, True, True, False, False,...","[(4, 'B-NAME_STUDENT'), (5, 'I-NAME_STUDENT')]"
3,3,20,Design Thinking for Innovation\n\nSindy Samaca...,"['Design', 'Thinking', 'for', 'Innovation', '\...","[True, True, True, False, False, True, False, ...","[(5, 'B-NAME_STUDENT'), (6, 'I-NAME_STUDENT')]"
4,4,56,Assignment: Visualization Reflection Submitt...,"['Assignment', ':', '\xa0 ', 'Visualization', ...","[False, False, False, False, False, False, Fal...","[(12, 'B-NAME_STUDENT'), (13, 'I-NAME_STUDENT')]"
...,...,...,...,...,...,...
6802,6802,22678,EXAMPLE – JOURNEY MAP\n\nTHE CHALLENGE My w...,"['EXAMPLE', '–', 'JOURNEY', 'MAP', '\n\n', 'TH...","[True, True, True, False, False, True, True, F...",[]
6803,6803,22679,Why Mind Mapping?\n\nMind maps are graphical r...,"['Why', 'Mind', 'Mapping', '?', '\n\n', 'Mind'...","[True, True, False, False, False, True, True, ...",[]
6804,6804,22681,"Challenge\n\nSo, a few months back, I had chos...","['Challenge', '\n\n', 'So', ',', 'a', 'few', '...","[False, False, False, True, True, True, True, ...",[]
6805,6805,22684,Brainstorming\n\nChallenge & Selection\n\nBrai...,"['Brainstorming', '\n\n', 'Challenge', '&', 'S...","[False, False, True, True, False, False, True,...",[]


## Cleaning
To have some uniform input, each source dataframe needs to have a list of tokens from of the source text located in each row.

### Official training dataset
Verify that there are no rows with any null values

In [None]:
# df_train[df_train.isnull().any(axis = 1)]

### Official test dataset
Verify that there are no rows with any null values

In [None]:
df_test[df_test.isnull().any(axis = 1)]

Unnamed: 0,document,full_text,tokens,trailing_whitespace


### PII Detection Dataset (GPT)

Verify that there no rows with any null values

In [None]:
df_ext1_ai_data[df_ext1_ai_data.isnull().any(axis = 1)]

Unnamed: 0,text,label


Run 
```
python -m spacy download en_core_web_sm
```
in bash to install the english spacy pipline if running this notebook locally.

Load in spaCy's english nlp pipeline.

In [None]:
nlp = sp.load("en_core_web_sm")

Create a new tokens column in the dataframe that is a list of tokens generated for each row of the dataframe using the text column as the source and the nlp to implement the process.

In [None]:
df_ext1_ai_data.loc[:,'tokens'] = df_ext1_ai_data.loc[:,'text'].apply(lambda line: [tok.text for tok in nlp(line)])
df_ext1_ai_data

Unnamed: 0,text,label,tokens
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'...","[In, today, 's, modern, world, ,, where, techn..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison...","[In, today, 's, modern, world, ,, where, techn..."
2,Janice: A Student with a Unique Identity\r\n\r...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5...","[Janice, :, A, Student, with, a, Unique, Ident..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '...","[Christian, is, a, student, who, goes, by, the..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T...","[In, today, 's, modern, world, ,, where, techn..."
...,...,...,...
1995,The Importance of Personal Identity in the Dig...,"{'NAME_STUDENT': ['John', 'Suzanne Davis'], 'E...","[The, Importance, of, Personal, Identity, in, ..."
1996,In today's fast-paced and interconnected world...,"{'NAME_STUDENT': ['Brenda Brown'], 'EMAIL': ['...","[In, today, 's, fast, -, paced, and, interconn..."
1997,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Joseph Carter'], 'EMAIL': [...","[In, today, 's, modern, world, ,, where, techn..."
1998,The Importance of Personal Identity in the Dig...,"{'NAME_STUDENT': ['Amanda', 'Johnathan', 'Shel...","[The, Importance, of, Personal, Identity, in, ..."


### Persuade PII Dataset

Verify that there no rows with any null values

In [None]:
df_ext2_persuade[df_ext2_persuade.isnull().any(axis = 1)]

Unnamed: 0,tokens,trailing_whitespace,labels,full_text,document


### AI4privacy-PII

Verify that there no rows with any null values

In [None]:
df_ext3_ai4priv[df_ext3_ai4priv.isnull().any(axis = 1)]

Unnamed: 0,masked_text,unmasked_text,token_entity_labels,tokenised_unmasked_text


### PII | External Dataset

Drop unuseful or redundant columns

In [None]:
df_ext4_pii_ds = df_ext4_pii_ds.drop(columns = ['document','prompt','prompt_id','name','email','phone','job','address','username','url','hobby'])
df_ext4_pii_ds

Unnamed: 0,text,tokens,trailing_whitespace,labels,len
0,"My name is Aaliyah Popova, and I am a jeweler ...","['My', 'name', 'is', 'Aaliyah', 'Popova,', 'an...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",363
1,"My name is Konstantin Becker, and I'm a develo...","['My', 'name', 'is', 'Konstantin', 'Becker,', ...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",255
2,"As Mieko Mitsubishi, an account manager at a p...","['As', 'Mieko', 'Mitsubishi,', 'an', 'account'...","[True, True, True, True, True, True, True, Tru...","['O', 'B-NAME_STUDENT', 'I-NAME_STUDENT', 'O',...",259
3,"My name is Kazuo Sun, and I'm an air traffic c...","['My', 'name', 'is', 'Kazuo', 'Sun,', 'and', ""...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",281
4,"My name is Arina Sun, and I'm a dental hygieni...","['My', 'name', 'is', 'Arina', 'Sun,', 'and', ""...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",210
...,...,...,...,...,...
4429,"Hello, I'm Nicholas Moore, a man with a rich t...","['Hello,', ""I'm"", 'Nicholas', 'Moore,', 'a', '...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUDENT',...",360
4430,"Hello, my name is Alexey Novikov and I'm a psy...","['Hello,', 'my', 'name', 'is', 'Alexey', 'Novi...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME...",248
4431,"My name is Ludmila Inoue, and I'm a person wit...","['My', 'name', 'is', 'Ludmila', 'Inoue,', 'and...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",445
4432,"Dr. Tu Garcia, a renowned dermatologist, embar...","['Dr.', 'Tu', 'Garcia,', 'a', 'renowned', 'der...","[True, True, True, True, True, True, True, Tru...","['O', 'B-NAME_STUDENT', 'I-NAME_STUDENT', 'O',...",320


Verify that there no rows with any null values

In [None]:
df_ext4_pii_ds[df_ext4_pii_ds.isnull().any(axis = 1)]

Unnamed: 0,text,tokens,trailing_whitespace,labels,len


### NEW DATASET PII Data Detection

Drop unuseful or redundant columns

In [None]:
df_ext5_train_revised = df_ext5_train_revised.drop(columns = ['Unnamed: 0'])
df_ext5_train_revised

Unnamed: 0,document,full_text,tokens,trailing_whitespace,index_pii
0,7,Design Thinking for innovation reflexion-Avril...,"['Design', 'Thinking', 'for', 'innovation', 'r...","[True, True, True, True, False, False, True, F...","[(9, 'B-NAME_STUDENT'), (10, 'I-NAME_STUDENT')..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"['Diego', 'Estrada', '\n\n', 'Design', 'Thinki...","[True, False, False, True, True, False, False,...","[(0, 'B-NAME_STUDENT'), (1, 'I-NAME_STUDENT'),..."
2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"['Reporting', 'process', '\n\n', 'by', 'Gilber...","[True, False, False, True, True, False, False,...","[(4, 'B-NAME_STUDENT'), (5, 'I-NAME_STUDENT')]"
3,20,Design Thinking for Innovation\n\nSindy Samaca...,"['Design', 'Thinking', 'for', 'Innovation', '\...","[True, True, True, False, False, True, False, ...","[(5, 'B-NAME_STUDENT'), (6, 'I-NAME_STUDENT')]"
4,56,Assignment: Visualization Reflection Submitt...,"['Assignment', ':', '\xa0 ', 'Visualization', ...","[False, False, False, False, False, False, Fal...","[(12, 'B-NAME_STUDENT'), (13, 'I-NAME_STUDENT')]"
...,...,...,...,...,...
6802,22678,EXAMPLE – JOURNEY MAP\n\nTHE CHALLENGE My w...,"['EXAMPLE', '–', 'JOURNEY', 'MAP', '\n\n', 'TH...","[True, True, True, False, False, True, True, F...",[]
6803,22679,Why Mind Mapping?\n\nMind maps are graphical r...,"['Why', 'Mind', 'Mapping', '?', '\n\n', 'Mind'...","[True, True, False, False, False, True, True, ...",[]
6804,22681,"Challenge\n\nSo, a few months back, I had chos...","['Challenge', '\n\n', 'So', ',', 'a', 'few', '...","[False, False, False, True, True, True, True, ...",[]
6805,22684,Brainstorming\n\nChallenge & Selection\n\nBrai...,"['Brainstorming', '\n\n', 'Challenge', '&', 'S...","[False, False, True, True, False, False, True,...",[]


Verify that there no rows with any null values

In [None]:
df_ext5_train_revised[df_ext5_train_revised.isnull().any(axis = 1)]

Unnamed: 0,document,full_text,tokens,trailing_whitespace,index_pii
