# Introduction


About the project:

Big data's significance in technological advancements highlights the increasing importance of open-source and reproducibility, emphasizing the need for open-sourced datasets with careful management of Personal Identifiable Information (PII). This project aims to automate the identification and obscuration of PII using machine learning models, moving beyond manual processes, and is conducted in conjunction with the Kaggle challenge "The Learning Agency Lab - PII Data Detection," targeting seven PII categories within a dataset of 22,000 student responses to apply course material to real-world problems. The approach involves using simple machine learning models to minimize the carbon footprint and complex models like transformers for higher accuracy, evaluating performance with metrics such as $F_{\beta}$.



Link to Kaggle competition: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview

---

Description of the features:

| Feature              | Type   | Explanation                           |
|----------------------|--------|---------------------------------------|
| document              | id     | The document Id                       |
| full_text            | string | The full text                        |
| tokens               | list   | The tokenized full text              |
| trailing_whitespace | list   | Spaces of full text                  |
| labels               | list   | List of whether token is PII and if so, which kind |


Descriptions of the different PII types:


| PII Type         | Description                                                                                           |
|------------------|-------------------------------------------------------------------------------------------------------|
| NAME_STUDENT     | The full or partial name of a student that is not necessarily the author of the essay.                |
| EMAIL            | A student’s email address.                                                                            |
| USERNAME         | A student's username on any platform.                                                                 |
| ID_NUM           | A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number. |
| PHONE_NUM        | A phone number associated with a student.                                                             |
| URL_PERSONAL     | A URL that might be used to identify a student.                                                        |
| STREET_ADDRESS   | A full or partial street address that is associated with the student, such as their home address.      |




In [1]:
# imports
import json

import pandas as pd

In [2]:

file_path = "../data/train.json"

with open(file_path, "r") as file:
    data = json.load(file)
df = pd.DataFrame(data)
print(f"Dataframe shape: {df.shape}")
print(f"Dataframe columns: {list(df.columns)}")

Dataframe shape: (6807, 5)
Dataframe columns: ['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels']
