# Use-cases

**In short**: this notebook explains our solution to the problem of identifying PII in documents.



#### An introduction to the problem:

The presence of PII hinders the creation of open datasets that could advance education, as its release poses significant risks to student privacy. It's essential to cleanse educational data of PII before its public release, a process that data science could make more efficient.

Currently, manual review is the most reliable method for detecting PII, though it's costly and limits dataset scalability. Automatic detection methods like named entity recognition (NER) are available but often fail to accurately identify sensitive personal details.

In order to alleviate these challenges, our platform harnesses advanced algorithms and models to improve the accuracy of automatic PII detection. By integrating machine learning techniques and deep learning, we offer a more robust solution that reduces the need for costly manual reviews and enhances the scalability of data processing. This approach not only speeds up the identification process but also increases the precision with which sensitive personal details are detected, ensuring better protection and usability of your data.

# **Our solution**

Our solution streamlines the process of managing and protecting personally identifiable information (PII) in text submissions. 

Users can **upload texts**, and our platform **automatically identifies PII**, provides tools for **detailed analysis and editing**, and allows for **easy downloading** of the refined data. 

This is achieved by using the **optimized models** from the previous notebooks as well as **streamlit** to develope the website.

#### Understand our product

In order to understand our product we split it up into **three** parts:
1. **Input**: the desired document to analyse.
2. **Output**: the document with PII identified words.
3. **Customizability**: potential adjustments to the identified PII.

### **Landing page**

![image-2.png](attachment:image-2.png)

### **Input**

Our current solution allows up to **two** ways of inputting documents, i.e.
1. Pasting text: either by command or by a button.
2. Uploading documents: either as `.txt` or by `.pdf`
 

![image.png](attachment:image.png)

Additionally the user is also able to try out a *test example* from the dataset.

### **Output**

After a document is analyzed, users can view and save the results in three ways:
1. Analyzed: see which words are classified as PII and also of what type.
2. Alias: obscure the PII words with fake ones.
3. Tagged: obscure the PII words with their classified PII types.


Let's look at an example:

#### Analyse PII

The user has entered a text and the model as classified potential PII words.

Now the user can easily see the identified PII words and what type they were classified as.


![image.png](attachment:image.png)

#### Alias

You are also able to obsure the text with new fake PII tags (alias). 


This way the text retains it's readability and context whilst being anonymized and following compliance.


![image.png](attachment:image.png)

#### Tagged

The last option is to view the classified types directly.

This could be helpful for the workers who create the open-source datasets.

![image.png](attachment:image.png)

The user is now able to copy or download the content.

#### Additional features

As an additional feature, `.pdf`'s can be obscured in their original formats - thus maintaing their readability and design.

In this example one of our group member has uploaded his CV as a `pdf` (with consent), and then chooses to obscure it this the alias feature.

![image-3.png](attachment:image-3.png)


Plus an example of the tag feature:

![image.png](attachment:image.png)

- Unfortunately the model wasn't able to classify the phone number.
- However, in the next section we'll try to fix this.

### **Customizability**

Since our models will never be perfect, we have added the options of:

1. Adding PII: in case the model missed some PII words.
2. Editing PII: in case the user want to edit the fake alias words.

The user is also able to edit the PII tag or alias, as well as add an additional PII to the list.

![image-4.png](attachment:image-4.png)

Let's try to change and add some variables.

In this example, we have made changes to the following cells:

![image-2.png](attachment:image-2.png)


And this is the result:

![image-3.png](attachment:image-3.png)

- We see the variables have changed and the phone number is now obscured.

This will provide the user with some form of control.

#### **Conclusion**

To conclude, this website illustrated the potential of combining our optimized NER model together with a user-friendly interface.

---

## Other potential use-cases

We provide _ other potential use-cases we considered:




### 1. How biased is a company

One product would be webscraping a companies `Employees` website to then identify the PII names of each employee. 

With the names of each employee, we could attempt to identify their gender and ethnicity to provide the user with a overview of how 'balanced' a company is.

In the end, we could provide a site the ranks multiple companies based on this.

### 2. Secure data sharing

Before sharing data externally or with third parties (e.g. chat-GPT), PII scanners can be used to ensure sensitive information is adequately anonymized or encrypted, preventing unintended disclosure.
