# Securing Sensitive Data: Transformer model for Detecting Personally Identifiable Information
<center>
Advanced Business Analytics (42578) Exam Project
<center>
<center>
Group: Mind Machines 
<center>
Students: Christoffer Wejendorp (s204090) - Jasmin Thari (s204155) - Marah Marak (s182946)
<center>


### Structure of this notebook

This notebook is organized into seven distinct sections, each aimed at guiding you through various stages of the project from initial setup to in-depth analysis and discussions:

1. **[Introduction](#1)**: Provides an overview of the objectives and scope of the project.

2. **[Get Started](#2)**: Outlines the setup procedures including the installation of required packages within the notebook.

3. **[Data & Pre-processing](#3)**: Details the dataset used in this project, followed by comprehensive steps involved in the data cleaning process to prepare the data for analysis.

4. **[Exploratory Data Analysis](#4)**: Dives into the dataset through various visualization techniques to uncover patterns, trends, and insights which inform further analyses.

5. **[Named Entity Recognition using Regex](#5)**: Introduces a baseline model for Named Entity Recognition (NER) utilizing regular expressions (Regex). This section demonstrates how to apply Regex patterns and rules to identify named entities within the text.

6. **[Named Entity Recognition Using Transformer Model](#6)**: Advances the NER approach by implementing the DistilBert Transformer model to achieve a more sophisticated and effective entity recognition.

7. **[Discussion](#7)**: Concludes with a critical analysis of the results obtained, discussing both the strengths and limitations of the methods used and suggesting potential areas for future work.

### Table of Contents
1. **[Introduction](#1)**
2. **[Get Started](#2)**
3. **[Exploratory Data Analysis](#4)**
4. **[Exploratory Data Analysis](#4)**
5. **[Named Entity Recognition using Regex](#5)**
6. **[Named Entity Recognition Using Transformer Model](#6)**
7. **[Discussion](#7)**

__________

<a id="1"></a>
## Section 1: Introduction

In the age of rapid advancements in artificial intelligence and generative AI technologies, the ability to responsibly manage sensitive information is more crucial than ever. This project, *Securing Sensitive Data: AI Methods for Detecting Personally Identifiable Information*, aims to address the growing concern surrounding the protection of personal data, especially in contexts where large language models (LLMs) are employed. A primary example is the use of student essays by universities wishing to train AI models. These essays often contain personally identifiable information (PII) that must be carefully identified and redacted to maintain confidentiality and comply with data protection laws.

The motivation behind this project is driven by the need to balance the utilization of generative AI in educational settings with stringent data security measures. As universities and other institutions increasingly rely on AI to enhance learning and research, ensuring the privacy of individuals represented in training datasets becomes essential. To handle this challenge, we will deploy Named Entity Recognition (NER) techniques to effectively identify and remove PII from texts.

Our approach begins with the application of regular expressions (Regex), a basic yet powerful tool for pattern recognition in texts. This will serve as our baseline model for detecting straightforward instances of PII. Subsequently, we will enhance the model by incorporating a more sophisticated method using the DistilBert Transformer model. Through these methods, we aim to create a safer data environment that respects individual privacy while enabling the progressive use of AI in education.

The methodologies and models developed here are not only applicable to academic settings but can also be extended to corporate environments where data privacy is paramount. As companies increasingly integrate AI technologies such as chatbots to interact with customers, the need to comply with strict regulations like the General Data Protection Regulation (GDPR) becomes critical.

__________

<a id="2"></a>
## Section 2: Get Started

In [1]:
# Standard libraries
import json
import itertools
import re
import string
from collections import Counter
from itertools import chain
import math

# Data manipulation
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
import ast

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
import plotly.express as px
from IPython.display import HTML

# NLP
import spacy
from spacy import displacy
from spacy.tokens import Doc, Span
from spacy.lang.en import English
spacy_nlp = spacy.load('en_core_web_sm')
eng_tokenizer = English().tokenizer

# Text Processing and Analysis
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, ImageColorGenerator
import random
from nltk.corpus import PlaintextCorpusReader
import nltk
import string
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import custom functions
from Functions.Spacy_Tokenizer import adjust_token_labels, refine_punctuation_labels, create_bio_labels
from Functions.Ner_Visualizer import *
from Functions.tdidf_wordclouds import *

# Ignore warnings
pd.options.mode.chained_assignment = None

In [26]:
import plotly.graph_objects as go

# Define the custom color theme
color_theme = {
    'three_colors': ['#57634B', '#D4793A', '#527184'],  
    'four_colors': ['#85977D', '#8498A5', '#587F86', '#BD8A3D'],  
    'five_colors': ['#57634B', '#D4793A', '#527184', '#CFA802', '#BBB599'], 
    'twelve_colors': ['#57634B', '#85977D', '#8498A5', '#527184', '#E9B649', '#BD8A3D', '#D4793A', 
                      '#7D1F1D', '#BB6D71', '#BBB599', '#BE477D', '#CDADE6']}

# fig = go.Figure()
# # Adding bars with specified colors from a palette
# for i, value in enumerate([2, 3, 1, 4]):
#     fig.add_trace(go.Bar(x=[f'Category {i+1}'], y=[value], marker_color=color_theme['four_colors'][i]))
# fig.show()


__________

<a id="1"></a>
## Section 1: Data

This project uses data from a Kaggle competition titled *The Learning Agency Lab - PII Data Detection*. The primary dataset includes approximately 6,807 essays contributed by students from an online course, each essay responding to a single assignment that tasked students with applying course material to a real-world problem. The objective is to annotate personally identifiable information (PII) within each essay. To preserve privacy, all reak PII has been replaced with surrogate identifiers using a semi-automated process. You can access this dataset [here](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data).

Given that the initial dataset predominantly included the *B-NAME_STUDENT* class and lacks diversity in PII types, additional data was necessary. This supplementary dataset is generated by a Large Language Model (LLM) and us also available on Kaggle. The dataset consists of 4,434 texts complete with annotations similar to the original training data. These texts, generated across eight different prompts, vary from life summaries to narratives. You can find this dataset [here](https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset). Both datasets will be utilized for this project.

### Project Goals
The aim of this project is to identify and annotate various types of PII, which include:

- **Student Names:** Identifying names specific to students, excluding names of instructors, authors, or other individuals.
- **Student Emails:** Detecting email addresses belonging to students.
- **Student Username:** Recognizing usernames associated with students.
- **Student ID Number:** Identifying students' ID numbers or social security numbers.
- **Student Phone Number:** Detecting phone numbers linked to students.
- **Personal URL:** Recognizing URLs that could potentially identify students.
- **Student Address:** Identifying street addresses related to students.

### Data Structure
The data provided includes detailed information about each essay:

- **index (int):** An index number assigned to each essay.
- **document id (int):** A unique integer identifier for each essay.
- **full_text (string):** The complete text of each essay in UTF-8 format.
- **tokens (list):** A sequence of tokens, derived using the SpaCy English tokenizer.
- **trailing_whitespace (list):** A list indicating whether a space follows each token.
- **labels (list):** These labels classify each token according to the type of PII they represent, using the BIO (Beginning, Inner, Outer) format:
  - **B-** prefix denotes the start of a PII entity.
  - **I-** indicates continuation of a PII entity.
  - **O** represents tokens unrelated to PII.

## 1.1 Data Loading

### 1.1.1 Loading the first data set

In [33]:
# Load data in dictionary and dataframe format
with open('data/train.json', 'r') as f:
    data = json.load(f)

data_df = pd.read_json('data/train.json')

print('Number of documents:',len(data_df))

Number of documents: 6807


In [34]:
data_df.head(2)

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O..."


### 1.1.2 Load LLM generated data

In [35]:
# Load LLM generated data in dataframe format
llm_data_df = pd.read_csv('data/pii_dataset.csv')
print('Number of documents:',len(llm_data_df))

Number of documents: 4434


In [36]:
llm_data_df.head(2)

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,prompt,prompt_id,name,email,phone,job,address,username,url,hobby,len
0,1073d46f-2241-459b-ab01-851be8d26436,"My name is Aaliyah Popova, and I am a jeweler ...","['My', 'name', 'is', 'Aaliyah', 'Popova,', 'an...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\n Aaliyah Popova is a jeweler with 13 year...,1,Aaliyah Popova,aaliyah.popova4783@aol.edu,(95) 94215-7906,jeweler,97 Lincoln Street,,,Podcasting,363
1,5ec717a9-17ee-48cd-9d76-30ae256c9354,"My name is Konstantin Becker, and I'm a develo...","['My', 'name', 'is', 'Konstantin', 'Becker,', ...","[True, True, True, True, True, True, True, Tru...","['O', 'O', 'O', 'B-NAME_STUDENT', 'I-NAME_STUD...",\n Konstantin Becker is a developer with 2 ...,1,Konstantin Becker,konstantin.becker@gmail.com,0475 4429797,developer,826 Webster Street,,,Quilting,255



> - Let's explore the number of different prompt IDs present in the data.


In [37]:
len(llm_data_df['prompt_id'].unique())

8

> - Below, we display a random sample for each prompt ID, where the prompt ID represents a category, each prompt within the same category is unique.

In [66]:
# Group by 'prompt_id', sample one record from each group, and reset index to flatten the DataFrame
sampled_prompts = llm_data_df.groupby('prompt_id').apply(lambda x: x.sample(1), include_groups=False).reset_index(drop=False)
html = '<table>'
html += '<tr><th>Prompt ID</th><th>Text</th></tr>'  
for index, row in sampled_prompts.iterrows():
    html += f'<tr><td>{row["prompt_id"]}</td><td>{row["prompt"]}</td></tr>'

html += '</table>'
display(HTML(html))

Prompt ID,Text
0,"Write a fictional semi-formal biography in first person for Tao Kato. Add the following information about him/her randomly inside the text: name is Tao Kato, phone number is (960) 464-3988, email is tao.kato8489@gmail.edu, address is 1666 South Summer Rose Avenue."
1,"Chen Mitsubishi is a musician with 12 years of experience. Write a detailed example in first person of a job-related project he/her did in the past. Add the following information about him/her randomly inside the text: name is Chen Mitsubishi, phone number is 0438 437 5019, email is chenmitsubishi@gmail.org, hobby is Painting, address is 11813 West 75th Circle."
2,"Viktor Suzuki is a neurologist. Write about a job-related project he/her did in the past including some of the following information: phone number is +91-25517 28305, email is viktorsuzuki5689@outlook.net"
3,"Mohammed Martinez is a dietician. Write a first person summary of something he solved in his job. Add the following information about him/her (randomly please) inside the text: name is Mohammed Martinez, email is mohammed_martinez@yahoo.edu, address is 235 Hugh Thomas Drive."
4,"Write a fictional semi-formal biography in first person for Omar Mitsubishi. Add the following information about him/her randomly inside the text: name is Omar Mitsubishi, profile at X.com is omar_mitsubishi19, email is omarmitsubishi@gmail.com, webpage is https://www.omarmitsubishi.edu/news. It is important to include this information in different parts of the text."
5,"Anil Kobayashi is a coach with 3 years of experience. Write a detailed example in first person of a job-related project he/her did in the past. Add the following information about him/her randomly inside the text: name is Anil Kobayashi, webpage is www.anil_kobayashi.com/news.php, profile at LinkedIn is anilkobayashi, address is 4674 Amy Landing Suite 292."
6,"Angel Roux is a biologist. Write about a job-related project he/her did in the past including some of the following information: phone number is 080-8030-5392, profile at LinkedIn is a.roux. It is important to include this information randomly throughout the text."
7,"Ram Rousseau is a psychologist. Write a first person summary of something he solved in his job. Add the following information about him/her (randomly please) inside the text: name is Ram Rousseau, email is ram_rousseau6798@yahoo.net, address is 824 Peters Neck Apt. 294, profile at Instagram is ram.rousseau98, webpage is www.ram-rousseau.biz."


> - From the output, we can observe that the majority of the prompts are either related to historical writing or summaries of job-related topics.

## 1.2 Pre-processing

> - First, we will preprocess both datasets to ensure they are aligned. For example, we will rename the text columns in both datasets to "text", add a new column to track which data is generated by LLMs, and assign a prompt ID of -1 to the non-generated data to facilitate tracking.

In [67]:
data_df = data_df.rename(columns={'full_text':'text'}) # Rename column to 'text'
data_df['llm_generated'] = False # Add column to indicate if the text was generated by LLM
data_df['prompt_id'] = -1  # Add column to indicate the prompt id

In [68]:
llm_data_df['llm_generated'] = True # Add column to indicate if the text was generated by LLM

llm_data_df[["tokens", "trailing_whitespace", "labels"]] = llm_data_df[["tokens", "trailing_whitespace", "labels"]].map(ast.literal_eval) # Convert string to list
llm_data_df["document"] = llm_data_df["document"].astype("category").cat.codes + (data_df.document.max() + 1) # make sure document id is unique and changing to int

**Addressing the Punctuation Issue in the LLM generated data set**:

> In the context of text processing, particularly when handling datasets for named entity recognition (NER), punctuation plays a crucial role in determining the boundaries and labels of tokens. The standard tokenization by tools like SpaCy segment a phrase such as "Charles, by" into ["Charles", ",", "by"] with trailing spaces marked as [False, True, True], and labels ["B-NAME-STUDENT", "O", "O"]. This segmentation accurately reflects the presence of punctuation as separate from named entities, even when there's no space between them.
>
> However, the LLM generated dataset presents a unique challenge, where all token are seperated by space so its tokenization will be ["Charles,", "by"] [True, True] ["B-NAME-STUDENT", "O"] and thus tokens without space are considered single token and are given single labels. While the tokenization will leas to the same text e.g ["Char", "les", ",", "_by"], the model labels will be different ["B-NAME-STUDENT", "B-NAME-STUDENT", "B-NAME-STUDENT", "O"] instead of appropriate ["B-NAME-STUDENT", "B-NAME-STUDENT", "O", "O"]. 
>
>So, this approach results in punctuation being considered part of a preceding token, thus receiving a single, unified label. Consequently, when aligning the LLM dataset's tokenization with the original, we encounter difference.
>
>To resolve this, adjustments are necessary to ensure that the dataset aligns more closely with standard tokenization and labeling practices. This involves re-evaluating tokens and labels to correctly identify and separate punctuation from named entities, thereby avoiding the mislabeling that can confuse NER models. By addressing this issue, we aim to improve the dataset's utility for training more accurate and reliable NER systems.

> NB! The following code uses functions from `Spacy_Tokenizer.py` file. 

In [69]:
llm_data_df_tokenized = llm_data_df.apply(adjust_token_labels, axis=1)
llm_data_df_tokenized["labels"] = llm_data_df_tokenized.apply(refine_punctuation_labels, axis=1).apply(create_bio_labels)
llm_data_df_tokenized['text'] = llm_data_df['tokens'].apply(lambda x: ' '.join(x))
llm_data_df_tokenized[['prompt_id', 'llm_generated']] = llm_data_df[['prompt_id', 'llm_generated']]

In [70]:
llm_data_df_tokenized.head(2)

Unnamed: 0,document,tokens,trailing_whitespace,labels,text,prompt_id,llm_generated
0,22968,"[My, name, is, Aaliyah, Popova, ,, and, I, am,...","[True, True, True, True, False, True, True, Tr...","[O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O...","My name is Aaliyah Popova, and I am a jeweler ...",1,True
1,24398,"[My, name, is, Konstantin, Becker, ,, and, I, ...","[True, True, True, True, False, True, True, Fa...","[O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O...","My name is Konstantin Becker, and I'm a develo...",1,True


In [79]:
print("Thus, the first row's tokens are transformed from " + str(len(llm_data_df_tokenized['tokens'].iloc[0])) + " to " + str(len(llm_data_df['tokens'].iloc[0])) + " tokens and are more aligned with the first data set and the more approciate tokenizer.")

Thus, the first row's tokens are transformed from 411 to 363 tokens and are more aligned with the first data set and the more approciate tokenizer.


### 1.2.1 Combine data sets

> - Both that data sets have now been pre-processed and we are ready to combine them into one data set.

In [80]:
df = pd.concat([data_df, llm_data_df_tokenized], ignore_index=True)

In [82]:
print('Number of documents:',len(df))

Number of documents: 11241


In [83]:
df.head(2)

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,llm_generated,prompt_id
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-...",False,-1
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O...",False,-1


### 1.2.2 Encode target

> The last step, will be to encode the targets as one hot encoding. 
>
> !NB The function used from `Spacy_Tokenizer.py` file. 

In [84]:
def encode_labels(df):
    df = df.copy()
    df["unique_labels"] = df["labels"].apply(lambda x: set(
        [l.split('-')[1] if l != 'O' else l for l in x]
         ))

    mlb = MultiLabelBinarizer()
    one_hot_encoded = mlb.fit_transform(df['unique_labels'])
    one_hot_df = pd.DataFrame(one_hot_encoded, columns=mlb.classes_)
    df = pd.concat([df, one_hot_df], axis=1)
    
    # add 'OTHER' column which is only true when we have no other label in text
    df['OTHER'] = df['unique_labels'].apply(lambda x: 1 if len(x - {"O"}) == 0 else 0)
    
    return df, list(mlb.classes_) + ['OTHER']

In [87]:
df, label_classes = encode_labels(df)

In [88]:
df.head(2)

Unnamed: 0,document,text,tokens,trailing_whitespace,labels,llm_generated,prompt_id,unique_labels,EMAIL,ID_NUM,...,USERNAME,OTHER,EMAIL.1,ID_NUM.1,NAME_STUDENT,O,PHONE_NUM,STREET_ADDRESS,URL_PERSONAL,USERNAME.1
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-...",False,-1,"{NAME_STUDENT, O}",0,0,...,0,0,0,0,1,1,0,0,0,0
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O...",False,-1,"{NAME_STUDENT, O}",0,0,...,0,0,0,0,1,1,0,0,0,0


__________

<a id="2"></a>
## Section 2: Exploratory Data Analysis

## 2.1 Data Visualization

### 2.1.1 Target Distrbution

First, we are examining the frequency of each target in all the documents excluding target "O".

In [None]:
labels = df['labels'].tolist()
# Flatten the list of lists
flattened_labels = list(itertools.chain.from_iterable(labels))
# Count the occurrences of each label
label_counts = Counter(flattened_labels)
# Separate the labels and their counts for plotting
labels, counts = zip(*label_counts.items())

# Create the bar plot
fig = go.Figure([go.Bar(x=labels[1:], y=counts[1:])])
fig.update_layout(title_text='Frequency of each label', xaxis_title='Labels', yaxis_title='Frequency')
fig.show()

- It is evident that the most common targets are *B-NAME-STUDENT*, *I-NAME-STUDENT*, *B-STREET_ADDRESS*, and *I-STREET_ADDRESS*. It is not surprising that *NAME-STUDENT* is the most frequently occurring target.
- It is also observed that some targets are very rare, such as *B-ID_NUM*, *I-URL_PERSONAL*, and *I-ID_NUM*.
- We notice the absence of certain potential targets like *I-EMAIL* and *I-USERNAME*. However, these targets are less common since emails or usernames extending beyond a single word are atypical.
- Overall, we have a diverse array of targets, which is crucial for effective model training.

In the following, we are examining the distribution of unqiue variables in each document.

In [None]:
df['unique_labels'] = df['labels'].apply(lambda x: list(set(x)))
df['num_unique_labels'] = df['unique_labels'].apply(len)

# Histogram of number of unique labels per document
fig = px.histogram(df, x='num_unique_labels', nbins=20, 
                   labels={'num_unique_labels': 'Number of unique labels'},
                   title='Histogram of number of unique labels per document')
fig.update_layout(yaxis_title='Frequency') 
fig.show()


- It is seen that it is most common for each document to have at least one unique target. 
- It is also noticed that for some reason some texts do also have up to 6 or 8 unique labels. 

It is also interesting to see how many targets there are in each document when excluding label "O".

In [None]:
df['num_labels'] = df['labels'].apply(lambda labels: len([label for label in labels if label != "O"]))
filtered_df = df[df['num_labels'] > 0]
# Histogram of number of unique labels per document
fig = px.histogram(filtered_df, x='num_labels', nbins=50, 
                   labels={'num_labels': 'Number of labels'},
                   title='Histogram of number of labels per document')
fig.update_layout(yaxis_title='Frequency') 
fig.show()

In [None]:
print("Number of documents without any target:", len(df['num_labels']==0))

- It is seen that we are around 11241 documents without any targets. 
- Also, we have some few documents with many targets. 
- Otherwise we see that most of the documents have around 5-15 targets.

### 2.1.2 Document distribution

First, we are visualizing the length of the text of both the documents with and without labels

In [None]:
df_with_labels = df[df['labels'].apply(lambda x: len(set(x)) > 1)] #with labels
df_non_labels = df[df['labels'].apply(lambda x: 'O' in x and len(set(x)) == 1)] #without labels 

df_with_labels['Documents'] = 'With Labels'
df_non_labels['Documents'] = 'Without Labels'
df['Documents'] = "All Documents"

# Calculate text length
df['len_text'] = df['text'].apply(len)
df_with_labels['len_text'] = df_with_labels['text'].apply(len)
df_non_labels['len_text'] = df_non_labels['text'].apply(len)

# Combine the dataframes
combined_df = pd.concat([df_with_labels, df_non_labels, df])

# Plotting
fig = px.histogram(combined_df, x='len_text', color='Documents', labels={'len_text': 'Length of text'},
                   nbins=500, title='Histogram of Length of Text per Document')

# Show the plot
fig.show()

- It is clear that the distribution of text length is approximating a heavy tailed normal distribution for all three.

- Let us see the length of tokens

In [None]:
combined_df['Length of tokens'] = combined_df['tokens'].apply(len)
# Plotting
fig = px.histogram(combined_df, x='Length of tokens', color='Documents',
                   nbins=500, title='Histogram of Length of Tokens per Document')
fig.show()

- It is seen that all three distrbutions are quiet similar. 

### 2.1.3 POS Labels

In the following, we will calculates normalized positions for each label in the data. 

In [None]:
df["Labels Pos"] = df["labels"].apply(lambda labels: np.arange(1, len(labels) + 1) / len(labels))
exp_df = df.explode(["tokens", "labels", "Labels Pos"])
exp_df["labels"] = pd.Categorical(exp_df["labels"], categories=labels, ordered=True)
exp_df = exp_df.sort_values(by="labels", ascending=False)
label_tokens = exp_df.groupby("labels", observed=False).agg(list)
label_tokens["counts"] = label_tokens["tokens"].apply(len)

In [None]:
fig = px.scatter(exp_df, x='Labels Pos', y='labels', title='Scatter Plot of Labels in Documents',)
fig.show()

## 2.2 Named Entity Recognition (NER) using spaCy 

In the following, we will visualize text data with their corresponding labels using NER spaCy library.

- First, we are presenting a text that contains the highest number of unique labels, which, in our dataset, reached a maximum of eight.

In [None]:
visualize_ner(df.sort_values(by=["num_unique_labels"], ascending=False).reset_index(drop=True).iloc[0:1])

In [None]:
df.sort_values(by=["num_unique_labels"], ascending=False)['unique_labels'].iloc[0]

- The illustration demonstrates the accurate labeling of the text. Additionally, it highlights the use of labels that indicate both the beginning of the target entity and its continuation

- In the example below, we showcase text featuring multiple labels. Notably, the sequence includes an email address after the PHONE_NUM class ,which are not labeled.

In [None]:
visualize_ner(df[df['document'] == 9854])

- The interesting aspect is determining whether the unlabeled email is a personal email or an oversight in labeling. Given the semi-automated nature of the labeling process, we anticipate encountering several such errors in the data.

- Lastly, we are showing one of the texts with most labels

In [None]:
visualize_ner(df.sort_values(by=["num_labels"], ascending=False).reset_index(drop=True).iloc[1:2])

## 2.3 WordClouds using tf–idf

- First, we will investigate the wordclouds using all documents

In [None]:
documents = preprocess_texts(df)

In [None]:
tokenize_documents = [doc.lower().split() for doc in documents]
tfidf_documents = calc_td_idf(tokenize_documents)

In [None]:
plot_wordcloud("WordCloud of all documents", tfidf_documents[0])

- Wordcloud of target Student Name

In [None]:
extracted_tokens_per_document = []

for _, row in df.iterrows():
    document_tokens = [token for token, label in zip(row['tokens'], row['labels']) if label in ['B-NAME_STUDENT', 'I-NAME_STUDENT']]
    extracted_tokens_per_document.append(document_tokens)

In [None]:
tokenize_names = [[word.lower() for word in sublist] for sublist in extracted_tokens_per_document]
tfidf_names = calc_td_idf(tokenize_names)
tfidf_names = {k: v for d in tfidf_names for k, v in d.items()}

In [None]:
plot_wordcloud("WordCloud of student names", tfidf_names)

In [None]:
def extract_tokens_before_labels(data_df, target_label, num_tokens=2):
    all_tokens_before_label = []
    
    for _, row in data_df.iterrows():
        tokens, labels = row['tokens'], row['labels']
        tokens_before_label = [
            tokens[i-num_tokens:i] 
            for i, label in enumerate(labels) 
            if label == target_label and i-num_tokens >= 0
        ]
        all_tokens_before_label.extend(tokens_before_label)
    
    return all_tokens_before_label

# Extracting the two tokens before 'B-NAME_STUDENT'
tokens_before_B_NAME_STUDENT = extract_tokens_before_labels(df, 'B-NAME_STUDENT')



In [None]:
tokens_before_B_NAME_STUDENT = [[word.lower() for word in sublist] for sublist in tokens_before_B_NAME_STUDENT]
tfidf_before_names = calc_td_idf(tokens_before_B_NAME_STUDENT)
tfidf_before_names = {k: v for d in tfidf_before_names for k, v in d.items()}

In [None]:
plot_wordcloud("WordCloud of student names", tfidf_before_names)

In [None]:
import re
from itertools import chain

# Assuming 'extracted_tokens_per_document' is your list of lists of tokens
cleaned_tokens_per_document = [
    [re.sub(r'[^\w\s]', '', token) for token in sublist]
    for sublist in tokens_before_B_NAME_STUDENT
]

# Optionally, you might want to remove empty tokens that result from this cleaning
cleaned_tokens_per_document = [
    [token for token in sublist if token.strip()]
    for sublist in cleaned_tokens_per_document
]


all_tokens = [token for sublist in cleaned_tokens_per_document for token in sublist]

# Count the frequencies of each word
word_counts = Counter(all_tokens)

import matplotlib.pyplot as plt

# Get the most common words and their counts
num_most_common_words = 10  # Change this to plot more or fewer words
most_common_words = word_counts.most_common(num_most_common_words)
words, frequencies = zip(*most_common_words)  # This unpacks the list of tuples into two tuples

# Plotting
plt.figure(figsize=(10, 8))  # Adjust the figure size as needed
plt.bar(words, frequencies, color='skyblue')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)  # Rotate the x-axis labels to make them readable
plt.title('Top {} Most Common Words'.format(num_most_common_words))
plt.show()
