# Text Generation using LLM
In order to increase the diversity of input texts, this notebook generates a dataset of texts using AI model. The output `pii_dataset.csv` contains a list of generated texts along with tokens, labels and trailing white space. Thanks @SILVESTRE BAHI for fixing the bug for labelling the enitities [Make labels for AI-generated essays | Bug fix](https://www.kaggle.com/code/mandrilator/make-labels-for-ai-generated-essays-bug-fix)

I've created two notebooks showcasing different options for AI-generated essays:
- [Create AI-generated essays | Gemma](https://www.kaggle.com/minhsienweng/create-ai-generated-essays-gemma) uses Google the latest Opensource LLM `Gemma-7b-it` model for text generation
- [Create AI-generated essays | Mistral-7b [Unsloth]](https://www.kaggle.com/code/minhsienweng/create-ai-generated-essays-mistral-7b-unsloth) uses `unsloth` library to run Mistral-7b models for text generation. 

Both notebooks offer comparable efficiency, generating text at speeds of 20-30 seconds per text. For this particular notebook, I've employed text generated by the Gemma model.

**References**
- @VALENTIN WERNER [Generation Label-Specific Essays (by 7b Models)](https://www.kaggle.com/code/valentinwerner/generation-label-specific-essays-by-7b-models)
- @PJMATHEMATICIAN [PII External Data Creation & Loading](https://www.kaggle.com/code/pjmathematician/pii-external-data-creation-loading)
- @VALENTIN WERNER [fix punctuation tokenization external dataset](https://www.kaggle.com/code/valentinwerner/fix-punctuation-tokenization-external-dataset/notebook)
- @IMPERFECTKITTO [Mistral essay generation](https://www.kaggle.com/code/defdet/mistral-essay-generation)


# Install packages

[Faker](https://faker.readthedocs.io/en/master/index.html) package creates fake entity
[bitsandbytes] and [accelerate] package to run the model on GPUs

In [1]:
!pip install faker --quiet

In [2]:
import torch
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Device: {DEVICE}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Pytorch {torch.__version__}")

Device: cuda
CUDA Version: 12.1
Pytorch 2.1.2


In [3]:
import sys, random, string, re, time, os
import pandas as pd
import numpy as np
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, pipeline
from tqdm.auto import tqdm
from faker import Faker  #generates fake data 
from spacy.lang.en import English

2024-03-26 04:13:38.694599: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-26 04:13:38.694708: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-26 04:13:38.829306: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
TOTAL = 500
GENERATE = False

In [5]:
import torch, random
# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

SEED = 111
# Seed the same seed to all 
def seed_everything(seed=42):
    Faker.seed(0)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

seed_everything(SEED)

In [6]:
import ctypes, gc, torch
libc = ctypes.CDLL("libc.so.6")
def clear_memory():
    libc.malloc_trim(0)
    torch.cuda.empty_cache()
    gc.collect()

# Generate Fake Student Information

Generate student information (first name, last name, address, email, etc.) using the Faker and random libraries. Save the generated data to a CSV file for later loading.

In [7]:
# Random generate 12 random number
def get_userid(length=16):
    """Generate userid - """
    userid = ""
    if random.choice([True, False, False, False]):
        userid += random.choice(string.ascii_lowercase) + random.choice(string.ascii_lowercase)
        userid += '.:'
    userid += str(int(np.random.rand()*1_000_000_000_000))
    # Add the extra rand chars
#     for i in range(length):
#         # Select random char or digital number (0-9)
#         userid = userid + random.choice(string.ascii_letters + str(random.randint(0, 9))) 
    return userid

# Generate the personal url from social media 
def generate_fake_social_media_url(user_name):
    social_media_platforms = {
        'LinkedIn': 'linkedin.com/in/',
        'YouTube': 'youtube.com/c/',
        'Instagram': 'instagram.com/',
        'GitHub': 'github.com/',
        'Facebook': 'facebook.com/',
        'Twitter': 'twitter.com/'
    }
    platform, domain = random.choice(list(social_media_platforms.items()))
    fake_url = f'https://{domain}{user_name}'
    return fake_url

def generate_username(first_name, last_name, fake_user_name):
    """usernames are created from first_name and last_name"""
    SEPS = ["_", ".", ""]    
    if random.choice([False, True]):
        username = f"{first_name.lower()}{random.choice(SEPS)}{last_name.lower()}{random.randint(1,999)}"
    else:
        username = fake_user_name
    return username

def generate_email(first_name, last_name, faker):
    """usernames are created from first_name and last_name"""
    SEPS = ["_", ".", ""]
    if random.choice([False, True]):
        email = f"{first_name.lower()}{random.choice(SEPS)}{last_name.lower()}@{faker.domain_name()}"
    else:
        email = faker.email()
    return email

def generate_student_info():
    """Generates all the user info (name, eamil addresses, phone number, etc) together """
    # Select the student country to generate the user info based on the country
    COUNTRIES = ["en_US", "en_US", "en_US", "en_US", "en_US", "en_US",
                 "de_DE", "en_UK", "de_DE", "en_AU", "en_IN", "en_IN"]
    country = random.choice(COUNTRIES)
    faker = Faker(country)
    first_name = faker.first_name()
    last_name = faker.last_name()
    user_name = generate_username(first_name, last_name, faker.user_name())
    fake_url = generate_fake_social_media_url(user_name)
    student = {}
    student['COUNTRY'] = country
    student['ID_NUM'] = get_userid(12) # User ID
    student['NAME_STUDENT'] = first_name + " "+  last_name 
    student['EMAIL'] = generate_email(first_name, last_name, faker)
    student['USERNAME'] = user_name
    student['PHONE_NUM'] = faker.phone_number().replace(" ", "")
    student['URL_PERSONAL'] = fake_url
    student['STREET_ADDRESS'] = str(faker.address()).replace("\n"," ") # Replace \n with space in the address
    del faker
    clear_memory()
#     print(student)
    return student   

In [8]:
# import json

# f = json.load(open("/kaggle/input/pii-detection-removal-from-educational-data/train.json", 'r'))

In [9]:
# f[0]['full_text']

In [10]:
# # ['B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL', 'O']
# labels = [x['labels'] for x in f]
# tokens = [x['tokens'] for x in f]
# for tk, l in zip(tokens, labels):
#     for tkk, ll in zip(tk, l):
#         if ll in ['B-ID_NUM', 'I-ID_NUM']:
#             print(ll, ":", tkk)

In [11]:
from pathlib import Path
# Copy the generated df to working folder
import shutil

# Create the folder
Path("/kaggle/working/temp").mkdir(parents=True, exist_ok=True)
if GENERATE:
    students = []
    for i in tqdm(range(TOTAL)):
        students.append(generate_student_info())
        print(f"Generate {i}-th information")
    df = pd.DataFrame(students)
    df = df.reset_index(drop=True)
    # Save to the csv file
    df.to_csv("/kaggle/working/temp/df.csv", index=False, encoding='UTF-8') # Do not save default ID column
    display(df.tail(10))
    # Check if ID_NUM has any duplicates
    assert df['ID_NUM'].duplicated().value_counts()[False] == TOTAL, "Duplicated ID_NUM"
else:
    shutil.copy('/kaggle/input/ai-generated-text-dataset/temp/df.csv', '/kaggle/working/temp/df.csv')

  0%|          | 0/3 [00:00<?, ?it/s]

Generate 0-th information
Generate 1-th information
Generate 2-th information


Unnamed: 0,COUNTRY,ID_NUM,NAME_STUDENT,EMAIL,USERNAME,PHONE_NUM,URL_PERSONAL,STREET_ADDRESS
0,en_IN,374540118847,Veer Sibal,yjain@example.com,qbahri,05938242194,https://linkedin.com/in/qbahri,H.No. 157 Bedi Circle Raebareli-938778
1,en_US,vx.:950714306409,Jorge Trujillo,salazarmaria@example.com,davismary,951.339.3328,https://twitter.com/davismary,"714 Mann Plaza Suite 839 Seanfurt, MD 75952"
2,en_AU,731993941811,John White,thomas12@example.org,john_white31,1868-4833,https://linkedin.com/in/john_white31,"7/5 Mark Circus Matthewsmouth, QLD, 2690"


# Generate texts using LLM

## Load Models
Using LLM `Mistral-7b-instruct-v0.1-hf` to generate the essays with given PII information (like name, email, etc). 


In [12]:
def load_model():
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    torch.backends.cuda.enable_flash_sdp(False)
    model_path="/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    # causal language model, where the prediction of next text is based on prevous text
    # this is used for generation work
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 torch_dtype=torch.bfloat16,
                                                 device_map="auto")
    return model, tokenizer

## Generate loop
The loop generate the texts based on each student's PII information, including name + email + street address + phone + username + user id + personal url.

All the generated texts are stored to `generate_df.csv`



In [23]:
def generate_on_user_info(row):
    # Prompt template generates the texts
    prompt_template = """<s>[INST]
    You are an essay writer.
    Write an esssay to describe how you applied a specific design thinking to address a challenge or problem.
    And you will be given some personal information like
    name = {first_name} {last_name}
    email = {email}
    street address = {address}
    phone number = {phone_num}
    personal url = {url}
    username = {username}
    user id = {userid}
    And you must include ALL of the personal information above somewhere in the essay. 
    Do not miss out any.[/INST]"""
    
    first_name = row['NAME_STUDENT'].split()[0]
    last_name = row['NAME_STUDENT'].split()[1]
    email = row['EMAIL']
    phone_num = row['PHONE_NUM']
    address = row['STREET_ADDRESS']
    url = row['URL_PERSONAL']
    username = row['USERNAME']
    userid = row['ID_NUM']
    # Fill in prompt with PII
    prompt = prompt_template.format(first_name=first_name,
                                    last_name=last_name,
                                    email=email,
                                    phone_num=phone_num,
                                    address=address,
                                    url=url,
                                    username=username,
                                    userid=userid
                                   )
    return prompt

def generate_texts(model, tokenizer, df, num_essays):
    generated_df = df[:num_essays]
    # Generate the texts
    for i in tqdm(range(len(generated_df))):
        start = time.time()
        # Generate PII
        row = generated_df.loc[i]
        # Generate the texts using three prompts
        prompt = generate_on_user_info(row)
        # Tokenize the prompt
        inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    
        # Generate the outputs from prompt
        generate_ids = model.generate(**inputs, 
                                      max_new_tokens=768,
                                      do_sample=True,
                                      temperature=0.9,
                                      top_p=0.95,
                                      top_k=40,
                                      repetition_penalty=1.1,
                                      pad_token_id=tokenizer.eos_token_id
                                     )
        # Decode the generated output
        generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                                 clean_up_tokenization_spaces=False)[0]
        generated_text = generated_text.split('[/INST] ')[1]
#         print(f"generated_text = {generated_text}" )
        generated_df.loc[i, 'generated_text'] = generated_text
        clear_memory()
        print(f"Complete the text for {i}-th student {time.time() - start: .1f} seconds")
    # Save generated_df to csv
    generated_df.to_csv("temp/generated_df.csv", index=False, encoding="UTF-8")
    return generated_df

In [24]:
if GENERATE:
    model, tokenizer = load_model()
    generated_df = generate_texts(model, tokenizer, df, num_essays=TOTAL)    
    sys.exit(0)

  0%|          | 0/2 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  generated_df.loc[i, 'generated_text'] = generated_text


Complete the text for 0-th student  43.5 seconds
Complete the text for 1-th student  30.8 seconds


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Turn Generated Texts to competition format


In [None]:
df2 = pd.read_csv('/kaggle/input/text-generation-gemma/generated_df_100_900.csv', encoding='UTF-8')

In [None]:
df2['generated_text'][4]

In [4]:
# Loaggad the generated df from Gemma
# df = pd.read_csv('/kaggle/input/text-generation-gemma/generated_df_0_100.csv', encoding='UTF-8')
# df2 = pd.read_csv('/kaggle/input/text-generation-gemma/generated_df_100_900.csv', encoding='UTF-8')
# df3 = pd.read_csv('/kaggle/input/text-generation-gemma/generated_df_900_1700.csv', encoding='UTF-8')

# generated_df = pd.concat([df, df2, df3])

df1 = pd.read_csv('/kaggle/input/pii-mistral7b-generated-batch1/temp/generated_df.csv', encoding='UTF-8')
df2 = pd.read_csv('/kaggle/input/pii-gen-batch2/generated_df2.csv', encoding='UTF-8')
generated_df = pd.concat([df1, df2])
generated_df = generated_df.reset_index(drop=True, inplace=False)
generated_df = generated_df.rename({'generated_text': 'Essay'}, axis=1)

# Load the generated df from Gemini
# df4 = pd.read_csv('/kaggle/input/pii-detection-gemini-created-dataset/pii_gemini.csv',
#                   encoding='UTF-8', index_col=[0]) # Load df without unnamed columns
# df4['NAME_STUDENT'] = ['' for _ in range(len(df4))] # The dataset lacks of student name
# df4 = df4.reset_index(drop=True, inplace=False)
# # Combine both datasets
# generated_df = pd.concat([generated_df, df4])
# # generated_df = df4
# generated_df = generated_df.reset_index(drop=True, inplace=False)
display(generated_df.tail(1))

Unnamed: 0,COUNTRY,ID_NUM,NAME_STUDENT,EMAIL,USERNAME,PHONE_NUM,URL_PERSONAL,STREET_ADDRESS,Essay
999,de_DE,so.:744615462156,Klaas Ruppersberger,obriemer@example.net,helga99,131198056,https://twitter.com/helga99,Bauerweg 9/5 51644 Rochlitz,"As an essay writer, I have used various design..."


In [7]:
label_types = ['NAME_STUDENT','EMAIL', 'USERNAME', 'ID_NUM',
               'PHONE_NUM', 'URL_PERSONAL', 'STREET_ADDRESS']

# Label assignment functions

In [8]:
from spacy.lang.en import English
import re

en_tokenizer = English().tokenizer
    
def tokenize_with_spacy(text, tokenizer=en_tokenizer):
    tokenized_text = tokenizer(text)
    tokens = [token.text for token in tokenized_text]
    trailing_whitespace = [bool(token.whitespace_) for token in tokenized_text]
    return tokens, trailing_whitespace

# Update labels and boolean flags
def update_labels(i, token, label_type, labels, isFirst_flags):
#     print(f"Found {i}-th position token: {token}")
    # Update the label
    if isFirst_flags[label_type]:
        labels[i] = 'B-'+label_type # Beginning of an entity
        isFirst_flags[label_type] = False
    else:
        labels[i] = 'I-'+label_type # Contiunity of an entity
    return labels, isFirst_flags

# Go through each token and assign name label ('NAME_STUDENT') if matched
def assign_name(names, tokens, labels):
    #print(f"Search 'NAME_STUDENT': {names}")
    for i, token in enumerate(tokens):
        token = str(token).lower()
        # Order does not matter
        if token in names:
            # If the previous 
            if i > 0 and labels[i-1] == 'B-NAME_STUDENT':
                labels[i] = 'I-NAME_STUDENT'
            else:
                labels[i] = 'B-NAME_STUDENT'        
    return labels

def assign_phone_number(phones, tokens, labels, isFirst_flags):
#     print(f"Search 'PHONE_NUM': {phone_tokens}")
    for i, token in enumerate(tokens):
        token_ = str(token).lower()
        label = labels[i]
        # Add a special case '-'
        if token == '-':
            # Check if the previous token is not 'O'
            if labels[i-1] != 'O':
                # Update the labels
                labels, isFirst_flags = update_labels(i, token_, 'PHONE_NUM',
                                                      labels, isFirst_flags)
        else:
            # Check if token is matched last name
            if label == 'O' and token_ in phones:
                labels, isFirst_flags = update_labels(i, token_, 'PHONE_NUM',
                                                      labels, isFirst_flags)
    return labels, isFirst_flags

# Go through each token and assign street address label
def assign_street_address(address, tokens, labels):
    # print(tokens)
    # Keep track of index for a long label
    label_index = 0
    reserve_indices = []
    sandwich_max_size = 3
    #print(f'address = {address}')
    for i, token in enumerate(tokens):
        try:
            #Order matters and sandwiches are possible
            token = str(token).lower()
            curr_idx = label_index
            curr_token = address[curr_idx] 
            if token == curr_token:
                # print(f"token = {token}")               
                #case where a token corresponds to the expected next token
                if len(reserve_indices) > sandwich_max_size:
                    reserve_indices = []

                prefix = 'B-' if curr_idx == 0 else 'I-'
                labels[i] = prefix + 'STREET_ADDRESS'
                #fill sandwiches if the next token has been found
                for k in reserve_indices:
                    labels[k] = 'I-STREET_ADDRESS'
                reserve_indices = []
                #Update positional pointer
                label_index += 1
                # At the end of address
                if label_index == len(address):
                    label_index = 0
                # print(f'label_index = {label_index}')
            else:
                reserve_indices.append(i)
                #print(f"! {token}")
        except Exception as e:
            print(f"Error occurs at {i}-th {token} \n{e}" )
            sys.exit(-1)
#         #case where some surprise token has been added in the PII
#     address_labels = [(label, token) for i, (label, token) in enumerate(zip(labels, tokens))
#                       if label =='B-STREET_ADDRESS' or label == 'I-STREET_ADDRESS']
#     print(address_labels)        

    return labels

# Assign labels for other types
def assign_other_label_types(essay, label_type, tokens, labels, isFirst_flags):
    label_value = essay[label_type].lower()
#     print(f"Search '{label_type}': {label_value}")    
    for i, token in enumerate(tokens):
        token_ = str(token).lower()
        # Check if token is first or last name
        if label_value in token_:
            # Update the label
            labels, isFirst_flags = update_labels(i, token_, label_type,
                                                      labels, isFirst_flags)
    return labels, isFirst_flags

# Assign labels for other types
def assign_email(email, tokens, labels):
    #print(f"Search Email: {email}")
    is_First = False
    for i, token in enumerate(tokens):
        token_ = str(token).lower()
        # Check if token is first or last name
        if email in token_:
            # print(f'Token {token_}')
            if not is_First: 
                # Update the label
                labels[i] = 'B-EMAIL'
                is_First = True
            else: # Skip labeling
                print(f"Duplicated Email {email}")
    return labels

# Assign labels for other types
def assign_username(username, tokens, labels):
    #print(f"Search username: {username}")
    is_First = False
    for i, token in enumerate(tokens):
        token_ = str(token).lower()
        # Check if token is first or last name
        if username in token_:
            # print(f'Token {token_}')
            if not is_First: 
                # Update the label
                labels[i] = 'B-USERNAME'
                is_First = True
            else: # Skip labeling
                print(f"Duplicated username {username}")
    return labels

# Assign labels

In [9]:
# “B-”: the beginning of an entity. 
# “I-”: the next of an entity
def assign_labels(essay, tokens):
    # print(f"essay['NAME_STUDENT'] = {essay['NAME_STUDENT']}" )
    # Assign "O" to labels by default
    labels = ['O' for token in tokens] 
    # Create a boolean flag list to track if a label type start the text.
    isFirst_flags = {label_type: True for label_type in label_types}
    # Go through each token and check if the label appear in the token
    # All token and label values are lower case for comparison
    for label_type in label_types:
        if label_type == 'NAME_STUDENT':
            if essay['NAME_STUDENT'] != '':
                names, _ = tokenize_with_spacy(essay['NAME_STUDENT'].lower())
                labels = assign_name(names, tokens, labels)
        elif label_type == 'STREET_ADDRESS':
            address = essay['STREET_ADDRESS'].lower().replace("\\n", " ")
            # print(f"address {address}")
            address = address.translate(str.maketrans('', '', string.punctuation)) # Remove punctuations
            address = address.split(' ')
            if len(address) > 0:                
                labels = assign_street_address(address, tokens, labels)
        elif label_type == 'PHONE_NUM':
            phones, _ = tokenize_with_spacy(essay['PHONE_NUM'].lower())
            labels, isFirst_flags = assign_phone_number(phones, tokens, labels, isFirst_flags) 
        elif label_type == 'EMAIL':
            email = essay['EMAIL'].lower()
            assign_email(email, tokens, labels)
        elif label_type == 'USERNAME':
            username = essay['USERNAME'].lower()
            assign_username(username, tokens, labels)
        else:
            labels, isFirst_flags = assign_other_label_types(essay, label_type, tokens,
                                                             labels, isFirst_flags)
    return labels 

In [10]:
def process_full_text(full_text):    
    if 'Unprocessed:' in full_text: # Need to split the instruction and response        
        pattern = r'([*]*Please write an essay about [^*]*[*]*)'
        x = re.search(pattern, full_text)
        if x:
            splits = re.split(pattern, full_text)
            text = splits[-1].strip()
            # print(f"=== Split text ===\n {text}")
            return text
        else:
            print(f"### UnProcess text###\n {full_text}")            
            return None
    else:
        return full_text.strip() # Remove the space to the left

# Map the label to token
def create_token_map(tokens, labels):
    token_map = []
    for i, label in enumerate(labels):
        if label != 'O':
            token_map.append({label: (tokens[i], i)})
    return token_map

In [13]:
# Assign the labels to the tokens           
results = []
doc_id = 1_221_555_000 # document id 
for i in range(len(generated_df)):
    row = generated_df.iloc[i]
    full_text = process_full_text(row["Essay"])
    if full_text:
        # Tokenize the text using spacy tokenizer
        tokens, trailing_whitespace = tokenize_with_spacy(full_text)
        labels = assign_labels(row, tokens)
        token_map = create_token_map(tokens, labels)
        doc_id += 1
        result = {
            'document': doc_id, 
            'full_text': full_text,
            'tokens': tokens, 
            'trailing_whitespace': trailing_whitespace,
            'labels': labels,
            'token_map': token_map
        }
        # Add PII to result
        for label_type in label_types:
            result[label_type] = row[label_type]
        # print(result)
        results.append(result)
# # Save to temp fold for verification
df = pd.DataFrame(results)
df.to_csv("pii_dataset_Gemma.csv", index=False, encoding="UTF-8")
display(df)

Duplicated username jenniferhughes
Duplicated username okay
Duplicated Email neureutherkatherina@example.org
Duplicated Email neureutherkatherina@example.org
Duplicated username riaan75
Duplicated username awali
Duplicated username dhanush38
Duplicated username sebastianhamann
Duplicated Email magdalene39@example.net
Duplicated username ronny.jäkel71
Duplicated username kkroker
Duplicated username transusan
Duplicated username ustadelmann
Duplicated username graeme10
Duplicated username daliarenee
Duplicated username daliarenee
Duplicated username qbell
Duplicated username colonjessica
Duplicated username blackdillon
Duplicated Email michael.shepherd@coleman.com
Duplicated username johnsonjennifer
Duplicated username trubconrad
Duplicated username carolyn44
Duplicated username webereric
Duplicated username nicholas.palmer383
Duplicated username luitgard75
Duplicated username louise85
Duplicated username mollymclean788
Duplicated username jayan.koshy655
Duplicated username melanie_steve

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels,token_map,NAME_STUDENT,EMAIL,USERNAME,ID_NUM,PHONE_NUM,URL_PERSONAL,STREET_ADDRESS
0,1221555001,"As an aspiring product designer, I have always...","[As, an, aspiring, product, designer, ,, I, ha...","[True, True, True, True, False, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[{'B-URL_PERSONAL': ('https://github.com/megan...,Megan Chang,ysullivan@example.com,megan_chang788,696469185597,459.638.2421x9489,https://github.com/megan_chang788,"578 Michael Island New Thomas, VI 68835"
1,1221555002,"As a design thinking practitioner, I have appl...","[As, a, design, thinking, practitioner, ,, I, ...","[True, True, True, True, False, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[{'B-EMAIL': ('salazarmaria@example.com](mailt...,Jorge Trujillo,salazarmaria@example.com,jorgetrujillo576,286139334950,951.339.3328,https://instagram.com/jorgetrujillo576,"714 Mann Plaza Suite 839 Seanfurt, MD 75952"
2,1221555003,"As an experienced designer and writer, I have ...","[As, an, experienced, designer, and, writer, ,...","[True, True, True, True, True, False, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[],John White,thomas12@example.org,antoniozavala,226851453564,8186684833,https://instagram.com/antoniozavala,"59179 Bruce Gardens Apt. 413 Lauramouth, NE 08652"
3,1221555004,I'm Craig Thomas and I'm here to share my expe...,"[I, 'm, Craig, Thomas, and, I, 'm, here, to, s...","[False, True, True, True, True, False, True, T...","[O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O...","[{'B-NAME_STUDENT': ('Craig', 2)}, {'I-NAME_ST...",Craig Thomas,johnsoncynthia@example.net,brandonberry,tm.:551314769082,001-516-415-1090,https://github.com/brandonberry,"00869 Mary Cliff Apt. 145 Whitehaven, NH 05662"
4,1221555005,Design thinking is a powerful problem-solving ...,"[Design, thinking, is, a, powerful, problem, -...","[True, True, True, True, True, False, False, T...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[],Candace Lyons,kmassey@example.org,candacelyons460,cv.:719468969785,001-230-922-5841x972,https://linkedin.com/in/candacelyons460,"564 Ann Bridge Suite 150 Dennisfort, MI 80472"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1221555496,"As an assistant, I can help you with that. Ple...","[As, an, assistant, ,, I, can, help, you, with...","[True, True, False, True, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[],Wolfgang Römer,roehrdanzhans-jochen@example.com,wolfgang_römer915,980597342061,+49(0)559130539,https://youtube.com/c/wolfgang_römer915,Swen-Junitz-Allee 3/1 18428 Lübz
496,1221555497,"As Sam Short, a designer based in Leeds, UK wi...","[As, Sam, Short, ,, a, designer, based, in, Le...","[True, True, False, True, True, True, True, Tr...","[O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O...","[{'B-NAME_STUDENT': ('Sam', 1)}, {'I-NAME_STUD...",Sam Short,fbolton@example.com,iphillips,882712984697,+44(0)1134960328,https://youtube.com/c/iphillips,Flat 68 Kathleen locks West Leanne SY5E 6GJ
497,1221555498,"As a design thinker, I believe that addressing...","[As, a, design, thinker, ,, I, believe, that, ...","[True, True, True, False, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[],Nicole Adams,nelsontimothy@example.net,robertsautumn,919472466327,299-266-2039x634,https://facebook.com/robertsautumn,"94234 Bryant Isle Suite 642 Brianland, DE 47464"
498,1221555499,"As an avid YouTuber, I often find myself faced...","[As, an, avid, YouTuber, ,, I, often, find, my...","[True, True, True, False, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[],Ursel Trupp,ursel.trupp@mielcarek.com,ursel_trupp264,415503550837,+49(0)9935182417,https://youtube.com/c/ursel_trupp264,Jo-Hande-Straße 83 77213 Schlüchtern


In [14]:
# A function displays PII (like phone number) to help double-check the generated text
def verify_df(df):
    tmp = df
    tmp = tmp.reset_index(drop=True)
    for i in range(len(tmp)):
        row = tmp.iloc[i]
        full_text = row['full_text']
        tokens = row['tokens']
        token_map = row['token_map']
        address = row['STREET_ADDRESS']
        # Display full text and all
        print(f"=== Doc {i} ===\n"
              f"full_text = {full_text}\n"
              f"address = {address}")
        for t_dic in token_map:
            if 'B-STREET_ADDRESS' in t_dic.keys() or 'I-STREET_ADDRESS' in t_dic.keys():
                print(t_dic)
tmp = df[0:10]
verify_df(tmp)

=== Doc 0 ===
full_text = As an aspiring product designer, I have always been fascinated by the design thinking process and its potential for solving complex problems. One challenge that I faced during my undergraduate studies was designing a mobile app that would help students with their time management skills. The problem statement was clear: many students struggle to balance their academic and extracurricular activities while also maintaining a social life.

To address this challenge, I used the design thinking tool of empathy mapping. Empathy mapping is a technique that allows designers to visualize their target users' thoughts, feelings, and behaviors. It involves four main stages: defining, ideating, prototyping, and testing.

During the defining stage, I interviewed several students who had struggled with time management. Through our conversations, I learned that they often found themselves overwhelmed with tasks and activities, leading to procrastination and reduced productivit

# Save to CSV

In [15]:
pii_df = df[['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels']]
pii_df.to_json("pii_dataset_Gemma.json")
display(pii_df)

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels
0,1221555001,"As an aspiring product designer, I have always...","[As, an, aspiring, product, designer, ,, I, ha...","[True, True, True, True, False, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,1221555002,"As a design thinking practitioner, I have appl...","[As, a, design, thinking, practitioner, ,, I, ...","[True, True, True, True, False, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,1221555003,"As an experienced designer and writer, I have ...","[As, an, experienced, designer, and, writer, ,...","[True, True, True, True, True, False, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,1221555004,I'm Craig Thomas and I'm here to share my expe...,"[I, 'm, Craig, Thomas, and, I, 'm, here, to, s...","[False, True, True, True, True, False, True, T...","[O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O..."
4,1221555005,Design thinking is a powerful problem-solving ...,"[Design, thinking, is, a, powerful, problem, -...","[True, True, True, True, True, False, False, T...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
...,...,...,...,...,...
495,1221555496,"As an assistant, I can help you with that. Ple...","[As, an, assistant, ,, I, can, help, you, with...","[True, True, False, True, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
496,1221555497,"As Sam Short, a designer based in Leeds, UK wi...","[As, Sam, Short, ,, a, designer, based, in, Le...","[True, True, False, True, True, True, True, Tr...","[O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O..."
497,1221555498,"As a design thinker, I believe that addressing...","[As, a, design, thinker, ,, I, believe, that, ...","[True, True, True, False, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
498,1221555499,"As an avid YouTuber, I often find myself faced...","[As, an, avid, YouTuber, ,, I, often, find, my...","[True, True, True, False, True, True, True, Tr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [16]:
import functools
# Get all the unique labels 
all_labels = sorted(np.unique(functools.reduce(lambda a, b: list(np.unique(a+b)),
                                              df['labels'].tolist())))
# print(f"all_labels = {all_labels}")
# print(f"The number of all labels {len(all_labels)}")

label2id = {label:str(index) for index,label in enumerate(all_labels)}
id2label = {str(index):label for index,label in enumerate(all_labels)}
print(f"id2label = {id2label} the number of id2label = {len(id2label)}")

id2label = {'0': 'B-EMAIL', '1': 'B-ID_NUM', '2': 'B-NAME_STUDENT', '3': 'B-PHONE_NUM', '4': 'B-STREET_ADDRESS', '5': 'B-URL_PERSONAL', '6': 'B-USERNAME', '7': 'I-ID_NUM', '8': 'I-NAME_STUDENT', '9': 'I-PHONE_NUM', '10': 'I-STREET_ADDRESS', '11': 'I-URL_PERSONAL', '12': 'O'} the number of id2label = 13
