```
This is the notebook we use to generate fake biographies for the GPT2 experiments. Adjust the hyperparameters in the first cell and you can then run the entire notebook to have everything uploaded to a huggingface directory of your choice
```

## Imports, Hyperparams

In [35]:
import requests
import numpy as np
from tqdm.auto import tqdm
import pandas as pd
import re
import os
from collections import defaultdict
from datasets import Dataset
from datasets import load_dataset

#Set N to the number of biographies to generate (10k is used in the paper)
N = 10000

#Upsampling factor to upsample the high count split (10 is used in the paper)
UPSAMPLING_FACTOR = 10

# get HF_TOKEN from environment variable
HF_TOKEN = os.getenv('HF_TOKEN')

#PATH to save the biographies and qa pairs
BIOS_PATH = 'data/biographies.csv'

#path to save the train/test splits for the gpt-2 models
GPT_SPLITS_PATH = 'data/gpt_splits.csv'

#set seed for reproducibility
SEED = 42
def set_seed(seed):
    np.random.seed(seed)
    np.random.default_rng(seed)

# Generating Attributes 

The first step is to generate data for the biography. I've closely followed the instructions noted in the [[paper]](http://arxiv.org/abs/2309.14316). Speccifically, the details in in section A.1 which I have copied down below. **I found open source github repos for all the required lists, and I'm working off these.**


1. **First, middle, and last names** are drawn from pools of 400, 400, and 1000 English names respectively. We apply rejection sampling to ensure all N individuals have unique full names. [[First names Link]](https://gist.githubusercontent.com/JTRNS/6faaf857580eed18aeab6ac9c97993c7/raw/bb5f7ade4f5454ac602c1cf0c00e73f97658243a/first-names.txt), [[Last Names Link]](https://gist.githubusercontent.com/craigh411/19a4479b289ae6c3f6edb95152214efc/raw/d25a1afd3de42f10abdea7740ed098d41de3c330/List%2520of%2520the%25201,000%2520Most%2520Common%2520Last%2520Names%2520(USA)) 
    

2. **Birth years** range from 1900 to 2099, months are selected from the 12 months, and days are chosen between 1 and 28.  
3. **Birth cities** are selected from 200 US cities, with their respective state abbreviations, such as Princeton, NJ and Cambridge, MA. [[Cities Link]](https://raw.githubusercontent.com/grammakov/USA-cities-and-states/refs/heads/master/us_cities_states_counties.csv)

4. **Universities** are drawn from a list of 300 US institutions. Some may have similar prefixes, like University of California, Berkeley/Irvine/Davis/etc. [[University List Link]](https://gist.githubusercontent.com/dotJoel/90c6acd65331c406d3cb/raw/3f3e5b495b8d6c48f28c43d241075149294f5714/all-colleges.txt)


5. **Majors** are selected from 100 common college disciplines, including Computer Science, Physics, and Music. [[Majors Link]](https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv)

6. **Employers** are chosen from a list of 263 companies, featuring names like Meta Platforms, Microsoft, and Google. [[Employers Link]](https://raw.githubusercontent.com/EatMoreOranges/Fortune-500-Dataset/refs/heads/main/data/2023-fortune-500-data.csv)

7. Pronouns are chosen randomly from [He, She, They]


## Generating Names

In [None]:
#reset numpy seed
set_seed(SEED)

#Making unique names
first_names_url = 'https://gist.githubusercontent.com/JTRNS/6faaf857580eed18aeab6ac9c97993c7/raw/bb5f7ade4f5454ac602c1cf0c00e73f97658243a/first-names.txt'
last_names_url = 'https://gist.githubusercontent.com/craigh411/19a4479b289ae6c3f6edb95152214efc/raw/d25a1afd3de42f10abdea7740ed098d41de3c330/List%2520of%2520the%25201,000%2520Most%2520Common%2520Last%2520Names%2520(USA)'
cities_url = 'https://raw.githubusercontent.com/grammakov/USA-cities-and-states/refs/heads/master/us_cities_states_counties.csv'

response = requests.get(first_names_url)
#convert to list of strings
first_names = response.content.decode('utf-8').splitlines()

middle_names = first_names.copy()

response = requests.get(last_names_url)
last_names = response.content.decode('utf-8').split(",\n")

#make 10000 unique names
names = set()
while len(names) < N:
    first_name = np.random.choice(first_names)
    middle_name = np.random.choice(middle_names)
    last_name = np.random.choice(last_names)

    #make sure no component repeats
    if first_name != middle_name and first_name != last_name and middle_name != last_name:
        names.add(f"{first_name} {middle_name} {last_name}")

names = sorted(names)
len(np.unique(names)), names[:10]


## Generating Birthdays

In [None]:
#reset seed
set_seed(SEED)

#randomly draw a birth year
years = np.random.randint(1900, 2099, size=N)
months = np.random.randint(1, 12, size=N)
month_verbage = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
months = [month_verbage[i-1] for i in months]
days = np.random.randint(1, 28, size=N)

birthdays = [f"{months[i]} {days[i]}, {years[i]}" for i in range(N)]
birthdays

## Generating Birth Cities

In [None]:
#set seed
set_seed(SEED)

cities_url = 'https://raw.githubusercontent.com/grammakov/USA-cities-and-states/refs/heads/master/us_cities_states_counties.csv'

cities = pd.read_csv(cities_url,delimiter='|', on_bad_lines='warn')
#parse as a csv, skip bad lines['City'].unique()arn')
cities = cities [['City','State short']]
cities = cities.drop_duplicates()
#format into numpy array
citynames = (cities['City'].str.title() + ', ' + cities['State short'].str.upper()).to_numpy()

#shuffle, subselect 200 cities
np.random.shuffle(citynames)
citynames = citynames[:200]

#now randomly sample N cities
citynames = np.random.choice(citynames, size=N)
len(np.unique(citynames)), citynames

## Generating Universities

In [None]:
#set seed
set_seed(SEED)

universities_url = 'https://gist.githubusercontent.com/dotJoel/90c6acd65331c406d3cb/raw/3f3e5b495b8d6c48f28c43d241075149294f5714/all-colleges.txt'

response = requests.get(universities_url)
universities = response.content.decode('utf-8').splitlines()
#remove content in brackets
universities = [re.sub(r'\(.*?\)', '', uni).strip() for uni in universities]

#shuffle and subsample to 300
np.random.shuffle(universities)
universities = universities[:300]

#now randomly sample N universities
universities = np.random.choice(universities, size=N)
len(np.unique(universities)), universities

## Generating Majors

In [None]:
set_seed(SEED)

major_url = 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv'

major_df = pd.read_csv(major_url)
#make major title case
majors = major_df['Major'].str.title().to_numpy()

#suffle and subselect 100
np.random.shuffle(majors)
majors = majors[:100]

#now randomly sample N majors
majors = np.random.choice(majors, size=N)
len(np.unique(majors)), majors




## Generating Employers and Company Cities

In [None]:
set_seed(SEED)


#url with fortune 500 companies
employer_url = 'https://raw.githubusercontent.com/EatMoreOranges/Fortune-500-Dataset/refs/heads/main/data/2023-fortune-500-data.csv'

employer_df = pd.read_csv(employer_url)

# we  need a company_city attribute. The [city,state_expanded] info is available with the company list, but we need to make it to [city,state_short] for consistency
#using the city state data to make a {state_expanded:state_short} dictionary
cities_url = 'https://raw.githubusercontent.com/grammakov/USA-cities-and-states/refs/heads/master/us_cities_states_counties.csv'
cities = pd.read_csv(cities_url,delimiter='|', on_bad_lines='warn')
#parse as a csv, skip bad lines['City'].unique()arn')
state_abbrev = cities [['State full','State short']].drop_duplicates()
state_abbrev.head()
#make this a dictionary
state_abbrev = defaultdict(str,zip(state_abbrev['State full'], state_abbrev['State short']))
employer_df['State short'] = employer_df['State'].map(state_abbrev)
#drop rows with empty state short, this happens when states are expanded differently compared to our dict. I checked and this is about 10 instances
employer_df = employer_df[employer_df['State short'] != '']
#now making the company city attribute
employer_df['company city'] = employer_df.apply(lambda x:f'{x["City"]}, {x["State short"]}', axis=1)
employer_df = employer_df[['Company', 'company city']]


# #subselect the top 263 as in paper
employer_df = employer_df.head(263)
#subsample N employers
employer_df = employer_df.sample(N,replace=True)


companies = employer_df['Company'].to_numpy()
company_cities = employer_df['company city'].to_numpy()


'''the number of company cities is lower than the number of companies because 24 companies (16%) share HQs in New York City. 
This is also noted by the paper authors; Their percentage is 13.7%'''

len(np.unique(companies)), len(np.unique(company_cities))



## Generating Pronouns

In [42]:
set_seed(SEED)

# pick personal and possessive pronouns
pronouns = {'he':'his', 'she':'her', 'they':'their'}
#pick a random pronoun for each name
personal_pronoun = np.random.choice(list(pronouns.keys()),size=N)
possessive_pronoun = np.array([pronouns[p] for p in personal_pronoun])


## Putting it all together

In [None]:
#now combining all attributes into a dataframe

biography_df = pd.DataFrame({'NAME':names, 
                            'BIRTHDAY':birthdays, 
                            'LOCATION':citynames,
                            'UNIVERSITY':universities,
                            'MAJOR':majors,
                            'EMPLOYER':companies,
                            'EMPLOYER_CITY':company_cities,
                            'PERSONAL_PRONOUN':personal_pronoun,
                            'POSSESIVE_PRONOUN':possessive_pronoun
                            })

biography_df.shape, display(biography_df.head())






# Templates for bio**S**

```
Make a prompt that makes Chatgpt give us a bunch of templates. Templates for each attribute are shown. Some postprocessing was done to make the generations fit the format I wanted to process them
```


## Birthday

```
I want a list of templates  expressing that a person NAME is born on BIRTHDAY. The templates should be distinct. NO other information is provided. Make sure to only depend on X and Y. 
``` 

In [8]:
birthday_templates = ["{NAME} was born on {BIRTHDAY}",
"{NAME}'s birthdate is {BIRTHDAY}",
"{NAME} came into the world on {BIRTHDAY}",
"{NAME} was welcomed into life on {BIRTHDAY}",
"{NAME}'s journey began on {BIRTHDAY}",
"On {BIRTHDAY}, {NAME} was born",
"{BIRTHDAY} marks the birth of {NAME}",
"{NAME} first saw the light of day on {BIRTHDAY}",
"{NAME} entered the world on {BIRTHDAY}",
"{NAME} was given life on {BIRTHDAY}",
"{BIRTHDAY} is the day {NAME} was born",
"{NAME} was brought into existence on {BIRTHDAY}",
"{NAME} was born into the world on {BIRTHDAY}",
"{NAME} took their first breath on {BIRTHDAY}",
"The birth of {NAME} took place on {BIRTHDAY}",
"{NAME} arrived on {BIRTHDAY}",
"{NAME} was delivered on {BIRTHDAY}",
"{BIRTHDAY} is when {NAME} was born",
"{NAME}'s life started on {BIRTHDAY}",
"{NAME} made their debut in the world on {BIRTHDAY}",
"{NAME}'s existence began on {BIRTHDAY}",
"The world first met {NAME} on {BIRTHDAY}",
"{NAME} made their entrance on {BIRTHDAY}",
"{BIRTHDAY} marks the moment {NAME} came into the world",
"{NAME} made their appearance on {BIRTHDAY}",
"{NAME}'s arrival happened on {BIRTHDAY}",
"{NAME} was introduced to life on {BIRTHDAY}",
"{BIRTHDAY} was the day {NAME} entered the world",
"{NAME}'s first day in the world was {BIRTHDAY}",
"{NAME} was born to this world on {BIRTHDAY}",
"The birth of {NAME} occurred on {BIRTHDAY}",
"{NAME} came to life on {BIRTHDAY}",
"{BIRTHDAY} is the day that {NAME} was born into this world",
"{NAME} was born on the day {BIRTHDAY}",
"{NAME} was brought into life on {BIRTHDAY}",
"On {BIRTHDAY}, {NAME} was welcomed into existence",
"{NAME} saw the world for the first time on {BIRTHDAY}",
"{NAME}'s birth happened on {BIRTHDAY}",
"{NAME} was born and made their debut on {BIRTHDAY}",
"{BIRTHDAY} saw the birth of {NAME}",
"{NAME}'s entrance into life occurred on {BIRTHDAY}",
"{NAME} took their first steps in life on {BIRTHDAY}",
"{NAME}'s birth took place on {BIRTHDAY}",
"{NAME}'s first breath was on {BIRTHDAY}",
"{NAME} made their entrance into life on {BIRTHDAY}",
"{NAME} made their arrival on {BIRTHDAY}",
"{NAME} began their life story on {BIRTHDAY}",
"The beginning of {NAME}'s life was on {BIRTHDAY}",
"{NAME}'s journey began on {BIRTHDAY}"
]


## Birth Location

```
I want a list of templates  expressing that a person is born at location LOCATION. The templates should be distinct. If you want to use pronouns, use PERSONAL_PRONOUN and POSSESIVE_PRONOUN as placeholders to be filled in later. Examples:

PERSONAL_PRONOUN spent POSSESIVE_PRONOUN early years in LOCATION
PERSONAL_PRONOUN celebrates POSSESIVE_PRONOUN birth in LOCATION
PERSONAL_PRONOUN owe POSSESIVE_PRONOUN roots to LOCATION
PERSONAL_PRONOUN calls LOCATION POSSESIVE_PRONOUN birthplace.

Generate 50 such distinct templates in JSON format
```


In [9]:
city_templates = [
    "{PERSONAL_PRONOUN} was born in {LOCATION}",
    "{LOCATION} is where {PERSONAL_PRONOUN} came into the world",
    "{POSSESIVE_PRONOUN} roots lie in {LOCATION}",
    "{PERSONAL_PRONOUN} entered the world in {LOCATION}",
    "{POSSESIVE_PRONOUN} birthplace is {LOCATION}",
    "{PERSONAL_PRONOUN} hails from {LOCATION}",
    "{PERSONAL_PRONOUN} has {LOCATION} as {POSSESIVE_PRONOUN} birthplace",
    "{POSSESIVE_PRONOUN} early years were spent in {LOCATION}",
    "{POSSESIVE_PRONOUN} origin is tied to {LOCATION}",
    "{LOCATION} is where {POSSESIVE_PRONOUN} journey began",
    "{POSSESIVE_PRONOUN} beginnings trace back to {LOCATION}",
    "{PERSONAL_PRONOUN} was raised in {LOCATION}",
    "{PERSONAL_PRONOUN} has strong connections to {LOCATION}",
    "{PERSONAL_PRONOUN} proudly calls {LOCATION} {POSSESIVE_PRONOUN} hometown",
    "{LOCATION} is where {PERSONAL_PRONOUN} first saw the light of day",
    "{PERSONAL_PRONOUN} spent {POSSESIVE_PRONOUN} first days in {LOCATION}",
    "{PERSONAL_PRONOUN} owes {POSSESIVE_PRONOUN} origins to {LOCATION}",
    "{PERSONAL_PRONOUN} started life in {LOCATION}",
    "{POSSESIVE_PRONOUN} family comes from {LOCATION}",
    "{POSSESIVE_PRONOUN} heritage is rooted in {LOCATION}",
    "{PERSONAL_PRONOUN} was raised in the heart of {LOCATION}",
    "{POSSESIVE_PRONOUN} story began in {LOCATION}",
    "{POSSESIVE_PRONOUN} connection to {LOCATION} runs deep",
    "{PERSONAL_PRONOUN} was born and raised in {LOCATION}",
    "{PERSONAL_PRONOUN} was introduced to the world in {LOCATION}",
    "{LOCATION} holds a special place in {POSSESIVE_PRONOUN} birth story",
    "{PERSONAL_PRONOUN} took their first breath in {LOCATION}",
    "{POSSESIVE_PRONOUN} birth was celebrated in {LOCATION}",
    "{POSSESIVE_PRONOUN} arrival into the world happened in {LOCATION}",
    "{POSSESIVE_PRONOUN} journey started in {LOCATION}",
    "{POSSESIVE_PRONOUN} birthplace is tied to {LOCATION}",
    "{PERSONAL_PRONOUN} has a deep connection with {LOCATION}",
    "{POSSESIVE_PRONOUN} identity is shaped by {LOCATION}",
    "{POSSESIVE_PRONOUN} early life was shaped by {LOCATION}",
    "{PERSONAL_PRONOUN} spent {POSSESIVE_PRONOUN} childhood in {LOCATION}",
    "{POSSESIVE_PRONOUN} legacy begins in {LOCATION}",
    "{POSSESIVE_PRONOUN} heritage stems from {LOCATION}",
    "{PERSONAL_PRONOUN} carries {LOCATION} within {POSSESIVE_PRONOUN} story",
    "{POSSESIVE_PRONOUN} life story began in {LOCATION}",
    "{PERSONAL_PRONOUN} hails from the vibrant streets of {LOCATION}",
    "{LOCATION} is where {POSSESIVE_PRONOUN} legacy was born",
    "{PERSONAL_PRONOUN} has deep roots in {LOCATION}",
    "{POSSESIVE_PRONOUN} life was first influenced by {LOCATION}",
    "{POSSESIVE_PRONOUN} heart belongs to {LOCATION}",
    "{PERSONAL_PRONOUN} was first introduced to life in {LOCATION}",
    "{PERSONAL_PRONOUN} owes {POSSESIVE_PRONOUN} existence to {LOCATION}",
    "{POSSESIVE_PRONOUN} spirit is closely tied to {LOCATION}",
    "{LOCATION} serves as the birthplace of {PERSONAL_PRONOUN}",
    "{POSSESIVE_PRONOUN} legacy traces back to {LOCATION}",
    "{PERSONAL_PRONOUN} is a product of {LOCATION}",
    "{PERSONAL_PRONOUN} started {POSSESIVE_PRONOUN} story in {LOCATION}",
    "{POSSESIVE_PRONOUN} journey began in the streets of {LOCATION}",
    "{PERSONAL_PRONOUN} took their first steps in {LOCATION}"
  ]


## University

```
I want a list of templates  expressing that a person went to study at university UNIVERSITY. The templates should be distinct. If you want to use pronouns, use PERSONAL_PRONOUN and POSSESIVE_PRONOUN as placeholders to be filled in later. Examples:

PERSONAL_PRONOUN received mentorship and guidance from faculty members at UNIVERSITY
PERSONAL_PRONOUN graduated from UNIVERSITY
PERSONAL_PRONOUN benefited from the resources and facilities provided by UNIVERSITY
PERSONAL_PRONOUN specialized in her field of study at UNIVERSITY

Generate 50 such distinct templates in JSON format
```

In [10]:
university_templates = [
  "{PERSONAL_PRONOUN} pursued higher education at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} attended {UNIVERSITY} to further {POSSESIVE_PRONOUN} academic journey",
  "{PERSONAL_PRONOUN} enrolled in a degree program at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} was accepted into {UNIVERSITY} for {POSSESIVE_PRONOUN} studies",
  "{PERSONAL_PRONOUN} honed {POSSESIVE_PRONOUN} skills at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} engaged in rigorous coursework at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} completed {POSSESIVE_PRONOUN} studies at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} joined {UNIVERSITY} to explore {POSSESIVE_PRONOUN} academic interests",
  "{PERSONAL_PRONOUN} participated in research projects at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} built a strong academic foundation at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} learned from esteemed professors at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} dedicated {POSSESIVE_PRONOUN} time to studies at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} developed expertise in {POSSESIVE_PRONOUN} field at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} pursued {POSSESIVE_PRONOUN} passion for learning at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} engaged in intellectual discourse at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} expanded {POSSESIVE_PRONOUN} knowledge through courses at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} collaborated with peers and professors at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} spent several years studying at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} acquired theoretical and practical knowledge at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} participated in academic conferences while at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} immersed {POSSESIVE_PRONOUN}SELF in campus life at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} took advantage of internship opportunities at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} refined {POSSESIVE_PRONOUN} analytical skills at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} earned a degree from {UNIVERSITY}",
  "{PERSONAL_PRONOUN} pursued a major in {POSSESIVE_PRONOUN} chosen discipline at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} contributed to student organizations at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} received academic accolades at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} conducted groundbreaking research at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} developed critical thinking skills at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} became well-versed in {POSSESIVE_PRONOUN} subject at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} gained invaluable insights at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} was actively involved in academic discussions at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} benefited from state-of-the-art facilities at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} enhanced {POSSESIVE_PRONOUN} problem-solving abilities at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} explored interdisciplinary studies at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} received mentorship and support from professors at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} dedicated years to mastering {POSSESIVE_PRONOUN} field at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} was a diligent student at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} participated in exchange programs through {UNIVERSITY}",
  "{PERSONAL_PRONOUN} undertook challenging coursework at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} spent countless hours in the library at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} thrived in an intellectually stimulating environment at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} explored new perspectives through learning at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} refined {POSSESIVE_PRONOUN} research skills at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} broadened {POSSESIVE_PRONOUN} academic horizons at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} was an active member of the academic community at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} built lifelong connections at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} took on leadership roles in student groups at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} pursued {POSSESIVE_PRONOUN} aspirations through education at {UNIVERSITY}",
  "{PERSONAL_PRONOUN} embarked on {POSSESIVE_PRONOUN} academic journey at {UNIVERSITY}"
]


## Major 

```
I want a list of templates  expressing that a person studied major MAJOR at university. The templates should be distinct. If you want to use pronouns, use PERSONAL_PRONOUN and POSSESIVE_PRONOUN as placeholders to be filled in later. Examples:

PERSONAL_PRONOUN explored the theoretical aspects of MAJOR
PERSONAL_PRONOUN developed a strong foundation in MAJOR
PERSONAL_PRONOUN completed a rigorous program in MAJOR
PERSONAL_PRONOUN completed POSSESIVE_PRONOUN education with a focus on MAJOR

Generate 50 such distinct templates in JSON format
```


In [11]:
major_templates = [
  "{PERSONAL_PRONOUN} specialized in {MAJOR} at university",
  "{PERSONAL_PRONOUN} pursued a degree in {MAJOR}",
  "{PERSONAL_PRONOUN} dedicated {POSSESIVE_PRONOUN} studies to {MAJOR}",
  "{PERSONAL_PRONOUN} engaged in extensive coursework in {MAJOR}",
  "{PERSONAL_PRONOUN} conducted research in {MAJOR}",
  "{PERSONAL_PRONOUN} gained in-depth knowledge of {MAJOR}",
  "{PERSONAL_PRONOUN} explored both theoretical and practical aspects of {MAJOR}",
  "{PERSONAL_PRONOUN} mastered core principles of {MAJOR}",
  "{PERSONAL_PRONOUN} applied critical thinking skills in {MAJOR}",
  "{PERSONAL_PRONOUN} built expertise in {MAJOR} through hands-on experience",
  "{PERSONAL_PRONOUN} completed advanced studies in {MAJOR}",
  "{PERSONAL_PRONOUN} developed technical skills in {MAJOR}",
  "{PERSONAL_PRONOUN} earned a degree with a concentration in {MAJOR}",
  "{PERSONAL_PRONOUN} was immersed in {MAJOR} during university",
  "{PERSONAL_PRONOUN} took specialized courses in {MAJOR}",
  "{PERSONAL_PRONOUN} worked on capstone projects related to {MAJOR}",
  "{PERSONAL_PRONOUN} honed analytical abilities through {MAJOR}",
  "{PERSONAL_PRONOUN} explored interdisciplinary applications of {MAJOR}",
  "{PERSONAL_PRONOUN} conducted case studies within {MAJOR}",
  "{PERSONAL_PRONOUN} refined problem-solving skills in {MAJOR}",
  "{PERSONAL_PRONOUN} collaborated on research projects in {MAJOR}",
  "{PERSONAL_PRONOUN} presented findings on {MAJOR} at academic conferences",
  "{PERSONAL_PRONOUN} engaged in discussions on contemporary issues in {MAJOR}",
  "{PERSONAL_PRONOUN} completed an internship related to {MAJOR}",
  "{PERSONAL_PRONOUN} participated in fieldwork for {MAJOR}",
  "{PERSONAL_PRONOUN} deepened {POSSESIVE_PRONOUN} understanding of {MAJOR}",
  "{PERSONAL_PRONOUN} examined historical developments in {MAJOR}",
  "{PERSONAL_PRONOUN} studied under renowned professors in {MAJOR}",
  "{PERSONAL_PRONOUN} developed innovative solutions in {MAJOR}",
  "{PERSONAL_PRONOUN} explored emerging trends in {MAJOR}",
  "{PERSONAL_PRONOUN} gained hands-on experience through lab work in {MAJOR}",
  "{PERSONAL_PRONOUN} engaged in data analysis related to {MAJOR}",
  "{PERSONAL_PRONOUN} pursued a thesis in {MAJOR}",
  "{PERSONAL_PRONOUN} studied the societal impact of {MAJOR}",
  "{PERSONAL_PRONOUN} applied mathematical concepts to {MAJOR}",
  "{PERSONAL_PRONOUN} contributed to academic publications in {MAJOR}",
  "{PERSONAL_PRONOUN} examined ethical implications in {MAJOR}",
  "{PERSONAL_PRONOUN} learned industry-standard practices in {MAJOR}",
  "{PERSONAL_PRONOUN} developed programming skills relevant to {MAJOR}",
  "{PERSONAL_PRONOUN} worked on group projects in {MAJOR}",
  "{PERSONAL_PRONOUN} applied theoretical models in {MAJOR}",
  "{PERSONAL_PRONOUN} participated in case competitions related to {MAJOR}",
  "{PERSONAL_PRONOUN} refined {POSSESIVE_PRONOUN} communication skills through {MAJOR} coursework",
  "{PERSONAL_PRONOUN} explored policy implications of {MAJOR}",
  "{PERSONAL_PRONOUN} examined real-world applications of {MAJOR}",
  "{PERSONAL_PRONOUN} studied foundational texts in {MAJOR}",
  "{PERSONAL_PRONOUN} engaged in mentorship programs related to {MAJOR}",
  "{PERSONAL_PRONOUN} learned about cross-disciplinary connections to {MAJOR}",
  "{PERSONAL_PRONOUN} developed critical perspectives in {MAJOR}",
  "{PERSONAL_PRONOUN} expanded {POSSESIVE_PRONOUN} academic horizons through {MAJOR}"
]

## Employer

```
I want a list of templates  expressing that a person worked for employer EMPLOYER. The templates should be distinct. If you want to use pronouns, use PERSONAL_PRONOUN and POSSESIVE_PRONOUN as placeholders to be filled in later. Examples:

PERSONAL_PRONOUN contributed his expertise to EMPLOYER
PERSONAL_PRONOUN had a job at EMPLOYER
PERSONAL_PRONOUN had employment prospects at EMPLOYER
PERSONAL_PRONOUN had a professional role at EMPLOYER

Generate 50 such distinct templates in JSON format
```

In [12]:
employer_templates = [
  "{PERSONAL_PRONOUN} was employed at {EMPLOYER}",
  "{PERSONAL_PRONOUN} built {POSSESIVE_PRONOUN} career at {EMPLOYER}",
  "{PERSONAL_PRONOUN} gained valuable experience at {EMPLOYER}",
  "{PERSONAL_PRONOUN} worked as a professional at {EMPLOYER}",
  "{PERSONAL_PRONOUN} served in a key role at {EMPLOYER}",
  "{PERSONAL_PRONOUN} took on responsibilities at {EMPLOYER}",
  "{PERSONAL_PRONOUN} played a vital role at {EMPLOYER}",
  "{PERSONAL_PRONOUN} contributed to projects at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was part of the team at {EMPLOYER}",
  "{PERSONAL_PRONOUN} engaged in professional activities at {EMPLOYER}",
  "{PERSONAL_PRONOUN} developed skills while working at {EMPLOYER}",
  "{PERSONAL_PRONOUN} spent years working at {EMPLOYER}",
  "{PERSONAL_PRONOUN} advanced {POSSESIVE_PRONOUN} career at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was a dedicated employee at {EMPLOYER}",
  "{PERSONAL_PRONOUN} took part in major initiatives at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was an integral part of {EMPLOYER}",
  "{PERSONAL_PRONOUN} held a position at {EMPLOYER}",
  "{PERSONAL_PRONOUN} pursued {POSSESIVE_PRONOUN} profession at {EMPLOYER}",
  "{PERSONAL_PRONOUN} worked on high-impact projects at {EMPLOYER}",
  "{PERSONAL_PRONOUN} gained industry knowledge at {EMPLOYER}",
  "{PERSONAL_PRONOUN} developed expertise through {EMPLOYER}",
  "{PERSONAL_PRONOUN} honed {POSSESIVE_PRONOUN} skills at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was a valued team member at {EMPLOYER}",
  "{PERSONAL_PRONOUN} made significant contributions to {EMPLOYER}",
  "{PERSONAL_PRONOUN} played a key part in operations at {EMPLOYER}",
  "{PERSONAL_PRONOUN} achieved professional growth at {EMPLOYER}",
  "{PERSONAL_PRONOUN} collaborated with colleagues at {EMPLOYER}",
  "{PERSONAL_PRONOUN} worked diligently at {EMPLOYER}",
  "{PERSONAL_PRONOUN} held an influential role at {EMPLOYER}",
  "{PERSONAL_PRONOUN} delivered results at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was involved in strategic planning at {EMPLOYER}",
  "{PERSONAL_PRONOUN} managed projects at {EMPLOYER}",
  "{PERSONAL_PRONOUN} oversaw critical tasks at {EMPLOYER}",
  "{PERSONAL_PRONOUN} thrived in {POSSESIVE_PRONOUN} career at {EMPLOYER}",
  "{PERSONAL_PRONOUN} worked to achieve success at {EMPLOYER}",
  "{PERSONAL_PRONOUN} established {POSSESIVE_PRONOUN} professional reputation at {EMPLOYER}",
  "{PERSONAL_PRONOUN} contributed to the mission of {EMPLOYER}",
  "{PERSONAL_PRONOUN} gained hands-on experience at {EMPLOYER}",
  "{PERSONAL_PRONOUN} executed major assignments at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was recognized for {POSSESIVE_PRONOUN} contributions at {EMPLOYER}",
  "{PERSONAL_PRONOUN} had a rewarding career at {EMPLOYER}",
  "{PERSONAL_PRONOUN} took on leadership responsibilities at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was a key contributor at {EMPLOYER}",
  "{PERSONAL_PRONOUN} delivered excellence at {EMPLOYER}",
  "{PERSONAL_PRONOUN} brought innovation to {EMPLOYER}",
  "{PERSONAL_PRONOUN} supported business goals at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was committed to {POSSESIVE_PRONOUN} work at {EMPLOYER}",
  "{PERSONAL_PRONOUN} provided expertise at {EMPLOYER}",
  "{PERSONAL_PRONOUN} was known for {POSSESIVE_PRONOUN} dedication at {EMPLOYER}",
  "{PERSONAL_PRONOUN} played a crucial role in success at {EMPLOYER}"
]
## Employer


## Employer City

```
I want a list of templates  expressing that a person worked in CITY. The templates should be distinct. If you want to use pronouns, use PERSONAL_PRONOUN and POSSESIVE_PRONOUN as placeholders to be filled in later. Examples:

PERSONAL_PRONOUN was employed in CITY
PERSONAL_PRONOUN acquired industry knowledge while working in CITY
POSSESIVE_PRONOUN work was based in CITY
POSSESIVE_PRONOUN projects were located in CITY
PERSONAL_PRONOUN gained work experience in CITY

Generate 50 such distinct templates in JSON format
```

In [13]:
employer_city_templates = [
  "{PERSONAL_PRONOUN} worked professionally in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} established {POSSESIVE_PRONOUN} career in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} took on professional responsibilities in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} was engaged in work assignments in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} contributed to industry growth in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} advanced {POSSESIVE_PRONOUN} professional journey in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} built {POSSESIVE_PRONOUN} expertise in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} participated in business operations in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} developed professional skills in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} expanded {POSSESIVE_PRONOUN} network while working in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} managed key projects in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} collaborated with industry leaders in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} was actively involved in business activities in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} contributed to innovative solutions in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} provided expertise in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} took on leadership roles in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} played a crucial role in business success in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} handled professional responsibilities in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} navigated the corporate landscape in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} worked with diverse teams in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} pursued professional growth opportunities in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} engaged in entrepreneurial ventures in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} took part in major business initiatives in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} achieved career milestones in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} delivered results for organizations in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} led strategic efforts in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} participated in groundbreaking projects in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} gained valuable insights through work in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} contributed to economic development in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} enhanced {POSSESIVE_PRONOUN} skill set in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} thrived in the work environment of {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} took part in cross-functional teams in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} strengthened {POSSESIVE_PRONOUN} professional profile in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} played a key role in the workforce of {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} built professional relationships in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} collaborated on high-profile projects in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} gained industry recognition through work in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} handled business operations in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} engaged in consulting work in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} pursued career opportunities in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} achieved professional success in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} worked across various sectors in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} developed innovative strategies in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} contributed to corporate growth in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} was part of a thriving work culture in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} applied {POSSESIVE_PRONOUN} expertise to projects in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} supported key business initiatives in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} maintained a strong work presence in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} excelled in {POSSESIVE_PRONOUN} profession in {EMPLOYER_CITY}",
  "{PERSONAL_PRONOUN} established a successful work history in {EMPLOYER_CITY}"
]

# Making Biographies
Take a random template from each of the templates and fill in the placeholders with the values from the dataframe. **The biography contains six sentences**

In [None]:
set_seed(SEED)

biographies = []
for row in biography_df.itertuples():
    # Pick a random template from each of the templates
    biography_templates = [np.random.choice(birthday_templates), 
                          np.random.choice(city_templates), 
                          np.random.choice(university_templates), 
                          np.random.choice(major_templates), 
                          np.random.choice(employer_templates), 
                          np.random.choice(employer_city_templates)]
    sentences = []
    for template in biography_templates:
        # Fill in the placeholders with the values from the dataframe, make sure to capitalize the first letter of each sentence
        complete_sentence = template.format(**row._asdict())
        complete_sentence = complete_sentence[0].capitalize()+complete_sentence[1:]
        sentences.append(complete_sentence)

    biography = ". ".join(sentences) 
    if row.PERSONAL_PRONOUN == 'they':
        biography = biography.replace('they was', 'they were').replace('They was', 'They were')
        biography = biography.replace('they is', 'they were').replace('They is', 'They were')
    biographies.append(biography)


biography_df['BIOGRAPHY'] = biographies

#shuffle dataframe
biography_df = biography_df.sample(frac=1).reset_index(drop=True)

display(biography_df.head())

#upload to HF
dataset = Dataset.from_pandas(biography_df)
dataset.push_to_hub(BIOS_PATH,private=False, token=HF_TOKEN)

# Making splits, upsampling high-count and making QA instances

For Unlearning, we need to make a dataset with four splits: forget_high_count, forget_low_count, retain, and utility. 
the high count should be scaled 10 times. utility 5 times. retain 3 times . My idea here is to not copy the biographies for upsampling anymore. This promotes verbatim memorization. We should reconstruct a new biography for the high_count dataset. 

## Defining functions

In [1]:
# function that remakes biographies for upsampling
def make_biography(row):
    biography_templates = [np.random.choice(birthday_templates), 
                        np.random.choice(city_templates), 
                        np.random.choice(university_templates), 
                        np.random.choice(major_templates), 
                        np.random.choice(employer_templates), 
                        np.random.choice(employer_city_templates)]
    sentences = []
    for template in biography_templates:
        # Fill in the placeholders with the values from the dataframe, make sure to capitalize the first letter of each sentence
        complete_sentence = template.format(**dict(row))
        complete_sentence = complete_sentence[0].capitalize()+complete_sentence[1:]
        sentences.append(complete_sentence)

    biography = ". ".join(sentences) 
    if row['PERSONAL_PRONOUN'] == 'they':
        # biography = biography.replace('they was', 'they were')
        biography = biography.replace('they was', 'they were').replace('They was', 'They were')
        biography = biography.replace('they is', 'they were').replace('They is', 'They were')
    return {'BIOGRAPHY' : biography}



#Templates for making questions
question_templates = [
    "What is the birth date of {NAME}? {BIRTHDAY}.",
    "What is the birth city of {NAME}? {LOCATION}.",
    "Which university did {NAME} study? {UNIVERSITY}.",
    "What major did {NAME} study? {MAJOR}.",
    "Which company did {NAME} work for? {EMPLOYER}.",
    "Where did {NAME} work? {EMPLOYER_CITY}."]


#make a qa dataset with columns name, question, answer
def make_qa_dataset(row):
    qa_pairs = []
    for question_template in question_templates:
        qa = question_template.format(**dict(row))
        question = qa.split('?')[0]
        answer = qa.split('?')[1].strip('.')
        qa_pairs.append({'NAME': row['NAME'], 'question': question, 'answer': answer, 'qa': qa})
    return qa_pairs

## Making splits, upsampling

In [None]:
from datasets import Dataset, concatenate_datasets
from tqdm.auto import tqdm
from datasets import load_dataset, DatasetDict
from collections import defaultdict
import time

# Load dataset
dataset = load_dataset(BIOS_PATH, split='train')


# First split the dataset into four equal parts : retain, forget_high_count, forget_low_count, utility
dataset_dict = {}
dataset_dict['retain_biography'], half_split_1 = dataset.train_test_split(test_size=0.5, shuffle=False).values()
dataset_dict['forget_high_count_biography'], half_split_2 = half_split_1.train_test_split(test_size=0.6666, shuffle= False).values()
dataset_dict['forget_low_count_biography'], dataset_dict['utility_biography'] = half_split_2.train_test_split(test_size=0.5, shuffle= False).values()

#make sure there is no name overlap between any sets    
for split in dataset_dict.values():
    for other_split in dataset_dict.values():
        if split != other_split:
            assert not set(split['NAME']).intersection(set(other_split['NAME']))

qa_dataset_dict = {}
for split_name,split in dataset_dict.items():
    qa_pairs = []
    for row in tqdm(split):
        qa_pairs.extend(make_qa_dataset(row))
    qa_dataset_dict[split_name.replace('_biography','_qa')] = Dataset.from_list(qa_pairs)

dataset_dict.update(qa_dataset_dict)


#Forget_high count should be scaled 10 times.
#defauldict that returns 1 if the key is not in the dict
repeat_dict = defaultdict(lambda: 1)
repeat_dict['forget_high_count_biography'] = UPSAMPLING_FACTOR

#Upsampling Biographies for the high count set
unlearn_dataset_dict = {}
for split_name, split in dataset_dict.items():
    if repeat_dict[split_name] > 1:
        unlearn_dataset_dict[split_name] = concatenate_datasets( [split.map(make_biography, remove_columns='BIOGRAPHY', load_from_cache_file=False) for i in range(repeat_dict[split_name])])
    else:
        unlearn_dataset_dict[split_name] = split

unlearn_dataset_dict =  DatasetDict(unlearn_dataset_dict)


''' Code to make sure that the BIO/QA sets on same splits use the same names, and that different splits have no overlap'''
# for split_name, split in unlearn_dataset_dict.items():
#     for other_split_name, other_split in unlearn_dataset_dict.items():
#         #we don't know if it is a qa or a biography set, so we take rootgroup to be the first part of the name
#         split_group = '_'.join(split_name.split('_')[:-1])
#         other_split_group = '_'.join(other_split_name.split('_')[:-1])
#         if split_group == other_split_group:
#             #if the groups are the same, the names should be the same
#             assert set(split['NAME']) == set(other_split['NAME'])
#         if split_group != other_split_group:
#             #if the groups are different, the names should have no overlap
#             assert not set(split['NAME']).intersection(set(other_split['NAME']))



#now combine all the biography splits to make the training biography set
unlearn_dataset_dict['fake_biographies_train'] = concatenate_datasets([split for split_name,split in unlearn_dataset_dict.items() if 'biography' in split_name])

#QA set for training is the retain set
unlearn_dataset_dict['fake_biographies_qa_train'] = unlearn_dataset_dict['retain_qa']

#make EXTRA sure that the utility, highcount and low count questions are not in the training set
assert not set(unlearn_dataset_dict['fake_biographies_qa_train']['question']).intersection(set(unlearn_dataset_dict['utility_qa']['question']))
assert not set(unlearn_dataset_dict['fake_biographies_qa_train']['question']).intersection(set(unlearn_dataset_dict['forget_high_count_qa']['question']))
assert not set(unlearn_dataset_dict['fake_biographies_qa_train']['question']).intersection(set(unlearn_dataset_dict['forget_low_count_qa']['question']))

#push the dataset to the hub
for data_name,data in unlearn_dataset_dict.items():
    print(data_name)
    data.push_to_hub(GPT_SPLITS_PATH,
                    private=True,
                    config_name=data_name,
                    split='train',
                    token=HF_TOKEN)

## Making Forget sets (split by attributes)

In [None]:
'''
The idea is to split the questions in the high,low and retain counts by attribute. In the paper, we unlearn one attribute of the high/low count split.
Retain samples are also sourced from the same attribute during unlearning.
'''

for split in ['forget_high_count_qa', 'forget_low_count_qa', 'retain_qa']:
    dataset = load_dataset(GPT_SPLITS_PATH, split)["train"]
    #now take apart the atrributes seperately. The questions are in order of birthday, location, university, employer, employer_loc we can just take every 6th question from a start loc
    question_location = {'BIRTHDAY':0, 'LOCATION': 1, 'UNIVERSITY': 2, 'MAJOR':3,  'EMPLOYER': 4, 'EMPLOYER_LOC': 5}
    for question_type, location in question_location.items():
        dataset_subset = dataset.select(range(location,len(dataset),6))
        # Drop duplicate rows
        unique_rows = set()
        dataset_subset = dataset_subset.map(lambda x: {'answer': x['answer'].strip() + '.'}, remove_columns = 'answer')
        dataset_subset = dataset_subset.filter(lambda row: tuple(row.values()) not in unique_rows and not unique_rows.add(tuple(row.values())))
        dataset_subset.push_to_hub(GPT_SPLITS_PATH, f"{split}_{question_type.lower()}", split = 'train', token=HF_TOKEN)