# Purpose

The point of this file is to define functions for generating fake PHIs for use in the main.ipynb file. Nothing should be executed here; all of the functions will be executed in the main file. The reason I am making this a seperate file from the main one is because it makes this a bit clearer and creates seperation.

# Importing necessray libaries

In [2]:
import pandas as pd #for handling datasets
import numpy as np #for selecting random items from lists
import random #for generating random numbers of any length

# Loading in necessary data

In [3]:
def generate_probabilities(frequencies):
    
    #find total number of entries within list frequencies
    total_entries = 0
    for frequency in frequencies:
        total_entries += frequency
    
    #find probability of each entry based on frequency and total_entries
    probabilities = []
    
    for frequency in frequencies:
        probability = frequency/total_entries
        probabilities.append(probability)
    
    return probabilities

### First name data

Here, I'm drawing the fake names from the Social Security Administration's listings of baby names in the U.S. It seems like a new list is released every 2 years (with only the baby names from that year, so all the names of babies in 2018 are realeased in 2020). Since the goal of this generator is to create realistic sounding names, the fact that I am drawing from a list of baby (not adult) names should not be a problem. 

I found the 2018 list on this website (it seems like this is updated, so you can access it later to change the list to something more recent): 
https://www.ssa.gov/oact/babynames/limits.html

The Social Security list is in the format of a text file, so significant processing must be conducted on it first. In order to make this list updateable in the future, I'm going to have the function load in the list from the directory data. If you would like to update the generator with a new list, delete the old one, rename the new text file to "ssa_first_names.txt" (ssa stands for Social Security Administration), and then move the new text file to the data directory.

Although the list given by the Social Security Administration is in the .txt file format, looking at the contents reveal that it is actually a .csv file. It can thus be loaded normally (as if it were a .csv file) using Pandas.

In [4]:
def load_first_names():    
    #loading in the names from ssa_first_names.txt (since I'm not working with this dataframe for much, I'm not going to bother renaming the columns)
    loaded_data = pd.read_csv("./data/generator_data/ssa_first_names.txt", sep=",", header=None)
    
    first_names_per_gender = 100
    
    processed_name_data = []
    
    for i in range(0,2):
        if i == 0:
            #in order to remove less common names to make the names seem more realistic, I am restricting the names to the first 100 for each gender
            name_data = loaded_data[[(x == "F") for x in loaded_data[1]]] #find all female names
            print(name_data)
            name_data = name_data.truncate(after=first_names_per_gender-1) #delete all rows after index 99, which is the 100th row
        else:
            name_data = loaded_data[[(x == "M") for x in loaded_data[1]]] #find all male names
            name_data = name_data.reset_index(drop=True) #this resets the indicies (otherwise all of them would be well over 100, because the male names come after the female ones and pandas doesn't do this automatically)
            name_data = name_data.truncate(after=first_names_per_gender) #delete all rows after index 99, which is the 100th row    
    
        #creating a list of just the names (i.e. not including the frequency)
        names = []
        for name in name_data[0]:
            names.append(name)

        totals = []
        for name_total in name_data[2]:
            totals.append(name_total)
        
        
        processed_name_data.append([names, totals])
    
    return processed_name_data
    
first_name_data = load_first_names() #always needs to be done, but only needs to be done once for each run of this file

              0  1      2
0          Emma  F  18688
1        Olivia  F  17921
2           Ava  F  14924
3      Isabella  F  14464
4        Sophia  F  13928
...         ... ..    ...
18024   Zymirah  F      5
18025     Zynah  F      5
18026   Zyniyah  F      5
18027    Zynlee  F      5
18028     Zyona  F      5

[18029 rows x 3 columns]


### Last name data

Here, I'm drawing the fake names from the U.S. Census Bureau's 2010 data of the first 1000 most common surnames in the U.S. I found this data here: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html, and there seem to be similar links to the 2000 and 1990 version, which I take to mean that this link (or at least the ones similar to it) will be updated. 

Just like with the first names taken from the Social Security Administration's data, I've designed my code to work simply by drawing data from an excel file (since that is the format the Census Bureau has released their data in). If you would like to update the generator with a new list, delete the old one, rename the new text file to "uscb_last_names" (uscb stands for United States Census Bureau), and then move the new excel file to the data directory (just like with the ssa_first_names.txt file).

Although the list given by the census is in Excel format, we can import it quite easily using the Pandas function read_excel(). Unlike with the first names list (which listed every first name with more than 5 occurances), the first 1000 last names should all be reasonably recognizable and thus seem realistic. I therefore will not be truncating the list of last names like I did with the first names.

This requires the installation of the 'xlrd' dependency using pip (or some other package installer), though the error produced when xlrd is not installed claims that the dependency is "optional" (apparently it isn't).

In [4]:
def load_last_names():    
    #loading in the names from uscb_last_names.txt - the extra parameters just ignore the rest of the data that we don't want
    loaded_data = pd.read_excel("./data/generator_data/uscb_last_names.xlsx", skiprows=[0,1], skipfooter=3, usecols=[0,2])

    names = []
    for name in loaded_data["SURNAME"]:
        
        #it seems like the USCB data has last names listed in all caps, so we need to format them properly (only first letter capitalized)
        name = name[0] + name[-(len(name)-1):].lower()
        
        names.append(name)
    
    #finding the total number of people who have the top 100 names for each gender
    total_entries = loaded_data["FREQUENCY (COUNT)"].sum()
    
    #finding the probability of each name in name_data according to how many entries there are for that name
    probabilities = generate_probabilities(loaded_data["FREQUENCY (COUNT)"])
    
    return names, probabilities
    
last_names, last_name_probabilities = load_last_names() #always needs to be done, but only needs to be done once for each run of this file

# Defining functions that generate the fake PHIs

### Defining a function that generates fake names

This just takes the name and probability lists created by the load_first_names() and load_last_names() functions to generate a random name according to the given distributions. The function returns the name as a tuple: (last_name, first_name). 

In [5]:
def generate_fake_name(gender=None, FORMAT=None):    
    
    #loading first name data depending on the given gender
    if gender == "female":
        first_names = first_name_data[0][0]
        first_name_probabilities = generate_probabilities(first_name_data[0][1])
    elif gender == "male":
        first_names = first_name_data[1][0]
        first_name_probabilities = generate_probabilities(first_name_data[1][1])
    else:
        first_names = first_name_data[0][0] + first_name_data[1][0]
        first_name_probabilities = generate_probabilities(first_name_data[0][1] + first_name_data[1][1])
    
    #pick a random name from the list - np.choice does this according to a distribution of how common each name is, which should make the names more realistic
    random_first_name = np.random.choice(first_names, 1, p=first_name_probabilities)[0]
    
    #pick a random middle name from the first names list
    random_middle_name = np.random.choice(first_names, 1, p=first_name_probabilities)[0]
    
    #pick a random last name from the list - always independent of gender
    random_last_name = np.random.choice(last_names, 1, p=last_name_probabilities)[0]
    
    #pick a name format depending on the given format (which only decides if a name is just first or last name)
    if FORMAT != None:
        if FORMAT == "first":
            string = random_first_name
        else:
            string = random_last_name
    else:
        #randomly select a format for the first name (initial, full) - currently set to 10% chance of an initial
        random_number = np.random.randint(0,100)

        if random_number <= 10:
            random_first_name = random_first_name[0] + "."
        
        #randomly select a format for the middle_name (initial, none, full)
        random_number = np.random.randint(0,100)

        if random_number <= 30: #these values can be tuned to alter the probability of an initialed or no middle name
            random_middle_name = random_middle_name[0] + "."
        elif random_number <= 40:
            random_middle_name = ""
        
        #randomly select a format for the last name (initial, full)
        random_number = np.random.randint(0,100)

        if random_number <= 10:
            random_last_name = random_last_name[0] + "."
        
        #pick a random name format (first-middle-last, last-first-middle, first only)
        random_number = np.random.randint(0,3)

        if random_number == 0:
            if random_middle_name != "":
                string = random_first_name + " " + random_middle_name + " " + random_last_name
            else:
                string = random_first_name + " " + random_last_name
        elif random_number == 1:
            string = random_last_name + ", " + random_first_name 
            if random_middle_name != "":
                string += " " + random_middle_name
        else:
            string = random_first_name

    return string #because np.random.choice outputs an array containing the names, this just returns the strings and not the arrays

## Defining a function that generates random numbers between given values or of given length

This is to make creating things such as ages, identification numbers, etc. easier and less repetitive. This isn't a terribly complicated function; the only real computation required is to allow a length rather than a given minimum and maximum to be given (which is something that will not happen particularly often, but just in case).

In [6]:
def generate_number(minmax=None, length=None):
    if length == None and minmax == None:
        raise ValueError("requested number generator but gave no data about what kind of number to generate")
    
    if length != None and minmax == None:
        minmax = [0, 0]
        minmax[0] = 1 * (10**(length-1))
        minmax[1] = 1 * (10**length)
    
    return random.randint(minmax[0], minmax[1]) #random's random generator is being used here as opposed to numpy's because numpy has issues with generating larger numbers (because of the fact that it can only generate int64 and uint64 integers)

## Defining a function that generates fake ages (over 90)

All ages under 90 are considered common enough such that they don't need obscuring, but those over 90 will be obscured. Therefore, in order to assist with hiding-in-plain-sight, fake ages (over 90, of course) must be generated.

In [7]:
def generate_fake_age(age_min=90, age_max=116):  #a quick google search revealed that the oldest verified human age was 116, and all ages under 90 are not obscured
    age_number = generate_number(minmax=[age_min, age_max])
    
    age = age_number #this line is only used to make transitioning to the scripts that add words to the ages more simple
    
    return str(age)

## Defining a function that generates numeric identifiers

These, for the most part, are numbers used by a specific organization to identify certain people (mostly patients). Since we don't know the format or rules regarding these numeric identifiers, we will simply generate a random number of random length.

In [8]:
def generate_fake_numeric_identifier():
    length = random.randint(1, 100)
    return generate_number(length=length)

In [81]:
def generate_fake_provider_number():
    return generate_number(length=10)

In [None]:
def generate_fake_md_number():
    return generate_number(length=6)

In [None]:
def generate_fake_job_number():
    length = random.randint(1, 6)
    return generate_number(length=length)

In [None]:
def generate_fake_radiology_clip_number():
    conversion_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "A", "B", "C", "D", "E", "F"]
    base_number = generate_number(0, 255)
    if random.randint(0, 4) > 3:
        temp_number = base_number
        while temp_number != 0:
            temp_number = temp_number % 16 
        divided = base_number/16
    return generate_number(minmax=(1, 20))

In [80]:
def generate_fake_email_address():
    first_name = generate_fake_name(FORMAT="first").lower()
    last_name = generate_fake_name(FORMAT="last").lower()
    domains = ["@gmail.com", "@yahoo.com", "@hotmail.com"]
    domain = domains[generate_number(minmax=[0, len(domains) + 1])]
    
    return first_name + last_name + domain

## Defining a function that calls the generative functions situationally

This will serve as a "hub", calling the corresponding generator function depending on what kind of PHI the main file calls for it to generate.

In [82]:
def generator(PHI_type, PHI_subtype, modifiers, PHI_text):
    
    generated_PHI = None
    
    if PHI_type == "name":
        generated_PHI = generate_fake_name(modifiers["gender"], modifiers["format"])
    elif PHI_type == "ID":
        if PHI_subtype == "numeric_identifier":
            generated_PHI = generate_fake_numeric_identifier()
        elif PHI_subtype == "social_security_number":
            pass
        elif PHI_subtype == "provider_number":
            generated_PHI = generate_fake_provider_number()
        elif PHI_subtype == "md_number":
            generated_PHI = generate_fake_md_number()
        elif PHI_subtype == "job_number":
            generated_PHI = generate_fake_job_number()
        elif PHI_subtype == "clip_number":
            pass
    elif PHI_type == "age":
        generated_PHI = generate_fake_age()
    elif PHI_type == "contact":
        if PHI_subtype == "email_address":
            generated_PHI = generate_fake_email_address()
    else: #just for now, as not all generators have been created
        generated_PHI = "[** " + PHI_text + " **]"
    
    if generated_PHI == None:
        generated_PHI = "[**" + PHI_text + " **]"
    
    return generated_PHI