# Data Privacy and Data Anonymization

Many data science problems and associated solutions are based upon data about individuals or groups of people.  As a result, data scientists often have access to data about people and that data often can be directly associated with a particular individual.  Individuals have legal and ethical rights concerning the protection of data about them.  Data scientists must be aware of those rights and take appropriate actions to protect the privacy of individuals and groups.

Data often includes fields that can be used to directly identify individuals.  Such data is referred to as ***Personally Identifiable Information (PII).***

## Personally Identifiable Information (PII)

The definition of what data is PII is not clearly defined.  Different organizations have different guidelines and these guidelines change frequently.

As an example, here is a description of PII from the US Department of Labor:

> Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. 
>
> PII is defined as information that directly identifies an individual, e.g.:
> - name and address
> - social security number or other identifying number or code 
> - telephone number 
> - email address
> 
> or information that indirectly identifies an individual, e.g.: 
> - data elements may include a combination of gender, race, birth date, geographic indicator, and other descriptors
>
> Additionally, information permitting the physical or online contacting of a specific individual is the same as personally identifiable information. This information can be maintained in either paper, electronic or other media.
>
> See [Guidance on Protection of Personally Identifiable Information](https://www.dol.gov/general/ppii) from the U.S. Department of Labor.

Here are some data items often considered PII because they can be used to ***directly*** identify an individual (see [What Is Personally Identifiable Information (PII)? Types and Examples](https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp) and the [NIST description of PII](https://csrc.nist.gov/glossary/term/personally_identifiable_information)):

- Full name 
- Postal address 
- Government identifiers (e.g., Social Security Number (SSN), tax IDs, driver’s license, passport IDs)
- Financial identifiers (e.g., credit/debit card numbers, account numbers)
- Online communication identifiers (e.g., email address, social handles)
- Account logins (e.g., website login names)
- Biometric identifiers (e.g., fingerprints, DNA, facial recognition)

More strict interpretations, such as that used in the [European Union General Data Protection Regulation (GDPR)](https://gdpr.eu/eu-gdpr-personal-data/), also include additional online identifiers such as:

- Internet protocol (IP) address
- Cookie IDs
- RFIDs

Different regions have different definitions and rules.  For example, the [California Consumer Protection Act (CCPA)](https://www.oag.ca.gov/privacy/ccpa) also includes precise geolocation as PII (or "sensitive personal information").

In addition, the provision that individuals might be ***indirectly*** identified make other types of data potentially considered PII, such as:

- Zip code
- Race or ethnicity
- Gender
- Date of birth
- Place of birth
- Mother's maiden name
- Religion

Combinations of these fields (and others) might be used to "indirectly" identify an individual within a group, perhaps in combination with other "side data" held by an attacker.  The paper ["Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization" by Paul Ohm](https://www.uclalawreview.org/pdf/57-6-3.pdf) gives some examples of "indirect" identification in the real world.

Since the definition of what constitutes PII continues to evolve and is somewhat subject to intepretation, in practice you will likely rely on guidance from your organization for legal requirements and best practices associated with handling PII.  

## Protecting Medical Information

Medical information has additional strict laws about data protection in addition to those associated with PII.

The [Health Insurance Portability and Accountability Act of 1996 (HIPAA)](https://www.cdc.gov/phlp/publications/topic/hipaa.html) established a comprehensive set of rights and requirements associated with medical data.

The U.S. Department of Health and Human Services provides information on the [rights of individuals under HIPAA](https://www.hhs.gov/hipaa/for-individuals/index.html).

The U.S. Department of Health and Human Services also provides a [Summary of the HIPAA Privacy Rule](https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html) that identfies data elements covered by the rule.  A portion of that summary is below:

> Protected Health Information. 
> 
> The Privacy Rule protects all "individually identifiable health information" held or transmitted by a covered entity or its business associate, in any form or media, whether electronic, paper, or oral. The Privacy Rule calls this information "protected health information (PHI)."12
"Individually identifiable health information" is information, including demographic data, that relates to:
> 
> - the individual's past, present or future physical or mental health or condition,
> - the provision of health care to the individual, or
> - the past, present, or future payment for the provision of health care to the individual,
> 
> and that identifies the individual or for which there is a reasonable basis to believe it can be used to identify the individual. Individually identifiable health information includes many common identifiers (e.g., name, address, birth date, Social Security Number).

## Data Anonymization or De-Identification

It is usually permissable to use and hold both PII and medical information provided that the data has been "anonymized" or "de-identified."  

Common techniques for anonymizing data include:

- Dropping PII and other sensitive data
- Masking (e.g., encoding, hashing, or encrypting data values)
- Pseudonymization (e.g., substituting false identifiers for true identifiers)
- Generalizing (e.g., changing specific values to ranges or range identifiers)
- Perturbing (e.g., adding noise or slightly altering true values)
- Shuffling (e.g., rearranging data record or column order)
- Synthetic data (e.g., generate data that follows statistical characteristics of true data)


Here are a few links to web resources discussing data anonymization:

- https://en.wikipedia.org/wiki/Data_anonymization
- https://www.imperva.com/learn/data-security/anonymization/
- https://www.splunk.com/en_us/blog/learn/data-anonymization.html

### Limits of Synthetic Data

Using synthetic data clearly limits the risks associated with individual, but this technique is only useful when the synthetic data is adequate for the task.  It is often difficult to generate synthetic data that realistically models data from the real world.

However, for academic exercises this is a good strategy and the assignment associated with this topic uses data that was synthetically generated.

## Limits of Other Anonymization Techniques

**Pseudonymization** by substituting a false identifier for true identifiers is sometimes useful.  Unfortunately, it is often the case that it is both acceptable and needed to be able to re-identify data after the anonymized form has been processed.  The common technique to enable re-identification is to maintain a separate mapping of the pseudonymous identifier to the true identifier (i.e., a cross reference table).  This table is kept separately from the anonymized data, but since this table contains the mapping back to the "identified" world, it must be carefully protected from theft.

**Generalizing** and **perturbing** data are often only applicable to less sensitive data where precise accuracy is not required and are often fairly weak protections.  Consider whether the knowledge that someone makes \\$72,342 per year is significantly more risky than knowing the person makes between \\$50,000 and \\$75,000.

**Shuffling** column values is often fairly easy to unravel.  Shuffling rows (records) in a table or a file is sometimes a good idea.  For example, a data provider may know the order of the records (i.e., which record belongs with which person) in a file without including any PII fields or columns.  If we process that file, adding attributes, then return the file to the provider, the provider can easily re-identify the individuals associated with the data.   Randomizing the row/record order (e.g., sorting on an added random field) before returning the file to the provider can prevent this re-identification. 

As a result of these limitations, we are going to focus on only two general techniques for anonymization: 
1. Dropping data
2. Masking data

Dropping data (usually columns, but sometimes rows) is straight-forward and we have covered how to do that in other lectures.

Three are four common ways to mask data:

- Encoding
- Encrypting
- Hashing
- Salted Hashing

**Encoding** data means finding some scheme to transform the data in such a way that the original value is obscured, but that can be reversed to retrieve the original data.  An example of encoding might be a simple substitution code where letters are swapped from the original to the "encoded" value.  Unfortunately, it is difficult to devise encoding techniques that are not easily broken by a dedicated attacker, so we will not use this strategy.

**Encrypting** data uses an encryption key to encrypt the original data into an encrypted form that can only be decrypted using a particular key.  Encryption algorithms use mathematical techniques to make it very hard for an attacker to break the encryption.  

It is notoriously hard to make a truly effective encryption algorithm, but fortunately there are widely accepted algorithms and implementations we can use.

The main challenge with encryption is that both parties in a transaction (sender and receiver) must carefully manage and protect the encryption/decryption keys so that they are not compromised.

From an anonymization perspective, another potential problem with encrypting data is that the original data can be reproduced *directly* from the encrypted data if successfully decrypted (e.g., because the encryption keys are compromized).

Still, encryption is a powerful form of data protection and you will often encrypt data, often at the file level, to protect data at rest (i.e., stored) or data during transmission.  This use of encryption is often in concert with other forms of masking discussed below.

**Hashing** data uses algorithms that perform a one-way transformation of input data into a different hashed output value.  The transformation is predictable, meaning you always get the same output value for the same input value.  Furthermore, the transformation is one-way, meaning that unlike encryption algorithms, there is no algorithm or mechanism to convert the hashed output value back into the original input data. It is possible for hashing algorithms to have *collisions* where two distinct input values to produce the same hashed output value.  Fortunately, much work has been done to produce hash algorithms that are sophisticated enough and generate long enough values that collisions are eliminated.  As with encryption algorithms, hash algorithms are readily available and we can use the algorithms from modules without coding the algorithms.

A challenge with a hash is if the attacker also possesses some or all of the input values to be hashed, the attacker can independently compute the same hashes.  This would allow the attacker to match your hashed values with known values.

**Salted Hashing** combines the ideas of a one-way hash with encryption keys.  Salted hash algorithms perform a one-way hash, but use a secret key, called a "salt", as part of the hash process.  Using this salt means that an attacker must not only have access to the original source data and hash algorithm, but also the salt value.  This makes it significantly harder for the attacker to compromise the anonymization scheme, so this will be our preferred mechanism for masking.  Of course, the salt values (like encryption keys) must be protected from compromise.  



---
---
## PII Example

There is a lab associated with this lecture on anonymizing a data file.  The examples here will be useful when working on the lab.

In [1]:
import pandas as pd
import random
import hashlib

In [2]:
samplefile = 'pii_sample.csv'

In [3]:
sample = pd.read_csv(samplefile)

In [4]:
sample

Unnamed: 0,name,street_address,city,state,zipcode,sex,username,purchases
0,Daniel Jacobs,2286 Karen Track Apt. 009,Matthewfort,MI,10216,M,zbishop,13926.58
1,Gerald Barajas,539 Thomas Plaza,New Monica,PA,57349,M,sotomark,3206.79
2,Michael Fleming,1923 Maxwell Mount Suite 561,Morgantown,MA,7841,M,kirktrevor,5628.12
3,Veronica Russell,9267 Kiara Stream Suite 112,Edwardsmouth,IL,51943,F,vmoyer,6215.65
4,Robert Bautista,85716 Rice Turnpike,Anitamouth,WY,1379,M,sarahmcgee,8389.27
5,Peter Miller,7444 Jackson Flat Apt. 498,New Caroline,WY,76759,M,gsnyder,3411.65
6,Andrew Stevens,50794 Isaac Coves,New Richardland,KY,88180,M,traceyfisher,6756.17
7,Janice Oneal,036 Moore Underpass Apt. 051,South Sherriton,SC,56599,F,steven14,115.95
8,Steven Dennis,9347 Zachary Forest Apt. 188,Lake Samanthaville,UT,23162,M,andrewlewis,12412.68
9,Laura Smith,7568 Dunn Springs,Port Sherri,OR,80323,F,holdermike,5550.86


## Augment the sample data

Let's add some new columns with random non-PII data to our sample data.

We are going to use `apply` to create values for these new columns.

Using `apply` to create a column is straight-forward when using the contents of an existing column.  The value in the existing source column for each row is passed as the first parameter to the applied function. 

For example, our current data has `zipcode` as a numeric field, which leaves off the leading zeros on the zip.  Let's create a new column called `full zipcode` that is a string with appropriate leading zeros.

In [5]:
sample['full zipcode'] = sample['zipcode'].apply(lambda z: f'{z:05d}')

In [6]:
sample

Unnamed: 0,name,street_address,city,state,zipcode,sex,username,purchases,full zipcode
0,Daniel Jacobs,2286 Karen Track Apt. 009,Matthewfort,MI,10216,M,zbishop,13926.58,10216
1,Gerald Barajas,539 Thomas Plaza,New Monica,PA,57349,M,sotomark,3206.79,57349
2,Michael Fleming,1923 Maxwell Mount Suite 561,Morgantown,MA,7841,M,kirktrevor,5628.12,7841
3,Veronica Russell,9267 Kiara Stream Suite 112,Edwardsmouth,IL,51943,F,vmoyer,6215.65,51943
4,Robert Bautista,85716 Rice Turnpike,Anitamouth,WY,1379,M,sarahmcgee,8389.27,1379
5,Peter Miller,7444 Jackson Flat Apt. 498,New Caroline,WY,76759,M,gsnyder,3411.65,76759
6,Andrew Stevens,50794 Isaac Coves,New Richardland,KY,88180,M,traceyfisher,6756.17,88180
7,Janice Oneal,036 Moore Underpass Apt. 051,South Sherriton,SC,56599,F,steven14,115.95,56599
8,Steven Dennis,9347 Zachary Forest Apt. 188,Lake Samanthaville,UT,23162,M,andrewlewis,12412.68,23162
9,Laura Smith,7568 Dunn Springs,Port Sherri,OR,80323,F,holdermike,5550.86,80323


It is a bit trickier if you don't actually need the values from an existing column to create a new column.

However, nothing says you must actually use the value of the existing column.  This means you can simply use one column as a "dummy" for invoking the function on each row.

In [7]:
def get_random_int(x, low=0, high=20):
    return random.randint(low, high)

In [8]:
sample['store visits'] = sample['name'].apply(get_random_int)

In [9]:
sample['web visits'] = sample['name'].apply(get_random_int, high=100)

In [10]:
sample['complaints'] = sample['name'].apply(get_random_int, high=5)

In [11]:
sample

Unnamed: 0,name,street_address,city,state,zipcode,sex,username,purchases,full zipcode,store visits,web visits,complaints
0,Daniel Jacobs,2286 Karen Track Apt. 009,Matthewfort,MI,10216,M,zbishop,13926.58,10216,6,26,4
1,Gerald Barajas,539 Thomas Plaza,New Monica,PA,57349,M,sotomark,3206.79,57349,6,69,0
2,Michael Fleming,1923 Maxwell Mount Suite 561,Morgantown,MA,7841,M,kirktrevor,5628.12,7841,16,71,4
3,Veronica Russell,9267 Kiara Stream Suite 112,Edwardsmouth,IL,51943,F,vmoyer,6215.65,51943,4,89,0
4,Robert Bautista,85716 Rice Turnpike,Anitamouth,WY,1379,M,sarahmcgee,8389.27,1379,8,96,1
5,Peter Miller,7444 Jackson Flat Apt. 498,New Caroline,WY,76759,M,gsnyder,3411.65,76759,11,29,3
6,Andrew Stevens,50794 Isaac Coves,New Richardland,KY,88180,M,traceyfisher,6756.17,88180,1,79,1
7,Janice Oneal,036 Moore Underpass Apt. 051,South Sherriton,SC,56599,F,steven14,115.95,56599,3,92,1
8,Steven Dennis,9347 Zachary Forest Apt. 188,Lake Samanthaville,UT,23162,M,andrewlewis,12412.68,23162,15,22,4
9,Laura Smith,7568 Dunn Springs,Port Sherri,OR,80323,F,holdermike,5550.86,80323,5,34,3


## Column Cardinality Relative to Observations

The **cardinality** of a column is the number of distinct values within the column.  Columns (or, more generally, a collection of columns used together) that have high cardinality (i.e., a lot of distinct values) may provide a hint whether that column could potentially be used to identify the individuals within a set of data.  If the column cardinality approaches the number of observations in the data, it may be prudent to consider whether or not the column may pose an identification risk.

Note that some columns are inherently high cardinality, but cannot reasonably be considered a threat for direct re-identification. 

Let's calculate the cardinality of columns in our sample data.

The Series method `nunique` counts the unique values in a series, so it is pretty easy to get the cardinality for each column in a dataframe.

In [12]:
sample.apply(pd.Series.nunique)

name              10
street_address    10
city              10
state              9
zipcode           10
sex                2
username          10
purchases         10
full zipcode      10
store visits       9
web visits        10
complaints         4
dtype: int64

How many observations do we have?

In [13]:
print(f'There are {sample.shape[0]} observations.')

There are 10 observations.


Which columns appear to pose a risk and which do not?

Remember, high cardinality might be an expected result and not pose a risk.

Let's just look at the name and address components.

In [14]:
cols = ['name', 'street_address', 'city', 'state', 'full zipcode']

In [15]:
sample[cols].apply(pd.Series.nunique)

name              10
street_address    10
city              10
state              9
full zipcode      10
dtype: int64

The cardinality of each of the fields above doesn't really tell us if the particular name and address combinations (i.e., all the components together) are unique in the data.  Let's put them together.

In [16]:
full_name_addr = sample[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

The full name and address are too wide for the default column display, so let's remove the limit.

In [17]:
pd.set_option('display.max_colwidth', None)

In [18]:
full_name_addr

0              Daniel Jacobs 2286 Karen Track Apt. 009 Matthewfort MI 10216
1                       Gerald Barajas 539 Thomas Plaza New Monica PA 57349
2          Michael Fleming 1923 Maxwell Mount Suite 561 Morgantown MA 07841
3        Veronica Russell 9267 Kiara Stream Suite 112 Edwardsmouth IL 51943
4                   Robert Bautista 85716 Rice Turnpike Anitamouth WY 01379
5             Peter Miller 7444 Jackson Flat Apt. 498 New Caroline WY 76759
6                 Andrew Stevens 50794 Isaac Coves New Richardland KY 88180
7        Janice Oneal 036 Moore Underpass Apt. 051 South Sherriton SC 56599
8    Steven Dennis 9347 Zachary Forest Apt. 188 Lake Samanthaville UT 23162
9                        Laura Smith 7568 Dunn Springs Port Sherri OR 80323
dtype: object

Restore the default column display width.

In [19]:
pd.reset_option('display.max_colwidth')

In [20]:
full_name_addr.nunique()

10

So, the name and address components taken as a unit are different for every observation.

Of course, we already knew that name and address is perhaps the best known form of PII.

---
## Masking Data with Secure Hashes

The `hashlib` module has implementations of many secure hashes. 

See: https://docs.python.org/3/library/hashlib.html

As examples, let's generate some different kinds of hashes on some sample data.

In [21]:
sample_user = sample.loc[0, 'username']
sample_user

'zbishop'

### Unsalted Hashes

Generate an MD5 of the sample username.

In [22]:
hashlib.md5(sample_user).hexdigest()

TypeError: Strings must be encoded before hashing

What happened here?

Strings (actually, `str`) are a type in Python, not just an array of bytes.  Bytes are a distinct type (`byte`) in Python.   Depending upon the encoding, characters in a string are not necessarily stored as a single byte, so storing in byte format assures an array of bytes.

The hashing methods expect byte-type arrays as input, not strings.

The `encode()` function converts a Python string type to a byte type.  We can use that to invoke the hash method.

In [None]:
hashlib.md5(sample_user.encode()).hexdigest()

In [None]:
h = hashlib.md5(sample_user.encode()).hexdigest()

The MD5 method produces a 128 bit value.  That translates to 128 bits / 8 bits per byte = 16 byte value.

Hex encoding the value produces two hex characters per byte, so the hex encoding is 32 bytes long.

In [None]:
len(h)

Longer hashes are considered more secure and have few (or no) collisions.

In [None]:
h = hashlib.sha256(sample_user.encode()).hexdigest()

In [None]:
len(h)

In [None]:
h = hashlib.sha512(sample_user.encode()).hexdigest()

In [None]:
len(h)

### Salted Hash

The hashes above ("unsalted") are not reversable (i.e., one-way), but the same values can be independently generated by two parties if the two parties each have access to the same inputs.  For "well known" fields, such as email addresses, this is a problem since the data values are widely disseminated throughout the world.  This means that adversaries with access to this well known data could independently match the hash to the original value, thwarting our attempts at anonymization.

To address this challenge, hash methods often support adding a "salt" (i.e., sort of a secret key) that is added to the computation of the hash.  To reproduce the salted hash value, an adversary would have to (1) have access to the same original data values, (2) use the same hash algorithm, and most importantly (3) have access to the same salt value.

The `hashlib` module supports salted hashes through the `pbkdf2_hmac` method.

The salts are assumed to be byte-style arrays, as with the input values passed to the hash.  The expectation is that the salt is at least 16 bytes long.

We can directly produce a byte representation when storing the string by using `b'\<string\>'` notation.

In [None]:
uca_salt = b'University of Central Arkansas Bears'
ua_salt  = b'University of Arkansas Razorbacks'

In [None]:
h = hashlib.pbkdf2_hmac('sha256', sample_user.encode(), uca_salt, 1000)
h.hex()

In [None]:
hashlib.pbkdf2_hmac('sha256', sample_user.encode(), uca_salt, 1000).hex()

Note that the value generated using the different salt produced a different result.

In [None]:
h = hashlib.pbkdf2_hmac('sha256', sample_user.encode(), ua_salt, 1000)
h.hex()

Generate using a different salted hash algorithm.

In [None]:
hashlib.pbkdf2_hmac('sha512', sample_user.encode(), uca_salt, 1000).hex()

In [None]:
hashlib.pbkdf2_hmac('sha512', sample_user.encode(), ua_salt, 1000).hex()

## <span style="color:red">Exercise</span>

How would you apply the salted hash of the `username` column of the sample table to create a new column containing the hashed result for each row of the table?