# COGS 108 - Assignment 3: Data Privacy

### Written By: Liz Izhikevich and Harshita Mangal

## Important

- Rename this file to 'A3_A########.ipynb' (filled in with your student ID) before you submit it. Submit it to TritonED.
- Do not change / update / delete any existing cells with 'assert' in them. These are the tests used to check your assignment. 
    - Changing these will be flagged for attempted cheating. 
- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted. 
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

## Overview

We have discussed in lecture the importance and the mechanics of protecting individuals privacy when they are included in datasets. In particular, in Lecture 11 (April 26th) we introduced the concept of the Safe Harbor Method. The Safe Harbour method specifies how to protect individual's identities by telling us which tells us which information to remove from a dataset in order to avoid accidently disclosing personal information. 

In this assignment, we will explore how identity can be decoded from badly anonymized datasets, and also explore using Safe Harbour to anonymize datasets properly. 

### Import Statements

In [1]:
# Import Pandas
# Note: Pandas is all you need! Do not import any other functions / packages.
import pandas as pd

## Part 1: Identifying Data

Data Files:
- anon_user_dat.json
- employee_info.json

You will first be working with a file called 'anon_user_dat.json'. This file that contains information about some (fake) Tinder users. When creating an account, each Tinder user was asked to provide their first name, last name, work email (to verify the disclosed workplace), age, gender, phone # and zip code. Before releasing this data, a data scientist cleaned the data to protect the privacy of Tinder's users by removing the obvious personal identifiers: phone #, zip code, and IP address. However, the data scientist chose to keep each users' email addresses because when they visually skimmed a couple of the email addresses none of them seemed to have any of the user's actual names in them. This is where the data scientist made a huge mistake!

We will take advantage of having the work email addresses by finding the employee information of different companies and matching that employee information with the information we have, in order to identify the names of the secret Tinder users!

In [2]:
##################################
# 1a) Load in the 'cleaned' data #
##################################

# Load the json file into a pandas dataframe. Call it 'df_personal'.

# YOUR CODE HERE
jsonfile = 'anon_user_dat.json'
df_personal = pd.read_json(jsonfile)

In [3]:
assert isinstance(df_personal, pd.DataFrame)




In [4]:
#################################
# 1b) Check the first 10 emails #
#################################

# Save the first 10 emails to a Series, and call it 'sample_emails'. 
# You should then and print out this Series. 
# The purpose of this is to get a sense of how these work emails are structured
#   and how we could possibly extract where each anonymous user seems to work

# YOUR CODE HERE
sample_emails = df_personal['email'].head(10)
#print (sample_emails)

In [5]:
assert isinstance(sample_emails, pd.Series)


In [6]:
###############################################
# 1c) Extract the Company Name From the Email #
###############################################

# Create a function with the following specifications:
#   Function Name: extract_company
#   Purpose: to extract the company of the email 
#          (i.e., everything after the @ sign but before the .com )
#   Parameter(s): email (string)
#   Returns: The extracted part of the email (string)
#   Hint: This should take 1 line of code. Look into the find('') method. 
#
# You can start with this outline:
#   def extract_company(email):
#      return 
#
# Example Usage: 
#   extract_company("larhe@uber.com") should return "uber"

# YOUR CODE HERE
def extract_company(email) :
    email1 = email.find('@')+1
    email2 = email.find('.')
    return email[email1 : email2]
extract_company ('gshoreson0@seattletimes.com')

'seattletimes'

In [7]:
assert extract_company("gshoreson0@seattletimes.com") == "seattletimes"


With a little bit of basic sleuthing (aka googling) and web-scraping (aka selectively reading in html code) it turns out that you've been able to collect information about all the present employees/interns of the companies you are interested in. Specifically, on each company website, you have found the name, gender, and age of its employees. You have saved that info in employee_info.json and plan to see if, using this new information, you can match the Tinder accounts to actual names.

In [8]:
#############################
# 1d) Load in employee data #
#############################

# Load the json file into a pandas dataframe. Call it 'df_employee'.

# YOUR CODE HERE
jsonfile = 'employee_info.json'
df_employee = pd.read_json(jsonfile)

In [9]:
assert isinstance(df_personal, pd.DataFrame)


In [10]:
#########################################################
# 1e) Match the employee name with company, age, gender #
#########################################################

# Create a function with the following specifications:
#   Function name: employee_matcher
#   Purpose: to match the employee name with the provided company, age, and gender
#   Parameter(s): company (string), age (int), gender (string)
#   Returns: The employee first_name and last_name like this: return first_name, last_name 
#   Note: If there are multiple employees that fit the same description, first_name and 
#         last_name should return a list of all possible first names and last name
#         i.e., ['Desmund', 'Kelby'], ['Shepley', 'Tichner']
#
# Hint:
# There are many different ways to code this.
# 1) An unelegant solution is to loop through df_employee 
#    and for each data item see if the company, age, and gender match
#    i.e., for i in range(0, len(df_employee)):
#              if (company == df_employee.ix[i,'company']):
#
# However! The solution above is very inefficient and long, 
# so you should try to look into this:
# 2) Google the df.loc method: It extracts pieces of the dataframe
#    if it fulfills a certain condition.
#    i.e., df_employee.loc[df_employee['company'] == company]
#    If you need to convert your pandas data series into a list,
#    you can do list(result) where result is a pandas "series"
# 
# You can start with this outline:
#   def employee_matcher(company, age, gender):
#      return first_name, last_name

# YOUR CODE HERE
def employee_matcher(company, age, gender):
    a1 = df_employee.loc[(df_employee['company']==company) & (df_employee['age']==age) & (df_employee['gender']==gender), "first_name"]
    a2 = df_employee.loc[(df_employee['company']==company) & (df_employee['age']==age) & (df_employee['gender']==gender), "last_name"]
    first_name = list(a1)
    last_name = list(a2)
    return first_name, last_name
    

In [11]:
assert employee_matcher("google", 41, "Male") == (['Ab'], ['Tetley'])
assert employee_matcher("google", 42, "Male") == (['Desmund', 'Kelby'],
                                                  ['Shepley', 'Tichner'])

In [12]:
####################################
# 1f) Extract all the private Data #
####################################

# - Create 2 empty lists called 'first_names' and 'last_names'
# - Loop through all the people we are trying to identify in df_personal
# - Call the extract_company function (i.e., extract_company(df_personal.ix[i, 'email']) )
# - Call the employee_matcher function 
# - Append the results of employee_matcher to the appropriate lists (first_names and last_names)

# YOUR CODE HERE
first_names = []
last_names =[]
for i in range(0,len(df_personal)):
    email = df_personal.ix[i,'email']
    company = extract_company(email)
    age = df_personal.ix[i, 'age']
    gender = df_personal.ix[i, 'gender']
    first, last = employee_matcher(company, age, gender)
    first_names.append(first)
    last_names.append(last)


In [13]:
assert first_names[45:50]== [['Justino'], ['Tadio'], ['Kennith'], ['Cedric'], ['Amargo']]
assert last_names[45:50] == [['Corro'], ['Blackford'], ['Milton'], ['Yggo'], ['Grigor']]

In [14]:
#######################################################
# 1g) Add the names to the original 'secure' dataset! #
#######################################################

# We have done this last step for you below, all you should do is uncomment.
# For your own personal enjoyment, you should also print out
#   the new df_personal with the identified people. 

df_personal['first_name'] = first_names
    
df_personal['last_name'] = last_names

df_personal

Unnamed: 0,age,email,gender,first_name,last_name
0,46,gshoreson0@seattletimes.com,Male,[Gordon],[DelaField]
1,56,eweaben1@salon.com,Female,[Elenore],[Gravett]
2,30,akillerby2@gravatar.com,Male,[Abbe],[Stockdale]
3,87,gsainz3@zdnet.com,Male,[Guido],[Comfort]
4,58,bdanilewicz4@4shared.com,Male,[Brody],[Pinckard]
5,39,sdeerness5@wikispaces.com,Female,[Shalne],[Smail]
6,43,jstillwell6@ustream.tv,Female,[Joell],[Bowlesworth]
7,37,mpriestland7@opera.com,Male,[Manfred],[Bricket]
8,35,nerickssen8@hatena.ne.jp,Female,[Neille],[McCahey]
9,40,hparsell9@xing.com,Male,[Henri],[Scotchford]


We have now just discovered the 'anonymous' identities of all the registered Tinder users...awkward.

## Part 2: Anonymize Data

You are hopefully now convinced that with some seemingly harmless data a hacker can pretty easily discover the identities of certain users. Thus, we will now clean the original Tinder data ourselves according to the Safe Harbor Method in order to make sure that it has been *properly* cleaned...

In [15]:
#############################
# 2a) Load in personal data #
#############################

# Load the user_dat.json file into a pandas dataframe. Call it 'df_users'.
# Note: You might find that using the same method as A2 (or above) leads to an error.
# The file has a slightly different organization. 
#   Try googling the error and finding the fix for it.
# Hint: you can still use 'pd.read_json', you just need to add another argument.

# YOUR CODE HERE

df_users = pd.read_json('user_dat.json', lines=True)

In [16]:
assert isinstance(df_users, pd.DataFrame)


In [17]:
################################
# 2b) Drop personal attributes #
################################

# Remove any personal information, following the Safe Harbour method.
# Based on lecture 11, remove any columns from df_personal that contain personal information.

# YOUR CODE HERE
df_users = df_users.drop(['email', 'ip_address', 'first_name', 'last_name', 'phone'],axis=1)

In [18]:
assert len(df_users.columns) == 3


In [19]:
###################################
# 2c) Drop ages that are above 90 #
###################################

# Safe Harbour rule C:
#   Drop all the rows which have age greater than 90 from df_personal

# YOUR CODE HERE
df_users = df_users[df_users['age'] <= 90]
df_users.shape

(990, 3)

In [20]:
assert df_users.shape==(990, 3)


In [21]:
#############################
# 2d) Load in zip code data #
#############################

# Load the zip_pop.csv file into a (different) pandas dataframe. Call it 'df_zip'.

# YOUR CODE HERE
df_zip = pd.read_csv('zip_pop.csv')

In [22]:
assert isinstance(df_zip, pd.DataFrame)


In [23]:
###################################################
# 2e) Sort zipcodes into "Geographic Subdivision" #
###################################################

# The Safe Harbour Method applies to "Geographic Subdivisions"
#   as opposed to each zipcode itself. 
# Geographic Subdivision:
#   All areas which share the first 3 digits of a zip code
#
# Count the total population for each geographic subdivision
# Warning: you have to be savy with a dictionary here
# To understand how a dictionary works, check the section materials,
#   use google and go to discussion sections!
#
# Instructions: 
# - Create an empty dictionary: zip_dict = {}
# - Loop through all the zip_codes in df_zip
# - Create a dictionary key for the first 3 digits of a zip_code in zip_dict
# - Continually add population counts to the key that contains the 
#     same first 3 digits of the zip code
#
# To extract the population you will find this code useful:
#   population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
# To extract the first 3 digits of a zip_code you will find this code useful:
#   int(str(zip_code)[:3])

# YOUR CODE HERE
zip_dict = {}
zip_unique = []
for zip_code in df_zip['zip']:
    population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])[0]
    if (zip_code not in zip_unique):
        zip_unique.append(zip_code)
        new_zip = int(str(zip_code)[:3])
        zip_dict[new_zip] = zip_dict.get(new_zip,0) + population

In [24]:
assert isinstance(zip_dict, dict)
assert zip_dict[100] == 1580423


In [25]:
#################################
# 2f) Explain this Code Excerpt #
#################################

# In the cell below, explain in words what what the following line of code is doing:
population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])

In [26]:
# YOUR CODE HERE
print ("This jsut shortens the data frame in which every entry has the same zip code (i.e if zip code != zipcode, delete).\n Then you get the population column of whatever is left ")

This jsut shortens the data frame in which every entry has the same zip code (i.e if zip code != zipcode, delete).
 Then you get the population column of whatever is left 


In [27]:
#############################
# 2g) Masking the Zip Codes #
#############################

# Go through each user, and update their zip-code, to Safe Harbour specifications:
#   If the user is from a zip code for the which the
#     "Geographic Subdivision" is less than equal to 20000:
#        - Change the zip code to 0 
#   Otherwise:
#         - Change the zip code to be only the first 3 numbers of the full zip cide
# Do all this re-writting the zip_code columns of the 'df_users' DataFrame
#
# Hints:
#  - This will be several lines of code, looping through the DataFrame, 
#      getting each zip code, checking the geographic subdivision with 
#      the population in zip_dict, and settig the zip_code accordingly. 

# YOUR CODE HERE
for i,row in df_users.iterrows():
    z = int(str(row['zip'])[:3])
    if (zip_dict[z] <= 20000):
        df_users['zip'][i] = 0
    else:
        df_users['zip'][i] = z


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [28]:
assert len(df_users) == 990
assert sum(df_users.zip == 0) == 2
assert df_users.ix[671, 'zip'] == 0



In [29]:
##########################################################
# 2h) Save out the properly anonymized data to json file #
##########################################################

# Save out df_users as a json file, called 'real_anon_user_dat.json'

# YOUR CODE HERE
df_users.to_json('real_anon_user_dat.json')

In [31]:
assert isinstance(pd.read_json('real_anon_user_dat.json'), pd.DataFrame)


Congrats, you're done! The users identities are much more protected now. 

Submit this notebook file to TritonED.