# DATA ANONYMIZATION

In [145]:
# import required libraries
import pandas as pd
import numpy as np

## Section 1: Setup

In [None]:
# Load created `students_data.csv` (fake dataset created by `fake_dataset_creation.py`)
students = pd.read_csv("students_data.csv")

# Set display options to show all columns
pd.set_option('display.max_columns', None)

# Check loading is correct 
students.head()

# Section 2: Data Anonymization

In [None]:
# Check columns
students.columns

### Direct Variables

The columns 'Eye_color', 'Hair_color', 'First_Name', and 'Last_Name' are not considered necessary for the proposed analysis. Since they are sensitive attributes that can directly identify the student, it has been decided to use a `de-identification technique`, meaning these columns will be removed.

Students will be identified by the 'ID' corresponding to the row to which they belong.

In [None]:
# remove 'Eye_colors', 'Hair_color', 'First_Name' and 'Last_Name'  columns
students.drop(['Eye_color', 'Hair_color', 'First_Name' , 'Last_Name'] , axis=1, inplace=True)

# add row id as student identifier
students["ID"] = students.index

# Check result
students.head()

The column names 'Previous_year_grades' and 'Current_year_grades' indicate that the data being processed pertains to students. An attacker could attempt to cross-reference the data to directly identify a specific student.

Similarly, the objective of the analysis is to determine whether the new product improves academic performance. Therefore, to mitigate the risk of inference through data cross-referencing, the `data masking technique` will be applied by creating an anonymized 'Performance' indicator. This attribute will indicate whether there has been an improvement in grades with a 0 or 1.

The word 'grades' will be avoided entirely throughout the dataset.

In [None]:
# create a function that returns whether is a performance or not
def performance(prev, actual):
    if actual > prev:
        return 1
    return 0

# apply function to the dataset
students["Performance"] = students.apply(lambda row: performance(row['Previous_year_grades'], row['Current_year_grades']), axis=1)

# remove notes columns
students.drop(['Previous_year_grades','Current_year_grades'] , axis=1, inplace=True)

# check the result
students.head()

Lastly, there is the 'Age' column. To determine whether any dissociation or anonymization technique should be applied, we will analyze the data distribution.

In [None]:
# obtain 'Age' min & max values
min = students['Age'].min()
max = students['Age'].max()

print(f"Age min value: {min} and max value: {max}")

# obtain instances per 'Age'
students["Age"].value_counts().sort_index()

The data distribution is homogeneous; however, the age range is very narrow, spanning only from 0 to 26 years.

Therefore, the 'Age' column needs to be anonymized to obscure this range, so that any attacker attempting to interpret the data will not be able to identify it as belonging to students, especially since certain age groups are more common among students.

`Data perturbation` will be applied, introducing small random changes to the age values to make it more difficult to map the data to specific individuals.

Subsequently, we will `generalize` the ages into broader ranges to further anonymize the data. The original 'Age' column will, of course, be removed.

In this way, we ensure that even if some information is leaked, the exact age of a student remains concealed within a broader category.

In [None]:
# First: Data Perturbation (adding noise)
np.random.seed(42)  # For reproducibility
students['Age_Perturbed'] = students['Age'] + np.random.normal(0, 1, len(students))

# check the result
students["Age_Perturbed"].value_counts().sort_index()

In [None]:
# Second: Data Generalization (avoiding age ranges in column name)
# Define the new bins and labels for generalization into two groups
bins = [0, 17, 30]  # Adjust the upper limit based on the data range
labels = ['Early Years', 'Later Years']

# Apply the new bins and labels to create the 'Age_Range' column
students['Age_Range'] = pd.cut(students['Age_Perturbed'], bins=bins, labels=labels, right=False)

# Check the result
print(students["Age_Range"].value_counts().sort_index())

In [None]:
# remove 'Age_Perturbed','Age' columns
students.drop(['Age_Perturbed','Age'] , axis=1, inplace=True)

# Check result
students.head()

### Indirect Variables

Regarding indirect variables, on one hand, there is data related to the 'School' (the school to which the data pertains), and on the other hand, data concerning the parents' employment status and their salaries.

For variables related to the school, the 'School_Name' is directly omitted because it directly indicates that the data belongs to students. Therefore, this column will be `suppressed`.

The other columns, 'School_Address' and 'School_ZipCode,' suggest an intention to retain information about the origin or geolocation of the data. Given this, only one attribute referring to geolocation will be retained. The school's address could be critical: for instance, if a school is located on a distant street and this data is combined with the identification of Early Years, an attacker could directly identify it as student data. For this reason, the `suppression` of this column will also be carried out.

In [None]:
# remove 'School_Name','School_Address' columns
students.drop(['School_Name','School_Address'] , axis=1, inplace=True)

# Check result
students.head()

Regarding the 'School_ZipCode' column, it should be renamed to something that does not indicate that this data pertains to a zip code and it is relates to a school. It is proposed to use the new name 'Code'.

Additionally, the data will be `masked` by retaining only the first 2 digits and replacing the remaining digits with * characters, while increasing the number of digits from 5 to 7. This makes it harder to link the data as a zip code.

In [None]:
# Define a function to mask the zip_code 
def mask_zip_code(zc):
  return zc[0:2]+ '*' * (len(zc))

# Replace existing SSN column with masked SSN
students['Code'] = students['School_ZipCode'].astype(str).apply(mask_zip_code)

# Check result
students.head()

In [None]:
# remove 'School_ZipCode' columns
students.drop('School_ZipCode', axis=1, inplace=True)

# Check result
students.head()

Lastly, it remains to analyze the data regarding the employment status of students' parents and their salaries.

Regarding the 'Parents_Salary' column, generalization appears to be the most appropriate technique. The name should be different from 'Parents_Salary,' opting for a more general term such as 'Salary'.

To achieve this, salary ranges will first be reviewed for each position, and these salary ranges will be generalized so that they are not directly associated with specific job positions. 

In [None]:
# Check salary ranges per occupation
# Group by 'Parents_Occupation' and calculate min and max salary
salary_ranges = students.groupby('Parents_Occupation')['Parents_Salary'].agg(['min', 'max'])

# Rename columns for clarity
salary_ranges.columns = ['Min_Salary', 'Max_Salary']

# Reset index to turn the grouped column back into a regular column
salary_ranges = salary_ranges.reset_index()

# Print the result
print(salary_ranges)

Generalize salaries in 3 ranges, instead of numeric values we will use 'low', 'medium' and 'high' values. 

In [None]:
# Define salary ranges
def categorize_salary(min_salary, max_salary):
    # Handle the case where min_salary or max_salary might be NaN
    if pd.isna(min_salary) or pd.isna(max_salary):
        return 'Low'
    
    # Determine the cutoffs for the ranges
    low_threshold = 3000
    medium_threshold = 12000
    
    # Use the average salary to categorize
    average_salary = (min_salary + max_salary) / 2
    
    if average_salary <= low_threshold:
        return 'Low'
    elif average_salary <= medium_threshold:
        return 'Medium'
    else:
        return 'High'

# Calculate min and max salary per occupation
salary_ranges = students.groupby('Parents_Occupation')['Parents_Salary'].agg(['min', 'max']).reset_index()

# Rename columns for clarity
salary_ranges.columns = ['Parents_Occupation', 'Min_Salary', 'Max_Salary']

# Apply categorization to the salary ranges
salary_ranges['Salary'] = salary_ranges.apply(lambda row: categorize_salary(row['Min_Salary'], row['Max_Salary']), axis=1)

# Check result
print(salary_ranges['Salary'].unique())

# Merge back with the original students DataFrame if needed
students = students.merge(salary_ranges[['Parents_Occupation', 'Salary']], on='Parents_Occupation', how='left')

# Check result
students.head()

In [None]:
# remove 'Parents_Salary' columns
students.drop('Parents_Salary', axis=1, inplace=True)

# Check result
students.head()

The specific type of employment the parents have is not relevant for the proposed analysis. Therefore, a new variable indicating with a 'Y' or 'N' whether they are working or not is deemed sufficient. Similarly, it is necessary to rename the column 'Parents_occupation' to a more generic name; 'Occupacy' is proposed, so that the 'Age_Range' column can help dissociate student data, making it more general, such as data on workers.

In [None]:
# create a function that returns Y or N depending of the occupacy
def has_occupacy(occupacy):
    if occupacy == "Unemployed":
        return "N"
    return "Y"

# create new column applying the function
students["Occupacy"] = students["Parents_Occupation"].apply(has_occupacy)

# remove `Parents_occupation` data
students.drop('Parents_Occupation' , axis=1, inplace=True)

# validate changets
students.head()

By examining records, it can be seen that retaining the columns 'Occupacy' and 'Salary_Range' along with the 'Age_Range' column might provide clues about a parent-child relationship. For example, if an individual has an 'Medium' salary and is currently inactive, the 'Early Years' value could suggest that the 'Occupacy' and 'Salary_Range' data do not belong to the same dataset as 'Early Years'.

To eliminate this association, it is proposed to rename and modify the 'Age_Range' column to a more general term 'Ages' and change the values 'Early Years' and 'Later Years' to more general terms such as '<17' and '17+', which could be interpreted mistakenly as years worked.

In [None]:
# change Age_Range values for '<17' and '17+'
def change_age_ranges(age_range):
    if age_range == "Early Years":
        return "<17"
    return "17+"

# create new columna 'Ages' 
students['Ages'] = students['Age_Range'].apply(change_age_ranges)

# check changes
students.head()

In [None]:
# remove `Age_Range` data
students.drop('Age_Range' , axis=1, inplace=True)

# validate changets
students.head()

# Section 3: Data Validation

A properly anonymized dataset should prevent the following risks: singularization, linkage, and inference.

To achieve this, we will ensure, through the technique of `k-anonymity`, that there are at least 2 cases for each of the value relationships.

In [None]:
# Define the desired column order
desired_order = ['ID', 'Code', 'Salary', 'Occupacy', 'Ages', 'Weight', 'Size', 'Feet_size', 'Performance']

# Reorder columns
students = students[desired_order]

# order data by Performance
students_sorted = students.sort_values(by=['Code','Performance'], ascending=False)

# check k-anonimyty
students_sorted.groupby(['Code','Performance','Ages', 'Occupacy']).size()

In [175]:
# save final data 
students.to_csv("../anonymized_data.csv", index=False)