# Data Anonymization Techniques

## Example Data Anonymization using Pandas and Keras

Let's assume we have a dataset with user information, including sensitive attributes like names and email addresses, along with some numerical data that we wish to use to train a Keras model. In this scenario, our model requires age, income, annual expenditure, and house price to determine house affordability.

In [1]:
import pandas as pd
import numpy as np
import hashlib
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Example DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com'],
    'Age': [25, 30, 35],
    'Post Code': ['SW1A 1AA', 'W1A 0AX', 'EC1A 1BB'],
    'Income': [50000, 60000, 70000],
    'Annual Expenditure': [20000, 25000, 30000],
    'House Price': [200000, 250000, 300000],
    'Affordability': [0.5, 0.9, 0.7]
}
df = pd.DataFrame(data)

df

2024-06-16 21:22:45.315694: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-16 21:22:45.315724: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-16 21:22:45.316637: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Unnamed: 0,Name,Email,Age,Post Code,Income,Annual Expenditure,House Price,Affordability
0,Alice,alice@example.com,25,SW1A 1AA,50000,20000,200000,0.5
1,Bob,bob@example.com,30,W1A 0AX,60000,25000,250000,0.9
2,Charlie,charlie@example.com,35,EC1A 1BB,70000,30000,300000,0.7


## Anonymization - Hashing Names and Emails, Adding Noise to Income and Expenditure

In [2]:
# Anonymization - Hashing Names and Emails, adding noise to Income and Expenditure
df['Name'] = df['Name'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
df['Email'] = df['Email'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
df['Income'] += np.random.normal(0, 1000, df['Income'].shape)
df['Annual Expenditure'] += np.random.normal(0, 500, df['Annual Expenditure'].shape)

df

Unnamed: 0,Name,Email,Age,Post Code,Income,Annual Expenditure,House Price,Affordability
0,3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e...,ff8d9819fc0e12bf0d24892e45987e249a28dce836a85c...,25,SW1A 1AA,48766.837965,19517.224991,200000,0.5
1,cd9fb1e148ccd8442e5aa74904cc73bf6fb54d1d54d333...,5ff860bf1190596c7188ab851db691f0f3169c453936e9...,30,W1A 0AX,61616.028643,24723.717089,250000,0.9
2,6e81b1255ad51bb201a2b8afa9b66653297ae0217f833b...,add7232b65bb559f896cbcfa9a600170a7ca381a036678...,35,EC1A 1BB,69365.645496,29621.372243,300000,0.7


## Preparing Data for Keras Model

In [9]:
# Preparing data for Keras model
X = df[['Age', 'Income', 'Annual Expenditure', 'House Price']]  # Using relevant features
y = df['Affordability']

# Build a Keras model
model = Sequential([
    Dense(10, input_dim=X.shape[1], activation='relu'),
    Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')

# Fit the model
model.fit(X, y, epochs=10, batch_size=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7f0c587563b0>

## Evaluating Different Levels of Noise

In [10]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.metrics import MeanSquaredError

# Noise levels to test
noise_levels = [10, 500, 1000, 5000]

for noise in noise_levels:
    # Create a copy of X and add noise
    X_noised = X + np.random.normal(0, noise, X.shape)
    
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_noised, y, test_size=0.33, random_state=42)
    
    # Fit the model
    model.fit(X_train, y_train, epochs=10, batch_size=1, verbose=0)
    
    # Evaluate the model
    predictions = model.predict(X_test)
    mse = MeanSquaredError()
    mse_value = mse(y_test, predictions).numpy()
    
    print(f'Noise Level: {noise}, MSE: {mse_value}')

Noise Level: 10, MSE: 929.5593872070312
Noise Level: 500, MSE: 88090.484375
Noise Level: 1000, MSE: 511335.9375
Noise Level: 5000, MSE: 14061255.0


The code above trains and evaluates the Keras model for different levels of noise to evaluate the impact on the performance of the model by using the MSE metric. This would be a more methodical approach to choose and validate random noise.

## Utilizing Derived Features

In [11]:
# Calculating Purchasing Power
df['Purchasing Power'] = df['Income'] / df['Annual Expenditure']

# Preparing data for Keras model
X = df[['Age', 'Purchasing Power', 'House Price']]  # Using Age, Purchasing Power, and House Price as features
y = df['Affordability']

## Geographic masking

In [13]:
# Original geographic coordinates (latitude, longitude)
original_coordinates = np.array([
[51.5074, -0.1278],  # London
[48.8566, 2.3522],   # Paris
     [40.7128, -74.0060]])  # New York
# Define the amount of noise to add (this could be adjusted based on your needs)
noise_scale = 0.01
# Generate random noise
noise = np.random.normal(scale=noise_scale, size=original_coordinates.shape)
# Add noise to original coordinates to get masked coordinates
masked_coordinates = original_coordinates + noise
print("Original Coordinates:\n", original_coordinates)
print("\nMasked Coordinates:\n", masked_coordinates)


Original Coordinates:
 [[ 51.5074  -0.1278]
 [ 48.8566   2.3522]
 [ 40.7128 -74.006 ]]

Masked Coordinates:
 [[ 51.43425638  -0.22372231]
 [ 48.87281359   2.44898603]
 [ 40.64861865 -73.85304906]]
