## 3.3 Experiments – 20 marks

### 3.3.1 Design experiments to test the following: The utility of the data that you have generated using your proposed anonymisation scheme (algorithms) for Q2.c.

First, we start by comparing the cardinality of different attributed in the anonymized k_anon_police.csv as well as the original dataset police_shooting.csv:

In [133]:
import pandas as pd
import numpy as np 

def interval_to_middle(value):
    if pd.isna(value):
        return np.nan  
    start, end = value.split('-')
    start = int(start)
    end = int(end)
    # Calculate the middle value of the interval
    middle = (start + end) / 2
    return middle


# Apply preprocessing measures to be able to compare our two datasets
orig_df = pd.read_csv("police-shooting.csv")
orig_df = orig_df.drop(['city', 'name', 'longitude', 'latitude', 'is_geocoding_exact', 'id'], axis=1)

orig_df['year'] = pd.to_datetime(orig_df['date']).dt.year
orig_df = orig_df.drop(['date'], axis=1)

anon_df = pd.read_csv("k_anon_police.csv")
anon_df['year'] = anon_df['year_range'].apply(interval_to_middle)
anon_df['age'] = anon_df['age_range'].apply(interval_to_middle)
anon_df = anon_df.drop(['year_range', 'age_range', 'id'], axis=1)

print(anon_df.iloc[:5])

    manner_of_death       armed gender race state  signs_of_mental_illness  \
0              shot         gun      M    A    WA                     True   
1              shot         gun      M    W    OR                    False   
2  shot and Tasered     unarmed      M    H    KS                    False   
3              shot  toy weapon      M    W    CA                     True   
4              shot    nail gun      M    H    CO                    False   

  threat_level         flee  body_camera    year   age  
0       attack  Not fleeing        False  2016.5  45.5  
1       attack  Not fleeing        False  2016.5  45.5  
2        other  Not fleeing        False  2016.5  15.0  
3       attack  Not fleeing        False  2016.5  45.5  
4       attack  Not fleeing        False  2016.5  45.5  


In [134]:
print(orig_df.iloc[:5])

    manner_of_death       armed   age gender race state  \
0              shot         gun  53.0      M    A    WA   
1              shot         gun  47.0      M    W    OR   
2  shot and Tasered     unarmed  23.0      M    H    KS   
3              shot  toy weapon  32.0      M    W    CA   
4              shot    nail gun  39.0      M    H    CO   

   signs_of_mental_illness threat_level         flee  body_camera  year  
0                     True       attack  Not fleeing        False  2015  
1                    False       attack  Not fleeing        False  2015  
2                    False        other  Not fleeing        False  2015  
3                     True       attack  Not fleeing        False  2015  
4                    False       attack  Not fleeing        False  2015  


In [135]:
cardinalities_orig = {}
cardinalities_anon = {}

for column in orig_df.columns:
    u = orig_df[column].nunique() 
    n = orig_df[column].count()  
    c = u / n  # Cardinality calculation
    cardinalities_orig[column] = c
    
for column in anon_df.columns:
    u = anon_df[column].nunique() 
    n = anon_df[column].count()  
    c = u / n  # Cardinality calculation
    cardinalities_anon[column] = c
    
for attribute, cardinality in cardinalities_orig.items():
    print(f"Original: The cardinality of {attribute} is: {cardinality}")
    
for attribute, cardinality in cardinalities_anon.items():
    print(f"Anonymized: The cardinality of {attribute} is: {cardinality}")    

Original: The cardinality of manner_of_death is: 0.00025034422330704717
Original: The cardinality of armed is: 0.013628182051941374
Original: The cardinality of age is: 0.010815863266123648
Original: The cardinality of gender is: 0.00025131942699170643
Original: The cardinality of race is: 0.0009269272362119574
Original: The cardinality of state is: 0.006383777694329703
Original: The cardinality of signs_of_mental_illness is: 0.00025034422330704717
Original: The cardinality of threat_level is: 0.0003755163349605708
Original: The cardinality of flee is: 0.0005688282138794084
Original: The cardinality of body_camera is: 0.00025034422330704717
Original: The cardinality of year is: 0.0010013768932281887
Anonymized: The cardinality of manner_of_death is: 0.00025034422330704717
Anonymized: The cardinality of armed is: 0.013628182051941374
Anonymized: The cardinality of gender is: 0.00025131942699170643
Anonymized: The cardinality of race is: 0.0009269272362119574
Anonymized: The cardinality 

In [136]:
def numerical_stats(original, anonymized):
    stats_original = original.describe().transpose()
    stats_anonymized = anonymized.describe().transpose()
    stats_original.columns = ['orig_' + col for col in stats_original.columns]
    stats_anonymized.columns = ['anon_' + col for col in stats_anonymized.columns]
    return pd.merge(stats_original, stats_anonymized, left_index=True, right_index=True)

print(numerical_stats(orig_df, anon_df))

      orig_count    orig_mean   orig_std  orig_min  orig_25%  orig_50%  \
age       7489.0    37.215917  12.986545       2.0      27.0      35.0   
year      7989.0  2018.536863   2.290178    2015.0    2017.0    2019.0   

      orig_75%  orig_max  anon_count    anon_mean   anon_std  anon_min  \
age       45.0      92.0      7489.0    38.072573  17.868025      15.0   
year    2021.0    2022.0      7989.0  2019.483352   2.556590    2016.5   

      anon_25%  anon_50%  anon_75%  anon_max  
age       15.0      45.5      45.5      80.5  
year    2016.5    2020.0    2023.0    2023.0  


### 3.3.2 Design experiments to test the following: Analyse the new (anonymized) dataset for risks of de-anonymization.
We are using the same Algorithm proposed in Task 3.2

In [None]:
import pandas as pd
from scipy import stats
import numpy as np
import prince
import warnings
import altair as alt
alt.data_transformers.enable("vegafusion")

df = pd.read_csv('k_anon_police.csv')
# Ignore specific PerformanceWarnings from pandas
warnings.filterwarnings('ignore', category=pd.errors.PerformanceWarning)
# Load the dataset

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
# Fit FAMD on the dataset
mca = mca.fit(df)

df_transformed = mca.transform(df)
z_scores = np.abs(stats.zscore(df_transformed))
outliers = np.where(z_scores > 3) 

outlier_rows = df.iloc[outliers[0]]
outlier_rows.to_csv('athletes_outliers_MCA.csv', index=False)

mca.plot(
    df,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))  
plt.scatter(df_transformed[0], df_transformed[1]) 

plt.title('FAMD Results')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.grid(True)

plt.show() 

### 3.3.3 Design experiments to test the following: Propose a method of assessing the risk of disclosure (de-anonymisation) and use this metric to evaluate your anonymised datasets (from Assignments #1, and #2-3), the anonymised dataset received from your colleague, and your version of the anonymised dataset that you obtained in Q2.c.
