## 3.2 De-anonymizing a dataset – 50 marks
For this task, we are using a dataset provided to us by our colleagues. The Dataset has been anonymized using bayesian inferences. 
- Dataset: police-shooting_anonymized.csv

### 3.2.1 Using standard search mechanisms, determine if there are any elements within the dataset that you received, that allow for de-anonymizations to occur. Make a note of what you find and explain the procedure you used. 
First of all, we look at the data present in our new dataset. The dataset contains information about victims of police shootings with information about where they happened (city, state, longitude, latitude), when they happened (date), information about the victims (gender, race, manner of death, signs_of_mental_illness, threat level, flee) and body_camera giving information about whether the police officer was wearing a body cam. 



In [7]:
import pandas as pd
df = pd.read_csv("police_shooting_anonymized.csv")
print(df.iloc[:5])

         date   manner_of_death       armed   age gender race  \
0  2022-11-06              shot         gun  20.0      M    H   
1  2016-01-20              shot         gun  32.0      M    W   
2  2020-08-29  shot and Tasered     unarmed  56.0      M    B   
3  2020-07-12              shot  toy weapon  52.0      F    B   
4  2019-04-29              shot    nail gun  19.0      M    B   

               city state  signs_of_mental_illness threat_level         flee  \
0         Baltimore    AZ                     True       attack  Not fleeing   
1         Kerrville    ME                    False       attack  Not fleeing   
2  Dearborn Heights    NC                    False        other  Not fleeing   
3   Butler Township    FL                     True       attack  Not fleeing   
4            Edmond    MS                    False       attack  Not fleeing   

   body_camera  longitude  latitude  is_geocoding_exact  
0        False    -83.663    37.472                True  
1        Fal

In our case, we tried google searching some of the entries on the internet trying to match information with the given dataset. The first entry of the dataset directly provides us with multiple online articles talking about the shooting in Baltimore on the 06.11.2022 without mentioning the name of the victim. We can find the respective name after searching for another while in a YouTube video showing the body cam of an officer present at the shooting on the 06.11.2022 - The name of the victim was Tyree Moorehead. 
Going further we tried matching other entries to real world events and there wasn't one entry we did not succeed with a simple google search. For the first five entries, the victim names are: 
- Tyree Moorehead
- Michael Clyde Lynch
- Donny Walker
- Terena Nicole Thurman
- Isaiah Lewis

### 3.2.2 Design a de-anonymisation algorithm and apply to both the received dataset and your dataset. Report on the following:
In this section, we aim to explore the underlying patterns and potentially re-identify anonymized records in our police-shooting dataset containing sensitive information about victims of fatal police shootings. The dataset includes demographic details, incident locations, and other attributes that could potentially be pieced together to de-anonymize the data.

To achieve this, we will utilize Factor Analysis of Mixed Data (FAMD), which is an extension of principal component analysis (PCA) that is suitable for mixed data types (categorical and numerical). FAMD allows us to reduce the dimensionality of the data while preserving the relationships between both numerical and categorical variables. The rationale behind the choice of FAMD and the specific parameters set in the initialization of the prince.FAMD object is as follows:

1. n_components=2: We choose to extract two principal components to simplify visualization and interpretation of the data in a 2D space, which is often sufficient to capture significant variance in the data.
2. n_iter=3: A lower number of iterations provides a quicker approximation of the FAMD solution, this can be advantageous for large datasets.

In [8]:
#!pip install -U "vegafusion[embed]" # we use vegafusion for interactive charts

In [18]:
import pandas as pd
from scipy import stats
import numpy as np
import prince
import warnings
import altair as alt
alt.data_transformers.enable("vegafusion")


# Ignore specific PerformanceWarnings from pandas
warnings.filterwarnings('ignore', category=pd.errors.PerformanceWarning)
# Load the dataset
df = pd.read_csv('police_shooting_anonymized.csv')


# Initialize FAMD object: specifying the number of components (n_components)
famd = prince.FAMD(
    n_components=2,
    n_iter=3,
    copy=True,
    check_input=True,
    random_state=42,
    engine="sklearn",
    handle_unknown="error"  # same parameter as sklearn.preprocessing.OneHotEncoder
)
# Fit FAMD on the dataset
famd = famd.fit(df)
print(famd.eigenvalues_summary)
print(famd.row_coordinates(df).head())
print(famd.column_coordinates_)


df_transformed = famd.transform(df)
z_scores = np.abs(stats.zscore(df_transformed))
outliers = np.where(z_scores > 3) 

outlier_rows = df.iloc[outliers[0]]
outlier_rows.to_csv('outliers.csv', index=False)

famd.plot(
    df,
    x_component=0,
    y_component=1
)



          eigenvalue % of variance % of variance (cumulative)
component                                                    
0             12.724         0.05%                      0.05%
1             12.380         0.05%                      0.10%
component          0         1
0           0.602289 -0.231371
1           1.696160  4.199227
2          -1.275863  2.524905
3          11.904668  4.453371
4           6.313158 -5.749478
component                       0         1
variable                                   
age                      0.000084  0.000142
longitude                0.000001  0.000004
latitude                 0.000763  0.000026
armed                    0.252883  0.175638
body_camera              0.001028  0.000374
city                     0.887289  0.914713
date                     0.891825  0.910021
flee                     0.033574  0.018013
gender                   0.000485  0.000516
is_geocoding_exact       0.001214  0.003795
manner_of_death          0.003510  0.0

TypeError: plot() got an unexpected keyword argument 'figsize'

### a) What are you able to discover using your de-anonymisation algorithm?


### b) How your discoveries (with the anonymised algorithm) compare to the results from Q1?
- Regarding the police shooting dataset anonymized using bayesian inferences: 
    Since we were already successful in deanonymizing the dataset in 3.2.1, we didn't gain additional knowledge by applying our de-anonymization to the dataset. 
- Regarding the 


### c)

