## 3.2 De-anonymizing a dataset – 50 marks
For this task, we are using a dataset provided to us by our colleagues. The Dataset has been anonymized using bayesian inferences. 
- Dataset: police-shooting_anonymized.csv

### 3.2.1 Using standard search mechanisms, determine if there are any elements within the dataset that you received, that allow for de-anonymizations to occur. Make a note of what you find and explain the procedure you used. 
First of all, we look at the data present in our new dataset. The dataset contains information about victims of police shootings with information about where they happened (city, state, longitude, latitude), when they happened (date), information about the victims (gender, race, manner of death, signs_of_mental_illness, threat level, flee) and body_camera giving information about whether the police officer was wearing a body cam. 



In [8]:
import pandas as pd
df = pd.read_csv("police_shooting_anonymized.csv")
print(df.iloc[:5])

         date   manner_of_death       armed   age gender race  \
0  2022-11-06              shot         gun  20.0      M    H   
1  2016-01-20              shot         gun  32.0      M    W   
2  2020-08-29  shot and Tasered     unarmed  56.0      M    B   
3  2020-07-12              shot  toy weapon  52.0      F    B   
4  2019-04-29              shot    nail gun  19.0      M    B   

               city state  signs_of_mental_illness threat_level         flee  \
0         Baltimore    AZ                     True       attack  Not fleeing   
1         Kerrville    ME                    False       attack  Not fleeing   
2  Dearborn Heights    NC                    False        other  Not fleeing   
3   Butler Township    FL                     True       attack  Not fleeing   
4            Edmond    MS                    False       attack  Not fleeing   

   body_camera  longitude  latitude  is_geocoding_exact  
0        False    -83.663    37.472                True  
1        Fal

dIn our case, we tried google searching some of the entries on the internet trying to match information with the given dataset. The first entry of the dataset directly provides us with multiple online articles talking about the shooting in Baltimore on the 06.11.2022 without mentioning the name of the victim. We can find the respective name after searching for another while in a YouTube video showing the body cam of an officer present at the shooting on the 06.11.2022 - The name of the victim was Tyree Moorehead. 
Going further we tried matching other entries to real world events and there wasn't one entry we did not succeed with a simple google search. For the first five entries, the victim names are: 
- Tyree Moorehead
- Michael Clyde Lynch
- Donny Walker
- Terena Nicole Thurman
- Isaiah Lewis

### 3.2.2 Design a de-anonymisation algorithm and apply to both the received dataset and your dataset. Report on the following:

In [9]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import numpy as np

# Load dataset from CSV file
file_path = 'police_shooting_anonymized.csv'  
df = pd.read_csv(file_path)

# Convert 'date' column to a numerical feature
df['date'] = pd.to_datetime(df['date'])
df['days_since'] = (df['date'] - df['date'].min()).dt.days

# Select columns to be one-hot encoded
categorical_cols = ['city', 'manner_of_death']

# One-hot encoding
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(df[categorical_cols]).toarray()
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Combine encoded features with the numerical 'days_since' column
final_df = pd.concat([encoded_df, df['days_since']], axis=1)

# PCA and StandardScaler in a Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))  # Adjust n_components as needed
])

# Fit and transform the data
pca_features = pipeline.fit_transform(final_df)

# Calculate the Euclidean distance from the origin for each point
distances = np.linalg.norm(pca_features, axis=1)

# Determine a threshold for outliers
threshold = np.mean(distances) + 2 * np.std(distances)

# Identify outliers
outliers = distances > threshold

# Retrieve the entire corresponding rows of the dataset for the outliers
outlier_indices = np.where(outliers)[0]
outlier_rows = df.iloc[outlier_indices].copy()  # Use .copy() to avoid SettingWithCopyWarning

# Add the outlier distances to these rows
outlier_rows['outlier_distance'] = distances[outliers]

# Export the outlier rows with distances to a new CSV file
outlier_rows.to_csv('outliers_with_distances.csv', index=False)

print("Outlier rows with distances have been exported to 'outliers_with_distances.csv'.")


Outlier rows with distances have been exported to 'outliers_with_distances.csv'.
