### 3.1.3 Using the techniques you applied in Assignment #1, apply a masking or transformation mechanism to modify the detected PII elements and substitute with suitable replacements.
In the following section, we will apply techniques similar to those used in Assignment #1 to mask or transform Personally Identifiable Information (PII) detected in a dataset. The goal is to substitute these sensitive elements with suitable replacements while maintaining the overall structure and coherence of the data.

As a first step, we start by checking how many occurrences of which category we have in our new PII column. This information is crucial for planning further anonymization steps. 

In [7]:
import pandas as pd
df = pd.read_csv("PII_tweet_emotions.csv")

In [8]:
import collections
import ast 

# Function to extract entities from a string and return their labels
def extract_labels(data_string):
    # Convert string representation of list to actual list
    entities = ast.literal_eval(data_string)
    # Extract labels
    return [label for _, label in entities]

# Extract labels from each item in the data
all_labels = [label for item in df['PII'] for label in extract_labels(item)]

# Count occurrences of each label
label_counts = collections.Counter(all_labels)

print(label_counts)
print(len(label_counts))

Counter({'PERSON': 10340, 'ORG': 8815, 'DATE': 7560, 'CARDINAL': 3430, 'GPE': 3269, 'TIME': 2964, 'NORP': 957, 'PRODUCT': 702, 'ORDINAL': 680, 'WORK_OF_ART': 607, 'MONEY': 353, 'LOC': 244, 'FAC': 210, 'QUANTITY': 191, 'EVENT': 123, 'LANGUAGE': 106, 'PERCENT': 75, 'LAW': 35})
18


The 18 different categories have the following meaning: 

- PERSON: Names of people.
- ORG: Organizations, including companies, agencies, institutions, etc.
- DATE: Absolute or relative dates or periods.
- CARDINAL: Numerals that do not fall under another type (like dates or quantities).
- GPE: Geopolitical entity, typically referring to countries, cities, states.
- TIME: Times smaller than a day, including specific time periods, durations, or times of day.
- NORP: Nationalities, religious or political groups.
- ORDINAL: "First", "second", etc., used to denote position in a ordered sequence.
- PRODUCT: Objects, vehicles, foods, etc. (not services).
- MONEY: Monetary values, including unit.
- WORK_OF_ART: Titles of books, songs, etc.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- FAC: Facilities, including buildings, airports, highways, bridges, etc.
- QUANTITY: Measurements, as of weight or distance.
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- PERCENT: Percentage (including "%").
- LANGUAGE: Any named language.
- LAW: Named documents made into laws.


So, we now know that there are 18 types of different datatypes that should be anonymized in some kind of way. We start by anonymizing the easiest ones with the faker library:

In [9]:
import spacy

#Load the large, pre-trained spaCy model
nlp = spacy.load('en_core_web_lg')



In [10]:
from faker import Faker
import re
import random

def close_number(original_number):
    try:
        num = int(original_number)
        # Generate a number within ±10% of the original number, for example
        variation = int(num * 0.1)
        return str(random.randint(max(0, num - variation), num + variation))
    except ValueError:
        return original_number  
    
def fake_ordinal():
    number = fake.random_int(min=1, max=100)
    suffix = ["th", "st", "nd", "rd"] + ["th"] * 6
    return str(number) + suffix[number % 10 if number % 100 not in [11, 12, 13] else 0]
    
fake = Faker()
def replace_pii_with_fake(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    # Iterate over the identified entities
    for ent in doc.ents: #a
        # Replace with fake data based on the entity type
        if ent.label_ == 'PERSON':
            text = re.sub(re.escape(ent.text), fake.name(), text)
        elif ent.label_ == 'GPE':
            text = re.sub(re.escape(ent.text), fake.city(), text)
        elif ent.label_ == 'DATE':
            text = re.sub(re.escape(ent.text), fake.date(), text)
        elif ent.label_ == 'ORG':
            text = re.sub(re.escape(ent.text), fake.company(), text)
        elif ent.label_ == 'NORP':
            text = re.sub(re.escape(ent.text), fake.country(), text)
        elif ent.label_ == 'CARDINAL':
            text = re.sub(re.escape(ent.text), lambda x: close_number(ent.text), text)
        elif ent.label_ == 'ORDINAL':
            text = re.sub(re.escape(ent.text), fake_ordinal(), text)
        elif ent.label_ == 'TIME':
            text = re.sub(re.escape(ent.text), fake.time(), text)
            
    # Replace Twitter @ with fake names
    text = re.sub(r'(?<=@)\w+', fake.user_name(), text)
    return text


df['content'] = df['content'].apply(replace_pii_with_fake)

# Save the modified DataFrame to a new CSV file
df.to_csv("Anonymized_PII_tweet_emotions.csv", index=False)

Now, that we anonymized seven of the 18 total categories, lets continue with the other 11, that are still missing: 
- ORDINAL PRODUCT MONEY WORK_OF_ART LOC FAC QUANTITY EVENT PERCENT LANGUAGE LAW


### 3.1.4 Analyse the text to determine if any information can be obtained after the transformation process. What conclusions can you draw from this?

Idea is to check semantics of text from original dataset and anonymised dataset and see if they are similar. Then check if the PII from the original dataset are still present in the anonymised dataset

Cosine Similarity: Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. Its value ranges from -1 to 1, where:

1 means the vectors are identical.
0 indicates orthogonality (no similarity).
-1 implies completely opposite.

By using 1 - cosine, we transform the scale:
If the cosine similarity is 1 (vectors are identical), 1 - 1 becomes 0, indicating no difference.
If the cosine similarity is 0 (vectors are orthogonal), 1 - 0 becomes 1, indicating maximum difference.
A cosine similarity of -1 (completely opposite vectors) would result in a transformed similarity of 2, which typically doesn't occur in normalized vector spaces used in text analysis.

In [11]:
import pandas as pd
import torch
from scipy.spatial.distance import cosine
import numpy as np
from transformers import DistilBertTokenizer, DistilBertModel

# Load your datasets
df = pd.read_csv("tweet_emotions.csv")  # Make sure you've loaded the original dataset into 'df'
df_anonymized = pd.read_csv("Anonymized_PII_tweet_emotions.csv")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Modify the get_embedding function to send inputs to the GPU
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

# Function to calculate semantic similarity
def semantic_similarity(text1, text2, tokenizer, model):
    emb1 = get_embedding(text1, tokenizer, model)
    emb2 = get_embedding(text2, tokenizer, model)
    # Ensure embeddings are 1-D
    emb1 = np.squeeze(emb1)
    emb2 = np.squeeze(emb2)
    #print(text1 + " and " + text2)
    return 1 - cosine(emb1, emb2)

# Calculate similarities
try:
    similarity_scores = [semantic_similarity(orig, anon, tokenizer, model) for orig, anon in zip(df['content'], df_anonymized['content'])]
except ValueError as e:
    print(f"Error calculating similarity: {e}")



ImportError: dlopen(/Users/julius/PycharmProjects/PPOD/.venv/lib/python3.9/site-packages/scipy/special/_ufuncs.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '_npy_asinh'

In [None]:
import nbformat
from nbformat import validate

# Replace '4.3.1.ipynb' with the path to your notebook
with open('3.1+3.2.ipynb', 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, as_version=4)
    validate(nb)