# Ethics for NLP: Spring 2022
# Homework 4 Privacy


## 1. Data Overview and Baseline

A major problem with utilizing web data as a source for NLP applications is the increasing concern for privacy, e.g., such as microtargeting. This homework is aimed at developing a method to obfuscate demographic features, in this case (binary) gender and to investigate the trade-off between obfuscating an users identity and preserving useful information.

The given dataset consists of Reddit posts (`post_text`) which are annotated with the gender (`op_gender`) of the user and the corresponding subreddit (`subreddit`) category.

*  `subreddit_classifier.pickle` pretrained subreddit classifier
*  `gender_classifier.pickle` pretrained gender classifier
*  `test.csv` your primary test data
*  `male.txt` a list of words commonly used by men
*  `female.txt` a list of words commonly used by women
*  `background.csv` additional Reddit posts that you may optionally use for training an obfuscation model

In [None]:
from pandas.core.frame import DataFrame
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from gensim.corpora import Dictionary
from typing import List, Tuple
import numpy as np
import random
import pickle
import cloudpickle
import pandas
import nltk
import gensim
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
def get_preds(cache_name: str, test: List[str]) -> List[str]:
    loaded_model, dictionary, transpose, train_bow = pickle.load(open(cache_name, 'rb'))
    X_test = transpose(test, train_bow, dictionary)
    preds = loaded_model.predict(X_test)
    return preds

In [None]:
def run_classifier(test_file: str) -> Tuple[float]:
    test_data = pandas.read_csv(test_file)

    cache_name = 'gender_classifier.pickle'
    test_preds = get_preds(cache_name, list(test_data["post_text"]))
    gold_test = list(test_data["op_gender"])
    gender_acc = accuracy_score(list(test_preds), gold_test)
    print("Gender classification accuracy", gender_acc)

    cache_name = 'subreddit_classifier.pickle'
    test_preds = get_preds(cache_name, list(test_data["post_text"]))
    gold_test = list(test_data["subreddit"])
    subreddit_acc = accuracy_score(list(test_preds), gold_test)
    print("Subreddit classification accuracy", subreddit_acc)
    return gender_acc, subreddit_acc

In [None]:
gender_acc, subreddit_acc = run_classifier("test.csv")

assert gender_acc == 0.646
assert subreddit_acc == 0.832

**Default accuracy:**
*   `Gender    classification accuracy: 0.646`
*   `Subreddit classification accuracy: 0.832`

## 2. Obfuscation of the Test Dataset
### 2.1 Random Obfuscated Dataset  (4P)
First, run a random experiment, by randomly swapping gender-specific words that appear in posts with a word from the respective list of words of the opposite gender.

*  Write a function to read the female.txt and male.txt files
*  Tokenize the posts („post_text“) using NLTK (0.5p)
*  For each post, if written by a man („M“) and containing a token from the male.txt, replace that token with a random one from the female.txt (1p)
*  For each post, if written by a woman („W“) and containing a token from the female.txt, replace that token with a random one from the male.txt (1p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results compared to the baseline (1p)

In [None]:
#
# Solution
#
def read_data(file_name: str) -> List[str]:
    with open(file_name) as file: return [l.replace("\n", "") for l in file.readlines()]

In [None]:
male_words = read_data("./male.txt")
female_words = read_data("./female.txt")

assert len(male_words) == 3000
assert len(male_words) == 3000

In [None]:
#
# Solution
#
def random_replace(x: str, words_1: List[str], words_2: List[str]) -> str:
  return " ".join(random.choice(words_1) if t.lower() in words_2 else t for t in nltk.tokenize.word_tokenize(x))

def obfuscate_gender(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  data = pandas.read_csv(dataset_file_name)
  data.loc[data['op_gender'] == "M", 'post_text'] = data.loc[data['op_gender'] == "M", 'post_text'].apply(lambda x: random_replace(x, female_words, male_words))
  data.loc[data['op_gender'] == "W", 'post_text'] = data.loc[data['op_gender'] == "W", 'post_text'].apply(lambda x: random_replace(x, male_words, female_words))
  return data

In [None]:
file_name = "random_replaced_test.csv"

In [None]:
random_replaced_test = obfuscate_gender(male_words=male_words, female_words=female_words, dataset_file_name="test.csv")
random_replaced_test.to_csv(file_name)

In [None]:
random_replaced_test = pandas.read_csv(file_name)
assert len(random_replaced_test) == 500
assert random_replaced_test["subreddit"][0] == "funny"
assert random_replaced_test["subreddit"][-1:].item() == "relationships"

In [None]:
gender_acc, subreddit_acc = run_classifier(file_name)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.7

**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: `
*   `Your commentary: ` ...

### 2.2 Similarity Obfuscated Dataset (4P)
In a second approach, refine the swap method. Instead of randomly selecting a word, use a similarity metric.


*  Instead of the first method replace the tokens by semantically similar tokens from the other genders token list. For that you may choose any metric for identifying semantically similar words, but you have to justify your choice. (Recommend: using cosine distance between pre-trained word embeddings) (2p)
*  Save the obfuscated version of the test.csv in a separate CSV file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results (compared to the baseline and your other results) (1p)
*  The classifiers accuracy for predicting the gender should be below random guessing (50%) and for the subreddit prediction it should be above 80% (0.5p)

In [None]:
from gensim.models import Word2Vec
import gensim.downloader
model = gensim.downloader.load("word2vec-google-news-300")

In [None]:
#
# Solution
#
def similar(a: str, b: str) -> float:
  try:
    return model.similarity(a, b)
  except:
    return 0

def max_similar(token: str, words: List[str]) -> str:
  return max([(i, similar(i, token)) for i in words], key=lambda x:x[1])[0]

def similarity_replace(x: str, words_1: List[str], words_2: List[str]) -> str:
  return " ".join(max_similar(t, words_1) if t.lower() in words_2 else t for t in nltk.tokenize.word_tokenize(x))

def obfuscate_gender(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  data = pandas.read_csv(dataset_file_name)
  data.loc[data['op_gender'] == "M", 'post_text'] = data.loc[data['op_gender'] == "M", 'post_text'].apply(lambda x: similarity_replace(x, female_words, male_words))
  data.loc[data['op_gender'] == "W", 'post_text'] = data.loc[data['op_gender'] == "W", 'post_text'].apply(lambda x: similarity_replace(x, male_words, female_words))
  return data

In [None]:
file_name = "similarity_replaced_test.csv"

In [None]:
similarity_replaced_test = obfuscate_gender(male_words=male_words, female_words=female_words, dataset_file_name="./test.csv")
similarity_replaced_test.to_csv(file_name)

In [None]:
similarity_replaced_test = pandas.read_csv(file_name)
assert len(similarity_replaced_test) == 500
assert similarity_replaced_test["subreddit"][0] == "funny"
assert similarity_replaced_test["subreddit"][-1:].item() == "relationships"

In [None]:
gender_acc, subreddit_acc = run_classifier(file_name)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.8

**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 2.3 Your Own Obfuscated Dataset (4P)
With this last approach, you can experiment by yourself how to obfuscate the posts.

*  Some examples: What if you randomly decide whether or not to replace words instead of replacing every lexicon word? What if you only replace words that have semantically similar enough counterparts? What if you use different word embeddings? (2p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Describe your modifications and report the accuracy and provide a brief commentary on the results compared to the baseline and your other results (1.5p)

In [None]:
model = gensim.downloader.load("glove-twitter-200")

In [None]:
file_name = "similarity_glove_replaced_test.csv"

In [None]:
similarity_replaced_test = obfuscate_gender(male_words=male_words, female_words=female_words, dataset_file_name="./test.csv")
similarity_replaced_test.to_csv(file_name)

In [None]:
similarity_replaced_test = pandas.read_csv(file_name)
assert len(similarity_replaced_test) == 500
assert similarity_replaced_test["subreddit"][0] == "funny"
assert similarity_replaced_test["subreddit"][-1:].item() == "relationships"

In [None]:
gender_acc, subreddit_acc = run_classifier(file_name)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.8

**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 3 Advanced Obfuscated Model (5P)
Develop your own obfuscation model using the provided background.csv for training. Your ultimate goal should be to obfuscate text so that the classifier is unable to determine the gender of an user (no better than random guessing) without compromising the accuracy of the subreddit classification task. To train a model that is good at predicting subreddit classification, but bad at predicting gender. The key idea in this approach is to design a model that does not encode information about protected attributes (in this case, gender). In your report, include a description of your model and results.

*  Develop your own classifier (3p)
*  Use only posts from the subreddits „CasualConversation“ and „funny“ (min. 1000 posts for each gender per subreddit) (0.5p)
*  Use sklearn models (MLPClassifier, LogisticRegression, etc.)
*  Use 90% for training and 10% for testing (0.5p)
*  In your report, include a description of your model and report the accuracy on the unmodified train data (your baseline here) as well as the modified train data and provide a brief commentary on the results (1p)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.utils import shuffle

In [None]:
def get_train_data(df_, labels, max_):
  df = [df_.loc[(df_['subreddit'] == l) & (df_['op_gender'] == "W"), ['post_text', 'op_gender', 'subreddit']].head(max_) for l in labels]
  df += [df_.loc[(df_['subreddit'] == l) & (df_['op_gender'] == "M"), ['post_text', 'op_gender', 'subreddit']].head(max_)for l in labels]
  return pandas.concat(df)

In [None]:
train_data = pandas.read_csv("background.csv")
labels = ["CasualConversation", "funny"]
train_data = get_train_data(train_data, labels, 4000)
train_data_original = train_data

In [None]:
print(train_data)
print(len(train_data))

In [None]:
train_data = shuffle(train_data)
test_data = train_data.head(round(len(train_data) * 0.1))
train_data = train_data[round(len(train_data) * 0.1):]
print(len(test_data))
print(len(train_data))

train_data.to_csv("mode_train_data.csv")

def simple_modify(data):
  copy_1 = data.copy()
  copy_2 = data.copy()
  copy_1.loc[copy_1['op_gender'] == "M", 'op_gender'] = "W"
  copy_2.loc[copy_2['op_gender'] == "W", 'op_gender'] = "M"
  new_data = pandas.concat([copy_1, copy_2])
  print(len(new_data))
  new_data.to_csv("mode_train_data_modified.csv")
  return new_data

In [None]:
train_data_modified = simple_modify(train_data)

In [None]:
def embedd_data(train_data, test_data):
  # bag-of-words representation
  train = [[token.lower() for token in nltk.tokenize.word_tokenize(t) if token not in nltk.corpus.stopwords.words('english')] for t in train_data["post_text"]]
  test = [[token.lower() for token in nltk.tokenize.word_tokenize(t) if token not in nltk.corpus.stopwords.words('english')] for t in test_data["post_text"]]
  dictionary = Dictionary(train + test)
  dictionary.filter_extremes(no_below=5)
  len_train = len(train)
  all_feats = gensim.matutils.corpus2csc([dictionary.doc2bow(x) for x in train + test]).transpose()
  test_feats = all_feats[len_train:]
  train_feats = all_feats[:len_train]
  return dictionary, train_feats, test_feats

In [None]:
def classify(train_data, test_data):
  dictionary, train_feats, test_feats = embedd_data(train_data, test_data)

  train_y = list(train_data["op_gender"])
  test_y = list(test_data["op_gender"])
  model_ = LogisticRegression(max_iter=10000)
  model_.fit(train_feats, train_y)
  predicted = model_.predict(test_feats)
  print("gender prediction:", np.mean(predicted == test_y))

  train_y = list(train_data["subreddit"])
  test_y = list(test_data["subreddit"])
  grid = {
      "penalty": ['l2'],
  }
  logreg = LogisticRegression(max_iter=10000)
  model_ = GridSearchCV(logreg, grid, cv=10, verbose=0)
  model_.fit(train_feats, train_y)
  predicted = model_.predict(test_feats)
  print("subreddit prediction:", np.mean(predicted == test_y))

In [None]:
classify(train_data_original, test_data)

In [None]:
classify(train_data_modified, test_data)

**Report accuracy:**
* Baseline:
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: `
* Your Model: 
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 4 Ethical Implications (3P)
Discuss the ethical implications of obfuscation and privacy based on the concepts covered in the lecture. Provide answers to the following points:

1.   What are demographic features (name at least three) and explain shortly some of the privacy violation risks? (1p)
2.   Explain the cultural and social implications and their effects? In this context discuss the information privacy paradox. You may refer to a recent example like the COVID-19 pandemic.  (1.5p)
3.   Name a at least three privacy preserving countermeasures  (0.5p)

1. - Examples: Gender; Age; Location; Religion; Ethnicity; Social class; Diet; Personality type 
  - Risks: Humiliation, Abuse, Discrimination, Identify theft, financial/physical/psychological/reputational damage or threat to life

---

2. 

Culturally, there are differences in societies, which are mainly reflected in differences in governmental stringency or power distance. For example, countries with comparatively high power distance were found to be more likely to control pandemic numbers. Different social values, such as obedience, orientation, trust, and commitment to rules and authority, serve a common goal of safety for all and allow more stringent measures to be evaluated by the outcome. Consequently, in countries such as Taiwan and Singapore, stricter measures could be implemented, while in many European countries the measures led to a loss of trust in authority and protests. Societies that value individual freedom and choice showed a more positive growth rate of COVID-19 cases than societies that value cooperation and collective well-being.( https://link.springer.com/article/10.1057/s41267-021-00455-w#Sec8)
These differences were particularly evident in the acceptance of quarantine and other national protective measures, such as the wearing of masks or Apps to control the spread of infection, which were enforced with varying degrees of strictness. Since privacy is assumed to be a fundamental freedom in many countries, such as Germany. This shows a conflict between individual and authority or risk and security or safey in this specific case. 
However, it is contradictory from a social perspective, since on the one hand data is voluntarily disclosed for many online applications and in social networks, which are usually used for profiling functions, and on the other hand little active effort is made to protect one's own data. The intentions and actions regarding privacy often do not coincide, which is commonly referred to as the information privacy paradox. 


---

3. 
  * Anonymization (𝑘-anonymity, 𝑙-diversity, 𝑡-closeness)
  * Differential Privacy 
  * Encryption
  * Privacy Aggregation of Teacher Ensenbles (PATE)
  * Synthetic Data
