## Illustrative Notebook : Privacy Leakage in NLP 

In [100]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
import warnings
from PrivacyLeakWE import *
warnings.filterwarnings('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### *In this notebook, we are going to study to what extent word embeddings are not private.*

Naively, since an embedding consists in a transformation of a word into a low dimensional vector, we could think that it preserves privacy but the reality is different. For illustrating that point, we are going to consider the famous word embedding "Word2Vec" available for instance with the library gensim and we will consider the following scenario :

- Someone uses an application in which he types some text and, without knowing it, his words are embedded and transmitted to an API enable to suggest us what to type next. Unfortunately for us, a malicious person know his identity and succeed in capturing this transmission : Is his privacy at risk ?

#### To answer to this question, we are going to consider different situation more or less realistic

### <span style='color:blue;'> Situation 1 : We imagine that the user express itself rather simply such that we can find online a dataset containing sentences close to what he has written.

#### For this case we have considered a translation dataset from http://www.manythings.org/anki/ mapping english sentences to french sentences. In practice, this kind of dataset contains very simple sentences that anyone could use in the everyday life. Here we are not interested by the translation task but just by the embeddings of the english sentences and of the french sentences.

In [7]:
raw_df=pd.read_csv('english_french.txt',delimiter='\t',encoding='utf-8')
raw_df.columns=['English','French','Licence']
raw_df=raw_df.drop(['Licence'],axis=1)
raw_df.sample(10)

Unnamed: 0,English,French
106166,I live in this house by myself.,J'habite cette maison seul.
67403,We were all so busy then.,Nous étions tous alors tellement occupés.
24178,Don't say too much.,N'en dis pas trop !
74166,This CD belongs to my son.,Ce CD appartient à mon fils.
130051,The boy caught the cat by the tail.,L'enfant attrapa le chat par la queue.
96401,We're facing a budget crisis.,Nous sommes confrontés à une crise budgétaire.
158918,Tom should've told Mary that he was married.,Tom aurait dû dire à Mary qu'il était marié.
38977,That's a lot of work.,C'est beaucoup de boulot.
126854,Are you finished reading the paper?,As-tu fini de lire le journal ?
56416,I guess I deserved that.,J'imagine que j'ai mérité ça.


#### As expected, these are sentences of everyday life. Now, we need to make some preprocessing in order to guarantee a good embedding. For instance, we could lower all the words and delete the punctuations. This is done using the functions `vpunc` and `vlower` from our library `PrivacyLeakWE.py`

In [8]:
clean_data=pd.DataFrame(vpunc(vlower(raw_df)))
clean_data.columns=raw_df.columns

print(f'Dataset shape : {clean_data.shape}')

Dataset shape : (179903, 2)


#### Now the dataset is cleaned, we are going to embed english and french sentences

In [9]:
np.random.seed(0)

#The sentences of the user are considered from approximately the same distribution as the one of the online dataset

english=clean_data.English.drop_duplicates()
private_english_data=english.sample(frac=0.5)
attacker_english_data=english.sample(frac=0.5) 

french=clean_data.French.drop_duplicates()
private_french_data=french.sample(frac=0.5)
attacker_french_data=french.sample(frac=0.5)

In [10]:
#Preprocessing consisting in splitting sentences into sequences of words
private_english_input=(private_english_data.apply(lambda x: x.split(" ")).values)
attacker_english_input=(attacker_english_data.apply(lambda x: x.split(" ")).values)

private_french_input=(private_french_data.apply(lambda x: x.split(" ")).values)
attacker_french_input=(attacker_french_data.apply(lambda x: x.split(" ")).values)

##### The next cell takes some times but you can skip it and take the embeddings we have already computed using the cell below this one

In [38]:
#private and attacker embeddings for english and french sentences
private_english_model = gensim.models.Word2Vec(sentences=private_english_input)
attacker_english_model = gensim.models.Word2Vec(sentences=attacker_english_input)

private_french_model = gensim.models.Word2Vec(sentences=private_french_input)
attacker_french_model = gensim.models.Word2Vec(sentences=attacker_french_input)

# Save the embeddings
private_english_model.save('private_english_model_0.5.model')
attacker_english_model.save('attacker_english_model_0.5.model')

private_french_model.save('private_french_model_0.5.model')
attacker_french_model.save('attacker_french_model_0.5.model')

In [39]:
#Load our embeddings
private_english_model=gensim.models.Word2Vec.load('private_english_model_0.5.model')
attacker_english_model=gensim.models.Word2Vec.load('attacker_english_model_0.5.model')

private_french_model=gensim.models.Word2Vec.load('private_french_model_0.5.model')
attacker_french_model=gensim.models.Word2Vec.load('attacker_french_model_0.5.model')

#### Now, we can already test if some sensitive sentence can be recovered by the attacker. This is done using the function `privacy_leak` of our library `PrivacyLeakWE.py`. Concretely, when we use this function we imagine that the user has typed the sentence we give at input and the attacker has to find the closest sentence he can found using his proper embeddings knowing the embedding of this private sentence.

In [40]:
privacy_leak('my doctor said that I only have a few days left to live',attacker_english_model,private_english_model)
print('\n')
privacy_leak('My son was ill yesterday',attacker_english_model,private_english_model)
print('\n')
privacy_leak('Il se sent seul',attacker_french_model,private_french_model)
print('\n')
privacy_leak('Votre mot de passe est bonjour' ,attacker_french_model,private_french_model)

 my | confidence: 0.73
 doctor | confidence: 0.81
 said | confidence: 0.9
 that | confidence: 0.87
 i | confidence: 0.87
 only | confidence: 0.85
 have | confidence: 0.89
 a | confidence: 0.74
 few | confidence: 0.83
 days | confidence: 0.9
 left | confidence: 0.88
 to | confidence: 0.87
 live | confidence: 0.89


 my | confidence: 0.73
 bag | confidence: 0.87
 was | confidence: 0.83
 ill | confidence: 0.89
 yesterday | confidence: 0.84


 il | confidence: 0.83
 se | confidence: 0.82
 sent | confidence: 0.9
 seul | confidence: 0.86


 votre | confidence: 0.86
 mot | confidence: 0.88
 de | confidence: 0.74
 passe | confidence: 0.87
 est | confidence: 0.83
 autrement | confidence: 0.88


#### As you can observe, we succeed without difficulty, and with a high confidence to retrieve most of the sensitive information typed by the user !

#### We can also determine the percentage of words approximately recovered by the attacker considering that a word is approximately recovered when he belongs to the 5 first most probable words proposed by the attacker model. This is done using the function `attack_efficiency` from the library `PrivacyLeakWE.py`

In [41]:
results=pd.DataFrame()
english=[]
french=[]

for topn in range(1,6):
    english.append(attack_efficiency(attacker_english_model,private_english_model,topn=topn))
    french.append(attack_efficiency(attacker_french_model,private_french_model,topn=topn))

results['topn']=list(range(1,6))
results['Accuracy_english_dataset']=english
results['Accuracy_french_dataset']=french
results

Unnamed: 0,topn,Accuracy_english_dataset,Accuracy_french_dataset
0,1,0.24,0.28
1,2,0.27,0.33
2,3,0.3,0.37
3,4,0.32,0.39
4,5,0.33,0.41


#### For the english sentences, we see that we can approximately recover 33% of the private vocabulary. This may seem low but as we have seen before, it is enough for capturing sensitive information ! For the french sentence, it is larger, we gan go up to 41% of the private vocabulary.

In [50]:
np.max(private_english_model['hello'])-np.min(private_english_model['hello'])

0.3940304

#### Now, we can see the impact of a Laplace noise against the attack. 

Let's say we make the same trick as the paper we studied, enforcing a [0,1] range to the embedded representations and we apply then a Laplace noise $\mathcal L(\dfrac1{\epsilon})$

In [102]:
attacker_english_model=normalize_embedding(attacker_english_model)
private_english_model=normalize_embedding(private_english_model)

In [110]:
pd.options.display.max_colwidth = 100

In [111]:
results=pd.DataFrame()
epsilons=[0.05,0.1,0.5,1,5,10]
recovered_sentences=[]
for eps in epsilons:
    dp_private_model=private_embedding(private_english_model,eps)
    recovered_sentence=privacy_leak('my doctor said that I only have a few days left to live',attacker_english_model,dp_private_model,display=False)
    recovered_sentences.append(recovered_sentence)

results['$\epsilon$']=epsilons
results['recovered_sentences']=recovered_sentences
results

Unnamed: 0,$\epsilon$,recovered_sentences
0,0.05,much you if who than yesterday sorry my know than who tell believe
1,0.1,much you if who than yesterday sorry my know than who tell believe
2,0.5,much you if who than dried sorry my know than who tell believe
3,1.0,much you why who than asleep sorry an were than who tell believe
4,5.0,your doctor said who i only have an options days left to live
5,10.0,my doctor said that i only have an few days left to live


##### As expected, it's possible to hide the sensitive information with a noise. Now, let's say that the attacker know the nature of this noise, could it recover our sensitive sentence in this case ?

In [118]:
dp_private_model=private_embedding(private_english_model,5)
noisy_attacker_model=private_embedding(attacker_english_model,5)
privacy_leak('my doctor said that I only have a few days left to live',noisy_attacker_model,dp_private_model,display=True)

 eat | confidence: 0.89
 morning | confidence: 0.92
 who | confidence: 0.93
 accident | confidence: 0.94
 your | confidence: 0.87
 guy | confidence: 0.96
 see | confidence: 0.9
 will | confidence: 0.91
 prices | confidence: 0.94
 record | confidence: 0.95
 weve | confidence: 0.92
 the | confidence: 0.84
 our | confidence: 0.97


'eat morning who accident your guy see will prices record weve the our '

##### We remark that the attack is worse with noisy embeddings for the attacker even if he knows the level of noise so this seems to be enough to protect the privacy of the user. However, this has a price on utility since doing so, we loose the information built by word2vec for the embeddings. We will discuss this fact in the second illustrative notebook.

### <span style='color:blue;'>  Situation 2 - We imagine that the user is going to make an anonymous question on a website such as quora.com and the attacker know it and decide to built a dataset made of questions of the same site.

#### For simulating this situation we will consider the quora pair dataset which is a set of questions pairs that are potentially duplicates in the sense they express the same question without being exactly formulated the same way. We will consider a random sample of the first questions as our private user dataset (let's say it's the questions already asked by the user or asked by the users which share same interests with him) and a random sample of the second questions as our attacker dataset. We can imagine that the attacker has made himself this dataset collecting questions in quora.com.


In [280]:
raw_df=pd.read_csv('quora.csv',encoding='utf-8')
private_df=raw_df.sample(frac=0.4).question1.values
attack_df=raw_df.sample(frac=0.7).question2.values

In [281]:
print('The user and his fellows have already asked ',len(private_df),' questions')
print('The attacker has collected ',len(attack_df),' questions')

The user and his fellows have already asked  161716  questions
The attacker has collected  283003  questions


In [282]:
raw_df.groupby(['is_duplicate']).count()['id']

is_duplicate
0    255027
1    149263
Name: id, dtype: int64

##### As we can observe, there are around twice more non duplicates questions than duplicates ones. Hence, our study is reasonable since the dataset are not completely correlated. We can show one case where two questions are completely not duplicates to illustrate that point.

In [283]:
print(raw_df[raw_df['id']==3].question1.values)
print(raw_df[raw_df['id']==3].question2.values)

['Why am I mentally very lonely? How can I solve it?']
['Find the remainder when [math]23^{24}[/math] is divided by 24,23?']


##### As we see, these are two questions completely different. However, there are also a lot of questions that are similar but expressed differently.

In [246]:
raw_df.sample(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
97018,97018,161498,161499,What are reviews for the Pro-Arch Foot Stretcher?,How can I get rid of a burning pain in the arch of my foot?,0
111697,111697,182888,182889,Why do Capricorn men cheat?,Do Capricorn men like affection?,0
306287,306287,429774,429775,What is Ben Kowalewicz's scar on the nose from?,How bad is the scar of a nose burn?,0
253546,253546,85662,7324,How do I reduce belly and chest fat?,What is the best way to reduce belly and arm fat?,1
17949,17949,6332,34041,"After a US President serves two four-year terms, can they run again after four to eight years be...",Could a president run for a third term after taking a 4-8 year break?,0
128470,128470,68290,184040,How would you deal with something that worries you and you have no control on it?,How would you deal with something that worries you and you have no control?,1
177810,177810,273247,273248,How can I stop loving someone who hates me?,How do I stop loving a person who does not love me?,0
94799,94799,158223,158224,How do I buy Facebook likes?,Where can I buy Facebook likes?,1
351134,351134,27805,479951,Why are console games more expensive than PC versions?,What is the benefit of having more than 8gb of RAM on a gaming PC?,0
283045,283045,403079,24918,What is the connection between data science and artificial intelligence? Is it machine learning?,"What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine...",0


##### Now, let's make the same preprocessing as before

In [284]:
private_data=pd.DataFrame(vpunc(vlower(private_df)))
attack_data=pd.DataFrame(vpunc(vlower(attack_df)))
private_data.columns=['sentence']
attack_data.columns=['sentence']

##### The next cell takes some times but you can skip it and take the embeddings we have already computed using the cell below this one

In [285]:
np.random.seed(0)

private_english_data=private_data.sentence.drop_duplicates()
attacker_english_data=attack_data.sentence.drop_duplicates()

private_english_input=(private_english_data.apply(lambda x: x.split(" ")).values)
attacker_english_input=(attacker_english_data.apply(lambda x: x.split(" ")).values)

private_english_model = gensim.models.Word2Vec(sentences=private_english_input,min_count=5)
attacker_english_model = gensim.models.Word2Vec(sentences=attacker_english_input,min_count=5)

private_english_model.save('private_english_model_quora.model')
attacker_english_model.save('attacker_english_model_quora.model')

In [336]:
private_english_model=gensim.models.Word2Vec.load('private_english_model_quora.model')
attacker_english_model=gensim.models.Word2Vec.load('attacker_english_model_quora.model')

#### Let's search now an example of a sensitive question in the dataset

In [332]:
private_df=pd.DataFrame(private_df)
private_df.sample(10)

Unnamed: 0,0
3480,What was the best day of your life so far?
117158,What is the best source of learning astronomy in Pakistan?
11819,Which is the best and genuine immigration consultant in Bangalore?
102596,What is Fiber Optic Transceivers Modules?
50690,Which the best distribution of linux?
145093,How do I think clearly when I am feeling lost and uncertain about the future?
105832,What is the interview process like for a summer internship at Mozilla?
129185,Does DNA change during the life?
144379,"Can I slap ""Forever"" stamps on a letter to Canada?"
57363,How can you lose weight quickly?


"How can you lose weight quickly?" will do the trick

In [337]:
privacy_leak('How can you lose weight quickly',attacker_english_model,private_english_model)
print('\n')

 how | confidence: 0.69
 can | confidence: 0.71
 you | confidence: 0.67
 lose | confidence: 0.75
 belly | confidence: 0.68
 quickly | confidence: 0.69




#### This time, we don't recover the exact same question but the same idea is here and the privacy of the author is at stake. Let's see know the amount of approximately correct guess like before

In [334]:
results=pd.DataFrame()
accuracies=[]

for topn in range(1,5):
     accuracies.append(attack_efficiency(attacker_english_model,private_english_model,topn=topn))

results['topn']=list(range(1,5))
results['Accuracy']=accuracies
print(results)

   topn  Accuracy
0     1      0.12
1     2      0.15
2     3      0.17
3     4      0.18


##### As we observe, now we are talking about a rate of 18% of approximately recovered words. This is less than before but still not negligible since as shown above, we can recover sensitive information.

### <span style='color:blue;'>  Situation 3 : We consider again that the user is going to make a question on quora.com but this time the attacker does not know that in advance. In that situation, he decides to use a public dataset made of numerous tweets and hope he could find something interesting

In [345]:
quora_df=pd.read_csv('quora.csv',encoding='utf-8')

twitter_df=pd.read_csv('chat.txt',delimiter='\t',encoding='utf-8')
twitter_df.columns=['sentence']

private_df=quora_df.sample(frac=0.7).question1.values
attack_df=twitter_df.sample(frac=0.3).sentence

##### The next cell takes some times and can lead you to a memory error but you can skip it and take the embeddings we have already computed using the cell below this one

In [347]:
private_data=pd.DataFrame(vpunc(vlower(private_df)))
attack_data=pd.DataFrame(vpunc(vlower(attack_df)))

private_data.columns=['sentence']
attack_data.columns=['sentence']

private_english_data=private_data.sentence.drop_duplicates()
attacker_english_data=attack_data.sentence.drop_duplicates()

private_english_input=(private_english_data.apply(lambda x: x.split(" ")).values)
attacker_english_input=(attacker_english_data.apply(lambda x: x.split(" ")[0:10]).values)

private_english_model = gensim.models.Word2Vec(sentences=private_english_input)
attacker_english_model = gensim.models.Word2Vec(sentences=attacker_english_input)

private_english_model.save('private_english_model_quora_twitter.model')
attacker_english_model.save('attacker_english_model_quora_twitter.model')

In [365]:
private_english_model=gensim.models.Word2Vec.load('private_english_model_quora_twitter.model')
attacker_english_model=gensim.models.Word2Vec.load('attacker_english_model_quora_twitter.model')

In [366]:
results=pd.DataFrame()
accuracies=[]

for topn in range(1,5):
    accuracies.append(attack_efficiency(attacker_english_model,private_english_model,topn=topn))

results['topn']=list(range(1,5))
results['Accuracy']=accuracies
print(results)

   topn  Accuracy
0     1   0.00302
1     2   0.00453
2     3   0.00613
3     4   0.00735


#### This time the attack seems not really efficient. However nothing tell us that it's not possible to have better results with others word embeddings that are more sophisticated. Moreover, we can still recover private information even with this poor attack as you can see above

In [367]:
privacy_leak('I really like trump',attacker_english_model,private_english_model)

 i | confidence: 0.41
 well | confidence: 0.31
 like | confidence: 0.32
 trump | confidence: 0.43


'i well like trump '

## Conclusions :

- Here, we have shown with very simple experiments that a famous embedding such as Word2Vec is not completely private. Even with two completely different datasets, we succeed in recovering sensitive information. 

- Adding a laplace noise to the user embedding really helped to make it private. Now, we have to see also to what extent it is an hindrance to the NLP task made by the API. This part is treated in the second illustrative notebook of our study.

- In practice if the attacker use a different embedding than the one used for the user dataset, the attack should not lead to satisfying results. Usually, people tend to use famous models that have made their proofs such BERT and so it may be a baseline for an attacker but he can't be totally sure of what to use if the information about the embedding has been private.

## Perspectives :

- In perspective to this work, we could study the impact of the hidden dimension of the embedding space on the privacy
- It would be also interesting to see what's is going one if the user embeddings method is different from the one designed by the attacker as suggested above.