##  **Illustrative notebook** : Differentially Private Neural Representation

*The code of this notebook for making the classifier is inspired by https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html*

We have adapted this code so that we can apply the strategy proposed by the authors which consists in making a robust DP-Private NLP classifier.

**Remark : To make computations faster you could use Google Colab' GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RonyAbecidan/PrivateWordEmbeddings/blob/main/Experiments/Private_Word_Embeddings/PrivateWE.ipynb)

In [None]:
!pip install torchtext=='0.4'--quiet
print('PLEASE RESTART RUNTIME AFTER INSTALLING THIS PACKAGE')

In [None]:
%load_ext autoreload
%autoreload 2

from tqdm import tqdm_notebook as tqdm
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
import time
import torch
import torchtext
import pickle
from torchtext.datasets import text_classification
from torch.utils.data.dataset import random_split
from IPython.display import clear_output

import warnings
# from DPWE import *
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#### First, we can define some functions that will be useful for the implementation of the DP word embedding

In [None]:
def normalize_batch(batch):
    '''
    @batch : a batch of embeddings
    this function returns a normalized version of the embeddings so that their coefficients belong to [0,1]
    '''
    batch_size,dim=batch.size()
    return torch.mul(batch-torch.min(batch,axis=1)[0].view(batch_size,-1),(1/(torch.max(batch,axis=1)[0]-torch.min(batch,axis=1)[0])).view(batch_size,-1))
    
def Laplace_mechanism(tensor,eps,s=1,random_state=None):
    '''
    @tensor : an embedding or a batch of embedding that we want to make private
    @s : l1-sensibility (equals to 1 if the normalized above is applied)
    @eps : the level of noise. A lower epsilon means a higher noise
    @random_state : the random seed
    this functions return a private embedding based on the Laplace mechanism
    '''
    rng = np.random.RandomState(random_state)
    return tensor + torch.tensor(rng.laplace(scale=s/eps,size=tensor.size())).to(device)

#### In order to test their algorithms, we have chosen 2 famous NLP datasets for classification tasks : The "AG dataset" made of news associated to 4 categories and, the "Yelp Polarity Reviews" made of reviews on which we can analyse 2 sentiments.

In [None]:
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
    os.mkdir('./.data')
AG_data, AG_test = text_classification.DATASETS['AG_NEWS'](root='./.data', ngrams=NGRAMS, vocab=None)
Yelp_data, Yelp_test = text_classification.DATASETS['YelpReviewPolarity'](root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 64

#### We cut our data for the training phase into 95% train, 5% validation

#### Since the training time is non negligible, we have considered only 30000 sentences for the train sets and the valid sets gathered.

In [None]:
min_valid_loss = float('inf')
ag_train_size = int(0.95*30000)
AG_train_set,AG_valid_set = random_split(AG_data[0:30000], [ag_train_size, 30000 - ag_train_size])

yelp_train_size = int(0.95*30000)
Yelp_train_set,Yelp_valid_set = random_split(Yelp_data[0:30000], [yelp_train_size, 30000 - yelp_train_size])

Yelp_train_set[0][1][0:10]

#### As we can observe, the sentences have been already tokenized, i.e. each word of the vocabulary is associated to a number identifying him. The module `nn.Embedding()` of pytorch can transform these integers into one-hot vectors and pass them to a simple linear layer.

#### Now, let's make the classifier we will use. Concretely, we are going to make several experiments and hence, it would great to have a classifier that can be adapted with few modifications according to the cases.

#### Thus, we propose to make **a swiss knife classifier** that will enable us to observe all the "interesting cases". Basically, we are seeking for :

- The test accuracy with a non private embedding and a classic training in order to have an idea of how much we lose/gain in utility when we will add noises
- The test accuracy with a private embedding with a classic training in order to have an idea of how much we lose/gain in utilility when we will add noise in the training phase
- The test accuracy with a private embedding and a robust training phase in which we add the noise to the embedded sentences from the training set.
- The test accuracy with the adaptative training strategy we have precised in our report to see if it was a good strategy or not.

In [None]:
class Classifier(nn.Module):
    '''
    Swiss knife classifier enabling to test all the listed cases 
    '''
    def __init__(self, vocab_size, embed_dim, num_class,noise=False,dropout_rate=False,random_state=0):
        super().__init__()
        
        self.vocab_size=vocab_size
        self.embed_dim=embed_dim
        self.num_class=num_class
        self.noise=noise
        self.dropout_rate=dropout_rate
        self.random_state=random_state
        
        #Embeddings of the words
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        #Linear layer enabling to do the classification
        self.fc = nn.Linear(embed_dim, num_class)

    #random initialization of the weights
    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        
    def forward(self, text, offsets):
        
        #If a dropout rate is provided
        if self.dropout_rate:
            #dropout rescales the output by 1/(1-p)
            text=nn.Dropout(p=self.dropout_rate)(text.float())*(1-self.dropout_rate)
            
        #Embeddings
        embedded = self.embedding(text.long(), offsets)
        #Normalization of the embedding
        embedded=normalize_batch(embedded)
        
        #If a noise is provided
        if self.noise:
            embedded=Laplace_mechanism(tensor=embedded,eps=self.noise,random_state=None).float()
           
        return self.fc(embedded)

#Typical loss for classification task also used by the authors
criterion = torch.nn.CrossEntropyLoss()

def generate_batch(batch):
    '''This function handles the case where batch of sentences have different sizes '''
    #list containing the labels for each element of the batch
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    #list of the sizes of each sentence
    offsets = [0] + [len(entry) for entry in text]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    #Concatenation of the sentences
    text = torch.cat(text)
    return text, offsets, label

#train the model on one epoch and return the loss and accuracy for this epoch.
def train_func(train_set,model,optimizer,scheduler):
    '''This function handles the training phase of a neural network'''
    model.train()
    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    
    for i, (text, offsets, cls) in tqdm(enumerate(data)):
        
        
        optimizer.zero_grad()
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets).to(device)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()
        
    clear_output(wait=True)
    # Adjust the learning rate
    scheduler.step()

    return train_loss / len(train_set), train_acc / len(train_set)

#Compute the accuracy on the validation and test sets
def test(test_set,model):
    '''This function handles the computation of the test accuracy after the training is ended'''
    model.eval()
    loss = 0
    acc = 0
    data = DataLoader(test_set, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in tqdm(data):
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets).to(device)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(test_set), acc / len(test_set)

def train(model,train_set,valid_set,test_set,nb_epochs=1,dropout=False,noise=False,noise_during_training=False,display=False,increasing_noise=False):
    '''This function can simulate the different situations we have considered above and will enable
    us to obtain the results wanted '''
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=2)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
    model.init_weights()
    #enables to active the "robust strategy"
    model.noise = 0 if not(noise_during_training) else noise
    #In all the cases, the dropout rate should be equal to 0 during the training phase
    model.dropout_rate=0
    t=1

    #Training (Server side)
    for epoch in range(nb_epochs):
          #enables to active the "adaptative strategy"
            if increasing_noise:
                model.noise=(increasing_noise/t)
            
            train_loss, train_acc = train_func(train_set,model,optimizer=optimizer,scheduler=scheduler)
            valid_loss, valid_acc = test(valid_set,model)
            
            if display:
                print('Epoch: %d' %(epoch + 1))
                print(f'\tAcc: {train_acc * 100:.1f}%(train)')
                print(f'\tAcc: {valid_acc * 100:.1f}%(valid)')
            t+=1

    ## Computation of test accuracy (User side)

    # We give to the model the dropout_rate and the noise level. 
    # If they were not given, they are equals to False by default
    model.dropout_rate=dropout
    model.noise=noise

    test_loss,test_acc=test(test_set,model)
    if display:
        print(f'\tAcc: {test_acc * 100:.1f}%(test)')
      
    return test_acc

def simulation(model,train_set,valid_set,test_set,nb_epochs,epsilons=[0.05,0.1,0.5,1,5],dprates=[0.1,0.3,0.5,0.8],nb_indep_runs=3,increasing_noise=1):
    '''This function simulates all the situations listed above for the given model and return a dataframe containing all the test accuracies
    according to the different cases. They are computed doing an average of their values obtained after nb_indep_runs runs 
    '''
    results=[]

    all_eps=['/']
    all_mu=['/']
    types=['Non robust','Robust','Adaptative']
    all_types=['Non private']

    #First case
    print('NON PRIVATE MODEL')
    test_acc=0
    for i in range(0,nb_indep_runs):
        clear_output(wait=True)
        test_acc+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs)

    results.append(test_acc/nb_indep_runs)
    #Second case
    print('VARYING NOISE DROPOUT FIXED')
    for eps in tqdm(epsilons):
        test_acc_1,test_acc_2,test_acc_3=[0,0,0]
        for i in range(0,nb_indep_runs):
            test_acc_1+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=False,noise=eps,noise_during_training=False)
            test_acc_2+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=False,noise=eps,noise_during_training=True)
            test_acc_3+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=False,noise=eps,increasing_noise=increasing_noise)
      
        for j in range(0,3):
            all_eps.append(eps)
            all_mu.append(0)
            all_types.append(types[j])

        results.append(test_acc_1/nb_indep_runs)
        results.append(test_acc_2/nb_indep_runs)
        results.append(test_acc_3/nb_indep_runs)
        
    #Third case
    print('VARYING DROPOUT FIXED NOISE')
    for mu in tqdm(dprates):
        test_acc_1,test_acc_2,test_acc_3=[0,0,0]
    
        for i in range(0,nb_indep_runs):
            test_acc_1+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=mu,noise=4,noise_during_training=False)
            test_acc_2+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=mu,noise=4,noise_during_training=True)
            test_acc_3+=train(model,train_set=train_set,valid_set=valid_set,test_set=test_set,nb_epochs=nb_epochs,dropout=mu,noise=4,increasing_noise=increasing_noise)
    
        for j in range(0,3):
            all_eps.append(4)
            all_mu.append(mu)
            all_types.append(types[j])
    
        results.append(test_acc_1/nb_indep_runs)
        results.append(test_acc_2/nb_indep_runs)
        results.append(test_acc_3/nb_indep_runs)
    
    out=pd.DataFrame()
    out['eps']=all_eps
    out['mu']=all_mu
    out['Type']=all_types
    out['Test Accuracy']=np.round(results,3)
    
    return out

#### Simulation on AG-NEWS and YELP

In [None]:
AG_TRAIN_SIZE = len(AG_data.get_vocab())
AG_NUM_CLASS = len(AG_data.get_labels())
EMBED_DIM = 32
AG_model = Classifier(vocab_size=AG_TRAIN_SIZE, embed_dim=EMBED_DIM,num_class=AG_NUM_CLASS)


YELP_TRAIN_SIZE = len(Yelp_data.get_vocab())
YELP_NUM_CLASS = len(Yelp_data.get_labels())
EMBED_DIM = 32
YELP_model = Classifier(vocab_size=YELP_TRAIN_SIZE, embed_dim=EMBED_DIM,num_class=YELP_NUM_CLASS)

#### Grab a cup of tea, the following cell takes a long time to run

In [None]:
results_AG=simulation(AG_model,AG_train_set,AG_valid_set,AG_test,nb_epochs=5,epsilons=[0.05,0.1,0.5,1,2,3,4,5],dprates=[0.1,0.3,0.5,0.8],nb_indep_runs=5)
results_Yelp=simulation(YELP_model,Yelp_train_set,Yelp_valid_set,Yelp_test,nb_epochs=5,epsilons=[0.05,0.1,0.5,1,2,3,4,5],dprates=[0.1,0.3,0.5,0.8],nb_indep_runs=5)

The `results_AG` and `results_Yekp` Dataframe are not good for visualisation but you can see the relevant results we obtained in the following good-looking table

![](https://i.imgur.com/Q1zyaSo.png)

#### Globally, we observe that the noise has a great impact on utility as we could suspect. The robust algorithm makes its proof only when $\epsilon=5$, i.e. with a small amount of noise and hence a small privacy guarantee.  In practice, this parameter should be optimized according to the task chosen so that the trade-off between privacy and utility is the best possible.

#### Concerning our adaptative strategy, it was a complete failure here, it's equally like having a classifier which classify randomly the sentences (and that's why we didn't give its results in the table above). In fact, this is not surprising since as observed before, adding a noise level with $\epsilon \leq 1$ leads to bad utility for the 'robust algorithm'. We will try after if using a smaller level of noise at the beginning could help this strategy to be effective.

#### Moreover, the robust strategy seems to not react well to the masking.

#### How can we explain such differences with the results of the paper ?

#### In the paper, they used the BERT embedding which is apparently robust against the noise by construction. Here, we have used a simple embedding which is clearly not robust to the noise as we have observed before, that could explains the gap between our results and those of the authors. In that case, we see that the strategy of the authors depend on the embedding method.

#### Now, for improving our results what could we do ? Maybe increasing the size of the embedding space could be a great idea. With more information (even noisy ones), the classifier could react better. We can also change the way we update the noise in the training phase for the adaptive strategy putting for instance $\epsilon_t = \dfrac{10}{t}$. 


In [None]:
AG_TRAIN_SIZE = len(AG_data.get_vocab())
AG_NUM_CLASS = len(AG_data.get_labels())
EMBED_DIM = 64
AG_model = Classifier(vocab_size=AG_TRAIN_SIZE, embed_dim=EMBED_DIM,num_class=AG_NUM_CLASS)

results_AG=simulation(AG_model,AG_train_set,AG_valid_set,AG_test,nb_epochs=5,epsilons=[0.05,0.1,0.5,1,2,3,4,5],dprates=[0.1,0.3,0.5,0.8],nb_indep_runs=5,increasing_noise=10)

YELP_TRAIN_SIZE = len(Yelp_data.get_vocab())
YELP_NUM_CLASS = len(Yelp_data.get_labels())
EMBED_DIM = 64
YELP_model = Classifier(vocab_size=YELP_TRAIN_SIZE, embed_dim=EMBED_DIM,num_class=YELP_NUM_CLASS)

results_YELP=simulation(YELP_model,Yelp_train_set,Yelp_valid_set,Yelp_test,nb_epochs=5,epsilons=[0.05,0.1,0.5,1,2,3,4,5],dprates=[0.1,0.3,0.5,0.8],nb_indep_runs=5,increasing_noise=10)

#### The results found are gathered in the following table :
![](https://i.imgur.com/0cnWpaO.png)


### This time, the results are promising for the adaptative strategy.

### What we see is really interesting and could lead to further investigation. It seems that the adaptative strategy could be helpful precisely in situations where the robust strategy isn't efficient, i.e. when the noise level is rather high. On the contrary, when the noise level is rather low, the robust strategy seems to perform better than our adaptative one. So, in some sense, these are complementaries strategies and one could think about using one or the other according to the maximum level of noise added by the user.