# Breaching privacy

This notebook does the same job as the cmd-line tool `simulate_breach.py`, but also directly visualizes the user data and reconstruction

In [1]:
import torch
import hydra
from omegaconf import OmegaConf
%load_ext autoreload
%autoreload 2

import breaching
import logging, sys
logging.basicConfig(level=logging.INFO, handlers=[logging.StreamHandler(sys.stdout)], format='%(message)s')
logger = logging.getLogger()

### Initialize cfg object and system setup:

This will print out all configuration options. 
There are a lot of possible configurations, but there is usually no need to worry about most of these. Below, a few options are printed.

Choose `case/data=` `shakespeare`, `wikitext`over `stackoverflow` here:

In [2]:
with hydra.initialize(config_path="config"):
    cfg = hydra.compose(config_name='cfg', overrides=["case=9_bert_training", "case/server=malicious-transformer",
                                                      "attack=decepticon"])
    print(f'Investigating use case {cfg.case.name} with server type {cfg.case.server.name}.')
          
device = torch.device(f'cuda:0') if torch.cuda.is_available() else torch.device('cpu')
torch.backends.cudnn.benchmark = cfg.case.impl.benchmark
setup = dict(device=device, dtype=torch.float)
setup

Investigating use case bert_training with server type malicious_transformer_parameters.


{'device': device(type='cpu'), 'dtype': torch.float32}

### Modify config options here

You can use `.attribute` access to modify any of these configurations:

In [11]:
cfg.case.user.num_data_points = 32 # How many sentences?
cfg.case.user.user_idx = 2 # From which user?
cfg.case.data.shape = [32] # This is the sequence length

cfg.case.model = "bert-sanity-check" #  "huawei-noah/TinyBERT_General_4L_312D"
cfg.case.data.tokenizer = "bert-base-uncased"
cfg.case.data.vocab_size = 30522
cfg.case.data.disable_mlm=False


cfg.case.server.provide_public_buffers = True
cfg.case.server.has_external_data = True
cfg.case.server.pretrained = False


cfg.case.server.param_modification.v_length = 32
cfg.case.server.param_modification.eps = 1e-8
cfg.case.server.param_modification.imprint_sentence_position = 0
cfg.case.server.param_modification.softmax_skew = 100000000
cfg.case.server.param_modification.sequence_token_weight = 1
cfg.case.server.param_modification.measurement_scale = 1

cfg.case.server.param_modification.equalize_token_weight = 10
cfg.case.server.param_modification.reset_embedding = True

cfg.attack.token_strategy="mixed" # "decoder-bias" is only nice for bert if disable_mlm=True

cfg.attack.embedding_token_weight = 0.25

### Instantiate all parties

In [12]:
user, server, model, loss_fn = breaching.cases.construct_case(cfg.case, setup)
attacker = breaching.attacks.prepare_attack(server.model, server.loss, cfg.attack, setup)
breaching.utils.overview(server, user, attacker)

Reusing dataset wikitext (/home/jonas/data/wikitext/wikitext-103-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Reusing dataset wikitext (/home/jonas/data/wikitext/wikitext-103-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Model architecture bert-sanity-check loaded with 109,514,298 parameters and 1,024 buffers.
Overall this is a data ratio of  106948:1 for target shape [32, 32] given that num_queries=1.
User (of type UserSingleStep) with settings:
    Number of data points: 32

    Threat model:
    User provides labels: False
    User provides buffers: False
    User provides number of data points: True

    Data:
    Dataset: wikitext
    user: 2
    
        
Server (of type MaliciousTransformerServer) with settings:
    Threat model: Malicious (Parameters)
    Number of planned queries: 1
    Has external/public data: True

    Model:
        model specification: bert-sanity-check
        model state: default
        public 

### Simulate an attacked FL protocol

True user data is returned only for analysis

In [13]:
server_payload = server.distribute_payload()
shared_data, true_user_data = user.compute_local_updates(server_payload)

Found attention of shape torch.Size([768, 768]).
Found attention of shape torch.Size([768, 768]).
Computing feature distribution before the probe layer Linear(in_features=768, out_features=3072, bias=True) from external data.
Feature mean is 0.07988665252923965, feature std is 1.0159486532211304.
Computing user update in model mode: eval.


In [14]:
user.print(true_user_data)

[CLS] ci [MASK]y mary barker ( 28 [MASK] [MASK] – 16 february 1973 surgeons was an english illustrator [MASK] known for a series of fantasy illustrations depicting fairies and flowers.
barker'[MASK] art education began in girlhood with [MASK] courses [MASK] instruction at the croydon school of art. her earliest professional work [MASK] [MASK] cards and juvenile magazine illustrations
, and her first book, flower fairies of the spring, was published in 1923 [MASK] similar books were published in the following decades. [SEP] [CLS] barker was a incentives
anglican,cene donated her artworks to christian [MASK]s and missionary [MASK]. she produced a few christian @ [MASK] [MASK] themed books [MASK] as the [MASK] ’ s [MASK] of
hymns and, ₅ [MASK] with her sister dorothy, he leadeth me. she designed a stained glass window [MASK] [MASK]. edmund's church, pitlake,
and her painting of [MASK] twenty20 child, the [MASK] of the world has come, was purchased contemplating [MASK] mary. [SEP] [CLS] ba

In [15]:
true_user_data["data"]

tensor([[  101, 25022,   103,  ...,  1998,  4870,  1012],
        [12852,  1005,   103,  ..., 11799,  2932, 11249],
        [ 1010,  1998,  2014,  ...,  2001,  1037, 21134],
        ...,
        [ 2358,  1012,  4080,  ...,  3412,  2573,  1012],
        [ 2016, 20364,   103,  ...,   103,  1996,  6546],
        [20182,  2127,  1996,  ...,   103,  2005,  4275]])

# Reconstruct user data

In [16]:
# attacker.cfg.sentence_algorithm = "k-means"
attacker.cfg.embedding_token_weight = 0.25

In [17]:
reconstructed_user_data, stats = attacker.reconstruct([server_payload], [shared_data], 
                                                      server.secrets, dryrun=cfg.dryrun)
# user.print(reconstructed_user_data)

Recovered tokens tensor([    3,     5,     6,  ..., 30512, 30517, 30520]) through strategy mixed.
Recovered 957 embeddings with positional data from imprinted layer.
Assigned [29, 28, 29, 28, 31, 31, 31, 29, 32, 29, 30, 30, 30, 31, 29, 29, 31, 30, 29, 31, 30, 30, 31, 30, 30, 29, 30, 32, 31, 31, 28, 28] breached embeddings to each sentence.
Replaced 643 tokens with avg. corr 0.1170421838760376 with new tokens with avg corr 0.9976682066917419
with painting books and a nursery library [MASK] included the, [MASK] kate greenaway and randolph cal ) oversees –. artists who exerted [MASK] influences on her later art
[CLS] ci [MASK]y mary barker ( two [MASK] [MASK] – 16 february 1973 surgeons strings, english illustrator [MASK] on for a series of fantasy illustrations depicting fairies and flowers.
barker [MASK] [MASK] art education began in girlhood with [MASK] a [MASK] instruction at the croydon school of art. her earliest professional of [MASK] [MASK] cards and juvenile magazine illustration

In [18]:
metrics = breaching.analysis.report(reconstructed_user_data, true_user_data, [server_payload], 
                                    server.model, cfg_case=cfg.case, setup=setup)
metrics

METRICS: | Accuracy: 0.9287 | S-BLEU: 0.87 | FMSE: 1.2758e+02 | 
 G-BLEU: 0.79 | ROUGE1: 0.93| ROUGE2: 0.84 | ROUGE-L: 0.92| Token Acc: 94.14% | Label Acc: 9.96%


{'order': tensor([ 1,  2, 24, 13,  9, 29, 17, 26, 15,  3, 16, 25,  4,  0, 23,  5, 20, 22,
         14, 19,  6, 21, 30, 28,  8, 12,  7, 27, 18, 31, 10, 11]),
 'intra-sentence_token_acc': [0.875,
  0.90625,
  0.9375,
  0.96875,
  0.90625,
  0.96875,
  0.9375,
  0.9375,
  0.90625,
  0.875,
  0.96875,
  0.9375,
  0.96875,
  0.90625,
  0.9375,
  0.96875,
  0.9375,
  0.96875,
  0.90625,
  0.96875,
  0.96875,
  0.9375,
  0.875,
  0.96875,
  0.96875,
  0.90625,
  0.90625,
  1.0,
  0.90625,
  0.875,
  0.9375,
  0.9375],
 'accuracy': 0.9287109375,
 'bleu': 0.806838918551962,
 'google_bleu': 0.7936507936507936,
 'sacrebleu': 0.86828580802873,
 'rouge1': 0.9289865438539718,
 'rouge2': 0.8444166904524608,
 'rougeL': 0.9240143891989743,
 'token_acc': 0.94140625,
 'feat_mse': 127.57867431640625,
 'parameters': 109514298,
 'label_acc': 0.099609375}