# Breaching privacy

This notebook does the same job as the cmd-line tool `simulate_breach.py`, but also directly visualizes the user data and reconstruction

In [1]:
import torch
import hydra
from omegaconf import OmegaConf
%load_ext autoreload
%autoreload 2

import breaching
import logging, sys
logging.basicConfig(level=logging.INFO, handlers=[logging.StreamHandler(sys.stdout)], format='%(message)s')
logger = logging.getLogger()

### Initialize cfg object and system setup:

This will print out all configuration options. 
There are a lot of possible configurations, but there is usually no need to worry about most of these. Below, a few options are printed.

Choose `case/data=` `shakespeare`, `wikitext`over `stackoverflow` here:

In [2]:
with hydra.initialize(config_path="config"):
    cfg = hydra.compose(config_name='cfg', overrides=["case/data=wikitext", "case/server=malicious-transformer",
                                                      "case.model=gpt2",
                                                      "attack=decepticon"])
    print(f'Investigating use case {cfg.case.name} with server type {cfg.case.server.name}.')
          
device = torch.device(f'cuda:0') if torch.cuda.is_available() else torch.device('cpu')
torch.backends.cudnn.benchmark = cfg.case.impl.benchmark
setup = dict(device=device, dtype=torch.float)
setup

Investigating use case single_imagenet with server type malicious_transformer_parameters.


{'device': device(type='cpu'), 'dtype': torch.float32}

### Modify config options here

You can use `.attribute` access to modify any of these configurations:

In [10]:
cfg.case.user.num_data_points = 2 # How many sentences?
cfg.case.user.user_idx = 1 # From which user?
cfg.case.data.shape = [32] # This is the sequence length

cfg.case.model = "gpt2S" #+ "gpt2" #"transformer3"
cfg.case.server.provide_public_buffers = True

cfg.case.server.has_external_data = True
cfg.case.data.tokenizer = "gpt2"

cfg.attack.token_strategy="embedding-norm"
cfg.case.server.param_modification.v_length = 64

cfg.case.server.param_modification.eps = 1e-6
cfg.case.server.param_modification.imprint_sentence_position = 0
cfg.case.server.param_modification.softmax_skew = 100000000
cfg.case.server.param_modification.sequence_token_weight = 1

cfg.case.server.param_modification.measurement_scale = 1

cfg.case.server.pretrained = False

### Instantiate all parties

In [11]:
user, server, model, loss_fn = breaching.cases.construct_case(cfg.case, setup)
attacker = breaching.attacks.prepare_attack(server.model, server.loss, cfg.attack, setup)
breaching.utils.overview(server, user, attacker)

Reusing dataset wikitext (/home/jonas/data/wikitext/wikitext-103-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Reusing dataset wikitext (/home/jonas/data/wikitext/wikitext-103-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Model architecture gpt2S loaded with 124,439,808 parameters and 12,582,924 buffers.
Overall this is a data ratio of 1944372:1 for target shape [2, 32] given that num_queries=1.
User (of type UserSingleStep) with settings:
    Number of data points: 2

    Threat model:
    User provides labels: False
    User provides buffers: False
    User provides number of data points: True

    Data:
    Dataset: wikitext
    user: 1
    
        
Server (of type MaliciousTransformerServer) with settings:
    Threat model: Malicious (Parameters)
    Number of planned queries: 1
    Has external/public data: True

    Model:
        model specification: gpt2S
        model state: default
        public buffers: True

    Se

### Simulate an attacked FL protocol

True user data is returned only for analysis

In [12]:
server_payload = server.distribute_payload()
shared_data, true_user_data = user.compute_local_updates(server_payload)

Found attention of shape torch.Size([2304, 768]).
Computing feature distribution before the probe layer Conv1D() from external data.
Feature mean is -0.09237781912088394, feature std is 0.7964291572570801.
Computing user update in model mode: eval.


In [13]:
user.print(true_user_data)

 The Tower Building of the Little Rock Arsenal, also known as U.S. Arsenal Building, is a building located in MacArthur Park in downtown Little Rock, Arkansas
. Built in 1840, it was part of Little Rock's first military installation. Since its decommissioning, The Tower Building has housed two museums. It


# Reconstruct user data

In [14]:
reconstructed_user_data, stats = attacker.reconstruct([server_payload], [shared_data], 
                                                      server.secrets, dryrun=cfg.dryrun)

metrics = breaching.analysis.report(reconstructed_user_data, true_user_data, [server_payload], 
                                    server.model, cfg_case=cfg.case, setup=setup)

user.print(reconstructed_user_data)

Recovered tokens tensor([   13,    13,    50,    50,    82,   257,   262,   278,   286,   286,
          287,   287,   318,   340,   355,   373,   383,   383,   468,   468,
          471,   632,   635,   636,   663,   705,   717,   734,   764,   764,
          837,   837,   837,   837,  1900,  2422,  2422,  2615,  3250,  3411,
         4619,  4631,  4631,  4631,  5140,  7703,  7703,  7703,  8765,  8765,
         9436,  9436,  9988, 11819, 11819, 13837, 13837, 14538, 23707, 26969,
        28477, 30794, 46626, 47784]) through strategy embedding-norm.
Recovered 63 embeddings with positional data from imprinted layer.
Assigned [31, 32] breached embeddings to each sentence.
METRICS: | Accuracy: 0.2500 | S-BLEU: 0.28 | FMSE: 1.6329e-01 | 
 G-BLEU: 0.24 | ROUGE1: 0.66| ROUGE2: 0.14 | ROUGE-L: 0.37| Token Acc: 93.75% | Label Acc: 93.75%
 The Built inmission downtown Rock military Arsenal It Little Rock MacArthur decom The Building, Little 1840, is a building located in Tower Park has downtown 

In [15]:
metrics

{'order': tensor([0, 1]),
 'intra-sentence_token_acc': [0.59375, 0.625],
 'accuracy': 0.25,
 'bleu': 0.22521821864695993,
 'google_bleu': 0.23557692307692307,
 'sacrebleu': 0.2792797652483076,
 'rouge1': 0.6632575757575758,
 'rouge2': 0.13781788351107466,
 'rougeL': 0.36666666666666664,
 'token_acc': 0.9375,
 'feat_mse': 0.16329431533813477,
 'parameters': 124439808,
 'label_acc': 0.9375}

In [9]:
reconstructed_user_data, stats = attacker.reconstruct([server_payload], [shared_data], 
                                                      server.secrets, dryrun=cfg.dryrun)

metrics = breaching.analysis.report(reconstructed_user_data, true_user_data, [server_payload], 
                                    server.model, cfg_case=cfg.case, setup=setup)

user.print(reconstructed_user_data)

Recovered tokens tensor([   13,    13,    50,   257,   262,   286,   287,   287,   318,   355,
          383,   471,   635,   837,   837,   837,  1900,  2615,  3250,  4631,
         4631,  5140,  7703,  7703,  8765,  9436, 11819, 11819, 13837, 13837,
        14538, 46626]) through strategy embedding-norm.
Recovered 32 embeddings with positional data from imprinted layer.
METRICS: | Accuracy: 1.0000 | S-BLEU: 1.00 | FMSE: 0.0000e+00 | 
 G-BLEU: 1.00 | ROUGE1: 1.00| ROUGE2: 1.00 | ROUGE-L: 1.00| Token Acc: 100.00% | Label Acc: 100.00%
 The Tower Building of the Little Rock Arsenal, also known as U.S. Arsenal Building, is a building located in MacArthur Park in downtown Little Rock, Arkansas
