# WikiText - Syft Duet - Data Scientist 🥁

The code used here is has been adapted directly from the `Word-level language modeling RNN
` PyTorch example:
https://github.com/pytorch/examples/tree/master/word_language_model

The goal is to demonstrate how the original example could be adapted to a context where you as a Data Scientist can access the remote private data of a Data Owner securely, and train a model over a Duet session.

## PART 1: Connect to a Remote Duet Server

As the Data Scientist, you want to perform data science on data that is sitting in the Data Owner's Duet server in their Notebook.

In order to do this, we must run the code that the Data Owner sends us, which importantly includes their Duet Session ID. The code will look like this, importantly with their real Server ID.

```
import syft as sy
duet = sy.duet('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
```

This will create a direct connection from my notebook to the remote Duet server. Once the connection is established all traffic is sent directly between the two nodes.

Paste the code or Server ID that the Data Owner gives you and run it in the cell below. It will return your Client ID which you must send to the Data Owner to enter into Duet so it can pair your notebooks.

In [None]:
import syft as sy
# duet = sy.join_duet("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
# sy.logging(disable=True, file_path="wiki_ds.log")
duet = sy.join_duet(loopback=True)

## PART 2: Get Pointers to Shared Objects

The first thing we need to do is to get our Data Owner to run their notebook so that the data is loaded into Duet. Once that is done we can check the store to see what pointers we can get.

In [None]:
duet.store.pandas

The first thing we should do is get a copy of the vocab_size.

In [None]:
vocab_size_ptr = duet.store["vocab_size"]
vocab_size_sy = vocab_size_ptr.get(
    request_block = True,
    name="vocab_size",
    reason="I need it to define my model.",
    timeout_secs=30,
    delete_obj=False,
    verbose=True
)
vocab_size_sy

Notice we get back a Syft Int. These primitive types are almost identical however in some
cases you will need to convert them for use in other code. You could cast `int(vocab_size)` or you can use the method `upcast()`

In [None]:
type(vocab_size_sy)

In [None]:
vocab_size = vocab_size_sy.upcast()
type(vocab_size), vocab_size

Now we should get some pointers to the datasets.

In [None]:
train_data = duet.store["train_data"]
valid_set = duet.store["valid_data"]
train_data, valid_set

## PART 3: Prepare Datasets for Training

The training and validation sets, as shared by the data owners, are flat tensors of the form:

```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .....]
```

where the integers represent words.

The DS (Data Scientist) here has the responsiblity of batchifying this dataset to serve training. I avoid giving the DO (Data Owner) this responsiblity as I suppose that it is up to the DS to decide how data is batchified.

I suggest we reshape it in the following way:

1. Reshape into a list of input/target samples:

```
[ 
  [ [1 , 2 , 3 , 4 ],
    [5 , 6 , 7 , 8 ]  ],
    
  [ [9 , 10, 11, 12], 
    [13, 14, 15, 16]  ],
  :
  :
  :
]
```

This should use the `view()` method in `torch`.

2. Create a `Dataloader` object on the DS side that batchified this training set. For example, for a batch size of 2, the data loader should return:

```
Input batch:
     [ [1 , 2 , 3 , 4 ],
       [9 , 10, 11, 12]  ]
       
Target batch:
     [ [5 , 6 , 7 , 8 ],
       [13, 14, 15, 16]  ]
```

All of these operations are carried out on the `TensorPointer` because the dataset does not leave the DO's machine.

### Build a Dataset Class

We will need a few predetermined hyperparameters.

In [None]:
duet.store.pandas

In [None]:
ntokens_train = 2088628

bsz = 4
bptt = 2
#dropout = 0.5

#
ninp = 100

# Size of hidden layer
nhid = 200

# Number of RNN layer
nlayers = 2

# Initial learning rate
lr = 20

In [None]:
import torch

Here we are going to mix a local `torch` Dataset class but feed in remote `TensorPointer` objects to batchify. Take care to remember not to confuse `Tensor` and `TensorPointer`.

In [None]:
class Wikitext2(torch.utils.data.Dataset):
    def __init__(self, tokens, ntokens, bsz, bptt):
        # A pointer to the tensor that contains the list of 
        # all token IDs in the dataset
        self.tokens = tokens

        # The sequence length
        self.bptt = bptt

        # The batch size
        self.bsz = bsz

        # Number of tokens in the dataset
        self.ntokens = ntokens
        
        # Batchify the dataset
        self._batchify()

    def __getitem__(self, index):
        input, target = self._get_batch(index)
        return input, target
        
    def __len__(self):
        return (self.ntokens // self.bsz) - (self.bptt + 1)
    
    def _batchify(self):  
        # Since we are going to reshape the self.tokens of 1D tensor
        # into a 2D tensor with a number of rows equal to the
        # batch size, we should compute the number of columns
        # of that reshaped tensor
        width = self.ntokens // self.bsz

        # remove surplus tokens
        self.tokens_2d = self.tokens.narrow(0, 0, self.bsz * width)

        # Reshape
        self.tokens_2d = self.tokens_2d.view(-1, self.bsz)
        
    def _get_batch(self, index):
        input = self.tokens_2d.narrow(dim = 0, start = index, length = self.bptt)
        target = self.tokens_2d.narrow(dim = 0, start = index + 1, length = self.bptt)

        return input, target.view(-1)

    def collate_fn(self, batch):
        return batch[0]        

Create a torch `Dataset` instance.

In [None]:
train_set = Wikitext2(
    tokens = train_data, 
    ntokens = ntokens_train, 
    bsz = bsz,
    bptt = bptt,
)

Create a `DataLoader` instance.

In [None]:
train_loader = torch.utils.data.DataLoader(
    dataset=train_set,
    batch_size=1, # Should be always set to 1
    num_workers=0,
    drop_last=True,
    shuffle=True,
    collate_fn=train_set.collate_fn
)

In [None]:
duet.store.pandas

The dataloader is ready to use. Now, let's build our RNN model!

## PART 3: Built an RNN-based Remote Model

Create the model. Take note that we are subclassing `sy.Module` not `nn.Module` and we are passing in `torch_ref`. Inside the model definition you need to use this self.torch_ref when ever referencing anything from torch. Internally this gets swapped between the real `torch` and the `remote_torch` on `duet.torch` so that the model and its definition can work in both environments.

In [None]:
class RNNModel(sy.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self,
        torch_ref, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False
    ):
        super(RNNModel, self).__init__(torch_ref=torch_ref)
        print(
            "Creating RNNModel with hyperparams: "
            + f"{rnn_type} {ntoken} {ninp} {nhid} {nlayers} {dropout} {tie_weights}"
        )
        self.ntoken = ntoken
        #self.drop = self.torch_ref.nn.Dropout(dropout)
        self.encoder = self.torch_ref.nn.Embedding(ntoken, ninp)
        if rnn_type in ["LSTM", "GRU"]:
            self.rnn = getattr(self.torch_ref.nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {"RNN_TANH": "tanh", "RNN_RELU": "relu"}[rnn_type]
                
            except KeyError:
                raise ValueError("""An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = self.torch_ref.nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = self.torch_ref.nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError("When using the tied flag, nhid must be equal to emsize")
            # I dont think we can just assign these pointers right now
            # self.decoder.weight = self.encoder.weight

        #self.init_weights()
        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers
#     def init_weights(self):
#         initrange = 0.1
#         self.torch_ref.nn.init.uniform_(self.encoder.weight, -initrange, initrange)
#         self.torch_ref.nn.init.zeros_(self.decoder.weight)
#         self.torch_ref.nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, x):
        input, hidden = x
        #emb = self.drop(self.encoder(input))
        emb = self.encoder(input)

        result = self.rnn(emb)
        output, hidden = result[0], result[1]
        #output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        output = self.torch_ref.nn.functional.log_softmax(decoded, dim=1) #, hidden
        return output

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == "LSTM":
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

Create an instance of our model.

In [None]:
local_model = RNNModel(
    torch_ref=torch,
    rnn_type = "LSTM", 
    ninp = ninp, 
    ntoken = vocab_size, 
    nhid = nhid, 
    nlayers = nlayers
)

print(f"local_model is Local: {local_model.is_local}")

Lets try sending our model over to duet. Then we can check its `.is_local` property to see where it is.

In [None]:
duet.store.pandas

In [None]:
remote_model = local_model.send(duet)
print(f"remote_model is Remote: {not remote_model.is_local}")

Now the model is on the DO's machine we can get the remote parameters which are the ones we will want to optimize.

In [None]:
duet.store.pandas

In [None]:
# Get the parameters as a pointer
parameters = remote_model.parameters()

We will need to do a few things with the `remote_torch` so lets grab an alias to it.

In [None]:
remote_torch = duet.torch

In [None]:
# Create the optimizer
optim = remote_torch.optim.Adadelta(parameters, lr=lr)

In [None]:
duet.store.pandas

## PART 4: Start Remote Training

In [None]:
train_loader

In [None]:
train_loader = torch.utils.data.DataLoader(
    dataset=train_set,
    batch_size=1, # Should be always set to 1
    num_workers=0,
    drop_last=True,
    shuffle=True,
    collate_fn=train_set.collate_fn
)

In [None]:
# turn on training mode
dry_run = True

remote_model.train()
# train_loss = duet.python.Float(0)  # create a remote Float we can use for summation
epochs = 10
log_interval = 10

for epoch in range(1, epochs + 1):
    for batch_idx, (input, target) in enumerate(train_loader):
        # Zero the gradients
        optim.zero_grad()

        # Forward pass
        output = remote_model((input, None))

        # Compute the loss
        loss = remote_torch.nn.functional.nll_loss(input=output, target=target)

        # Backprop
        loss.backward()

        loss_item = loss.item()
        #train_loss += loss_item # its still a pointer at this stage

        # Update waits 
        optim.step()

        if batch_idx % log_interval == 0:
            local_loss = None
            local_loss = loss_item.get(
                name="loss",
                reason="To evaluate training progress",
                request_block=True,
                timeout_secs=5,
                verbose=True
            )
            if local_loss is not None:
                print("Train Epoch: {} {} {:.4}".format(epoch, batch_idx, local_loss))
            else:
                print("Train Epoch: {} {} ?".format(epoch, batch_idx))

            if dry_run:
                break

_____ Local model test ______

In [None]:
import torch
import torch.nn as nn

class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(RNNModel, self).__init__()
        self.ntoken = ntoken
        #self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        #self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, x):
        input, hidden = x
        #emb = self.drop(self.encoder(input))
        emb = self.encoder(input)

        result = self.rnn(emb)
        output, hidden = result[0], result[1]
        
        #output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        
        return nn.functional.log_softmax(decoded, dim=1)#, hidden
        #return result

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

In [None]:
model = RNNModel(rnn_type = 'LSTM', ninp = 100, ntoken = 100, nhid = 100, nlayers =2)

In [None]:
input = torch.ones(20,2,dtype = torch.long) # bsz * bptt
hidden = torch.zeros(2,20, 100) # nb_layers * bptt * nhid
c = torch.zeros(2, 20, 100) # nb_layers * bptt * nhid

In [None]:
#output = model((input, hidden))
#output, hidden = model((input, [hidden, c]))
output= model((input, None))

In [None]:
output.shape

In [None]:
nn.functional.nll_loss(output, input.view(-1))

In [None]:
type(output)

### Remaining issues

1. passing tuple of (hidden, c) to self.rnn gives exception in MergeFrom() call
2. Dropout layer does not seems to be working
3. when calling `input, hidden = self.rnn(input, None)` in LSTM case, we get two many values to unpack. This works in the local torch case. To solve the problem, I called it as `input = self.rnn(input, None)`
4. We cannot index a list pointer `l = syft.lib.python.list.List([1,2]).send(duet); a = l[0]`
5. We cannot unpack a list pointer `a,b = l`
6. Couldn't implement gradient clipping since the function `torch.nn.utils.clip_grad_norm_()` is not yet implemented, and since we cannot iterate in `model.parameters()` when it is a `ListPointer`. Actually `model.parameters()` seems to be nonfunctional, it should return an iterator not a list in the allowlist

In [None]:
import torch
t1 = torch.tensor([2,3,5])
t2 = torch.tensor([4,1,6])

list_ptr = sy.lib.python.list.List([t1,t2]).tag('#list').send(duet)
elem_ptr = list_ptr[0] #Does not work

In [None]:
duet.store.pandas

In [None]:
elem_ptr, list_ptr

In [None]:
l = elem_ptr.get(request_block=True)

In [None]:
l

In [None]:
t = duet.torch.Tensor(elem_ptr )

In [None]:
t

In [None]:
t = t + 3

In [None]:
t = t.get(request_block = True)

In [None]:
type(t)

In [None]:
t1_ptr, t2_ptr = list_ptr