# The Data Scientist's Notebook

Only data privacy is protected in this use case. The model created by the data scientist is not kept private since it is sent to the data owner's machine for training without being encrypted.

**Note:**

Much of the code used here is either copied or adapted from the `Word-level language modeling` PyTorch example:

https://github.com/pytorch/examples/tree/master/word_language_model

The goal being to demonstrate how the original example could be adapted to a context where the dataset is private to the data owner as it is the case in this demo.

## PART 0: Connect to a Remote Duet Server

In [None]:
import syft as sy
import torch

Before connecting to the remote duet server, the data owner should first launch a duet server. After launch, the data scientist can connect to the duet server.

In [None]:
duet = sy.join_duet(loopback=True)

## PART 1: Get Pointers to Shared Objects

Get a list of the shared objects:

In [None]:
duet.store.pandas

Get the size of the dataset's vocabulary. 

**Please choose the correct index from the above Pandas dataframe**

In [None]:
vocab_size = duet.store[1]
vocab_size = vocab_size.get_copy(request_block = True)
vocab_size = int(vocab_size)
vocab_size

Get references to the datasets

**Please choose the correct index from the above Pandas dataframe**

In [None]:
train_data = duet.store[0]
#valid_set = duet.store[2]
#train_set = duet.store['22d54a82-b7e7-40da-adfb-06a588121ba1']

## PART 2: Prepare Datasets for Training

The training and validation sets, as shared by the data owners, are flat tensors of the form:

```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .....]
```

where the integers represent words.

The data scientist here has the responsiblity of batchifying this dataset to serve training. I avoid giving the DO this responsiblity as I suppose that it is up to the DS to decide how data is batchified.

I suggest to reshape in the following way:

1. Reshape into a list of input/target samples:

```
[ 
  [ [1 , 2 , 3 , 4 ],
    [5 , 6 , 7 , 8 ]  ],
    
  [ [9 , 10, 11, 12], 
    [13, 14, 15, 16]  ],
  :
  :
  :
]
```

This should use the `view()` method in `torch`.

2. Create a `Dataloader` object in the DS side that batchified this training set. For example, for a batch size of 2, the data loader should return:

```
Input batch:
     [ [1 , 2 , 3 , 4 ],
       [9 , 10, 11, 12]  ]
       
Target batch:
     [ [5 , 6 , 7 , 8 ],
       [13, 14, 15, 16]  ]
```

Of course all operations are carried out on tensor pointers because the dataset does not quit the DO's node.

### Build a Dataset class

Fix some hyperparameters:

In [None]:
ntokens_train = 2088628

bsz = 4
bptt = 2
#dropout = 0.5

#
ninp = 100

# Size of hidden layer
nhid = 200

# Number of RNN layer
nlayers =2

# Initial learning rate
lr = 20

In [None]:
class Wikitext2(torch.utils.data.Dataset):
    
    def __init__(self, tokens, ntokens, bsz, bptt):
        
        # A pointer to the tensor that contains the list of 
        # all token IDs in the dataset
        self.tokens = tokens
        
        # The sequence length
        self.bptt = bptt
        
        # The batch size
        self.bsz = bsz
        
        # Number of tokens in the dataset
        self.ntokens = ntokens
        
        
        # Batchify the dataset
        self._batchify()
        
    def __getitem__(self, index):
        
        input, target = self._get_batch(index)        
        
        return input, target

        
    def __len__(self):
                
        return (self.ntokens // self.bsz) - (self.bptt + 1)
    
    
    def _batchify(self):  
        
        # Since we are going to reshape the self.tokens 1D tensor 
        # into a 2D tensor with a number of rows equal to the
        # batch size, we should compute the number of columns
        # of that reshaped tensor
        width = self.ntokens // self.bsz

        # remove surplus tokens
        self.tokens_2d = self.tokens.narrow(0, 0, self.bsz * width)

        # Reshape
        self.tokens_2d = self.tokens_2d.view(-1, self.bsz)
        
        
    def _get_batch(self, index):
        
        input = self.tokens_2d.narrow(dim = 0, start = index, length = self.bptt)
        target = self.tokens_2d.narrow(dim = 0, start = index + 1, length = self.bptt)

        return input, target.view(-1)
    
    
    def collate_fn(self, batch):
        return batch[0]
        

Create a torch `Dataset` instance:

In [None]:
train_set = Wikitext2(tokens = train_data, 
                      ntokens = ntokens_train, 
                      bsz = bsz,
                      bptt = bptt,
                     )

Create a `DataLoader` instance:

In [None]:
train_loader = torch.utils.data.DataLoader(dataset=train_set, 
                                           batch_size=1, # Should be always set to 1
                                           num_workers=0, 
                                           drop_last=True,
                                           shuffle=True,
                                           collate_fn = train_set.collate_fn
    
                                          )

The dataloader is ready for use. Now, let's build the RNN model:

## PART 3: Built an RNN-based Remote Model

Get a pointer to the remote torch and its modules

In [None]:
torch_do = duet.torch
nn = torch_do.nn
F = torch_do.nn.functional

Create the model

In [None]:
class RNNModel(sy.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(RNNModel, self).__init__()
        self.ntoken = ntoken
        #self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        #self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, x):
        input, hidden = x
        #emb = self.drop(self.encoder(input))
        emb = self.encoder(input)

        result = self.rnn(emb)
        output, hidden = result[0], result[1]
        
        #output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        
        return F.log_softmax(decoded, dim=1)#, hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

Create a model instance:

In [None]:
model = RNNModel(rnn_type = 'LSTM', 
                 ninp = ninp, 
                 ntoken = vocab_size, 
                 nhid = nhid, 
                 nlayers = nlayers)

Creates an optimizer

In [None]:
# Get the parameters as a pointer to list
parameters = model.parameters(params_list=duet.syft.lib.python.List())

# Creates the optimizer
optim = torch_do.optim.Adadelta(parameters, lr = lr)

2020-11-11 19:11:39.070 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:11:44.072 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:11:49.073 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:11:54.075 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:11:59.076 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:04.079 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:09.082 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:14.086 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:19.089 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:24.091 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:29.094 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:34.096 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:39.098 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:44.102 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:49.104 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7269 / 7269 - CQ: 0 / 0 - AT: 4


## PART 4: Start Remote Training

In [17]:
model.train()

for iter, (input, target) in enumerate(train_loader):
    
    # Zero the gradients
    optim.zero_grad()
    
    # Forward pass
    output = model((input, None))
    
    # Compute the loss
    loss = F.nll_loss(input = output, target = target)
    
    # Backprop
    loss.backward()
    
    # Update waits 
    optim.step()

    print(iter)
    #break

0


2020-11-11 19:12:54.107 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


2020-11-11 19:12:59.110 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


2020-11-11 19:13:04.112 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


2020-11-11 19:13:09.114 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


Joiner PQ: 7329 / 7329 - CQ: 0 / 0 - AT: 4


_____ Local model test ______

In [None]:
import torch
import torch.nn as nn

class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(RNNModel, self).__init__()
        self.ntoken = ntoken
        #self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        #self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, x):
        input, hidden = x
        #emb = self.drop(self.encoder(input))
        emb = self.encoder(input)

        result = self.rnn(emb)
        output, hidden = result[0], result[1]
        
        #output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        
        return nn.functional.log_softmax(decoded, dim=1)#, hidden
        #return result

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

In [None]:
model = RNNModel(rnn_type = 'LSTM', ninp = 100, ntoken = 100, nhid = 100, nlayers =2)

In [None]:
input = torch.ones(20,2,dtype = torch.long) # bsz * bptt
hidden = torch.zeros(2,20, 100) # nb_layers * bptt * nhid
c = torch.zeros(2, 20, 100) # nb_layers * bptt * nhid

In [None]:
#output = model((input, hidden))
#output, hidden = model((input, [hidden, c]))
output= model((input, None))

In [None]:
output.shape

In [None]:
nn.functional.nll_loss(output, input.view(-1))

In [None]:
type(output)

### Remaining issues

1. passing tuple of (hidden, c) to self.rnn gives exception in MergeFrom() call
2. Dropout layer does not seems to be working
3. when calling `input, hidden = self.rnn(input, None)` in LSTM case, we get two many values to unpack. This works in the local torch case. To solve the problem, I called it as `input = self.rnn(input, None)`
4. We cannot index a list pointer `l = syft.lib.python.list.List([1,2]).send(duet); a = l[0]`
5. We cannot unpack a list pointer `a,b = l`
6. Couldn't implement gradient clipping since the function `torch.nn.utils.clip_grad_norm_()` is not yet implemented, and since we cannot iterate in `model.parameters()` when it is a `ListPointer`. Actually `model.parameters()` seems to be nonfunctional, it should return an iterator not a list in the allowlist

In [None]:
import torch
t1 = torch.tensor([2,3,5])
t2 = torch.tensor([4,1,6])

list_ptr = sy.lib.python.list.List([t1,t2]).tag('#list').send(duet)
elem_ptr = list_ptr[0] #Does not work

In [None]:
duet.store.pandas

In [None]:
elem_ptr, list_ptr

In [None]:
l = elem_ptr.get(request_block=True)

In [None]:
l

In [None]:
t = duet.torch.Tensor(elem_ptr )

In [None]:
t

In [None]:
t = t + 3

In [None]:
t = t.get(request_block = True)

In [None]:
type(t)

In [None]:
t1_ptr, t2_ptr = list_ptr