# The Data Scientist's Notebook

Only data privacy is protected in this use case. The model created by the data scientist is not kept private since it is sent to the data owner's machine for training without being encrypted.

**Note:**

Much of the code used here is either copied or adapted from the `Word-level language modeling` PyTorch example:

https://github.com/pytorch/examples/tree/master/word_language_model

The goal being to demonstrate how the original example could be adapted to a context where the dataset is private to the data owner as it is the case in this demo.

## PART 0: Connect to a Remote Duet Server

In [None]:
import syft as sy

Before connecting to the remote duet server, the data owner should first launch a duet server. After launch, the data scientist can connect to the duet server.

In [None]:
duet = sy.join_duet(loopback=True)

## PART 1: Get Pointers to Shared Objects

Get a list of the shared objects:

In [None]:
duet.store.pandas

Get the size of the dataset's vocabulary

In [None]:
vocab_size = duet.store[1]
vocab_size = vocab_size.get(request_block = True)
vocab_size = int(vocab_size)
vocab_size

Get references to the datasets

In [None]:
train_set = duet.store[0]
valid_set = duet.store[2]
#train_set = duet.store['22d54a82-b7e7-40da-adfb-06a588121ba1']

## PART 2: Prepare Datasets for Training

The training and validation sets, as shared by the data owners, are flat tensors of the form:

```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .....]
```

where the integers represent words.

The data scientist here has the responsiblity of batchifying this dataset to serve training. I avoid giving the DO this responsiblity as I suppose that it is up to the DS to decide how data is batchified.

I suggest to reshape in the following way:

1. Reshape into a list of input/target samples:

```
[ 
  [ [1 , 2 , 3 , 4 ],
    [5 , 6 , 7 , 8 ]  ],
    
  [ [9 , 10, 11, 12], 
    [13, 14, 15, 16]  ],
  :
  :
  :
]
```

This should use the `view()` method in `torch`.

2. Create a `Dataloader` object in the DS side that batchified this training set. For example, for a batch size of 2, the data loader should return:

```
Input batch:
     [ [1 , 2 , 3 , 4 ],
       [9 , 10, 11, 12]  ]
       
Target batch:
     [ [5 , 6 , 7 , 8 ],
       [13, 14, 15, 16]  ]
```

Of course all operations are carried out on tensor pointers because the dataset does not quit the DO's node.

Fix some hyperparameters:

In [None]:
# BPTT: Backprop through time. AKA, RNN depth.
bptt = 35
dropout = 0.5

------ DRAFT --------------

In [None]:
# b = b.get(request_block=True)

In [None]:
# bptt = 6
# sq_count = 2088628 // (bptt * 2)
# print(sq_count)
# #b = a.narrow(0,0,sq_count * 2 * bptt)
# b = a.view(sq_count, 2, bptt)

In [None]:
torch_d = duet.torch

In [None]:
import torch
b = torch.tensor(range(1002))
b.size(0)

In [None]:
bsz = 4
bptt = 6
sq_count = b.size(0) // (bsz * bptt)
print(sq_count)
b = b.narrow(0,0,sq_count * bsz * bptt)

In [None]:
b.size(0)

In [None]:
#b.view(sq_count, 2, bsz, bptt // 2)
#b = b.view(sq_count, bsz, bptt)
bptt = 6
sq_count = b.size(0) // (bptt * 2)
print(sq_count)
b = b.narrow(0,0,sq_count * 2 * bptt)
b = b.view(sq_count, 2, bptt)

In [None]:
train_loader = torch.utils.data.DataLoader(b, batch_size=2)

In [None]:
for batch in train_loader:
    print(batch)
    break

## PART 3: Built an RNN-based model

Get a pointer to the remote torch and its modules

In [None]:
torch_do = duet.torch
nn = torch_do.nn
F = torch_do.nn.functional

Create the model

In [None]:
class RNNModel(sy.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        super(RNNModel, self).__init__()
        self.ntoken = ntoken
        #self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, x):
        input, hidden = x
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        #output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

Create a model instance

In [None]:
model = RNNModel(rnn_type = 'LSTM', ninp = 100, ntoken = 100, nhid = 100, nlayers =2)

In [None]:
model((input, torch.zeros(1,20)))
#model(input=torch.zeros(1,20), hidden = torch.zeros(1,20))

In [None]:
c = torch.ones(5,5)
c[torch.LongTensor([1,3])]

In [None]:
torch.nn.init.xavier_uniform_(torch.empty(5,5), gain=1.0).send(duet)