# The Data Owner's Notebook


**Note:**

Much of the code used here is either copied or adapted from the `Word-level language modeling` PyTorch example:

https://github.com/pytorch/examples/tree/master/word_language_model

The goal being to demonstrate how the original example could be adapted to a context where the dataset is private to the data owner as it is the case in this demo.

## PART 0: Importing Libraries

In [None]:
import syft as sy

## PART 1: Launch Duet Server and Connect 

Let's start by launching the duet server. This server is launched by the data owner. The data scientist will connect to it afterwards.

In [None]:
duet = sy.launch_duet(loopback=True)

## PART 2: Prepare Data

Let's create a class to hold the vocab of the dataset, with some utility methods.

In [None]:
import torch
import os

class Dictionary(object):
    """This class holds the vocabulary along
    with some utility functions.
    """
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        """Adds a word to the vocab.
        """
        
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
            
        return self.word2idx[word]

    def __len__(self):
        """Return the size of the used vocab
        """
        return len(self.idx2word)

Let's now create a class that preprocesses the dataset, and prepares it for both training and testing. In this particular use case, preprocessing includes tokenization and transforming words into integer IDs. 

In [None]:
class Corpus(object):
    
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self._tokenize(os.path.join(path, "train.txt"))
        self.valid = self._tokenize(os.path.join(path, "valid.txt"))
        self.test = self._tokenize(os.path.join(path, "test.txt"))

    def _tokenize(self, path):
        """Tokenizes a text file."""
        
        assert os.path.exists(path)
        
        # Add words to the dictionary
        with open(path, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, "r", encoding="utf8") as f:
            
            idss = []
            
            for line in f:
                
                words = line.split() + ["<eos>"]
                
                ids = []
                
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                    
                idss.append(torch.tensor(ids).type(torch.int64))
                
            ids = torch.cat(idss)

        return ids

Create a dataset instance for each of training, validation and testing, batchify, and share them on Grid:

In [None]:
# Create dataset
corpus = Corpus(path = 'data/')


Tag and describe the datasets before sharing on Duet.

In [None]:
# Training set
corpus.train.tag("wikitext2_dataset", 'train_data')
corpus.train.describe(f"Wikitext2 training set. shape: ({corpus.train.shape[0]},)")

# Validation set
corpus.valid.tag("wikitext2_dataset", 'valid_data')
corpus.valid.describe(f"Wikitext2 validation set. shape: ({corpus.valid.shape[0]},)")

# Test set
corpus.test.tag("wikitext2_dataset", 'test_data')
corpus.test.describe(f"Wikitext2 test set. shape: ({corpus.test.shape[0]},)")

Get the vocabulary size to share it on Duet:

In [None]:
vocab_size = sy.lib.python.Int(len(corpus.dictionary))
vocab_size.tag('wikitext2_dataset', 'vocab_size')
vocab_size.describe('Vocabulary size of Wikitext2 dataset')

## PART 3: Share Dataset on Duet

In [None]:
# Share the datasets on Grid
corpus.train.send(duet, searchable=True)

In [None]:
corpus.valid.send(duet, searchable=True)
#corpus.test.send(duet, searchable=True)

In [None]:
# Share the vocab size
vocab_size.send(duet, searchable = True)

Get a list of the shared objects:

In [None]:
duet.store.pandas

Automatically approve all requests for the sake of this demo

In [None]:
accept_handler = {
#     "request_name": "age_data",
    "timeout_secs": -1,
    "action": "accept",
    "print_local": True,
    "log_local": True
}
duet.requests.add_handler(accept_handler)

# while True:
#     if duet.requests:
#         for request in duet.requests:
#             request.approve()

In [None]:
duet.requests.handlers