# The Data Owner's Notebook


**Note:**

Much of the code used here is either copied or adapted from the `Word-level language modeling` PyTorch example:

https://github.com/pytorch/examples/tree/master/word_language_model

The goal being to demonstrate how the original example could be adapted to a context where the dataset is private to the data owner as it is the case in this demo.

## PART 0: Importing Libraries

In [1]:
import torch
import syft as sy

import os

sy.VERBOSE=False

## PART 1: Launch Duet Server and Connect 

Let's start by launching the duet server. This server is launched by the data owner. The data scientist will connect to it afterwards.

In [2]:
duet = sy.launch_duet(loopback=True)

🎤  🎸  ♪♪♪ Starting Duet ♫♫♫  🎻  🎹

♫♫♫ >[93m DISCLAIMER[0m:[1m Duet is an experimental feature currently 
♫♫♫ > in alpha. Do not use this to protect real-world data.
[0m♫♫♫ >
♫♫♫ > Punching through firewall to OpenGrid Network Node at: http://ec2-18-216-8-163.us-east-2.compute.amazonaws.com:5000
♫♫♫ > http://ec2-18-216-8-163.us-east-2.compute.amazonaws.com:5000
♫♫♫ >
♫♫♫ > ...waiting for response from OpenGrid Network... id {
  value: "\211X;\035\354\001JN\236\006\213\\;-\201\366"
}
name: "om-net"
node {
  id {
    value: "\211X;\035\354\001JN\236\006\213\\;-\201\366"
  }
  name: "om-net"
}

[92mDONE![0m
♫♫♫ >
♫♫♫ > [95mSTEP 1:[0m Send the following code to your Duet Partner!

import syft as sy
duet = sy.join_duet('[1mbaaa092689e8d00b34cc561228d0dc20[0m')

♫♫♫ > [95mSTEP 2:[0m The code above will print out a 'Client Id'. Have
♫♫♫ >         your duet partner send it to you and enter it below!

Running loopback mode. Use sy.join_duet(loopback=True) on the other side.
♫♫♫ > Co

## PART 2: Prepare Data

Let's create a class to hold the vocab of the dataset, with some utility methods.

In [3]:
class Dictionary(object):
    """This class holds the vocabulary along
    with some utility functions.
    """
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        """Adds a word to the vocab.
        """
        
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
            
        return self.word2idx[word]

    def __len__(self):
        """Return the size of the used vocab
        """
        return len(self.idx2word)

2020-11-03 21:01:39.270 | ERROR    | syft.grid.duet.webrtc_duet:push:201 - Got an exception in Duet push. 


Let's now create a class that preprocesses the dataset, and prepares it for both training and testing. In this particular use case, preprocessing includes tokenization and transforming words into integer IDs. 

In [4]:
class Corpus(object):
    
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self._tokenize(os.path.join(path, "train.txt"))
        self.valid = self._tokenize(os.path.join(path, "valid.txt"))
        self.test = self._tokenize(os.path.join(path, "test.txt"))

    def _tokenize(self, path):
        """Tokenizes a text file."""
        
        assert os.path.exists(path)
        
        # Add words to the dictionary
        with open(path, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, "r", encoding="utf8") as f:
            
            idss = []
            
            for line in f:
                
                words = line.split() + ["<eos>"]
                
                ids = []
                
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                    
                idss.append(torch.tensor(ids).type(torch.int64))
                
            ids = torch.cat(idss)

        return ids

Create a dataset instance for each of training, validation and testing, batchify, and share them on Grid:

In [5]:
# Create dataset
corpus = Corpus(path = 'data/')


Tag and describe the datasets before sharing on Duet.

In [6]:
# Training set
corpus.train.tag("#wikitext2_dataset", '#train_data')
corpus.train.describe(f"Wikitext2 training set. shape: ({corpus.train.shape[0]},)")

# Validation set
corpus.valid.tag("#wikitext2_dataset", '#valid_data')
corpus.valid.describe(f"Wikitext2 validation set. shape: ({corpus.valid.shape[0]},)")

# Test set
corpus.test.tag("#wikitext2_dataset", '#test_data')
corpus.test.describe(f"Wikitext2 test set. shape: ({corpus.test.shape[0]},)")

tensor([   0,    1, 1144,  ...,   15,    0,    0])

Get the vocabulary size to share it on Duet:

In [7]:
vocab_size = sy.lib.python.Int(len(corpus.dictionary))
vocab_size.tag('#wikitext2_dataset', '#vocab_size')
vocab_size.describe('Vocabulary size of Wikitext2 dataset')

2020-11-03 21:01:43.337 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


33278

## PART 3: Share Dataset on Duet

In [8]:
# Share the datasets on Grid
corpus.train.send(duet, searchable=True)
corpus.valid.send(duet, searchable=True)
#corpus.test.send(duet, searchable=True)

# Share the vocab size
vocab_size.send(duet, searchable = True)

<syft.proxy.syft.lib.python.IntPointer at 0x7f0df3ed5f90>

2020-11-03 21:01:57.315 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Get a list of the shared objects:

In [9]:
duet.store.pandas

Unnamed: 0,ID,Tags,Description
0,<UID:8c8686aa-e60e-45b4-950a-aa4a336f3e81>,"[#wikitext2_dataset, #vocab_size]",Vocabulary size of Wikitext2 dataset
1,<UID:ca4bfaad-f1d7-4df6-b262-8596b7814e29>,"[#wikitext2_dataset, #test_data]","Wikitext2 test set. shape: (245569,)"


2020-11-03 21:02:02.317 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:07.318 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:12.321 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:17.323 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:22.325 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:27.328 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:32.331 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:37.333 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:42.336 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:47.339 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


2020-11-03 21:02:52.340 | CRITICAL | syft.grid.connections.webrtc:heartbeat:563 - Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Launcher PQ: 0 / 0 - CQ: 0 / 0 - AT: 5


Automatically approve all requests for the sake of this demo

In [10]:
while True:
    if duet.requests:
        for request in duet.requests:
            request.approve()

KeyboardInterrupt: 