# WikiText - Syft Duet - Data Owner 🎸

The code used here is has been adapted directly from the `Word-level language modeling RNN
` PyTorch example:
https://github.com/pytorch/examples/tree/master/word_language_model

The goal is to demonstrate how the original example could be adapted to a context where you as a Data Owner can load and share your private data securely, to allow the Data Scientist to train on it over a Duet session.

## PART 1: Launch a Duet Server and Connect

As a Data Owner, you want to allow someone else to perform data science on data that you own and likely want to protect.

In order to do this, we must load our data into a locally running server within this notebook. We call this server a "Duet".

To begin, you must launch Duet and help your Duet "partner" (a Data Scientist) connect to this server.

You do this by running the code below and sending the code snippit containing your unqiue Server ID to your partner and following the instructions it gives!

In [None]:
import syft as sy
duet = sy.launch_duet(loopback=True)
sy.logger.add(sink="./syft_do.log")

## PART 2: Prepare Data

Let's create a class to hold the vocab of the dataset, with some utility methods.

In [None]:
import torch
import os

In [None]:
class Dictionary:
    """This class holds the vocabulary along with some utility functions."""
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        """Adds a word to the vocab."""
        
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
            
        return self.word2idx[word]

    def __len__(self):
        """Return the size of the used vocab"""
        return len(self.idx2word)

Let's now create a class that preprocesses the dataset, and prepares it for both training and testing. In this particular use case, preprocessing includes tokenization and transforming words into integer IDs. 

In [None]:
class Corpus:
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self._tokenize(os.path.join(path, "train.txt"))
        self.valid = self._tokenize(os.path.join(path, "valid.txt"))
        self.test = self._tokenize(os.path.join(path, "test.txt"))

    def _tokenize(self, path):
        """Tokenizes a text file."""
        
        assert os.path.exists(path)
        
        # Add words to the dictionary
        with open(path, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, "r", encoding="utf8") as f:
            idss = []
            
            for line in f:
                words = line.split() + ["<eos>"]
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])

                idss.append(torch.tensor(ids).type(torch.int64))

            ids = torch.cat(idss)

        return ids

Create a dataset instance for each of training, validation and testing, batchify, and share them with Duet!

In [None]:
# Create dataset
corpus = Corpus(path = "./original/data/wikitext-2")

Don't forget to Tag and Describe the datasets before sharing on Duet.

In [None]:
# Training set
corpus.train.tag("wikitext2_dataset", "train_data")
corpus.train.describe(f"Wikitext2 training set. shape: ({corpus.train.shape[0]},)")

# Validation set
corpus.valid.tag("wikitext2_dataset", "valid_data")
corpus.valid.describe(f"Wikitext2 validation set. shape: ({corpus.valid.shape[0]},)")

# Test set
corpus.test.tag("wikitext2_dataset", "test_data")
corpus.test.describe(f"Wikitext2 test set. shape: ({corpus.test.shape[0]},)")

Get the vocabulary size to share it on Duet. If we use a syft primitive we can tag and send this directly to Duet!

In [None]:
vocab_size = sy.lib.python.Int(len(corpus.dictionary))
vocab_size

In [None]:
vocab_size.tag("wikitext2_dataset", "vocab_size")
vocab_size.describe("Vocabulary size of Wikitext2 dataset")
vocab_size.tags, vocab_size.description

# Don't forget to share the vocab size
vocab_size.send(duet, searchable=True)

## PART 3: Share Dataset on Duet

In [None]:
# Share the datasets on Duet and make them visible to the Data Scientist with searchable=True
corpus.train.send(duet, searchable=True)

In [None]:
corpus.valid.send(duet, searchable=True)

Lets see all the data we just created in the store.

In [None]:
duet.store.pandas

## PART 4: Set Accept Handlers

Automatically approve all requests for the sake of this demo.

In [None]:
duet.requests.add_handler(
    action="accept"
)

In [None]:
duet.store.pandas