# Data Owner 02

Outline of what DO2 will do

0. Run the `syftbox client` in a terminal or the SyftBox UI app
1. DO logs into the datasite as an admin and creates a Syft dataset
2. DO reviews and run jobs submitted by data scientists on DO's private data

## 0. DO runs the `syftbox client` in a terminal or the SyftBox UI app
The CLI syftbox client can be installed with a single command: `curl -fsSL https://syftbox.net/install.sh | sh`. The SyftUI app can be installed from `https://www.syftbox.net/`:

<img src="../images/syftboxnet.png" width="40%" alt="syftbox.net">

This will set up a SyftBox directory, where by default, it's under the `~/SyftBox` folder

<img src="../images/SyftBoxNetwork.png" width="20%" alt="SyftBox network">

## 1. DO logs into the datasite as admin and creates a corpus dataset

<img src="../images/do2LogsInSyftBoxDatasite.png" width="70%" alt="DO1 logs into SyftBox datasite">

In [None]:
import syft_rds as sy
from loguru import logger
from syft_core import Client

do2_email = Client.load().email
logger.info(f"DO2 email: {do2_email}")
do2 = sy.init_session(host=do2_email, start_rds_server=True)

logger.info(f"{do2.is_admin = }")

First, DO2 also has a local dataset (`textbook`) with a mock (fake / synthetic) part and a real, private part  

<img src="../images/do2PreparesDataset.png" width="33%" alt="do2 prepares a Syft dataset">

In [None]:
from pathlib import Path

from huggingface_hub import snapshot_download

DATASET_DIR = Path.cwd().parent / "datasets"
CORPUS_NAME = "textbooks"

use_subset = True  # Set to False to download the full corpus (very slow)
if use_subset:
    DATASET_PATH = DATASET_DIR / "subsets" / CORPUS_NAME
    allow_patterns = f"subsets/{CORPUS_NAME}/*"
else:
    DATASET_PATH = DATASET_DIR / CORPUS_NAME
    allow_patterns = f"{CORPUS_NAME}/*"

if not DATASET_PATH.exists():
    snapshot_download(
        repo_id="khoaguin/medical-corpus",
        repo_type="dataset",
        local_dir=DATASET_DIR,
        allow_patterns=allow_patterns,
    )

In [None]:
PRIVATE_PATH = DATASET_PATH / "private"
MOCK_PATH = DATASET_PATH / "mock"
README_PATH = MOCK_PATH / "README.md"

assert PRIVATE_PATH.exists()
assert MOCK_PATH.exists()
assert README_PATH.exists()

DO2 also creates a syft dataset, where the mock part is uploaded to the datasite and is public to the SyftBox network, and the private part stays local (never get shared)

<img src="../images/do2CreatesSyftADataset.png" width="45%" alt="do2 creates a syft dataset">

In [None]:
dataset = do2.dataset.create(
    name=CORPUS_NAME,
    path=PRIVATE_PATH,
    mock_path=MOCK_PATH,
    description_path=README_PATH,
)
dataset.describe()

In [None]:
# Optional: Clean up old jobs
do2.job.delete_all()

DO2 now waits for jobs from some data scienists

<img src="../images/do2WaitsForJobs.png" width="20%" alt="do waiting for jobs">

## 2. Review and Run Jobs

After the DS submits a job, the DO2 will also see that there is one job from the DS 

<img src="../images/do2ReviewsJob.png" width="40%" alt="do waiting for jobs">

In [None]:
jobs = do2.job.get_all(status="pending_code_review")
jobs

In [None]:
job = jobs[0]
job

In [None]:
# same as job.code.describe()
job.show_user_code()

By running `run_private(job)`, the DO1 runs the `syft_flwr` client code on the private dataset, retrieves the relevant documents and send them to the DS

In [None]:
res_job = do2.run_private(job)

<video width="90%" controls>
  <source src="../images/fedrag-rds.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>