# Data Scientist (DS)

Outline of what the DS will do

1. DS logs into DO1 and DO2's datasites as guest and explores the mock datasets  
2. DS prepares `syft_flwr` code and the `MirageQA` dataset
3. DS runs `flwr` and `syft_flwr` simulations (optional)   
4. DS submits jobs to the DOs' datasites  
5. DS starts the FL server code  
6. DS observes the results

## 1. DS logs into DO1 and DO2's datasites as guest and explores the mock datasets

In [None]:
import os
from pathlib import Path

from loguru import logger
from syft_rds.orchestra import setup_rds_server

DS = "ds@openmined.org"
DO1 = "do1@openmined.org"
DO2 = "do2@openmined.org"

ds_stack = setup_rds_server(email=DS, root_dir=Path("."), key="local_syftbox_network")
do1_guest = ds_stack.init_session(host=DO1)
do2_guest = ds_stack.init_session(host=DO2)

# Set some constants and variables
os.environ["SYFTBOX_CLIENT_CONFIG_PATH"] = str(ds_stack.client.config_path)
os.environ["LOGURU_LEVEL"] = "DEBUG"
os.environ["SYFT_FLWR_MSG_TIMEOUT"] = "60"

do_clients = [do1_guest, do2_guest]
do_emails = [DO1, DO2]

In [None]:
ds = ds_stack.client

In [None]:
do1_guest.is_admin

In [None]:
do2_guest.is_admin

DS can access the DOs' mock data, but can't access the private data

<img src="../images/dsExploresDOsDatasets.png" width="50%" alt="DS explores datasets">

In [None]:
mock_paths = []
for client in do_clients:
    dataset = client.dataset.get_all()[0]
    logger.info(f"Client {client.host}'s dataset name: {dataset.name}")
    mock_paths.append(dataset.get_mock_path())

Check that the DS can't access the private datasets

In [None]:
try:
    dataset.get_private_path()
except Exception as e:
    logger.error(e)

## 2. DS prepares `syft_flwr` code and the `MirageQA` dataset

A `syft_flwr` project requires minimal changes to a Flower project:

1. The `syft_flwr` fedrag project has the same structure as an [Flower fedrag project](https://flower.ai/docs/examples/fedrag.html/):
```
fedrag_v1
‚îú‚îÄ‚îÄ fedrag/
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ client_app.py   # DO runs this
‚îÇ   ‚îú‚îÄ‚îÄ llm_querier.py  # DS runs this
‚îÇ   ‚îú‚îÄ‚îÄ mirage_qa.py    # DS runs this
‚îÇ   ‚îú‚îÄ‚îÄ retriever.py    # DO runs this
‚îÇ   ‚îú‚îÄ‚îÄ retriever.yaml
‚îÇ   ‚îú‚îÄ‚îÄ server_app.py   # DS runs this
‚îÇ   ‚îî‚îÄ‚îÄ task.py         # Common code
‚îú‚îÄ‚îÄ pyproject.toml
‚îî‚îÄ‚îÄ README.md
``` 

In [None]:
SYFT_FLWR_PROJECT_PATH = Path("../fedrag_v1")
assert SYFT_FLWR_PROJECT_PATH.exists()

#### DS bootstraps the `syft_flwr` project

After the DS has prepared the `syft_flwr` project code, they can run the `syft_flwr.bootstrap` command to modify the project's metadata in `pyproject.toml` that specifies the unique name of the FL run, who are the data owners and who is the aggregator
```
[tool.syft_flwr]
app_name = "ds@openmined.org_fed-analytics-diabetes_1753435643"
datasites = [
    "do1@openmined.org",
    "do2@openmined.org",
]
aggregator = "ds@openmined.org"
```
and creates the `main.py` to specifies who runs what based on the new metatdata information

In [None]:
import syft_flwr

try:
    !rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
    syft_flwr.bootstrap(SYFT_FLWR_PROJECT_PATH, aggregator=DS, datasites=do_emails)
    logger.info("Bootstrapped project successfully ‚úÖ")
except Exception as e:
    logger.error(e)

Finally, the `fedrag` job's code will look like below 

<img src="../images/dsFedRagJob.png" width="40%" alt="DS explores datasets">

#### DS prepares the MirageQA dataset

In [None]:
import sys
from pathlib import Path
from pprint import pprint

SYFT_FLWR_PROJECT_PATH = Path("../fedrag_v1")
sys.path.append(str(SYFT_FLWR_PROJECT_PATH))

from fedrag.mirage_qa import MirageQA

MIRAGE_PATH = Path("../datasets/") / "mirage.json"
if not MIRAGE_PATH.exists():
    MirageQA.download(MIRAGE_PATH)

with open(MIRAGE_PATH, "r") as f:
    lines = f.readlines()[:12]
    for line in lines:
        pprint(line.strip())

## 3. (Optional) DS runs `flwr` and `syft_flwr` simulations

In [None]:
RUN_SIMULATION = False

First we run normal Flower simulation with `flwr run`

In [None]:
if RUN_SIMULATION:
    !flwr run {SYFT_FLWR_PROJECT_PATH}

In [None]:
# clean up
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fedrag" / "**/__pycache__"}  # clean before submitting

Now the DS can run `syft_flwr` simulation on DO's mock data by launching 2 threads that run DOs code and a thread that runs the DS code 

<img src="../images/dsRunsSyftFLWRSimulation.png" width="50%" alt="ds runs simulation with syft_flwr.run">

In [None]:
if RUN_SIMULATION:
    logger.info(f"running syft_flwr simulation with mock paths: {mock_paths}")
    syft_flwr.run(SYFT_FLWR_PROJECT_PATH, mock_paths)

Please look into the dir pointed to by `üìù Log directory` to see the logs of different simulated clients, by default it stays in the `fed-analytics-diabetes/simulation_logs`

<img src="../images/dsSimulationLogs.png" width="20%" alt="DS simulation logs folder">

## 4. DS submits jobs to the DOs' datasites

The DS submits the code in `fed-analytics-diabetes` to the DO's datasites

<img src="../images/dsSubmitsJobs.png" width="50%" alt="DS submits jobs">

In [None]:
# clean up before submitting jobs
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fedrag" / "__pycache__"}
!rm -rf {SYFT_FLWR_PROJECT_PATH / "simulation_logs"}

In [None]:
logger.info(f"sending job to {do1_guest.host}")
job = do1_guest.job.submit(
    name="fedrag",
    user_code_path=SYFT_FLWR_PROJECT_PATH,
    dataset_name="statpearls",
    entrypoint="main.py",
)
logger.success(job)

In [None]:
logger.info(f"sending job to {do2_guest.host}")
job = do2_guest.job.submit(
    name="fedrag",
    user_code_path=SYFT_FLWR_PROJECT_PATH,
    dataset_name="textbooks",
    entrypoint="main.py",
)
logger.success(job)

<img src="../images/dsWaitsForJobsToBeApproved.png" width="20%" alt="DS waits for jobs to be approved">

## 5. DS starts the FL server code

By running the FL server code, the DS merges the documents received from the DOs, and then put those documents as context for the query to the LLM 

In [None]:
!uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")} --active

<video width="90%" controls>
  <source src="../images/fedrag-rds.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

In [None]:
# Clean up
!rm -rf .server .syftbox local_syftbox_network

## 6. DS Observes the Results