# Data Scientist (DS)

Outline:  

1.  
2.  
3.  
4.  
5.  
6.  


Some paths and constants 

## 1. DS logs into DO1 and DO2's datasites as guest

<img src="../images/dsLogsInAsGuests.png" width="70%" alt="DS logs into DOs' datasites as guests">

In [None]:
import os
from pathlib import Path

from syft_rds.orchestra import setup_rds_server

DS = "ds@openmined.org"
DO1 = "do1@openmined.org"
DO2 = "do2@openmined.org"

ds_stack = setup_rds_server(email=DS, root_dir=Path("."), key="local_syftbox_network")
do1_guest = ds_stack.init_session(host=DO1)
do2_guest = ds_stack.init_session(host=DO2)

In [None]:
do1_guest.is_admin

In [None]:
do2_guest.is_admin

Set some constants and variables

In [None]:
SYFTBOX_DATASET_NAME = "pima-indians-diabetes-database"

os.environ["SYFTBOX_CLIENT_CONFIG_PATH"] = str(ds_stack.client.config_path)
os.environ["LOGURU_LEVEL"] = "DEBUG"
os.environ["SYFT_FLWR_MSG_TIMEOUT"] = "60"

do_clients = [do1_guest, do2_guest]
do_emails = [DO1, DO2]

## 2. DS explores the datasets

DS can access the DOs' mock data, but can't access the private data

<img src="../images/dsExploresMockData.png" width="80%" alt="DS explores datasets">

In [None]:
SYFTBOX_DATASET_NAME

mock_paths = []
for client in do_clients:
    dataset = client.dataset.get(name=SYFTBOX_DATASET_NAME)
    mock_paths.append(dataset.get_mock_path())
    print(f"Client {client.host}'s dataset: \n{dataset}\n")

**Note that if you do `dataset.get_private_path()` in the local test version, you will still be able to see the private datasets, this is because in local test, DOs' private datasets also locate on your machine**

In [None]:
dataset.get_private_path()

## 3. DS prepares `syft_flwr` code

A `syft_flwr` project requires minimal changes to a Flower project:

1. Similar to a [Flower project](https://flower.ai/docs/framework/tutorial-quickstart-pytorch.html), a `syft_flwr` project also has the same structure:
```
<your-project-name>
├── <your-project-name>
│   ├── __init__.py
│   ├── client_app.py   # Defines your ClientApp
│   ├── server_app.py   # Defines your ServerApp
│   └── task.py         # Defines your model, training and data loading
├── pyproject.toml      # Project metadata like dependencies and configs
└── README.md
``` 

Concretely, for diabetes prediction, we have the project code under the path `../fl-diabetes-prediction`

<img src="../images/dsPreparesSyftFlwrProject.png" width="30%" alt="DS explores datasets">


In [None]:
SYFT_FLWR_PROJECT_PATH = Path("../fl-diabetes-prediction")
assert SYFT_FLWR_PROJECT_PATH.exists()

2. Compared to a Flower project, the DS only needs to change a bit of how the `client_app.py`, which runs on the data owners' datasites, loads the data

<img src="../images/client_fn.png" width="90%" alt="client_fn comparison">

where the `load_syftbox_dataset` (used by `syft_flwr`) and the `load_flwr_data` (used by pure Flower) differ a bit like below:  

<img src="../images/load_syftbox_flwr_dataset.png" width="90%" alt="load datasets comparison">


## 4. DS bootstraps the `syft_flwr` project

TODO: What does it mean to "bootstrap"?  
- What does it do?
- Why do we do it? (what's the role of `main.py` and what's the info being added to the toml file does?)

TODO: Image of Flower code before and after bootstrap 

TODO: Explain a difference of Flower and syft_flwr projects (what code need to be modified) via an image (don't open the code files) - show git diff

In [None]:
import syft_flwr

try:
    !rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
    syft_flwr.bootstrap(SYFT_FLWR_PROJECT_PATH, aggregator=DS, datasites=do_emails)
    print("Bootstrapped project successfully ✅")
except Exception as e:
    print(e)

## 5. DS runs `flwr` and `syft_flwr` simulations (optional)

TODO: it's against mock data, since DS only has acccess to DOs' mock data locally, and to test the code runs before submitting

In [None]:
RUN_SIMULATION = True

In [None]:
if RUN_SIMULATION:
    !flwr run {SYFT_FLWR_PROJECT_PATH}

In [None]:
# clean up
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}  # clean before submitting
!rm -rf weights/  # 

In [None]:
mock_paths

In [None]:
if RUN_SIMULATION:
    print(f"running syft_flwr simulation with mock paths: {mock_paths}")
    syft_flwr.run(SYFT_FLWR_PROJECT_PATH, mock_paths)

## 6. DS submits jobs to the DOs' datasites

<img src="./images/dsSendsJobs.png" width="80%" alt="DS Submits Jobs">

In [None]:
# clean up before submitting jobs  (TODO: make it in the backend)
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}
!rm -rf {SYFT_FLWR_PROJECT_PATH / "simulation_logs"}
!rm -rf weights/

In [None]:
for client in do_clients:
    print(f"sending job to {client.host}")
    job = client.jobs.submit(
        name="Syft Flower Experiment",
        user_code_path=SYFT_FLWR_PROJECT_PATH,
        dataset_name=SYFTBOX_DATASET_NAME,
        entrypoint="main.py",
        description="Syft Flower Federated Learning Experiment",
    )
    print(job)

<img src="./images/dsDoneSubmittingJobs.png" width="40%" alt="DS waits for jobs to be approved">

## 7. DS starts the FL server code

TODO: having a GIF to demonstrate the long-running jobs that's exchanging the models between the clients and server

In [None]:
!uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")} --active

In [None]:
# TODO: Change syntax to `ds_client.run(job)`

By running the FL server code, the DS aggregates the models trained on DOs' private local data into an improved global model

<img src="./images/dsAggregateModels.png" width="30%" alt="DS Aggregates Models">

## DS Observes the Results

Now the DS can monitor the aggregated models trained no DO's private datasets in the `weights` folder