# Data Scientist (DS)

Outline of what the DS will do

1. DS logs into DO1 and DO2's datasites as guest  
2. DS explores the datasets  
3. DS prepares syft_flwr code  
4. DS bootstraps the syft_flwr project  
5. DS runs flwr and syft_flwr simulations (optional)   
6. DS submits jobs to the DOs' datasites  
7. DS starts the FL server code  
8. DS Observes the Results

## 1. DS logs into DO1 and DO2's datasites as guest


DO will also have their own datasite where they can log in as an admin, but it's not needed in this workflow

<img src="../images/dsLogsInAsGuests.png" width="55%" alt="DS logs into DOs' datasites as guests">

In [None]:
import os
from pathlib import Path

from loguru import logger
from syft_rds.orchestra import setup_rds_server

DS = "ds@openmined.org"
DO1 = "do1@openmined.org"
DO2 = "do2@openmined.org"

ds_stack = setup_rds_server(email=DS, root_dir=Path("."), key="local_syftbox_network")
do1_guest = ds_stack.init_session(host=DO1)
do2_guest = ds_stack.init_session(host=DO2)

In [None]:
ds = ds_stack.client

In [None]:
do1_guest.is_admin

In [None]:
do2_guest.is_admin

Set some constants and variables

In [None]:
SYFTBOX_DATASET_NAME = "pima-indians-diabetes-database"

os.environ["SYFTBOX_CLIENT_CONFIG_PATH"] = str(ds_stack.client.config_path)
os.environ["LOGURU_LEVEL"] = "DEBUG"
os.environ["SYFT_FLWR_MSG_TIMEOUT"] = "60"

do_clients = [do1_guest, do2_guest]
do_emails = [DO1, DO2]

## 2. DS explores the datasets

DS can access the DOs' mock data, but can't access the private data

<img src="../images/dsExploresDOsDatasets.png" width="71%" alt="DS explores datasets">

In [None]:
SYFTBOX_DATASET_NAME

mock_paths = []
for client in do_clients:
    dataset = client.dataset.get(name=SYFTBOX_DATASET_NAME)
    mock_paths.append(dataset.get_mock_path())
    print(
        f"Client {client.host}'s dataset name: \n{dataset.name}\n. Mock path: \n{dataset.get_mock_path()}"
    )
    print()

Check that the DS can't access the private datasets, even with local test

In [None]:
try:
    dataset.get_private_path()
except Exception as e:
    logger.error(e)

## 3. DS prepares `syft_flwr` code

A `syft_flwr` project requires minimal changes to a Flower project:

1. A `syft_flwr` project has the same structure as an [example federated analytics Flower project](https://flower.ai/blog/2023-01-24-federated-analytics-pandas/):
```
fed-analytics-diabetes
├── fed-analytics-diabetes
│   ├── __init__.py
│   ├── client_app.py   # Defines your ClientApp
│   ├── server_app.py   # Defines your ServerApp
├── pyproject.toml      # Project metadata like dependencies and configs
└── README.md
``` 


Concretely, for diabetes prediction, we have the project code under the path `../fed-analytics-diabetes`

<img src="../images/dsPreparesSyftFlwrProject.png" width="20%" alt="DS prepares code">


where the DS will run the `ServerApp` (defined in `server_app.py`), and the `ClientApp` (in `client_app.py`)

<img src="../images/dsSyftFlwrProjectArch.png" width="69%" alt="syft_flwr project architecture">


In [None]:
SYFT_FLWR_PROJECT_PATH = Path("../fed-analytics-diabetes")
assert SYFT_FLWR_PROJECT_PATH.exists()

2. Compared to a Flower project, the DS only needs to change a bit of how the `client_app.py`, which runs on the data owners' datasites, loads the data:

<img src="../images/dsCodeLoadsClientData.png" width="70%" alt="load datasets comparison">


## 4. DS bootstraps the `syft_flwr` project

After the DS has prepared the `syft_flwr` project code, they can run the `syft_flwr.bootstrap` command to modify the project's metadata in `pyproject.toml` that specifies the unique name of the FL run, who are the data owners and who is the aggregator
```
[tool.syft_flwr]
app_name = "ds@openmined.org_fed-analytics-diabetes_1753435643"
datasites = [
    "do1@openmined.org",
    "do2@openmined.org",
]
aggregator = "ds@openmined.org"
```
and creates the `main.py` to specifies who runs what based on the new metatdata information

In [None]:
import syft_flwr

try:
    !rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
    syft_flwr.bootstrap(SYFT_FLWR_PROJECT_PATH, aggregator=DS, datasites=do_emails)
    print("Bootstrapped project successfully ✅")
except Exception as e:
    print(e)

## 5. DS runs `flwr` and `syft_flwr` simulations (optional)

In [None]:
RUN_SIMULATION = True

First we run normal Flower simulation with `flwr run`

In [None]:
if RUN_SIMULATION:
    !flwr run {SYFT_FLWR_PROJECT_PATH}

In [None]:
# clean up
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}  # clean before submitting
!rm -rf {"../weights/"}

Now the DS can run `syft_flwr` simulation on DO's mock data by launching 2 threads that run DOs code and a thread that runs the DS code 

<img src="../images/dsRunsSyftFLWRSimulation.png" width="53%" alt="ds runs simulation with syft_flwr.run">

In [None]:
if RUN_SIMULATION:
    print(f"running syft_flwr simulation with mock paths: {mock_paths}")
    syft_flwr.run(SYFT_FLWR_PROJECT_PATH, mock_paths)

Please look into the dir pointed to by `📝 Log directory` to see the logs of different simulated clients, by default it stays in the `fed-analytics-diabetes/simulation_logs`

<img src="../images/dsSimulationLogs.png" width="20%" alt="DS simulation logs folder">

We will also see that the aggregated historgrams are saved under `figures/` dir 

<img src="../images/dsAggregatedFiguresFolder.png" width="20%" alt="DS aggregated figures folder">

However, since running simulation is done on mock data, the local frequency is very low (only 7 as maximum for age)

<img src="../images/dsLocalFrequencyMock.png" width="50%" alt="local frequency mock data">

## 6. DS submits jobs to the DOs' datasites

The DS submits the code in `fl-diabetes-prediction` to the DO's datasites

<img src="../images/dsSubmitsJobs.png" width="50%" alt="DS submits jobs">

In [None]:
# clean up before submitting jobs
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}
!rm -rf {SYFT_FLWR_PROJECT_PATH / "simulation_logs"}
!rm -rf {"./figures/"}

In [None]:
for client in do_clients:
    print(f"sending job to {client.host}")
    job = client.jobs.submit(
        name="fl-diabetes-prediction",
        user_code_path=SYFT_FLWR_PROJECT_PATH,
        dataset_name=SYFTBOX_DATASET_NAME,
        entrypoint="main.py",
    )
    print(job)

<img src="../images/dsWaitsForJobsToBeApproved.png" width="30%" alt="DS waits for jobs to be approved">

## 7. DS starts the FL server code

By running the FL server code, the DS aggregates the models trained on DOs' private local data into an improved global model in multiple rounds

<video width="90%" controls>
  <source src="../images/fed-analytics.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

In [None]:
!uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")} --active

## 8. DS Observes the Results

Now the DS can monitor the aggregated figures trained on DO's private datasets in the `figures` folder

<img src="../images/dsAggregatedFiguresFolder.png" width="20%" alt="DS aggregated figures folder">

Now, the local frequency when running job on private data is much higher compared to running simulation on mock data

<img src="../images/dsLocalFrequencyReal.png" width="50%" alt="DS aggregated frequency real">