# Database and DataSets Management with **MEDfl**

### Introduction
One of the main advantages of *MEDfl* over the *Flower* package is that MEDfl utilizes a database to store all the steps of the learning process, configurations, and learning results.

1. Store network elements of the federated learning architecture (Network, Nodes, Server)
2. Store the initial DataSet and the Federated DataSet
3. Store the FL Pipeline
4. Store the results of the training and testing

The database will assist researchers in analyzing and comparing different results based on various configurations, enabling them to track and select the best configuration.


In this tutorial, we'll demonstrate how to initialize your database and establish its connection to MEDfl. Subsequently, we'll explore the step-by-step process of storing various pieces of information.

Our choice for utilizing [MySQL](https://www.mysql.com/fr/) as the database system is due to its robust features, reliability, and widespread adoption in the industry. Its strong support for structured query language (SQL) and its scalability make it an ideal choice for managing the diverse data sets and configurations within MEDfl.

<img src="../Images/logos/mysqllogo.png"  style="width:150px ;height:50px ;"> 

Before beginning, ensure that you have installed MySQL and one of the servers, such as [WAMP](https://www.wampserver.com) or [XAMPP](https://www.apachefriends.org/fr/index.html), and have them running. 

To visualize your database, you can open [PHPMyAdmin](https://www.phpmyadmin.net) , a web-based tool that allows for convenient management and visualization of your database.

<img src="../Images/logos/wampLogo.png"  style="width:120px ;height:50px ;"> 
<img src="../Images/logos/xampplogo.png"  style="width:160px ;height:50px ;"> 
<img src="../Images/logos/phpmyadmin.png"  style="width:150px ;height:50px ;"> 



In [1]:
# from MEDfl.LearningManager.utils import global_params

import sys
sys.path.append('../..')

import os
os.environ['PYTHONPATH'] = '../..'

1. Create the MEDfl database 
   
   open your php my admin and create the MEDfl database that we will use during the next tutorials
    
   ```sql
   CREATE DATABASE MEDfl;
   ```

Imports 

In [2]:
import pandas as pd
from sqlalchemy import create_engine

## MEDfl Imports 
from MEDfl.NetManager.node import Node
from MEDfl.NetManager.network import Network
from MEDfl.NetManager.dataset import DataSet

# Utils
from MEDfl.LearningManager.utils import *

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Generating Database Tables

In order to generate the database tables, it is necessary to run a script.

Before initiating the table creation process, you must specify the CSV file containing the datasets. The script will utilize the provided CSV file to define the columns of the datasets.
the csv file must be specified on the yaml file, just change the parameter **path_to_master_csv**

Execute the following command to run the script:

```bash
!python ../../scripts/create_db.py
```

In [3]:
# Create tables 
!python ../../scripts/create_db.py

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Some UseCases of the dataBase

In [4]:
my_eng = create_engine('mysql+mysqlconnector://root:@localhost:3306/MEDfl')
my_eng = my_eng.connect()

#### Creating and Adding a Network to the Database

In order to create a network, we'll use the `NetWork` class provided by *MEDfl*. Instantiating this class requires providing the network's name. The creation process involves using the `create_network` method, which adds the newly created network to the database.

In [5]:
# Instantiating the newtwork class with the name 
Net = Network("Net1")
# Create and add the newtwork to the dataBase
Net.create_network()



#### List all created netWorks 

In [6]:
df = Net.list_allnetworks()
df

Unnamed: 0,NetId,NetName
0,1,Net1


#### Creating a MasterDataSet for the Network

Using the Methode `create_master_dataset` we will create a new table **MasterDataset**, and upload the data of `path_to_csv` to it, ( path_to_csv is optional and by default is the file specified on `MEDfl.LearningManager.params.yaml` : `path_to_master_csv`  )
The MasterDataSet serves dual purposes within the network:

1. **Auto-Creation of Nodes:**
    - When automatically creating nodes, the MasterDataSet plays a pivotal role in dividing the data across various train and test nodes.

2. **Manual Creation of Nodes:**
    - In scenarios involving manual creation of nodes, the MasterDataSet acts as a reference point to verify compatibility between different dataSets and the MasterDataSet.


In [8]:
# Create a masterDataSet Table
Net.create_master_dataset() ; 

/home/local/USHERBROOKE/saho6810/MEDfl/code/MEDfl/notebooks/eicu_test.csv


#### Create nodes 

In [9]:

# Instantiating the Node class 
node = Node(name = "node_1" , train = 1)
# Create the node 
node.create_node(NetId=1)

# List all nodes 
nodeList = node.list_allnodes()
nodeList

Unnamed: 0,NodeId,NodeName,train,NetId
0,1,node_1,1,1


### Uploading DataSet to Nodes

To upload a dataSet to a node directly, there are two options available:

1. **Upload a New DataSet:**
    - Use the `Node.upload_dataset` method. This method enables the direct upload of a CSV file to a node, adding it to the database, and assigning it to the respective node.

2. **Using an Existing DataSet:**
    - Utilize the `Node.affect_dataSet` method. This approach involves using an existing DataSet already present in the database and assigning it to the node. This option is viable if the DataSet was previously added to the database without being assigned to any node using the `DataSet.upload_dataset` method. It's also beneficial when reusing the same DataSet for another node in a different experiment, already associated with a node in a prior experiment.


1. **Upload a new DataSet**
   
   we will upload the dataset `eicu_test_1.csv` to the node that we have create **test_node**

In [10]:
# uploading the data set to the node 1
node.upload_dataset(dataset_name ='Test_Data_set' , path_to_csv='../eicu_test_1.csv') ; 

In [9]:
# Get the node DataSet
node_dataset = node.get_dataset() ; 
node_dataset

Unnamed: 0,DataSetId,DataSetName,NodeId,id,site_hospital,site_region,age,pao2fio2,uo,admissiontype,...,bun,chron_dis,gcs,hr,potassium,sbp,sodium,tempc,wbc,event_death
0,1,Test_Data_set,1,stay147985,site73,Midwest,16,0.0,4,6,...,6,0,5,0,0,5,1,0,0,1
1,2,Test_Data_set,1,stay156248,site73,Midwest,7,0.0,0,6,...,0,0,0,0,0,5,0,0,0,0
2,3,Test_Data_set,1,stay156308,site60,Midwest,18,0.0,0,6,...,6,0,0,0,3,5,1,0,0,0
3,4,Test_Data_set,1,stay157820,site73,Midwest,12,0.0,11,6,...,10,0,0,0,0,0,1,0,0,0
4,5,Test_Data_set,1,stay159036,site73,Midwest,18,0.0,0,6,...,6,0,0,4,0,5,0,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,145,Test_Data_set,1,stay695514,site167,West,12,0.0,4,6,...,6,0,0,4,0,13,0,0,0,1
145,146,Test_Data_set,1,stay700930,site171,West,7,0.0,0,6,...,0,0,0,0,3,5,0,0,0,0
146,147,Test_Data_set,1,stay709956,site167,West,12,0.0,0,6,...,6,0,0,0,0,0,0,3,0,0
147,148,Test_Data_set,1,stay713512,site165,West,18,0.0,4,6,...,0,0,0,4,0,5,0,0,0,0


In [11]:
# List all DataSets assotiated with the node
data = node.list_alldatasets()
data

Unnamed: 0,DataSetName,NodeName
0,Test_Data_set,node_1


2. **Using an existing DataSet**
   
In this scenario, we'll use an existing dataSet and assign it to a node. However, before assigning it, we need to add the dataSet using `DataSet.upload_dataset()`. There's an optional parameter called *NodeId* that allows assigning the added dataset to a node. By default, this parameter is set to -1, indicating that the added dataset is not assigned to any node.


In [12]:
# Path to the csv file
path_to_csv = "../eicu_test_2.csv"
#
ds = DataSet(name="Dataset_2", path=path_to_csv)

In [13]:
# Upload the dataSet with the default NodeId = - 1 ( not assined to any node )
ds.upload_dataset()

#### List All DataSets

When listing the datasets, the output includes the dataset name and the associated nodeId. If the nodeId is -1, it indicates that the dataset is not assigned to any node. 

In [14]:
allDatasets = ds.list_alldatasets(my_eng)
allDatasets

Unnamed: 0,DataSetName,NodeId
0,Test_Data_set,1
1,Dataset_2,-1


Assign DataSet to Node 

In [15]:

node_2 = Node(name="node_2" , train = 0 ) ; 

# # Create a new node 
node_2.create_node(NetId=1) 

# assing Dataset_3 to node_2 
node_2.assign_dataset(dataset_name="Dataset_2"); 

# Display node Data Set
data_2 = node_2.get_dataset() ; 
data_2

Unnamed: 0,DataSetId,DataSetName,NodeId,id,site_hospital,site_region,age,pao2fio2,uo,admissiontype,...,bun,chron_dis,gcs,hr,potassium,sbp,sodium,tempc,wbc,event_death
0,150,Dataset_2,2,stay722936,site148,West,12,0.00000,0,6,...,6,0,5,4,0,5,0,0,3,1
1,151,Dataset_2,2,stay725182,site154,West,12,0.00000,0,6,...,0,0,5,4,0,13,1,0,0,0
2,152,Dataset_2,2,stay731227,site167,West,0,0.00000,11,6,...,6,0,26,2,0,5,0,0,3,0
3,153,Dataset_2,2,stay735476,site176,Unknown,12,0.00000,11,6,...,6,0,0,4,0,5,0,3,3,0
4,154,Dataset_2,2,stay739214,site157,West,18,0.00000,0,6,...,0,0,0,2,0,5,1,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,294,Dataset_2,2,stay1117239,site199,Northeast,7,0.00000,0,6,...,0,0,0,0,3,0,0,0,0,0
145,295,Dataset_2,2,stay1119233,site199,Northeast,18,0.00000,11,6,...,6,0,26,0,3,13,0,0,0,0
146,296,Dataset_2,2,stay1119288,site199,Northeast,7,0.00000,0,6,...,0,0,5,4,0,13,0,0,3,0
147,297,Dataset_2,2,stay1121252,site199,Northeast,0,0.00000,0,6,...,0,0,5,4,0,5,0,0,3,0


#### Unassigne DataSet 

The action of unassigning a dataSet is essentially the reverse or inverse of assigning a dataSet to a node. This process involves setting the nodeId to -1, which means disassociating the dataSet from any specific node. Consequently, when the nodeId is -1, the dataSet is not assigned to any node within the network.

In [16]:
# unAssigning DataSet_3
node_2.unassign_dataset('DataSet_2') ; 

# Display node Data Set
data_2 = node_2.get_dataset() ; 
data_2


Unnamed: 0,DataSetId,DataSetName,NodeId,id,site_hospital,site_region,age,pao2fio2,uo,admissiontype,...,bun,chron_dis,gcs,hr,potassium,sbp,sodium,tempc,wbc,event_death


#### Delete DataSet 
Deleting a dataSet involves the complete removal of all samples or records associated with that specific dataSet from the database.

In [18]:
# Deleting the DataSet Dataset_3 
ds.delete_dataset() ; 

# List All available DataSets 
allDatasets = ds.list_alldatasets(my_eng)
allDatasets

Unnamed: 0,DataSetName,NodeId
0,Test_Data_set,1
1,Dataset_2,-1


Delete node 

In [19]:
# Delete the second node
node_2.delete_node() ; 

### Create Federated DataSet

The Federated DataSet is a dataset used as a connector between the NetManager and the Learning Manager, 
To create a Federated DataSet we use the methode `create_federated_dataset` of the class `FLsetup`, it will go through all nodes,and generate **trainloders** & **valloaders** for the train nodes, and testloaders for the test nodes


In [20]:
from importlib import reload
from MEDfl.NetManager import network, flsetup, net_helper
reload(network)
reload(flsetup)
reload(net_helper)

from MEDfl.NetManager.network import Network
from MEDfl.NetManager.flsetup import FLsetup



Net = Network('random')

network = Net.use_network('Net1')

# Create a FLsetup
flSetup = FLsetup(name="My_FLSetUp",
                  description="This is just a test", network=Net)

flSetup.create(); 

# Create a FL DATASET
# By default the take these values :
# val_frac=0.1, test_frac=0.2
flDataSet = flSetup.create_federated_dataset(
    output="event_death", 
    fill_strategy="mean", 
    fit_encode=["site_hospital", "site_region"], 
    to_drop=[ "id", "event_death"], 
    val_frac=0.1, 
    test_frac=0.15)

flDataSet

<MEDfl.LearningManager.federated_dataset.FederatedDataset at 0x7f31105d6e50>

Finaly let's clear the database 

In [21]:
# clear DB 
empty_db()