# Research FFD Demo

This notebook goes over the interaction code FFD, which used different pipeline steps to do  the following incremental actions:

1. Global model initilization in central
2. Sending initial model to workers
3. Training a new model in workers
4. Returning model updates to central
5. Aggregating updates into a global model
6. Repeating steps 2 to 4 until model converges

We will use the [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1/data) to simulate a fraud detection  scenario, where the central node is controlled by the trade organization and worker nodes are different banks that belong to that organisation where the trade organisation decides to use federated learning to facilitate a adapting, robust and private fraud detection system for their partners. The imports we will use in this notebook are the following:

- Pandas
- Numpy
- Scikit-learn
- MinIO

In [1]:
import io
import pickle
import requests
import json

import numpy as np
import pandas as pd

from minio import Minio

## Preprocessing

Please do the following:

1. Use the link to download the provided zip
2. Create a 'data' folder in the same place as this notebook and unzip the data to it
3. Rename the data into 'fraud_detection.csv'
4. Run the following blocks to save formatted data formatted_fraud_detection_data.csv into data folder

In [2]:
source_data_df = pd.read_csv('data/fraud_detection.csv')

In [5]:
source_data_df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


The columns are:
- Row index = The amount of logs
- Step = One hour in the real world 
- Type = Transaction type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- Amount = Unit of local currency
- NameOrig = Customer who started the transaction
- OldbalanceOrig = Initial balance before the transaction
- NewbalanceOrig = New balance after the transaction
- NameDest = Customer who is the recipient of the transaction
- oldbalanceDest = Initial balance recipient before the transaction.
- NewbalanceDest = New balance recipient after the transaction
- IsFraud = The transactions made by the fraudulent agents.
- IsFlaggedFraud = Existing detection, where more than 200.000 transcations are flagged

In order to simulate fraud detection, we need to remove the following columns:
- OldbalanceOrg
- NewbalanceOrig
- OldbalanceDest
- NewbalanceDest
- IsFlaggedFraud (Should be used for comparison, but not for training a model)

After that, we need to modify the following columns:
- type = Requires hot one encoding using integers
- nameOrig = requires string integer encoding
- nameDest = requires string integer encoding
- amount = round up

In [3]:
def formatting(
    source_df: any
) -> any:
    print('Formatting data')
    formated_df = source_df.copy()
    
    irrelevant_columns = [
        'oldbalanceOrg',
        'newbalanceOrig',
        'oldbalanceDest',
        'newbalanceDest'
    ]
    formated_df.drop(
        columns = irrelevant_columns, 
        inplace = True
    )
    print('Columns dropped')
    formated_df = pd.get_dummies(
        data = formated_df, 
        columns = ['type']
    )
    
    for column in formated_df.columns:
        if 'type' in column:
            formated_df[column] = formated_df[column].astype(int)
    print('One hot coded type')

    unique_values_orig = formated_df['nameOrig'].unique()
    unique_values_dest = formated_df['nameDest'].unique()
    
    unique_value_list_orig = unique_values_orig.tolist()
    unique_value_list_dest = unique_values_dest.tolist()

    print('Orig amount:', len(unique_value_list_orig))
    print('Dest amount:', len(unique_value_list_dest))
    
    set_orig_ids = set(unique_value_list_orig)
    set_dest_ids = set(unique_value_list_dest)
    intersection = set_dest_ids.intersection(set_orig_ids)

    print('Orig and Dest duplicates', len(intersection))
    
    set_dest_ids.difference_update(intersection)
    fixed_unique_value_list_dest = list(set_dest_ids)
    print('Fixed Dest amount:',len(fixed_unique_value_list_dest))
    
    orig_encoding_dict = {}
    index = 1
    for string in unique_value_list_orig:
        if not string in orig_encoding_dict:
            orig_encoding_dict[string] = index
            index = index + 1

    dest_encoding_dict = {}
    cont_index = len(orig_encoding_dict) + 1
    for string in fixed_unique_value_list_dest:
        if not string in dest_encoding_dict:
            dest_encoding_dict[string] = cont_index
            cont_index = cont_index + 1
    print('Orig dict amount:', len(orig_encoding_dict))
    print('Dest dict amount:', len(dest_encoding_dict))
    
    print('Orig and dest string-integer encodings created')

    string_orig_values = formated_df['nameOrig'].tolist()
    string_dest_values = formated_df['nameDest'].tolist()

    orig_encoded_values = []
    for string in string_orig_values:
        orig_encoded_values.append(orig_encoding_dict[string])

    dest_encoded_values = []
    for string in string_dest_values:
        if not string in dest_encoding_dict:
            dest_encoded_values.append(orig_encoding_dict[string])
            continue
        dest_encoded_values.append(dest_encoding_dict[string])

    formated_df['nameOrig'] = orig_encoded_values
    formated_df['nameDest'] = dest_encoded_values

    print('Orig encoded values amount:', len(orig_encoded_values))
    print('Dest encoded values amount:', len(dest_encoded_values))
    
    print('Orig and dest encodings set')

    formated_df['amount'] = formated_df['amount'].round(0).astype(int)
    print('Amount rounded')

    column_order = [
        'step',
        'amount',
        'nameOrig',
        'nameDest',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud',
        'isFlaggedFraud'
    ]
    formated_df = formated_df[column_order]
    print('Columns reordered')
    print('Dataframe shape:', formated_df.shape)
    print('Formatting done')
    return formated_df

In [4]:
formated_data_df = formatting(
    source_df = source_data_df
)

Formatting data
Columns dropped
One hot coded type
Orig amount: 6353307
Dest amount: 2722362
Orig and Dest duplicates 1769
Fixed Dest amount: 2720593
Orig dict amount: 6353307
Dest dict amount: 2720593
Orig and dest string-integer encodings created
Orig encoded values amount: 6362620
Dest encoded values amount: 6362620
Orig and dest encodings set
Amount rounded
Columns reordered
Dataframe shape: (6362620, 11)
Formatting done


In [6]:
formated_data_df

Unnamed: 0,step,amount,nameOrig,nameDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud,isFlaggedFraud
0,1,9840,1,7053780,0,0,0,1,0,0,0
1,1,1864,2,8749151,0,0,0,1,0,0,0
2,1,181,3,6719726,0,0,0,0,1,1,0
3,1,181,4,7910659,0,1,0,0,0,1,0
4,1,11668,5,7215342,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682,6353303,8801140,0,1,0,0,0,1,0
6362616,743,6311409,6353304,7111791,0,0,0,0,1,1,0
6362617,743,6311409,6353305,8490075,0,1,0,0,0,1,0
6362618,743,850003,6353306,8562842,0,0,0,0,1,1,0


In [7]:
formated_data_df.to_csv('data/formated_fraud_detection_data.csv', index = False)

## Starting Training

Please run these following blocks step by step to initilized training:

In [51]:
formated_data_df = pd.read_csv('data/formated_fraud_detection_data.csv')

In [8]:
used_data = formated_data_df.iloc[:100000]

In [10]:
used_data

Unnamed: 0,step,amount,nameOrig,nameDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud,isFlaggedFraud
0,1,9840,1,7053780,0,0,0,1,0,0,0
1,1,1864,2,8749151,0,0,0,1,0,0,0
2,1,181,3,6719726,0,0,0,0,1,1,0
3,1,181,4,7910659,0,1,0,0,0,1,0
4,1,11668,5,7215342,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
99995,10,4021,99996,6614037,0,0,0,1,0,0,0
99996,10,18345,99997,6969413,0,0,0,1,0,0,0
99997,10,183775,99998,7389144,1,0,0,0,0,0,0
99998,10,82237,99999,8510563,0,1,0,0,0,0,0


In [9]:
target_value_amounts = used_data['isFraud'].value_counts()
print(target_value_amounts)
fraud_to_no_fraud_ratio = target_value_amounts / target_value_amounts.sum()
print(fraud_to_no_fraud_ratio)

0    99884
1      116
Name: isFraud, dtype: int64
0    0.99884
1    0.00116
Name: isFraud, dtype: float64


## Experiment Metadata

This is used to identify MinIO objects and MLflow experiments. 

In [132]:
experiment = {
    'name': 'federated-learning',
    'tags': {}
}

## Model Parameters

This is used to configure the default FFD's Logistic Regression model. The parameters you should change are:

- learning-rate = The used learning rate of the model
- sample-rate = The amount of samples used in training batches
- optimizer = The model optimizer
- epochs = The amount of cycles used for training

In [None]:
model_parameters = {
    'seed': 42,
    'used-columns': [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    'input-size': 6,
    'target-column': 'isFraud',
    'scaled-columns': [
        'amount'
    ],
    'learning-rate': 0.20,
    'sample-rate': 0.20,
    'optimizer':'SGD',
    'epochs': 10
}

### Central Parameters

This is used to configure training on the central node. The parameters you should change are:

- sample-pool = The amount of data reserved for central train, test and eval
- eval-ratio = The amount of data used for eval with remainin for train and test
- train-ratio = The amount of data used for train with remaining for test
- 1-0-ratio = The amount of fraud cases in data augmented sample
- max-cycles = Maximum amount of incremental cycles in learning
- min-metric-success = Minimum amount of succesful metrics to stop learning
- min-update-amount = Minimum required amount of worker updates for aggregation
- metric-thresholds = The thresholds that check if a metric passes based on set condition
    - '>=':metric should be larger than or equal to threshold
    - '<=':metric should be smaller than or equal to threshold 

In [None]:
central_parameters = {
    'sample-pool': 50000,
    'data-augmentation': {
        'active': True,
        'sample-pool': 100000,
        '1-0-ratio': 0.2
    },
    'eval-ratio': 0.2,
    'train-ratio': 0.9,
    'min-update-amount':5,
    'max-cycles':2,
    'min-metric-success': 10,
    'metric-thresholds': {
        'true-positives': 2000,
        'false-positives': 1000,
        'true-negatives': 40000, 
        'false-negatives': 10000,
        'recall': 0.20,
        'selectivity': 0.90,
        'precision': 0.80,
        'miss-rate': 0.50,
        'fall-out': 0.05,
        'balanced-accuracy': 0.70,
        'accuracy': 0.80
    },
    'metric-conditions': {
        'true-positives': '>=',
        'false-positives': '<=',
        'true-negatives': '>=', 
        'false-negatives': '<=',
        'recall': '>=',
        'selectivity': '>=',
        'precision': '>=',
        'miss-rate': '<=',
        'fall-out': '<=',
        'balanced-accuracy': '>=',
        'accuracy': '>='
    }
}

## Worker Parameters

This is used to configure training on the worker nodes. The parameters you should change are 

- sample-pool = The amount of data used for worker train, test and eval
- eval-ratio = The amount of data used for eval with remainin for train and test
- train-ratio = The amount of data used for train with remaining for test
- 1-0-ratio = The amount of fraud cases in data augmented worker sample

In [11]:
worker_parameters = {
    'sample-pool': 250000,
    'data-augmentation': {
        'active': True,
        'sample-pool': 250000,
        '1-0-ratio': 0.2
    },
    'eval-ratio': 0.2,
    'train-ratio': 0.9
}

## Context Payload

Run these blocks to start training process.

In [135]:
parameters = {
    'model': model_parameters,
    'central': central_parameters,
    'worker': worker_parameters
}

context = {
    'experiment': experiment,
    'parameters': parameters,
    'data': data,
    'columns': columns
}

payload = json.dumps(context)
print('Payload size in bytes: ' + str(len(payload)))

In [16]:
len(payload)

4933793

In [136]:
response = requests.post(
    url = 'http://127.0.0.1:7500/start',
    json = payload
)

print(response.status_code)

200


## Checking training with Logs and Docker stats

During the run its recommeded to check central progress by updating the logs website and checking the metrics of docker stats, which you can read about [here]( https://docs.docker.com/reference/cli/docker/container/stats/). You know that the training is complete, when Central logs show evaluation results and Workers now longer receive any context from Central, causing their tasks to idle with False prints. 

## Nodes, MLflow and MinIO

Because **central and worker are stateless** due to using MinIO, when you see that the logs start to idle, you can stop node deployment with either CTRL+C or using docker compose -f (file) stop unless you want to try model inference. 

As a reminder it is not recommeded to use docker compose -f (file) down on storage and monitoring deployments, because **the gathered data will be lost**. **Data will not be wiped during computer restarts**, but to prevent the loss of MinIO and MLflow data simply download the buckets and runs to your computer by in MinIO case clicking a folder and checking the provided options on the right side and in MLflow case clicking all runs and clicking the 3 dots on the right side on top of new run button. 

## Checking results with MLflow and Grafana

In order to analyse model, time and resource metrics, the fastest way to get a general idea is to use MLflow comparison analysis and Grafana dashboard on a given time range. 

### MLflow

To analyse with MLflow simply click the experiments and runs you intende to analyse and then click compare, which provides different analysis options. For simplicity we recommed using the pararell coordinates plot to compare evaluation results in metrics list using the parameters list, which you can open and select by clicking the metrics and parameters you want. Its also recommeded to simply scroll down to get a list of the different metrics.  

### Grafana

To analyse with Grafana, press the three bars under grafana logo and go to dashboards. There press new, select import and select the ffd_dashboard.json provided in deployment/grafana. Now, you might need to correct the time range by selecting the top right date and setting the last two hours. When you see the plots having lines, simply click inside one of them to reduce the time range of all plots to the one you want.

## Inference

After the creation of the first model in nodes, we can get predictions using the following function:

In [11]:
def central_worker_inference(
    address: str,
    experiment_name: str,
    experiment: str,
    cycle: str,
    data_df: any,
    relevant_columns: list,
    rows: int
):
    sample_df = data_df.iloc[:rows,:]
    relevant_df = sample_df[relevant_columns]
    input_df = relevant_df.iloc[:rows,:-2]
    mean = input_df['amount'].mean()
    std_dev = input_df['amount'].std()
    input_df['amount'] = (input_df['amount'] - mean)/std_dev

    payload = {
        'experiment-name': experiment_name,
        'experiment': experiment,
        'cycle': cycle,
        'input': input_df.values.tolist()
    }
    payload = json.dumps(payload)
    central_address = address + '/predict' 
    response = requests.post(
        url = central_address,
        json = payload
    )

    text_output = json.loads(response.text)
    sample_df['pred'] = np.array(text_output['predictions']).astype(int)
    return sample_df

You need to provide the correct address, experiment name, experiment number, cycle and amount of rows for the input:

In [25]:
inference_df = central_worker_inference(
    address = 'http://127.0.0.1:7500',
    experiment_name = 'federated-learning',
    experiment = '1',
    cycle = '2',
    data_df = formated_data_df,
    relevant_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud',
        'isFlaggedFraud'
    ],
    rows = 50
)
inference_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df['pred'] = np.array(text_output['predictions']).astype(int)


Unnamed: 0,step,amount,nameOrig,nameDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud,isFlaggedFraud,pred
0,1,9840,1,7233461,0,0,0,1,0,0,0,0
1,1,1864,2,7735206,0,0,0,1,0,0,0,0
2,1,181,3,8598945,0,0,0,0,1,1,0,1
3,1,181,4,7880837,0,1,0,0,0,1,0,0
4,1,11668,5,7670940,0,0,0,1,0,0,0,0
5,1,7818,6,6477257,0,0,0,1,0,0,0,0
6,1,7108,7,8194799,0,0,0,1,0,0,0,0
7,1,7862,8,8738506,0,0,0,1,0,0,0,0
8,1,4024,9,6735336,0,0,0,1,0,0,0,0
9,1,5338,10,6427877,0,0,1,0,0,0,0,0


## MinIO Interaction

During runs, you can interact with MinIO using following client and functions:

In [2]:
minio_client = Minio(
    endpoint = "127.0.0.1:9000", 
    access_key = '23034opsdjhksd', 
    secret_key = 'sdkl3slömdm',
    secure = False
)

In [3]:
def create_bucket(
    minio_client: any,
    bucket_name: str
) -> bool:
    MINIO_CLIENT = minio_client 
    try:
        MINIO_CLIENT.make_bucket(
            bucket_name = bucket_name
        )
        return True
    except Exception as e:
        print(e)
        return False
    
def check_bucket(
    minio_client: any,
    bucket_name:str
) -> bool:
    MINIO_CLIENT = minio_client
    try:
        status = MINIO_CLIENT.bucket_exists(bucket_name = bucket_name)
        return status
    except Exception as e:
        print(e)
        return False 
       
def delete_bucket(
    minio_client: any,
    bucket_name:str
) -> bool:
    MINIO_CLIENT = minio_client
    try:
        MINIO_CLIENT.remove_bucket(
            bucket_name = bucket_name
        )
        return True
    except Exception as e:
        print(e)
        return False
# Works
def create_object(
    minio_client: any,
    bucket_name: str, 
    object_path: str, 
    data: any,
    metadata: dict
) -> bool: 
    # Be aware that MinIO objects have a size limit of 1GB, 
    # which might result to large header error
    MINIO_CLIENT = minio_client
    
    pickled_data = pickle.dumps(data)
    length = len(pickled_data)
    buffer = io.BytesIO()
    buffer.write(pickled_data)
    buffer.seek(0)
    try:
        MINIO_CLIENT.put_object(
            bucket_name = bucket_name,
            object_name = object_path + '.pkl',
            data = buffer,
            length = length,
            metadata = metadata
        )
        return True
    except Exception as e:
        print(e)
        return False
# Works
def check_object(
    minio_client: any,
    bucket_name: str, 
    object_path: str
) -> bool: 
    MINIO_CLIENT = minio_client
    try:
        object_info = MINIO_CLIENT.stat_object(
            bucket_name = bucket_name,
            object_name = object_path + '.pkl'
        )      
        return True
    except Exception as e:
        return False 
# Works
def delete_object(
    minio_client: any,
    bucket_name: str, 
    object_path: str
) -> bool: 
    MINIO_CLIENT = minio_client
    try:
        MINIO_CLIENT.remove_object(
            bucket_name = bucket_name, 
            object_name = object_path + '.pkl'
        )
        return True
    except Exception as e:
        print(e)
        return False
# Works
def update_object(
    minio_client: any,
    bucket_name: str, 
    object_path: str, 
    data: any,
    metadata: dict
) -> bool:  
    remove = delete_object(minio_client,bucket_name, object_path)
    if remove:
        create = create_object(minio_client, bucket_name, object_path, data, metadata)
        if create:
            return True
    return False
# works
def create_or_update_object(
    minio_client: any,
    bucket_name: str, 
    object_path: str, 
    data: any, 
    metadata: dict
) -> any:
    bucket_status = check_bucket(minio_client,bucket_name)
    if not bucket_status:
        creation_status = create_bucket(minio_client,bucket_name)
        if not creation_status:
            return None
    object_status = check_object(minio_client,bucket_name, object_path)
    if not object_status:
        return create_object(minio_client,bucket_name, object_path, data, metadata)
    else:
        return update_object(minio_client,bucket_name, object_path, data, metadata)

def get_object_data_and_metadata(
    minio_client: any,
    bucket_name: str, 
    object_path: str
) -> dict:
    MINIO_CLIENT = minio_client
    
    try:
        given_object_info = MINIO_CLIENT.stat_object(
            bucket_name = bucket_name, 
            object_name = object_path + '.pkl'
        )
        # There seems to be some kind of a limit
        # with the amount of request a client 
        # can make, which is why this variable
        # is set here to give more time got the client
        # to complete the request
        given_metadata = given_object_info.metadata
        
        given_object_data = MINIO_CLIENT.get_object(
            bucket_name = bucket_name, 
            object_name = object_path + '.pkl'
        )
        given_pickled_data = given_object_data.data
        
        try:
            given_data = pickle.loads(given_pickled_data)
            relevant_metadata = {} 
            for key, value in given_metadata.items():
                if 'x-amz-meta' in key:
                    key_name = key[11:]
                    relevant_metadata[key_name] = value
            return {'data': given_data, 'metadata': relevant_metadata}
        except Exception as e:
            print('MinIO object pickle decoding error')
            print(e)
            return None 
    except Exception as e:
        print('MinIO object fetching error')
        print(e)
        return None
# Works
def get_object_list(
    minio_client: any,
    bucket_name: str,
    path_prefix: str
) -> dict:
    MINIO_CLIENT = minio_client
    try:
        objects = MINIO_CLIENT.list_objects(bucket_name = bucket_name, prefix = path_prefix, recursive = True)
        object_dict = {}
        for obj in objects:
            object_name = obj.object_name
            object_info = MINIO_CLIENT.stat_object(
                bucket_name = bucket_name,
                object_name = object_name
            )
            given_metadata = {} 
            for key, value in object_info.metadata.items():
                if 'X-Amz-Meta' in key:
                    key_name = key[11:]
                    given_metadata[key_name] = value
            object_dict[obj.object_name] = given_metadata
        return object_dict
    except Exception as e:
        return None  

These can be used to debug the current state of central or workers:

In [137]:
minio_object = get_object_data_and_metadata(
    minio_client = minio_client,
    bucket_name = 'central', 
    object_path = 'experiments/status'
)
minio_object

{'data': {'experiment-name': 'federated-learning',
  'experiment': 2,
  'experiment-id': '1',
  'start': True,
  'data-split': True,
  'preprocessed': True,
  'trained': True,
  'worker-split': True,
  'sent': True,
  'updated': True,
  'evaluated': True,
  'complete': True,
  'train-amount': 80000,
  'test-amount': 20000,
  'eval-amount': 100000,
  'collective-amount': 400000,
  'worker-updates': 4,
  'cycle': 6,
  'run-id': 'df8eee217bea4c0695ebd3ebfa54921b'},
 'metadata': {}}

## Experiment Results

When you have completed the experiment, you might want to parse MinIO metrics into CSVs, which you can do using the following functions:

### MinIO Get and Set

In [None]:
def format_metadata_dict(
    given_metadata: dict
) -> dict:
    # MinIO metadata is first characeter capitalized 
    # and their values are strings due to AMZ format, 
    # which is why the key strings must be made lower
    # and their stirng integers need to be changed to integers 
    fixed_dict = {}
    for key, value in given_metadata.items():
        if value.replace('.','',1).isdigit():
            fixed_dict[key.lower()] = int(value)
        else: 
            fixed_dict[key.lower()] = value
    fixed_dict = decode_metadata_strings_to_lists(fixed_dict)
    return fixed_dict
# Created and works
def encode_metadata_lists_to_strings(
    given_metadata: dict
) -> dict:
    # MinIO metadata only accepts strings and integers, 
    # that have keys without _ characters
    # which is why saving lists in metadata requires
    # making them strings
    modified_dict = {}
    for key,value in given_metadata.items():
        if isinstance(value, list):
            modified_dict[key] = 'list=' + ','.join(map(str, value))
            continue
        modified_dict[key] = value
    return modified_dict 
# Created and works
def decode_metadata_strings_to_lists(
    given_metadata: dict
) -> dict:
    modified_dict = {}
    for key, value in given_metadata.items():
        if isinstance(value, str):
            if 'list=' in value:
                string_integers = value.split('=')[1]
                values = string_integers.split(',')
                if len(values) == 1 and values[0] == '':
                    modified_dict[key] = []
                else:
                    try:
                        modified_dict[key] = list(map(int, values))
                    except:
                        modified_dict[key] = list(map(str, values))
                continue
        modified_dict[key] = value
    return modified_dict

def get_experiments_objects(
    minio_client: any,
    object_bucket: str,
    object_path: str
) -> any:
    object_exists = check_object(
        minio_client = minio_client,
        bucket_name = object_bucket,
        object_path = object_path
    )
    
    object_data = None
    object_metadata = None
    if object_exists:
        fetched_object = get_object_data_and_metadata(
            minio_client = minio_client,
            bucket_name = object_bucket,
            object_path = object_path
        )
        object_data = fetched_object['data']
        object_metadata = format_metadata_dict(fetched_object['metadata'])
    return object_data, object_metadata
# Created and works 
def set_experiments_objects(
    minio_client: any,
    object_bucket: str,
    object_path: str,
    overwrite: bool,
    object_data: any,
    object_metadata: any
):
    object_exists = check_object(
        minio_client = minio_client,
        bucket_name = object_bucket,
        object_path = object_path
    )
    perform = True
    if object_exists and not overwrite:
        perform = False

    if perform:
        create_or_update_object(
            minio_client = minio_client,
            bucket_name = object_bucket,
            object_path = object_path,
            data = object_data,
            metadata = encode_metadata_lists_to_strings(object_metadata)
        )

### Central Parsing

In [None]:
def set_central_objects_and_paths(
    experiments_folder: str,
    experiment_name: str,
    experiment: str
) -> any:
    central_objects = {
        'specifications': {},
        'times': {},
        'task': pd.DataFrame(),
        'function': pd.DataFrame(),
        'network': pd.DataFrame(),
        'training': pd.DataFrame(),
        'inference': pd.DataFrame(),
        'metrics': pd.DataFrame(),
        'system': pd.DataFrame(),
        'server': pd.DataFrame()
    }
    
    object_paths = {
        'specifications': experiments_folder + '/specifications',
        'times': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/times',
        'task': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/times/task',
        'function': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/times/function',
        'network': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/times/network',
        'training': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/times/training',
        'inference': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/times/inference',
        'metrics': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/metrics',
        'system': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/resources/system',
        'server': experiments_folder + '/' + str(experiment_name) + '/' + str(experiment) + '/c/resources/server'
    }
    
    return central_objects, object_paths

def PyTorch_model_into_data_and_columns(
    parameters: any
) -> any:
    columns = []
    values = []
    for key,value in data.items():
        numpy_format = value.numpy().tolist()
        shape = value.shape

        if 'weight' in key:
            for i in range(1,shape[1]+1):
                columns.append('w-' + str(i))
            for weights in numpy_format:
                values.append(weights)

        if 'bias' in key:
            columns.append('b-1')
            i = 0
            for bias in numpy_format:
                values[i].append(bias)
                i = i + 1
    return values, columns

def format_central_object(
    collected_objects: any,
    bucket: str,
    path: str,
    key: str,
    cycle: str
) -> bool:    
    formatted_data = collected_objects[key]
    data, metadata = get_experiments_objects(
        minio_client = minio_client,
        object_bucket = bucket,
        object_path = path
    )
    
    if isinstance(data, OrderedDict):
        path_split = path.split('/')
        
        if key in path_split[-2]:
            model_values, model_columns = PyTorch_model_into_data_and_columns(
                parameters = data
            )
            model_df = pd.DataFrame(model_values, columns = model_columns)
            model_df['name'] = path_split[-1]
            model_df['cycle'] = int(cycle)
            
            result = pd.concat([formatted_data,model_df])
            result = result.reset_index(drop=True)
            collected_objects[key] = result
        else:
            model_values, model_columns = PyTorch_model_into_data_and_columns(
                parameters = data
            )
            model_df = pd.DataFrame(model_values, columns = model_columns)
            model_df['cycle'] = int(cycle)

            result = pd.concat([formatted_data,model_df])
            result = result.reset_index(drop=True)
            collected_objects[key] = result
        return True
        
    if isinstance(formatted_data, dict):
        if cycle is None:
            collected_objects[key] = data
        else:
            collected_objects[key][cycle] = data
        return True
            
    if isinstance(formatted_data, pd.DataFrame):
        if not metadata is None:
            if 'header' in metadata:
                if cycle is None:
                    object_df = pd.DataFrame(data, columns = metadata['header'])
                    collected_objects[key] = pd.concat([formatted_data,object_df])
                else:
                    path_split = path.split('/')
                    object_df = pd.DataFrame(data, columns = metadata['header'])
                    object_df['name'] = path_split[-1]
                    object_df['cycle'] = int(cycle)
                    result = pd.concat([formatted_data,object_df])
                    result = result.reset_index(drop=True)
                    collected_objects[key] = result
                return True
        if not data is None:
            object_df = pd.DataFrame.from_dict(data,orient='index')
            object_df['cycle'] = int(cycle)
            result = pd.concat([formatted_data,object_df])
            result = result.reset_index(drop=True)
            collected_objects[key] = result
        return True
        
def format_central_experiment_objects(
    minio_client: any,
    experiments_folder: str,
    experiment_name: str,
    experiment: str,
    cycles: int
):
    target_bucket = 'central'
    collected_objects, storage_paths = set_central_objects_and_paths(
        experiments_folder = experiments_folder,
        experiment_name = experiment_name,
        experiment = experiment
    )
    
    max_cycles = cycles + 1
    for name in collected_objects.keys():
        whole_path = storage_paths[name]
        path_split = whole_path.split('/')
        used_key = path_split[-1]
        if 4 < len(path_split):
            if path_split[3] == 'c':
                for cycle in range(1,max_cycles + 1):
                    path_split[3] = str(cycle)
                    cycle_path = '/'.join(path_split)
                    if '|' in path_split[-1]:
                        folder_path = cycle_path.split('|')[0]
                        folder_objects = get_object_list(
                            minio_client = minio_client,
                            bucket_name = target_bucket,
                            path_prefix = folder_path
                        )
                        formatted_key = used_key.split('|')[0]
                        for object_name in folder_objects.keys():
                            object_path = object_name.split('.')[0]
                            format_central_object(
                                collected_objects = collected_objects,
                                bucket = target_bucket,
                                path = object_path,
                                key = formatted_key,
                                cycle = str(cycle)
                            )
                        continue
                    format_central_object(
                        collected_objects = collected_objects,
                        bucket = target_bucket,
                        path = cycle_path,
                        key = used_key,
                        cycle = str(cycle)
                    )
                continue
        format_central_object(
            collected_objects = collected_objects,
            bucket = target_bucket,
            path = whole_path,
            key = used_key,
            cycle = None
        )
    return collected_objects

### Workers parsing

In [None]:
def set_workers_objects_and_paths(
    experiments_folder: str,
    experiment_name: str,
    experiment: str
) -> any:
    workers_objects = {
        'times': {},
        'task': pd.DataFrame(),
        'function': pd.DataFrame(),
        'network': pd.DataFrame(),
        'training': pd.DataFrame(),
        'metrics': pd.DataFrame(),
        'server': pd.DataFrame()
    }
    
    return workers_objects

def format_worker_object(
    collected_objects: any,
    bucket: str,
    path: str,
    worker: str,
    key: str,
    cycle: str
) -> bool:    
    formatted_data = collected_objects[key]
    data, metadata = get_experiments_objects(
        minio_client = minio_client,
        object_bucket = bucket,
        object_path = path
    )
    
    if data is None:
        return False
    
    if isinstance(formatted_data, dict):
        collected_objects[key][worker] = data
        return True
         
    if isinstance(formatted_data, pd.DataFrame):
        if not cycle is None:
            object_df = pd.DataFrame.from_dict(data,orient='index')
            object_df['worker'] = worker
            object_df['cycle'] = int(cycle)
            result = pd.concat([formatted_data,object_df])
            result = result.reset_index(drop=True)
            collected_objects[key] = result
            
    return True

def format_workers_experiment_objects(
    minio_client: any,
    experiments_folder: str,
    experiment_name: str,
    experiment: str
):
    target_bucket = 'workers'
    collected_objects = set_workers_objects_and_paths(
        experiments_folder = experiments_folder,
        experiment_name = experiment_name,
        experiment = experiment
    )
    
    worker_objects = get_object_list(
        minio_client = minio_client,
        bucket_name = target_bucket,
        path_prefix = ''
    )
    
    relevant_paths = []
    for object_path in worker_objects:
        formatted_path = object_path.split('.')[0]
        formatted_path_split = formatted_path.split('/')
        
        if len(formatted_path_split) < 4:
            continue
        
        path_key = formatted_path_split[-1]
        path_experiment_name = formatted_path_split[2] 
        path_experiment = formatted_path_split[3]
        
        if not path_key in collected_objects:
            continue
        
        if not path_experiment_name == experiment_name:
            continue
        
        if not path_experiment == experiment:
            continue
             
        relevant_paths.append(formatted_path)
    
    for path in relevant_paths:
        formatted_path_split = path.split('/')
        path_worker = formatted_path_split[0]
        path_key = formatted_path_split[-1]
        
        path_cycle = str(formatted_path_split[4])
        format_worker_object(
            collected_objects = collected_objects,
            bucket = target_bucket,
            path = path,
            worker = path_worker,
            key = path_key,
            cycle = path_cycle
        )
            
    return collected_objects

Before you run these, its recommeded that you create the directories for storing the created files.Now, in order to parse central, you need to provide correct experiment name, experiment id and cycle amount:

In [None]:
formatted_central_objects = format_central_experiment_objects(
    minio_client = minio_client,
    experiments_folder = 'experiments',
    experiment_name = 'federated-learning',
    experiment = '1',
    cycles = 2
)

In [None]:
with open('data/minio/federated-learning/central/specifications.json', 'w') as f:
    json.dump(formatted_central_objects['specifications'],f, indent=4)

In [None]:
with open('data/minio/federated-learning/central/times.json', 'w') as f:
    json.dump(formatted_central_objects['times'],f, indent=4)

In [None]:
metrics_keys = [
    'task',
    'function',
    'network', 
    'training', 
    'metrics', 
    'system', 
    'server'
]


for metric in metrics_keys:
    path = 'data/minio/federated-learning/central/replace.csv'
    used_path = path.replace('replace',metric)
    formatted_objects[metric].to_csv(used_path, index = False)

You can parse all the workers just by giving experiment name and experiment id:

In [None]:
formatted_worker_objects = format_workers_experiment_objects(
    minio_client = minio_client,
    experiments_folder = 'experiments',
    experiment_name = 'federated-learning',
    experiment = '2'
)

In [None]:
with open('data/minio/federated-learning/workers/times.json', 'w') as f:
    json.dump(formatted_worker_objects['times'],f, indent=4)

In [None]:
metrics_keys = [
    'task',
    'function',
    'network', 
    'training', 
    'metrics',  
    'server'
]


for metric in metrics_keys:
    path = 'data/minio/federated-learning/workers/replace.csv'
    used_path = path.replace('replace',metric)
    formatted_worker_objects[metric].to_csv(used_path, index = False)