# Central and Worker 

This notebook goes over the necessery code for central and worker federated learning agents, which have their own machine learning pipelines that enable the following incremental actions:
1. Global model initilization in central
2. Sending initial model to workers
3. Training a new model in workers
4. Returning model updates to central
5. Aggregating updates into a global model
6. Repeating steps 2 to 4 until model converges

In this project we will use the [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1/data) to simulate a fraud detection infrastucture, where the central node is controlled by the trade organization and worker nodes are different banks that belong to that organisation where the trade organisation decides to use federated learning to facilitate a adapting, robust and private fraud detection system for their partners.The import we will use in this notebook are the following:

- Pandas
- Numpy
- Scikit-learn

In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
source_data_df = pd.read_csv('data/Fraud_Detection.csv')

In [3]:
source_data_df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


## Formatting

The columns are:
- Row index = The amount of logs
- Step = One hour in the real world 
- Type = Transaction type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- Amount = Unit of local currency
- NameOrig = Customer who started the transaction
- OldbalanceOrig = Initial balance before the transaction
- NewbalanceOrig = New balance after the transaction
- NameDest = Customer who is the recipient of the transaction
- oldbalanceDest = Initial balance recipient before the transaction.
- NewbalanceDest = New balance recipient after the transaction
- IsFraud = The transactions made by the fraudulent agents.
- IsFlaggedFraud = Existing detection, where more than 200.000 transcations are flagged

In order to simulate fraud detection, we need to remove the following columns:
- OldbalanceOrg
- NewbalanceOrig
- OldbalanceDest
- NewbalanceDest
- IsFlaggedFraud (Should be used for comparison, but not for training a model)

After that, we need to modify the following columns:
- type = Requires hot one encoding using integers
- nameOrig = requires string integer encoding
- nameDest = requires string integer encoding
- amount = round up

In [18]:
def formatting(
    source_df: any
) -> any:
    print('Formatting data')
    formated_df = source_df.copy()
    
    irrelevant_columns = [
        'oldbalanceOrg',
        'newbalanceOrig',
        'oldbalanceDest',
        'newbalanceDest'
    ]
    formated_df.drop(
        columns = irrelevant_columns, 
        inplace = True
    )
    print('Columns dropped')
    formated_df = pd.get_dummies(
        data = formated_df, 
        columns = ['type']
    )
    
    for column in formated_df.columns:
        if 'type' in column:
            formated_df[column] = formated_df[column].astype(int)
    print('One hot coded type')

    unique_values_orig = formated_df['nameOrig'].unique()
    unique_values_dest = formated_df['nameDest'].unique()
    
    unique_value_list_orig = unique_values_orig.tolist()
    unique_value_list_dest = unique_values_dest.tolist()

    print('Orig amount:', len(unique_value_list_orig))
    print('Dest amount:', len(unique_value_list_dest))
    
    set_orig_ids = set(unique_value_list_orig)
    set_dest_ids = set(unique_value_list_dest)
    intersection = set_dest_ids.intersection(set_orig_ids)

    print('Orig and Dest duplicates', len(intersection))
    
    set_dest_ids.difference_update(intersection)
    fixed_unique_value_list_dest = list(set_dest_ids)
    print('Fixed Dest amount:',len(fixed_unique_value_list_dest))
    
    orig_encoding_dict = {}
    index = 1
    for string in unique_value_list_orig:
        if not string in orig_encoding_dict:
            orig_encoding_dict[string] = index
            index = index + 1

    dest_encoding_dict = {}
    cont_index = len(orig_encoding_dict) + 1
    for string in fixed_unique_value_list_dest:
        if not string in dest_encoding_dict:
            dest_encoding_dict[string] = cont_index
            cont_index = cont_index + 1
    print('Orig dict amount:', len(orig_encoding_dict))
    print('Dest dict amount:', len(dest_encoding_dict))
    
    print('Orig and dest string-integer encodings created')

    string_orig_values = formated_df['nameOrig'].tolist()
    string_dest_values = formated_df['nameDest'].tolist()

    orig_encoded_values = []
    for string in string_orig_values:
        orig_encoded_values.append(orig_encoding_dict[string])

    dest_encoded_values = []
    for string in string_dest_values:
        if not string in dest_encoding_dict:
            dest_encoded_values.append(orig_encoding_dict[string])
            continue
        dest_encoded_values.append(dest_encoding_dict[string])

    formated_df['nameOrig'] = orig_encoded_values
    formated_df['nameDest'] = dest_encoded_values

    print('Orig encoded values amount:', len(orig_encoded_values))
    print('Dest encoded values amount:', len(dest_encoded_values))
    
    print('Orig and dest encodings set')

    formated_df['amount'] = formated_df['amount'].round(0).astype(int)
    print('Amount rounded')

    column_order = [
        'step',
        'amount',
        'nameOrig',
        'nameDest',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud',
        'isFlaggedFraud'
    ]
    formated_df = formated_df[column_order]
    print('Columns reordered')
    print('Dataframe shape:', formated_df.shape)
    print('Formatting done')
    return formated_df

In [19]:
formated_data_df = formatting(
    source_df = source_data_df
)

Formatting data
Columns dropped
One hot coded type
Orig amount: 6353307
Dest amount: 2722362
Orig and Dest duplicates 1769
Fixed Dest amount: 2720593
Orig dict amount: 6353307
Dest dict amount: 2720593
Orig and dest string-integer encodings created
Orig encoded values amount: 6362620
Dest encoded values amount: 6362620
Orig and dest encodings set
Amount rounded
Columns reordered
Dataframe shape: (6362620, 11)
Formatting done


In [20]:
formated_data_df

Unnamed: 0,step,amount,nameOrig,nameDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud,isFlaggedFraud
0,1,9840,1,6788653,0,0,0,1,0,0,0
1,1,1864,2,6647762,0,0,0,1,0,0,0
2,1,181,3,6405410,0,0,0,0,1,1,0
3,1,181,4,7291669,0,1,0,0,0,1,0
4,1,11668,5,8220099,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682,6353303,8111677,0,1,0,0,0,1,0
6362616,743,6311409,6353304,8024143,0,0,0,0,1,1,0
6362617,743,6311409,6353305,7595045,0,1,0,0,0,1,0
6362618,743,850003,6353306,7587114,0,0,0,0,1,1,0


In [21]:
formated_data_df.to_csv('data/Formated_Fraud_Detection_Data.csv', index = True)

In [9]:
df = pd.read_csv('data/Formated_Fraud_Detection_Data.csv')

## Regular Learning with Pytorch

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader, TensorDataset

In [11]:
np.random.seed(42)

def preprocess_into_tensors(
    data_path: str,
    used_columns: list,
    rows: int,
    scaled_columns: list,
    target_column: str,
    set_seed: int
) -> any:
    df = pd.read_csv(data_path)
    
    preprocessed_df = df[used_columns]

    preprocessed_df = preprocessed_df[:rows]

    for column in scaled_columns:
        mean = preprocessed_df[column].mean()
        std_dev = preprocessed_df[column].std()
        preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev

    X = preprocessed_df.drop(target_column, axis = 1).values
    y = preprocessed_df[target_column].values
        
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size = 0.2, 
        random_state = set_seed
    )

    print('X train:',X_train.shape)
    print('X test:',X_test.shape)
    print('Y train:',y_train.shape)
    print('Y test:',y_test.shape)

    X_train = np.array(X_train, dtype=np.float32)
    X_test = np.array(X_test, dtype=np.float32)
    y_train = np.array(y_train, dtype=np.int32)
    y_test = np.array(y_test, dtype=np.int32)
    
    train_tensor = TensorDataset(
        torch.tensor(X_train), 
        torch.tensor(y_train, dtype=torch.float32)
    )
    test_tensor = TensorDataset(
        torch.tensor(X_test), 
        torch.tensor(y_test, dtype=torch.float32)
    )

    return X_train.shape[1], train_tensor, test_tensor

In [19]:
class LogisticRegression(nn.Module):
    def __init__(self, dim, bias=True):
        super().__init__()
        self.dim = dim
        self.linear = nn.Linear(dim, 1, bias=bias)
        self.loss = nn.BCEWithLogitsLoss(reduction="mean")

    def forward(self, x):
        return self.linear(x).view(-1)

    @staticmethod
    def train_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        return loss

    @staticmethod
    def test_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        preds = out > 0 # Predict y = 1 if P(y = 1) > 0.5
        corrects = torch.tensor(torch.sum(preds == y).item())
        return loss, corrects

def get_loaders(
    set_seed: int,
    sample_rate: float,
    train_tensor: any,
    test_tensor: any
) -> any:
    train_loader = DataLoader(
        train_tensor,
        batch_size=int(len(train_tensor) * sample_rate),
        generator=torch.Generator().manual_seed(set_seed)
    )
    test_loader = DataLoader(test_tensor, 64)
    return train_loader,test_loader

def train(
    model: any, 
    train_loader: any, 
    opt_func: any, 
    learning_rate: float, 
    num_epochs: int,  
    random_seed: int, 
    verbose = True
) -> int:
    optimizer = opt_func(model.parameters(), learning_rate)
    model_type = type(model)
    
    for epoch in range(num_epochs):
        losses = []
        for batch in train_loader:
            loss = model_type.train_step(model, batch)
            loss.backward()
            losses.append(loss)
            optimizer.step()
            optimizer.zero_grad()
        
        if verbose:
            print("Epoch {}, loss = {}".format(epoch + 1, torch.sum(loss) / len(train_loader)))
    
def test(
    model: any, 
    test_loader: any
) -> any:
    with torch.no_grad():
        losses = []
        accuracies = []
        total_size = 0
        
        for batch in test_loader:
            total_size += len(batch[1])
            loss, corrects = model.test_step(model, batch)
            losses.append(loss)
            accuracies.append(corrects)

        average_loss = np.array(loss).sum() / total_size
        total_accuracy = np.array(accuracies).sum() / total_size
        return average_loss, total_accuracy

def run_model_pipeline(
    set_seed: int,
    learning_rate: float,
    sample_rate: float,
    num_epochs: int,
    input_dim: int,
    train_tensor: any,
    test_tensor: any
) -> any:
    torch.manual_seed(set_seed)
    print('Loaders')
    given_train_loader, given_test_loader = get_loaders(
        set_seed,
        sample_rate,
        train_tensor,
        test_tensor
    )
    print('Model')
    lr_model = LogisticRegression(dim = input_dim)
    print('Train')
    train(
        model = lr_model, 
        train_loader = given_train_loader, 
        opt_func = torch.optim.SGD, 
        learning_rate = learning_rate, 
        num_epochs = num_epochs,  
        random_seed = set_seed, 
        verbose = True
    )
    
    print('Test')
    average_loss, total_accuracy = test(
        model = lr_model, 
        test_loader = given_test_loader
    )
    print('Complete')
    return average_loss, total_accuracy

In [13]:
input_dim, train_tensor, test_tensor = preprocess_into_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    rows = 10000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [21]:
run_model_pipeline(
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = train_tensor,
    test_tensor = test_tensor
)

Loaders
Model
Train
Epoch 1, loss = 0.005733905825763941
Epoch 2, loss = 0.005491575691848993
Epoch 3, loss = 0.005264477338641882
Epoch 4, loss = 0.005051491782069206
Epoch 5, loss = 0.004851583391427994
Test
Complete


(0.00022987823188304901, 0.908)

## Federated Learning with PyTorch

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader, TensorDataset

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [15]:
np.random.seed(42)

def preprocess_into_train_and_test_tensors(
    data_path: str,
    used_columns: list,
    start_row: int,
    end_row: int,
    scaled_columns: list,
    target_column: str,
    set_seed: int
) -> any:
    df = pd.read_csv(data_path)
    
    preprocessed_df = df[used_columns]

    preprocessed_df = preprocessed_df[start_row:end_row]

    for column in scaled_columns:
        mean = preprocessed_df[column].mean()
        std_dev = preprocessed_df[column].std()
        preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev

    X = preprocessed_df.drop(target_column, axis = 1).values
    y = preprocessed_df[target_column].values
        
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size = 0.2, 
        random_state = set_seed
    )

    print('X train:',X_train.shape)
    print('X test:',X_test.shape)
    print('Y train:',y_train.shape)
    print('Y test:',y_test.shape)

    X_train = np.array(X_train, dtype=np.float32)
    X_test = np.array(X_test, dtype=np.float32)
    y_train = np.array(y_train, dtype=np.int32)
    y_test = np.array(y_test, dtype=np.int32)
    
    train_tensor = TensorDataset(
        torch.tensor(X_train), 
        torch.tensor(y_train, dtype=torch.float32)
    )
    test_tensor = TensorDataset(
        torch.tensor(X_test), 
        torch.tensor(y_test, dtype=torch.float32)
    )

    return X_train.shape[0], X_train.shape[1], train_tensor, test_tensor

In [5]:
class FederatedLogisticRegression(nn.Module):
    def __init__(self, dim, bias=True):
        super().__init__()
        self.dim = dim
        self.linear = nn.Linear(dim, 1, bias=bias)
        self.loss = nn.BCEWithLogitsLoss(reduction="mean")

    def forward(self, x):
        return self.linear(x).view(-1)

    @staticmethod
    def train_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        return loss

    @staticmethod
    def test_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        preds = out > 0 # Predict y = 1 if P(y = 1) > 0.5
        corrects = torch.tensor(torch.sum(preds == y).item())
        return loss, corrects

    @staticmethod
    def get_parameters(model):
        return model.state_dict()

    @staticmethod
    def apply_parameters(model, parameters):
        model.load_state_dict(parameters)

def get_loaders(
    set_seed: int,
    sample_rate: float,
    train_tensor: any,
    test_tensor: any
) -> any:
    train_loader = DataLoader(
        train_tensor,
        batch_size=int(len(train_tensor) * sample_rate),
        generator=torch.Generator().manual_seed(set_seed)
    )
    test_loader = DataLoader(test_tensor, 64)
    return train_loader,test_loader

def train(
    model: any, 
    train_loader: any, 
    opt_func: any, 
    learning_rate: float, 
    num_epochs: int,  
    random_seed: int, 
    verbose = True
) -> int:
    optimizer = opt_func(model.parameters(), learning_rate)
    model_type = type(model)
    
    for epoch in range(num_epochs):
        losses = []
        for batch in train_loader:
            loss = model_type.train_step(model, batch)
            loss.backward()
            losses.append(loss)
            optimizer.step()
            optimizer.zero_grad()
        
        if verbose:
            print("Epoch {}, loss = {}".format(epoch + 1, torch.sum(loss) / len(train_loader)))
   
def test(
    model: any, 
    test_loader: any
) -> any:
    with torch.no_grad():
        losses = []
        accuracies = []
        total_size = 0
        
        for batch in test_loader:
            total_size += len(batch[1])
            loss, corrects = model.test_step(model, batch)
            losses.append(loss)
            accuracies.append(corrects)

        average_loss = np.array(loss).sum() / total_size
        total_accuracy = np.array(accuracies).sum() / total_size
        return average_loss, total_accuracy

def federated_model_pipeline(
    given_parameters: any,
    set_seed: int,
    learning_rate: float,
    sample_rate: float,
    num_epochs: int,
    input_dim: int,
    train_tensor: any,
    test_tensor: any
) -> any:
    torch.manual_seed(set_seed)
    print('Loaders')
    given_train_loader, given_test_loader = get_loaders(
        set_seed,
        sample_rate,
        train_tensor,
        test_tensor
    )
    print('Fed Model')
    lr_model = FederatedLogisticRegression(dim = input_dim)
    if given_parameters: 
        lr_model.apply_parameters(lr_model,given_parameters)
    print('Train')
    train(
        model = lr_model, 
        train_loader = given_train_loader, 
        opt_func = torch.optim.SGD, 
        learning_rate = learning_rate, 
        num_epochs = num_epochs,  
        random_seed = set_seed, 
        verbose = True
    )
    
    print('Test')
    average_loss, total_accuracy = test(
        model = lr_model, 
        test_loader = given_test_loader
    )
    print('Complete')

    parameters = lr_model.get_parameters(lr_model)
    return average_loss, total_accuracy, parameters

### Central Initilization

In [8]:
central_sample_size_1, input_dim, central_train_tensor_1, central_test_tensor_1 = preprocess_into_train_and_test_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 0,
    end_row = 10000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [9]:
loss, accuracy, global_model_1 = federated_model_pipeline(
    given_parameters = None,
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = central_train_tensor_1,
    test_tensor = central_test_tensor_1
)
print(loss)
print(accuracy)
print(global_model_1)

Loaders
Fed Model
Train
Epoch 1, loss = 0.005733905825763941
Epoch 2, loss = 0.005491575691848993
Epoch 3, loss = 0.005264477338641882
Epoch 4, loss = 0.005051491782069206
Epoch 5, loss = 0.004851583391427994
Test
Complete
0.00022987823188304901
0.908
OrderedDict([('linear.weight', tensor([[ 0.2859,  0.2883, -0.1216,  0.3666, -0.1894,  0.0594]])), ('linear.bias', tensor([-0.4066]))])


## Worker 1 Update

In [10]:
worker_1_sample_size_1, input_dim, worker_1_train_tensor_1, worker_1_test_tensor_1 = preprocess_into_train_and_test_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 10000,
    end_row = 20000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [11]:
loss, accuracy, worker_1_model_1 = federated_model_pipeline(
    given_parameters = global_model_1,
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = worker_1_train_tensor_1,
    test_tensor = worker_1_test_tensor_1
)
print(loss)
print(accuracy)
print(worker_1_model_1)

Loaders
Fed Model
Train
Epoch 1, loss = 0.004760765470564365
Epoch 2, loss = 0.004586684051901102
Epoch 3, loss = 0.004422419238835573
Epoch 4, loss = 0.004267293494194746
Epoch 5, loss = 0.004120686091482639
Test
Complete
0.00020317628979682923
0.958
OrderedDict([('linear.weight', tensor([[ 0.2548,  0.2538, -0.1625,  0.3646, -0.2644,  0.0338]])), ('linear.bias', tensor([-0.5846]))])


## Worker 2 Update

In [12]:
worker_2_sample_size_1,input_dim, worker_2_train_tensor_1, worker_2_test_tensor_1 = preprocess_into_train_and_test_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 20000,
    end_row = 30000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [13]:
loss, accuracy, worker_2_model_1 = federated_model_pipeline(
    given_parameters = global_model_1,
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = worker_2_train_tensor_1,
    test_tensor = worker_2_test_tensor_1
)
print(loss)
print(accuracy)
print(worker_2_model_1)

Loaders
Fed Model
Train
Epoch 1, loss = 0.004955528303980827
Epoch 2, loss = 0.004771655425429344
Epoch 3, loss = 0.004597960971295834
Epoch 4, loss = 0.004433786496520042
Epoch 5, loss = 0.004278519656509161
Test
Complete
0.00018141770362854005
0.9575
OrderedDict([('linear.weight', tensor([[ 0.2526,  0.2437, -0.1726,  0.3647, -0.2506,  0.0369]])), ('linear.bias', tensor([-0.5878]))])


### Central FedAvg

In [23]:
from collections import OrderedDict

received_updates = [
    {'parameters':worker_1_model_1, 'samples': worker_1_sample_size_1},
    {'parameters':worker_2_model_1, 'samples': worker_2_sample_size_1}
]

collective_sample_size = 0
for update in received_updates:
    print(update['samples'])
    collective_sample_size += update['samples']
    
weights = []
biases = []

print(collective_sample_size)
for update in received_updates:
    parameters = update['parameters']
    worker_sample_size = update['samples']
    worker_weights = np.array(parameters['linear.weight'].tolist()[0])
    worker_bias = parameters['linear.bias'].tolist()[0]
    print(worker_weights,worker_bias)

    adjusted_worker_weights = worker_weights * (worker_sample_size/collective_sample_size)
    adjusted_worker_bias = worker_bias * (worker_sample_size/collective_sample_size)
    
    weights.append(adjusted_worker_weights.tolist())
    biases.append(adjusted_worker_bias)

weights = np.array(weights)
biases = np.array(biases)

FedAvg_weight = [np.sum(weights,axis = 0)]
FedAvg_bias = [np.sum(biases, axis = 0)]

print(FedAvg_weight,FedAvg_bias)

global_model_2 = OrderedDict([
    ('linear.weight', torch.tensor(FedAvg_weight,dtype=torch.float32)),
    ('linear.bias', torch.tensor(FedAvg_bias,dtype=torch.float32))
])
print(global_model_2)

8000
8000
16000
[ 0.254816    0.2537528  -0.16251254  0.36463702 -0.26442271  0.03384   ] -0.5846246480941772
[ 0.25262707  0.2436935  -0.17256512  0.36465079 -0.25056174  0.0369154 ] -0.5877873301506042
[array([ 0.25372154,  0.24872315, -0.16753883,  0.3646439 , -0.25749223,
        0.0353777 ])] [-0.5862059891223907]
OrderedDict([('linear.weight', tensor([[ 0.2537,  0.2487, -0.1675,  0.3646, -0.2575,  0.0354]])), ('linear.bias', tensor([-0.5862]))])


  ('linear.weight', torch.tensor(FedAvg_weight,dtype=torch.float32)),


### Global Model Evaluation 1

In [16]:
np.random.seed(42)

def preprocess_into_evaluation_tensor(
    data_path: str,
    used_columns: list,
    start_row: int,
    end_row: int,
    scaled_columns: list,
    target_column: str,
    set_seed: int
) -> any:
    df = pd.read_csv(data_path)
    
    preprocessed_df = df[used_columns]

    preprocessed_df = preprocessed_df[start_row:end_row]

    for column in scaled_columns:
        mean = preprocessed_df[column].mean()
        std_dev = preprocessed_df[column].std()
        preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev

    X_test = preprocessed_df.drop(target_column, axis = 1).values
    y_test = preprocessed_df[target_column].values
        
    print('X test:',X_test.shape)
    print('Y test:',y_test.shape)

    X_test = np.array(X_test, dtype=np.float32)
    y_test = np.array(y_test, dtype=np.int32)
    
    test_tensor = TensorDataset(
        torch.tensor(X_test), 
        torch.tensor(y_test, dtype=torch.float32)
    )

    return X_test.shape[0], X_test.shape[1], test_tensor

In [21]:
class FederatedLogisticRegression(nn.Module):
    def __init__(self, dim, bias=True):
        super().__init__()
        self.dim = dim
        self.linear = nn.Linear(dim, 1, bias=bias)
        self.loss = nn.BCEWithLogitsLoss(reduction="mean")

    def forward(self, x):
        return self.linear(x).view(-1)

    @staticmethod
    def train_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        return loss

    @staticmethod
    def test_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        preds = out > 0 # Predict y = 1 if P(y = 1) > 0.5
        corrects = torch.tensor(torch.sum(preds == y).item())
        return loss, corrects

    @staticmethod
    def get_parameters(model):
        return model.state_dict()

    @staticmethod
    def apply_parameters(model, parameters):
        model.load_state_dict(parameters)

def test(
    model: any, 
    test_loader: any
) -> any:
    with torch.no_grad():
        losses = []
        accuracies = []
        total_size = 0
        
        for batch in test_loader:
            total_size += len(batch[1])
            loss, corrects = model.test_step(model, batch)
            losses.append(loss)
            accuracies.append(corrects)

        average_loss = np.array(loss).sum() / total_size
        total_accuracy = np.array(accuracies).sum() / total_size
        return average_loss, total_accuracy

def federated_model_evaluation(
    given_parameters: any,
    set_seed: int,
    input_dim: int,
    evaluation_tensor: any
) -> any:
    torch.manual_seed(set_seed)
    print('Loader')
    given_evaluation_loader = DataLoader(evaluation_tensor, 64)
    
    print('Fed Model')
    lr_model = FederatedLogisticRegression(dim = input_dim)
    lr_model.apply_parameters(lr_model,given_parameters)
    
    print('Test')
    average_loss, total_accuracy = test(
        model = lr_model, 
        test_loader = given_evaluation_loader
    )
    print('Complete')
    return average_loss, total_accuracy

In [19]:
evaluation_sample_size_1, input_dim, evaluation_tensor_1 = preprocess_into_evaluation_tensor(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 30000,
    end_row = 40000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X test: (10000, 6)
Y test: (10000,)


In [24]:
loss, accuracy = federated_model_evaluation(
    given_parameters = global_model_2,
    set_seed = 42,
    input_dim = input_dim,
    evaluation_tensor = evaluation_tensor_1
)
print(loss)
print(accuracy)

Loader
Fed Model
Test
Complete
4.322848916053772e-05
0.9636


### Worker node 1 Reupdate

In [25]:
worker_1_sample_size_2, input_dim, worker_1_train_tensor_2, worker_1_test_tensor_2 = preprocess_into_train_and_test_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 40000,
    end_row = 50000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [26]:
loss, accuracy, worker_1_model_2 = federated_model_pipeline(
    given_parameters = global_model_2,
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = worker_1_train_tensor_2,
    test_tensor = worker_1_test_tensor_2
)
print(loss)
print(accuracy)
print(worker_1_model_2)

Loaders
Fed Model
Train
Epoch 1, loss = 0.00407393230125308
Epoch 2, loss = 0.003936012275516987
Epoch 3, loss = 0.003805415239185095
Epoch 4, loss = 0.003681669943034649
Epoch 5, loss = 0.0035643333103507757
Test
Complete
0.0001800934076309204
0.98
OrderedDict([('linear.weight', tensor([[ 0.2236,  0.2144, -0.2236,  0.3635, -0.3020,  0.0143]])), ('linear.bias', tensor([-0.7433]))])


### Worker Node 2 Reupdate

In [27]:
worker_2_sample_size_2,input_dim, worker_2_train_tensor_2, worker_2_test_tensor_2 = preprocess_into_train_and_test_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 50000,
    end_row = 60000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X train: (8000, 6)
X test: (2000, 6)
Y train: (8000,)
Y test: (2000,)


In [28]:
loss, accuracy, worker_2_model_2 = federated_model_pipeline(
    given_parameters = global_model_2,
    set_seed = 42,
    learning_rate = 0.001,
    sample_rate = 0.01,
    num_epochs = 5,
    input_dim = input_dim,
    train_tensor = worker_2_train_tensor_2,
    test_tensor = worker_2_test_tensor_2
)
print(loss)
print(accuracy)
print(worker_2_model_2)

Loaders
Fed Model
Train
Epoch 1, loss = 0.0038810980040580034
Epoch 2, loss = 0.0037517123855650425
Epoch 3, loss = 0.0036292257718741894
Epoch 4, loss = 0.0035131804179400206
Epoch 5, loss = 0.003403153968974948
Test
Complete
0.00017841057479381562
0.971
OrderedDict([('linear.weight', tensor([[ 0.2244,  0.2080, -0.2197,  0.3632, -0.3028,  0.0167]])), ('linear.bias', tensor([-0.7445]))])


## Final Central FedAvg

In [29]:
from collections import OrderedDict

received_updates = [
    {'parameters':worker_1_model_2, 'samples': worker_1_sample_size_2},
    {'parameters':worker_2_model_2, 'samples': worker_2_sample_size_2}
]

collective_sample_size = 0
for update in received_updates:
    print(update['samples'])
    collective_sample_size += update['samples']
    
weights = []
biases = []

print(collective_sample_size)
for update in received_updates:
    parameters = update['parameters']
    worker_sample_size = update['samples']
    worker_weights = np.array(parameters['linear.weight'].tolist()[0])
    worker_bias = parameters['linear.bias'].tolist()[0]
    print(worker_weights,worker_bias)

    adjusted_worker_weights = worker_weights * (worker_sample_size/collective_sample_size)
    adjusted_worker_bias = worker_bias * (worker_sample_size/collective_sample_size)
    
    weights.append(adjusted_worker_weights.tolist())
    biases.append(adjusted_worker_bias)

weights = np.array(weights)
biases = np.array(biases)

FedAvg_weight = [np.sum(weights,axis = 0)]
FedAvg_bias = [np.sum(biases, axis = 0)]

print(FedAvg_weight,FedAvg_bias)

global_model_3 = OrderedDict([
    ('linear.weight', torch.tensor(FedAvg_weight,dtype=torch.float32)),
    ('linear.bias', torch.tensor(FedAvg_bias,dtype=torch.float32))
])
print(global_model_3)

8000
8000
16000
[ 0.22358397  0.21440691 -0.22364609  0.36348364 -0.30195028  0.01430325] -0.7433218955993652
[ 0.22438714  0.2079628  -0.21968411  0.36321157 -0.30277276  0.01668498] -0.7445172071456909
[array([ 0.22398555,  0.21118485, -0.2216651 ,  0.3633476 , -0.30236152,
        0.01549411])] [-0.7439195513725281]
OrderedDict([('linear.weight', tensor([[ 0.2240,  0.2112, -0.2217,  0.3633, -0.3024,  0.0155]])), ('linear.bias', tensor([-0.7439]))])


## Final Evaluation

In [30]:
evaluation_sample_size_2, input_dim, evaluation_tensor_2 = preprocess_into_evaluation_tensor(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    start_row = 60000,
    end_row = 70000,
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

X test: (10000, 6)
Y test: (10000,)


In [31]:
loss, accuracy = federated_model_evaluation(
    given_parameters = global_model_3,
    set_seed = 42,
    input_dim = input_dim,
    evaluation_tensor = evaluation_tensor_2
)
print(loss)
print(accuracy)

Loader
Fed Model
Test
Complete
3.361541330814361e-05
0.9787


## Central and Worker ML Pipelines

By studying Federeated Learning with Pytorch, we conclude that the central and worker requires the following functions, background jobs and routes:

central:
- functions:
    - format_data (optional) 
    - preprocess_data 
    - initilize_global_model
    - evaluate_global_model
    - list_workers
    - fed_avg
    - update_global_model
    - prepare_worker_data
    - send_worker_data
- background:
    - create/check_workers
    - check_updates
    - collect_model_metrics
- routes:
    - receive_update
    - get_central_logs
    - inference
    - start_learning

worker:
- functions:
    - preprocess_data
    - set_local_model
    - evaluate_local_model
- background:
    - train_local_model
    - collect_model_metrics
- routes:
    - receive_configuration 
    - receive_worker_data 
    - receive_global_model
    - get_worker_logs
    - inference
    - start_learning      

In [32]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader, TensorDataset

## Common Functions

In [None]:
class FederatedLogisticRegression(nn.Module):
    def __init__(self, dim, bias=True):
        super().__init__()
        self.dim = dim
        self.linear = nn.Linear(dim, 1, bias=bias)
        self.loss = nn.BCEWithLogitsLoss(reduction="mean")

    def forward(self, x):
        return self.linear(x).view(-1)

    @staticmethod
    def train_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        return loss

    @staticmethod
    def test_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        preds = out > 0 # Predict y = 1 if P(y = 1) > 0.5
        corrects = torch.tensor(torch.sum(preds == y).item())
        return loss, corrects

    @staticmethod
    def get_parameters(model):
        return model.state_dict()

    @staticmethod
    def apply_parameters(model, parameters):
        model.load_state_dict(parameters)

### Central Functions

### Worker Functions

In [None]:
def preprocess_into_train_and_test_tensors(
    data_path: str,
    used_columns: list,
    start_row: int,
    end_row: int,
    scaled_columns: list,
    target_column: str,
    set_seed: int
) -> any:
    np.random.seed(set_seed)
    df = pd.read_csv(data_path)
    
    preprocessed_df = df[used_columns]

    preprocessed_df = preprocessed_df[start_row:end_row]

    for column in scaled_columns:
        mean = preprocessed_df[column].mean()
        std_dev = preprocessed_df[column].std()
        preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev

    X = preprocessed_df.drop(target_column, axis = 1).values
    y = preprocessed_df[target_column].values
        
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size = 0.2, 
        random_state = set_seed
    )

    print('X train:',X_train.shape)
    print('X test:',X_test.shape)
    print('Y train:',y_train.shape)
    print('Y test:',y_test.shape)

    X_train = np.array(X_train, dtype=np.float32)
    X_test = np.array(X_test, dtype=np.float32)
    y_train = np.array(y_train, dtype=np.int32)
    y_test = np.array(y_test, dtype=np.int32)
    
    train_tensor = TensorDataset(
        torch.tensor(X_train), 
        torch.tensor(y_train, dtype=torch.float32)
    )
    test_tensor = TensorDataset(
        torch.tensor(X_test), 
        torch.tensor(y_test, dtype=torch.float32)
    )

    return X_train.shape[0], X_train.shape[1], train_tensor, test_tensor

Used imports are:
- pip install pandas
- pip install numpy
- pip install scikit-learn

## Worker ML Pipeline