# Central and Worker 

This notebook goes over the necessery code for central and worker federated learning agents, which have their own machine learning pipelines that enable the following incremental actions:
1. Global model initilization in central
2. Sending initial model to workers
3. Training a new model in workers
4. Returning model updates to central
5. Aggregating updates into a global model
6. Repeating steps 2 to 4 until model converges

In this project we will use the [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1/data) to simulate a fraud detection infrastucture, where the central node is controlled by the trade organization and worker nodes are different banks that belong to that organisation where the trade organisation decides to use federated learning to facilitate a adapting, robust and private fraud detection system for their partners.The import we will use in this notebook are the following:

- Pandas
- Numpy
- Scikit-learn

In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
source_data_df = pd.read_csv('data/Fraud_Detection.csv')

In [3]:
source_data_df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


## Formatting

The columns are:
- Row index = The amount of logs
- Step = One hour in the real world 
- Type = Transaction type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- Amount = Unit of local currency
- NameOrig = Customer who started the transaction
- OldbalanceOrig = Initial balance before the transaction
- NewbalanceOrig = New balance after the transaction
- NameDest = Customer who is the recipient of the transaction
- oldbalanceDest = Initial balance recipient before the transaction.
- NewbalanceDest = New balance recipient after the transaction
- IsFraud = The transactions made by the fraudulent agents.
- IsFlaggedFraud = Existing detection, where more than 200.000 transcations are flagged

In order to simulate fraud detection, we need to remove the following columns:
- OldbalanceOrg
- NewbalanceOrig
- OldbalanceDest
- NewbalanceDest
- IsFlaggedFraud (Should be used for comparison, but not for training a model)

After that, we need to modify the following columns:
- type = Requires hot one encoding using integers
- nameOrig = requires string integer encoding
- nameDest = requires string integer encoding
- amount = round up

In [18]:
def formatting(
    source_df: any
) -> any:
    print('Formatting data')
    formated_df = source_df.copy()
    
    irrelevant_columns = [
        'oldbalanceOrg',
        'newbalanceOrig',
        'oldbalanceDest',
        'newbalanceDest'
    ]
    formated_df.drop(
        columns = irrelevant_columns, 
        inplace = True
    )
    print('Columns dropped')
    formated_df = pd.get_dummies(
        data = formated_df, 
        columns = ['type']
    )
    
    for column in formated_df.columns:
        if 'type' in column:
            formated_df[column] = formated_df[column].astype(int)
    print('One hot coded type')

    unique_values_orig = formated_df['nameOrig'].unique()
    unique_values_dest = formated_df['nameDest'].unique()
    
    unique_value_list_orig = unique_values_orig.tolist()
    unique_value_list_dest = unique_values_dest.tolist()

    print('Orig amount:', len(unique_value_list_orig))
    print('Dest amount:', len(unique_value_list_dest))
    
    set_orig_ids = set(unique_value_list_orig)
    set_dest_ids = set(unique_value_list_dest)
    intersection = set_dest_ids.intersection(set_orig_ids)

    print('Orig and Dest duplicates', len(intersection))
    
    set_dest_ids.difference_update(intersection)
    fixed_unique_value_list_dest = list(set_dest_ids)
    print('Fixed Dest amount:',len(fixed_unique_value_list_dest))
    
    orig_encoding_dict = {}
    index = 1
    for string in unique_value_list_orig:
        if not string in orig_encoding_dict:
            orig_encoding_dict[string] = index
            index = index + 1

    dest_encoding_dict = {}
    cont_index = len(orig_encoding_dict) + 1
    for string in fixed_unique_value_list_dest:
        if not string in dest_encoding_dict:
            dest_encoding_dict[string] = cont_index
            cont_index = cont_index + 1
    print('Orig dict amount:', len(orig_encoding_dict))
    print('Dest dict amount:', len(dest_encoding_dict))
    
    print('Orig and dest string-integer encodings created')

    string_orig_values = formated_df['nameOrig'].tolist()
    string_dest_values = formated_df['nameDest'].tolist()

    orig_encoded_values = []
    for string in string_orig_values:
        orig_encoded_values.append(orig_encoding_dict[string])

    dest_encoded_values = []
    for string in string_dest_values:
        if not string in dest_encoding_dict:
            dest_encoded_values.append(orig_encoding_dict[string])
            continue
        dest_encoded_values.append(dest_encoding_dict[string])

    formated_df['nameOrig'] = orig_encoded_values
    formated_df['nameDest'] = dest_encoded_values

    print('Orig encoded values amount:', len(orig_encoded_values))
    print('Dest encoded values amount:', len(dest_encoded_values))
    
    print('Orig and dest encodings set')

    formated_df['amount'] = formated_df['amount'].round(0).astype(int)
    print('Amount rounded')

    column_order = [
        'step',
        'amount',
        'nameOrig',
        'nameDest',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud',
        'isFlaggedFraud'
    ]
    formated_df = formated_df[column_order]
    print('Columns reordered')
    print('Dataframe shape:', formated_df.shape)
    print('Formatting done')
    return formated_df

In [19]:
formated_data_df = formatting(
    source_df = source_data_df
)

Formatting data
Columns dropped
One hot coded type
Orig amount: 6353307
Dest amount: 2722362
Orig and Dest duplicates 1769
Fixed Dest amount: 2720593
Orig dict amount: 6353307
Dest dict amount: 2720593
Orig and dest string-integer encodings created
Orig encoded values amount: 6362620
Dest encoded values amount: 6362620
Orig and dest encodings set
Amount rounded
Columns reordered
Dataframe shape: (6362620, 11)
Formatting done


In [20]:
formated_data_df

Unnamed: 0,step,amount,nameOrig,nameDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,isFraud,isFlaggedFraud
0,1,9840,1,6788653,0,0,0,1,0,0,0
1,1,1864,2,6647762,0,0,0,1,0,0,0
2,1,181,3,6405410,0,0,0,0,1,1,0
3,1,181,4,7291669,0,1,0,0,0,1,0
4,1,11668,5,8220099,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682,6353303,8111677,0,1,0,0,0,1,0
6362616,743,6311409,6353304,8024143,0,0,0,0,1,1,0
6362617,743,6311409,6353305,7595045,0,1,0,0,0,1,0
6362618,743,850003,6353306,7587114,0,0,0,0,1,1,0


In [21]:
formated_data_df.to_csv('data/Formated_Fraud_Detection_Data.csv', index = True)

In [9]:
df = pd.read_csv('data/Formated_Fraud_Detection_Data.csv')

## Regular Learning with Pytorch

In [9]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader, TensorDataset

np.random.seed(42)

def preprocess_into_tensors(
    data_path: str,
    used_columns: list,
    scaled_columns: list,
    target_column: str,
    set_seed: int
) -> any:
    df = pd.read_csv(data_path)
    
    preprocessed_df = df[used_columns]

    for column in scaled_columns:
        mean = preprocessed_df[column].mean()
        std_dev = preprocessed_df[column].std()
        preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev

    X = preprocessed_df.drop(target_column, axis = 1).values
    y = preprocessed_df[target_column].values
        
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size = 0.2, 
        random_state = set_seed
    )

    print('X train:',X_train.shape)
    print('X test:',X_test.shape)
    print('Y train:',y_train.shape)
    print('Y test:',y_test.shape)

    X_train = np.array(X_train, dtype=np.float32)
    X_test = np.array(X_test, dtype=np.float32)
    y_train = np.array(y_train, dtype=np.int32)
    y_test = np.array(y_test, dtype=np.int32)
    
    train_tensor = TensorDataset(
        torch.tensor(X_train), 
        torch.tensor(y_train, dtype=torch.float32)
    )
    test_tensor = TensorDataset(
        torch.tensor(X_test), 
        torch.tensor(y_test, dtype=torch.float32)
    )

    return X_train.shape[1], train_tensor, test_tensor

input_dim, train_tensor, test_tensor = preprocess_into_tensors(
    data_path = 'data/Formated_Fraud_Detection_Data.csv',
    used_columns = [
        'amount',
        'type_CASH_IN',
        'type_CASH_OUT',
        'type_DEBIT',
        'type_PAYMENT',
        'type_TRANSFER',
        'isFraud'
    ],
    scaled_columns = [
        'amount'
    ],
    target_column = 'isFraud',
    set_seed = 42
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preprocessed_df[column] = (preprocessed_df[column] - mean)/std_dev


X train: (5090096, 6)
X test: (1272524, 6)
Y train: (5090096,)
Y test: (1272524,)


In [None]:
def get_loaders(
    set_seed: int,
    sample_rate: float,
    train_tensor: any,
    test_tensor: any
) -> any:
    train_loader = DataLoader(
        train_tensor,
        batch_size=int(len(train_tensor) * sample_rate),
        generator=torch.Generator().manual_seed(set_seed)
    )
    test_loader = DataLoader(test_tensor, 64)
    return train_loader,test_loader

class LogisticRegression(nn.Module):
    def __init__(self, dim, bias=True):
        super().__init__()
        self.dim = dim
        self.linear = nn.Linear(dim, 1, bias=bias)
        self.loss = nn.BCEWithLogitsLoss(reduction="mean")

    def forward(self, x):
        return self.linear(x).view(-1)

    @staticmethod
    def train_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        return loss

    @staticmethod
    def test_step(model, batch):
        x, y = batch
        out = model(x)
        loss = model.loss(out, y)
        preds = out > 0 # Predict y = 1 if P(y = 1) > 0.5
        corrects = torch.tensor(torch.sum(preds == y).item())
        return loss, corrects

def train(
    model: any, 
    train_loader: any, 
    opt_func: any, 
    learning_rate: float, 
    num_epochs: int, 
    noise_multiplier: int, 
    clip_bound: int, 
    delta: float, 
    random_seed: int, 
    verbose=False
) -> int:
    optimizer = opt_func(model.parameters(), learning_rate)
    privacy_engine = opacus.PrivacyEngine(
        accountant="rdp",
        secure_mode=False, 
    )
    
    rng = torch.Generator()
    rng.manual_seed(int(random_seed))
    model_type = type(model)
    model, optimizer, train_loader = privacy_engine.make_private(
        module=model,
        optimizer=optimizer,
        data_loader=train_loader,
        noise_multiplier=noise_multiplier,
        max_grad_norm=clip_bound,
        noise_generator=rng,
        loss_reduction="mean"
    )
    
    for epoch in range(num_epochs):
        losses = []
        for batch in train_loader:
            loss = model_type.train_step(model, batch)
            loss.backward()
            losses.append(loss)
            optimizer.step()
            optimizer.zero_grad()
        
        if verbose:
            print("Epoch {}, loss = {}".format(epoch + 1, torch.sum(loss) / len(train_loader)))
    
    epsilon = privacy_engine.get_epsilon(delta)
    return epsilon

def test(
    model: any, 
    test_loader: any
) -> any:
    with torch.no_grad():
        losses = []
        accuracies = []
        total_size = 0
        
        for batch in test_loader:
            total_size += len(batch[1])
            loss, corrects = model.test_step(model, batch)
            losses.append(loss)
            accuracies.append(corrects)

        average_loss = np.array(loss).sum() / total_size
        total_accuracy = np.array(accuracies).sum() / total_size
        return average_loss, total_accuracy

def run_model_pipeline(
    set_seed: int,
    delta: float,
    learning_rate: float,
    noise_multiplier: int,
    clip_bound: int,
    sample_rate: float,
    num_epochs: int,
    input_dim: int,
    train_tensor: any,
    test_tensor: any
) -> any:
    torch.manual_seed(set_seed)

    train_loader, test_loader = get_loaders(
        set_seed,
        sample_rate,
        train_tensor,
        test_tensor
    )
    
    model = LogisticRegression(dim = input_dim)
    epsilon = train(
        model, 
        train_loader, 
        torch.optim.SGD, 
        learning_rate, 
        num_epochs,
        noise_multiplier, 
        clip_bound, 
        delta,
        set_seed
    )
    average_loss, total_accuracy = test(model, test_loader)
    return epsilon, delta, average_loss, total_accuracy


## Federated Learning with PyTorch

In [None]:
import pandas as pd
import numpy as np

In [None]:
formated_data_df = pd.read_csv('data/Formated_Fraud_Detection_Data.csv')

## Central ML Pipeline

Used imports are:
- pip install pandas
- pip install numpy
- pip install scikit-learn

## Worker ML Pipeline