# MNIST 

This example is an implementation of federated learning using Substra's Distributed Learning Contributivity.

This is based on both [existing resources on MNIST](https://medium.com/@mjbhobe/mnist-digits-classification-with-keras-ed6c2374bd0e) and [precedent implementation of this dataset for the standalone application](https://github.com/SubstraFoundation/distributed-learning-contributivity/blob/master/datasets/dataset_mnist.py).

This notebook will be focused on importing manually the dataset, do a bit of preprocessing and build our objects to run a collaborative round.


## Prerequisites

In order to run this example, you'll need to:

* use python 3.7 +
* install requierements from the requirements.txt file
* install this package https://test.pypi.org/project/pkg-test-distributed-learning-contributivity/0.0.7/



In [1]:
!wget https://raw.githubusercontent.com/SubstraFoundation/distributed-learning-contributivity/Moving-functions/requirements.txt
!pip install -r requirements.txt
!pip install -i https://test.pypi.org/simple/ subtest==0.0.0.8

--2020-08-26 11:12:24--  https://raw.githubusercontent.com/SubstraFoundation/distributed-learning-contributivity/Moving-functions/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176 [text/plain]
Saving to: ‘requirements.txt’


2020-08-26 11:12:24 (8.14 MB/s) - ‘requirements.txt’ saved [176/176]

Collecting Keras==2.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/ad/fd/6bfe87920d7f4fd475acd28500a42482b6b84479832bdc0fe9e589a60ceb/Keras-2.3.1-py2.py3-none-any.whl (377kB)
[K     |████████████████████████████████| 378kB 2.8MB/s 
[?25hCollecting matplotlib==3.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/7e/07/4b361d6d0f4e08942575f83a11d33f36897e1aae4279046606dd1808778a/matplotlib-3.1.3-cp36-cp36m-manylinux1_x86_64.w

Looking in indexes: https://test.pypi.org/simple/
Collecting subtest==0.0.0.8
[?25l  Downloading https://test-files.pythonhosted.org/packages/9f/62/c90051a4e9247eb15f8ac5af1dc92ff7cfbd6ccfbbc763499398c99af44a/subtest-0.0.0.8-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 1.7MB/s 
Installing collected packages: subtest
Successfully installed subtest-0.0.0.8


In [2]:
# imports
import numpy as np
from pathlib import Path
import pandas as pd
import seaborn as sns
sns.set()

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist

  import pandas.util.testing as tm
Using TensorFlow backend.


In [3]:
# Object and methodes needed in order to run a collaborative round
from subtest.dataset import Dataset
from subtest.scenario import Scenario, run_scenario

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


# Create a custom scenario handling mandatory parameters

These parameters describe how many partners will be created and how much proportion they will have in the dataset.

We can use more advanced sample split options in order to fine tune the data distribution between partners.

In [4]:
scenario_params = {
    'partners_count': 3,
    'amounts_per_partner': [0.2, 0.5, 0.3],
}

# Set values for optional parameters

We want our training to go for 10 epochs and 3 minibatches per epoch.

In [5]:
scenario_params['epoch_count'] = 10
scenario_params['minibatch_count'] = 3

#### Every other parameter will be set to its default value

We might consider :

- Datas will be split randomly between partner
- The learning approach is 'fedavg' for federated averaging 
- Weights will be averaged uniformly, different weights can be applied for each partner

The learning approaches are built-in paramater that can be set easily. There are currently 4 differents approaches.


#### More details at : https://github.com/SubstraFoundation/distributed-learning-contributivity

# Define scenario

We specify our experiment path used to output graphs and results.

We can now create the scenario that will handle every parameter

In [6]:
current_scenario = Scenario(scenario_params)

2020-08-26 11:14:25.985 | DEBUG    | subtest.scenario:__init__:58 - Dataset selected: mnist
2020-08-26 11:14:25.989 | DEBUG    | subtest.scenario:__init__:93 - Computation use the full dataset for scenario #1
2020-08-26 11:14:26.112 | INFO     | subtest.scenario:__init__:282 - ### Description of data scenario configured:
2020-08-26 11:14:26.113 | INFO     | subtest.scenario:__init__:283 -    Number of partners defined: 3
2020-08-26 11:14:26.115 | INFO     | subtest.scenario:__init__:284 -    Data distribution scenario chosen: random
2020-08-26 11:14:26.116 | INFO     | subtest.scenario:__init__:285 -    Multi-partner learning approach: fedavg
2020-08-26 11:14:26.117 | INFO     | subtest.scenario:__init__:286 -    Weighting option: uniform
2020-08-26 11:14:26.119 | INFO     | subtest.scenario:__init__:287 -    Iterations parameters: 10 epochs > 3 mini-batches > 8 gradient updates per pass
2020-08-26 11:14:26.122 | INFO     | subtest.scenario:__init__:293 - ### Data loaded: mnist
2020-08

# Create Data Set

For this experiment we use the well known MNIST dataset.

This example is also available using the standalone app specifying in the config file : dataset_name: - 'mnist'

In [7]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0],  28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

input_shape = (28, 28, 1)
num_classes = 10

# Create Preprocessing function

In [8]:
def preprocess_dataset_labels(y):
    y = np_utils.to_categorical(y, 10)
    return y

# Create Model

In [9]:
def generate_new_model_for_dataset():
    model = Sequential()
    # add Convolutional layers
    model.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu', padding='same',
                     input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))    
    model.add(Flatten())
    # Densely connected layers
    model.add(Dense(128, activation='relu'))
    # output layer
    model.add(Dense(num_classes, activation='softmax'))
    # compile with adam optimizer & categorical_crossentropy loss function
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Assignate dataset to scenario

In [10]:
current_scenario.dataset = Dataset(
    "my_dataset",
    X_train,
    X_test,
    y_train,
    y_test,
    input_shape,
    num_classes,
    preprocess_dataset_labels,
    generate_new_model_for_dataset
)

# Split train and validation sets

This is a mandatory step to provide an unbiased evaluation of the model performance.

In [11]:
current_scenario.dataset.train_val_split()

# Run scenario

The actual training phase of our federated learning example ! 

In [12]:
run_scenario(current_scenario)

2020-08-26 11:14:26.905 | INFO     | subtest.scenario:split_data:537 - ### Splitting data among partners:
2020-08-26 11:14:26.906 | INFO     | subtest.scenario:split_data:538 -    Simple split performed.
2020-08-26 11:14:26.907 | INFO     | subtest.scenario:split_data:539 -    Nb of samples split amongst partners: 38880
2020-08-26 11:14:26.908 | INFO     | subtest.scenario:split_data:541 -    Partner #0: 7776 samples with labels [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2020-08-26 11:14:26.909 | INFO     | subtest.scenario:split_data:541 -    Partner #1: 19440 samples with labels [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2020-08-26 11:14:26.910 | INFO     | subtest.scenario:split_data:541 -    Partner #2: 11664 samples with labels [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2020-08-26 11:14:27.184 | DEBUG    | subtest.scenario:compute_batch_sizes:585 -    Compute batch sizes, partner #0: 324
2020-08-26 11:14:27.185 | DEBUG    | subtest.scenario:compute_batch_sizes:585 -    Compute batch sizes, partner #1: 810
2020-08-26

0

# Results

We can see every parameter used pre and post training.

In [13]:
df_results = current_scenario.to_dataframe()
print(df_results.columns)

Index(['aggregation_weighting', 'dataset_fraction_per_partner', 'dataset_name',
       'epoch_count', 'final_relative_nb_samples',
       'gradient_updates_per_pass_count', 'is_early_stopping',
       'learning_computation_time_sec', 'minibatch_count',
       'mpl_nb_epochs_done', 'mpl_test_score',
       'multi_partner_learning_approach', 'nb_samples_used', 'partners_count',
       'samples_split_description', 'scenario_name', 'short_scenario_name',
       'test_data_samples_count', 'train_data_samples_count'],
      dtype='object')


#### Our score :

In [14]:
print("Approach used :", df_results.multi_partner_learning_approach[0])
print("Model accuracy :", df_results.mpl_test_score[0])
print(df_results.aggregation_weighting)

Approach used : fedavg
Model accuracy : 0.9811000227928162
0    uniform
Name: aggregation_weighting, dtype: object


## Extract model 

We can extract our model and save it for later

In [15]:
model = current_scenario.mpl.get_model()

In [16]:
model.evaluate(X_test, preprocess_dataset_labels(y_test))



[0.05563341381018981, 0.9811000227928162]

# That's it !

Now you can explore our other tutorials for a better snapshot of what can be done with our library!