# MNIST 

This example is an implementation of federated learning using Substra's Distributed Learning Contributivity.

This is based on both [existing resources on MNIST](https://medium.com/@mjbhobe/mnist-digits-classification-with-keras-ed6c2374bd0e) and [precedent implementation of this dataset for the standalone application](https://github.com/SubstraFoundation/distributed-learning-contributivity/blob/master/datasets/dataset_mnist.py).

This notebook will be focused on importing manually the dataset, do a bit of preprocessing and build our objects to run a collaborative round.


## Prerequisites

In order to run this example, you'll need to:

* use python 3.7 +
* install requierements from the requirements.txt file
* install this package https://test.pypi.org/project/pkg-test-distributed-learning-contributivity/0.0.7/



In [2]:
!wget https://raw.githubusercontent.com/SubstraFoundation/distributed-learning-contributivity/Moving-functions/requirements.txt
!pip install -r requirements.txt
!pip install -i https://test.pypi.org/simple/ subtest==0.0.0.10


'wget' n'est pas reconnu en tant que commande interne
ou externe, un programme ex‚cutable ou un fichier de commandes.
ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
ERROR: Could not find a version that satisfies the requirement tensorflow==2.2.0 (from subtest==0.0.0.10) (from versions: none)

Looking in indexes: https://test.pypi.org/simple/
Collecting subtest==0.0.0.10
  Downloading https://test-files.pythonhosted.org/packages/b8/5a/2540de2b5d53a93ddb6513c0aed65aef8e964b7d9d951011055bcc55fa61/subtest-0.0.0.10-py3-none-any.whl (51 kB)



ERROR: No matching distribution found for tensorflow==2.2.0 (from subtest==0.0.0.10)




In [12]:
# imports
import seaborn as sns
sns.set()

from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist

In [7]:
# Object and methodes needed in order to run a collaborative round
from subtest.datasets.dataset import Dataset
from subtest.scenario import Scenario

Using TensorFlow backend.


# Create a custom scenario handling mandatory parameters

These parameters describe how many partners will be created and how much proportion they will have in the dataset.

We can use more advanced sample split options in order to fine tune the data distribution between partners.

# Set values for scenario parameters

## Mandatory parameters
partner_count and amounts_per_partner describe how many partners will be created and how much proportion they will have in the dataset. Here we choose 4 partners, with respectively 20 %, 50% and 30% of the dataset. 

We can use more advanced sample split options in order to fine tune the data distribution between partners.

## Optionnal parameters

We want our training to go for 10 epochs and 3 minibatches per epoch.

Moreover, there is 4 datasets which are pre-implemented in subtest, but for the example we will create our own dataset:



# Create Data Set

For this experiment we use the well known MNIST dataset.

This example is also available using the standalone app specifying in the config file : dataset_name: - 'mnist', or by passing the parameter dataset_name = 'mnist' to Scenario

In [13]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0],  28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

input_shape = (28, 28, 1)
num_classes = 10

# Create Preprocessing function

In [14]:
def preprocess_dataset_labels(y):
    y = np_utils.to_categorical(y, 10)
    return y

# Create Model

In [15]:
def generate_new_model_for_dataset():
    model = Sequential()
    # add Convolutional layers
    model.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu', padding='same',
                     input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))    
    model.add(Flatten())
    # Densely connected layers
    model.add(Dense(128, activation='relu'))
    # output layer
    model.add(Dense(num_classes, activation='softmax'))
    # compile with adam optimizer & categorical_crossentropy loss function
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Generate dataset

Note that the scenario needs a dataset object

In [16]:
dataset = Dataset(
    "my_dataset",
    X_train,
    X_test,
    y_train,
    y_test,
    input_shape,
    num_classes,
    preprocess_dataset_labels,
    generate_new_model_for_dataset
)

In [24]:
dataset.train_val_split_local

AttributeError: 'Dataset' object has no attribute 'train_val_split_local'

#### Every other parameter will be set to its default value

We might consider :

- Datas will be split randomly between partner
- The learning approach is 'fedavg' for federated averaging 
- Weights will be averaged uniformly, different weights can be applied for each partner

The learning approaches are built-in paramater that can be set easily. There are currently 4 differents approaches.


#### More details at : https://github.com/SubstraFoundation/distributed-learning-contributivity

# Define scenario

We specify our experiment path used to output graphs and results.

We can now create the scenario that will handle every parameter

In [17]:
current_scenario = Scenario(partners_count = 3,
                            amounts_per_partner = [0.2, 0.5, 0.3],
                            epoch_count = 10,
                            minibatch_count = 3,
                            dataset = dataset
                            )

2020-09-16 10:28:08.235 | DEBUG    | subtest.scenario:__init__:101 - Computation use the full dataset for scenario #1
2020-09-16 10:28:08.579 | INFO     | subtest.scenario:__init__:262 - ### Description of data scenario configured:
2020-09-16 10:28:08.580 | INFO     | subtest.scenario:__init__:263 -    Number of partners defined: 3
2020-09-16 10:28:08.581 | INFO     | subtest.scenario:__init__:264 -    Data distribution scenario chosen: random
2020-09-16 10:28:08.581 | INFO     | subtest.scenario:__init__:265 -    Multi-partner learning approach: fedavg
2020-09-16 10:28:08.582 | INFO     | subtest.scenario:__init__:266 -    Weighting option: uniform
2020-09-16 10:28:08.583 | INFO     | subtest.scenario:__init__:267 -    Iterations parameters: 10 epochs > 3 mini-batches > 8 gradient updates per pass
2020-09-16 10:28:08.584 | INFO     | subtest.scenario:__init__:273 - ### Data loaded: my_dataset
2020-09-16 10:28:08.585 | INFO     | subtest.scenario:__init__:274 -    54000 train data with

# Run scenario

The actual training phase of our federated learning example ! 

In [18]:
current_scenario.run()

AttributeError: 'Dataset' object has no attribute 'train_val_split_local'

# Results

We can see every parameter used pre and post training.

In [13]:
df_results = current_scenario.to_dataframe()
print(df_results.columns)

Index(['aggregation_weighting', 'dataset_fraction_per_partner', 'dataset_name',
       'epoch_count', 'final_relative_nb_samples',
       'gradient_updates_per_pass_count', 'is_early_stopping',
       'learning_computation_time_sec', 'minibatch_count',
       'mpl_nb_epochs_done', 'mpl_test_score',
       'multi_partner_learning_approach', 'nb_samples_used', 'partners_count',
       'samples_split_description', 'scenario_name', 'short_scenario_name',
       'test_data_samples_count', 'train_data_samples_count'],
      dtype='object')


#### Our score :

In [14]:
print("Approach used :", df_results.multi_partner_learning_approach[0])
print("Model accuracy :", df_results.mpl_test_score[0])
print(df_results.aggregation_weighting)

Approach used : fedavg
Model accuracy : 0.9811000227928162
0    uniform
Name: aggregation_weighting, dtype: object


## Extract model 

We can extract our model and save it for later

In [15]:
model = current_scenario.mpl.get_model()

In [16]:
model.evaluate(X_test, preprocess_dataset_labels(y_test))



[0.05563341381018981, 0.9811000227928162]

# That's it !

Now you can explore our other tutorials for a better snapshot of what can be done with our library!