# Tutorial 3 : Use homemade dataset 

With this example, we dive deeper into the potential of the library, and run a scenario on a new dataset, that we will implement 

## 1 - Prerequisites

In order to run this example, you'll need to:

* use python 3.7 +
* install requirements from the requirements.txt file
* install this package https://test.pypi.org/project/mplc/

If you did not follow our firsts tutorials, it is highly recommended to [take a look at it !](https://github.com/SubstraFoundation/distributed-learning-contributivity/tree/master/notebooks/examples/) 


In [None]:
!pip install mplc

## 2 - Context 

In collaborative data science projects partners sometimes need to train a model on multiple datasets, contributed by different data providing partners. In such cases the partners might have to measure how much each dataset involved contributed to the performance of the model. This is useful for example as a basis to agree on how to share the reward of the ML challenge or the future revenues derived from the predictive model, or to detect possible corrupted datasets or partners not playing by the rules. The library explores this question and the opportunity to implement some mechanisms helping partners in such scenarios to measure each dataset's *contributivity* (as *contribution to the performance of the model*).

In the first tutorial, you learnt how to parametrize and run a scenario.
In the second tutorial, you discovered how to add one of the contributivity measurement implemented to your scenario run.
And in this tutorial, we are going to use our own dataset.  

### The dataset : Sentiment140
We are going to use a subset of the [sentiment140](http://help.sentiment140.com/for-students) dataset and try to 
classified short film review, between positive sentiments and negative sentiments for movies. 

*The whole machine learning process is inspired from this [article](https://medium.com/@alyafey22/sentiment-classification-from-keras-to-the-browser-7eda0d87cdc6)*
Please note that the library provided a really easy way to adapt a single partner, common machine learning use case with tensorflow, to a multipartner case, with contributivity measurement. 

In [2]:
# imports
import seaborn as sns
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split

import re

from keras.models import Sequential
from keras.layers import Dense, GRU, Embedding

from mplc.dataset import Dataset
from mplc.scenario import Scenario

sns.set()

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


## 3 - Generation, and preparation of the dataset
 
The scenario object needs a dataset object to run. In the previous tutorials, we indicate which one to generate automatically by passing a name of a pre-implemented dataset to the scenario constructor. 
Here, we will create this dataset object and pass it to the scenario constructor. 

The dataset needs few arguments. 
### Dataset generator :

The structure of the dataset generator is represented below:

```python
dataset = Dataset(
    "name",
    x_train,
    x_test,
    y_train,
    y_test,
    input_shape,
    num_classes,                   
    generate_new_model_for_dataset  # See below
    train_val_split_global,         # See below
    train_test_split_local,         # See below
    train_val_split_local           # See below
)
```
#### Data labels
The data labels can take whatever shape you need, with only one condition. 
The labels need to be convertible into string format, and with respect to the condition that if label1 is equal to label2 (
reciprocally different from), therefore str(label1) must be equal to str(label2) (reciprocally different from)
#### Model generator
This function provides the model use, which will be trained by the scenario object. 
Note: It is mandatory to have loss and accuracy as metrics for your model.

#### Train/validation/test splits

The dataset object must be provided some separated train and test sets (referred to as global train set and global test set).
The global train set is then further split into a global train set and a global validation set, by the function `train_val_split_global`. Please denote that if this function is not provided, the sklearn's train_test_split function will be called by default, and 10% of the training set will be use as validation set. 
In the multi-partner learning computations, the global validation set is used for early stopping and the global test set is used for performance evaluation.
The global train set is then split amongst partners (according to the scenario configuration) to populate the partner's local datasets.
For each partner, the local dataset will be split into separated train, validation and test sets, using the `train_test_split_local` and `train_val_split_local` functions.
These are not mandatory, by default the local dataset will not be split. 
Denote that currently, the local validation and test set are not used, but they are available for further developments of multi-partner learning and contributivity measurement approaches.

### Dataset construction
Now that we know all of that, we can create our dataset.
#### Download and unzip data if needed

In [3]:
!curl https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip --output trainingandtestdata.zip
!unzip trainingandtestdata.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 77.5M  100 77.5M    0     0  15.0M      0  0:00:05  0:00:05 --:--:-- 15.9M
Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


#### Define preprocessing functions

In [1]:
def process(txt):
    out = re.sub(r'[^a-zA-Z0-9\s]', '', txt)
    out = out.split()
    out = [word.lower() for word in out]
    return out

def getMax(data):
    max_tokens = 0 
    for txt in data:
        if max_tokens < len(txt.split()):
            max_tokens = len(txt.split())
    return max_tokens


def tokenize(thresh = 5):
    count  = dict()
    idx = 1
    word_index = dict()
    for txt in x:
        words = process(txt)
        for word in words:
            if word in count.keys():
                count[word] += 1
            else:
                count[word]  = 1
    most_counts = [word for word in count.keys() if count[word]>=thresh]
    for word in most_counts:
        word_index[word] = idx
        idx+=1
    return word_index


def create_sequences(data):
    tokens = []
    for txt in data:
        words = process(txt)
        seq = [0] * max_tokens
        i = 0 
        for word in words:
            start = max_tokens-len(words)
            if word.lower() in word_index.keys():
                seq[i+start] = word_index[word]
            i+=1
        tokens.append(seq)        
    return np.array(tokens)

def preprocess_dataset_labels(label):
    label = np.array([e/4 for e in label])
    return label

#### Create dataset

In [5]:
df_train = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding = "raw_unicode_escape", header=None)
df_test = pd.read_csv("testdata.manual.2009.06.14.csv", encoding = "raw_unicode_escape",  header=None)

df_train.columns = ["polarity", "id", "date", "query", "user", "text"]
df_test.columns = ["polarity", "id", "date", "query", "user", "text"]

# We keep only a fraction of the whole dataset

df_train = df_train.sample(frac = 0.1)

x = df_train["text"]
y = df_train["polarity"]



In [6]:
max_tokens = getMax(x)

num_words = None
word_index = tokenize()
num_words = len(word_index)

x = create_sequences(x)
y = preprocess_dataset_labels(y)

input_shape = max_tokens
num_classes = len(np.unique(y))


print('length of the dictionary ',len(word_index))
print('max token ', max_tokens) 
print('num classes', num_classes)

length of the dictionary  15174
max token  39
num classes 2


In [7]:
(x_train, x_test) = train_test_split(x, shuffle = False)
(y_train, y_test) = train_test_split(y, shuffle = False)

#### Create Model generator
 

In [9]:
def generate_new_model_for_dataset():
    model = Sequential()
    embedding_size = 8
    model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

    model.add(GRU(units=16, name = "gru_1",return_sequences=True))
    model.add(GRU(units=8, name = "gru_2" ,return_sequences=True))
    model.add(GRU(units=4, name= "gru_3"))
    model.add(Dense(1, activation='sigmoid',name="dense_1"))
    model.compile(loss='binary_crossentropy',
              optimizer="Adam",
              metrics=['accuracy'])
    return model


And eventually, generate our object !

In [10]:
my_dataset = Dataset(
    "my_dataset",
    x_train,
    x_test,
    y_train,
    y_test,
    input_shape,
    num_classes,
    generate_new_model_for_dataset
)

## 4 - Create the custom scenario
The dataset can be passed to the scenario, through the `dataset` argument.

In [8]:
my_scenario = Scenario(partners_count=3,
                           amounts_per_partner=[0.2, 0.5, 0.3],
                           epoch_count=10,
                           minibatch_count=3,
                           dataset=my_dataset)
# Every other parameter will be set to its default value

2020-08-26 11:23:59.182 | DEBUG    | subtest.scenario:__init__:58 - Dataset selected: mnist
2020-08-26 11:23:59.186 | DEBUG    | subtest.scenario:__init__:93 - Computation use the full dataset for scenario #1
2020-08-26 11:23:59.332 | INFO     | subtest.scenario:__init__:282 - ### Description of data scenario configured:
2020-08-26 11:23:59.333 | INFO     | subtest.scenario:__init__:283 -    Number of partners defined: 3
2020-08-26 11:23:59.334 | INFO     | subtest.scenario:__init__:284 -    Data distribution scenario chosen: random
2020-08-26 11:23:59.337 | INFO     | subtest.scenario:__init__:285 -    Multi-partner learning approach: fedavg
2020-08-26 11:23:59.339 | INFO     | subtest.scenario:__init__:286 -    Weighting option: uniform
2020-08-26 11:23:59.341 | INFO     | subtest.scenario:__init__:287 -    Iterations parameters: 10 epochs > 3 mini-batches > 8 gradient updates per pass
2020-08-26 11:23:59.343 | INFO     | subtest.scenario:__init__:293 - ### Data loaded: mnist
2020-08

In [12]:
my_scenario.run()

2020-08-26 11:23:59.507 | INFO     | subtest.scenario:split_data:537 - ### Splitting data among partners:
2020-08-26 11:23:59.611 | INFO     | subtest.scenario:split_data:538 -    Simple split performed.
2020-08-26 11:23:59.614 | INFO     | subtest.scenario:split_data:539 -    Nb of samples split amongst partners: 77760
2020-08-26 11:23:59.615 | INFO     | subtest.scenario:split_data:541 -    Partner #0: 15552 samples with labels [0, 4]
2020-08-26 11:23:59.617 | INFO     | subtest.scenario:split_data:541 -    Partner #1: 38880 samples with labels [0, 4]
2020-08-26 11:23:59.618 | INFO     | subtest.scenario:split_data:541 -    Partner #2: 23328 samples with labels [0, 4]
2020-08-26 11:23:59.900 | DEBUG    | subtest.scenario:compute_batch_sizes:585 -    Compute batch sizes, partner #0: 648
2020-08-26 11:23:59.901 | DEBUG    | subtest.scenario:compute_batch_sizes:585 -    Compute batch sizes, partner #1: 1620
2020-08-26 11:23:59.901 | DEBUG    | subtest.scenario:compute_batch_sizes:585 - 

0

## 5 - Accuracy score between each partner and comparison with aggregated model performance

Like in the first tutorial, we take a look at the scores, local and global.

In [13]:
scores = my_scenario.mpl.score_matrix_per_partner.mean(axis = 1)
score_collective = my_scenario.mpl.score_matrix_collective_models.mean(axis=1)

scores_df = pd.DataFrame({
    f'partner {i}':scores[:,i] for i in range(my_scenario.partners_count) })
scores_df['collective model'] = score_collective

scores_df

0    0.7673
Name: mpl_test_score, dtype: float64


We can plot the evolution of the accuracy through the epochs. 

In [2]:
ax = sns.relplot(data = scores_df, kind="line")
ax.set(xlabel='epochs', ylabel='accuracy', title='Accuracy evolution through the epochs')


NameError: name 'scores_df' is not defined

 
# That's it !

Now you can explore our other tutorials for a better snapshot of what can be done with our library!

This work is collaborative, enthusiasts are welcome to comment open issues and PRs or open new ones.

Should you be interested in this open effort and would like to share any question, suggestion or input, you can use the following channels:

- This Github repository (issues or PRs)
- Substra Foundation's [Slack workspace](https://substra-workspace.slack.com/join/shared_invite/zt-cpyedcab-FHYgpy08efKJ2FCadE2yCA), channel `#workgroup-mpl-contributivity`
- Email: hello@substra.org
- Come meet with us at La Paillasse (Paris, France), Le Palace (Nantes, France) or Studio Iconosquare (Limoges, France)

 ![logo Substra Foundation](./img/substra_logo_couleur_rvb_w150px.png)