<a href="https://colab.research.google.com/github/SubstraFoundation/substra-examples/blob/master/deepfake-detection/Deepfake_Detection_Substra_Example_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this Notebook, you can test the example's assets and potentially run ML tasks on a public distant VM.
On Google Colab, you can see and modify the assets with the "Files" button on the left.

# A Substra example for Deepfakes Detection

*This example is a Substra implementation of a deepfake detector.
The Algo is based on the [inference demo Kaggle notebook](https://www.kaggle.com/humananalog/inference-demo) and use the [DFDC dataset from Kaggle](https://www.kaggle.com/c/deepfake-detection-challenge).
The structure of the example is inspired from [Substra's Titanic Example](https://github.com/SubstraFoundation/substra/blob/master/examples/titanic/)*

## Prerequisites

In order to run this example, you'll need to:

* use Python 3
* have [Docker](https://www.docker.com/) installed
* [install the `substra` cli](https://github.com/SubstraFoundation/substra#install) (supported version: 0.6.0)

In [None]:
! pip3 install substra==0.6.0


* [install the `substratools` library](https://github.com/substrafoundation/substra-tools) (supported version: 0.6.0)
* [pull the `substra-tools` docker images](https://github.com/substrafoundation/substra-tools#pull-from-private-docker-registry)
* have access to a Substra installation ([configure your host to a public node ip](https://doc.substra.ai/getting_started/installation/local_install_skaffold.html#network) or [install Substra on your machine](https://doc.substra.ai/getting_started/installation/local_install_skaffold.html))


In [None]:
#replace this ip by the ip of a distant VM running a substra node
public_node_ip = "127.0.0.1"

In [None]:
! echo "{public_node_ip} substra-backend.node-1.com substra-frontend.node-1.com substra-backend.node-2.com substra-frontend.node-2.com" | sudo tee -a /etc/hosts
# Check if it's ok
! curl substra-backend.node-1.com/readiness


* create a substra profile to define the substra network to target, for instance:


In [None]:
! substra config --profile node-1 http://substra-backend.node-1.com
! substra login --profile node-1 --username node-1 --password 'p@$swr0d44'

* checkout this repository

In [None]:
%cd /content
! git clone https://github.com/SubstraFoundation/substra-examples/
%cd /content/substra-examples/deepfake-detection/

/content
Cloning into 'substra-examples'...
remote: Enumerating objects: 26, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 285 (delta 6), reused 13 (delta 1), pack-reused 259[K
Receiving objects: 100% (285/285), 170.32 MiB | 4.30 MiB/s, done.
Resolving deltas: 100% (136/136), done.
/content/substra-examples
Checking out files: 100% (21/21), done.
Branch 'deepfake' set up to track remote branch 'deepfake' from 'origin'.
Switched to a new branch 'deepfake'
/content/substra-examples/deepfake-detection


All commands in this example are run from the `deepfake-detection` folder.

## Data preparation



### Download the data

The first step will be to download the data from the [Kaggle challenge source](https://www.kaggle.com/c/deepfake-detection-challenge/data)

* Sign-up or login to [Kaggle](https://www.kaggle.com/) and accept the [competitions rules](https://www.kaggle.com/c/deepfake-detection-challenge/rules).
* Download the data samples (4Go) manually (`Download All` at the bottom of the [data section](https://www.kaggle.com/c/deepfake-detection-challenge/data)), or install & configure the [Kaggle API](https://github.com/Kaggle/kaggle-api) 


#### Using Kaggle API

In [None]:
! pip install --upgrade --force-reinstall kaggle

To use the Kaggle API, go to the 'Account' tab of your user profile (https://www.kaggle.com/<username\>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials.  
Upload your kaggle.json file

In [None]:
from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"wallrider","key":"df97e4c7538d8e5191f6d1f961580d1c"}'}

In [None]:
%mkdir ~/.kaggle
%cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

and execute the following command:

In [None]:
! kaggle competitions download -c deepfake-detection-challenge

Downloading deepfake-detection-challenge.zip to /content/substra-examples/deepfake-detection
100% 4.13G/4.13G [01:00<00:00, 111MB/s]
100% 4.13G/4.13G [01:00<00:00, 73.7MB/s]


#### If you downloaded the data samples manually

In [None]:
from google.colab import files
# upload your deepfake-detection-challenge.zip file
files.upload()

#### Extract the zip file and copy-paste the 'train_sample_videos' folder in the data/DFDC folder of the example.

In [None]:
%mkdir -p data/DFDC
! unzip -q deepfake-detection-challenge.zip 'train_sample_videos/*' -d data/DFDC
%rm deepfake-detection-challenge.zip

### Generate data samples

The second step will be to generate train and test data samples from the [Kaggle challenge source](https://www.kaggle.com/c/deepfake-detection-challenge/data).
To generate the data samples, run:

In [None]:
%cd /content/substra-examples/deepfake-detection/
#! pip install -r scripts/requirements.txt --user
# requirements are already satisfied in Colab, except for substratools
! pip install substratools==0.6.0

/content/substra-examples/deepfake-detection
Collecting substratools==0.6.0
  Downloading https://files.pythonhosted.org/packages/46/90/74983a05a05259321b51516fc5404abfa169c8f39a090b53e1788b3e5cd7/substratools-0.6.0-py3-none-any.whl
Installing collected packages: substratools
Successfully installed substratools-0.6.0


In [None]:
%cd /content/substra-examples/deepfake-detection/
! python scripts/generate_data_samples.py

/content/substra-examples/deepfake-detection
Loading DFDC data from data/DFDC
# of files in data folder: 401
Files with extension `mp4`: 400
Files with extension `json`: 1
JSON file: metadata.json
# of files in metadata: 400, # of videos: 400
Spliting data in train/test sets...
# of train data points:  320
# of test data points:  80
Data will be generated in :  /content/substra-examples/deepfake-detection/assets


This will create two sub-folders in the `assets` folder:

* `train_data_samples` contains train data features (paths of the videos) and labels as numpy array files
* `test_data_samples` contains test data features (paths of the videos) and labels as numpy array files

## Writing the objective and data manager

Both objective and data manager will need a proper markdown description, you can check them out in their respective
folders. Notice that the data manager's description includes a formal description of the data structure.

Notice also that the `metrics.py` and `opener.py` module both rely on classes imported from the `substratools` module.
These classes provide a simple yet rigid structure that will make algorithms pretty easy to write.

## Writing a simple algorithm

You'll find under `assets/algo_inference` an implementation of the `inference` model from the [inference demo Kaggle notebook](https://www.kaggle.com/humananalog/inference-demo). Like the metrics and opener scripts, it relies on a
class imported from `substratools` that greatly simplifies the writing process. You'll notice that it handles not only
the train and predict tasks but also a lot of data preprocessing.

## Testing our assets

### Using asset command line interfaces

You can first test each assets with the `substratools` CLI, by running specific ML tasks in your local Python environment.

#### Training task

In [None]:
#for a quicker test, you can change --data-samples-path to a specific data sample, (e.g. assets/train_data_samples/data_sample_0)

#train your model with the train_data
! python assets/algo_inference/algo.py train \
  --debug \
  --opener-path assets/dataset/opener.py \
  --data-samples-path assets/train_data_samples \
  --output-model-path assets/model/model \
  --log-path assets/logs/train.log

substratools.utils - Module 'opener' loaded from path 'assets/dataset/opener.py'
substratools.opener - loading X from '['assets/train_data_samples/data_sample_2', 'assets/train_data_samples/data_sample_49', 'assets/train_data_samples/data_sample_48', 'assets/train_data_samples/data_sample_13', 'assets/train_data_samples/data_sample_79', 'assets/train_data_samples/data_sample_74', 'assets/train_data_samples/data_sample_71', 'assets/train_data_samples/data_sample_52', 'assets/train_data_samples/data_sample_26', 'assets/train_data_samples/data_sample_3', 'assets/train_data_samples/data_sample_31', 'assets/train_data_samples/data_sample_46', 'assets/train_data_samples/data_sample_62', 'assets/train_data_samples/data_sample_6', 'assets/train_data_samples/data_sample_14', 'assets/train_data_samples/data_sample_67', 'assets/train_data_samples/data_sample_45', 'assets/train_data_samples/data_sample_28', 'assets/train_data_samples/data_sample_37', 'assets/train_data_samples/data_sample_59', 'as

In [None]:
#predict the labels of train_data with your previously trained model
! python assets/algo_inference/algo.py predict \
  --debug \
  --opener-path assets/dataset/opener.py \
  --data-samples-path assets/train_data_samples \
  --output-predictions-path assets/pred-train.csv \
  --models-path assets/model/ \
  --log-path assets/logs/train_predict.log \
  model

substratools.utils - Module 'opener' loaded from path 'assets/dataset/opener.py'
substratools.opener - loading X from '['assets/train_data_samples/data_sample_2', 'assets/train_data_samples/data_sample_49', 'assets/train_data_samples/data_sample_48', 'assets/train_data_samples/data_sample_13', 'assets/train_data_samples/data_sample_79', 'assets/train_data_samples/data_sample_74', 'assets/train_data_samples/data_sample_71', 'assets/train_data_samples/data_sample_52', 'assets/train_data_samples/data_sample_26', 'assets/train_data_samples/data_sample_3', 'assets/train_data_samples/data_sample_31', 'assets/train_data_samples/data_sample_46', 'assets/train_data_samples/data_sample_62', 'assets/train_data_samples/data_sample_6', 'assets/train_data_samples/data_sample_14', 'assets/train_data_samples/data_sample_67', 'assets/train_data_samples/data_sample_45', 'assets/train_data_samples/data_sample_28', 'assets/train_data_samples/data_sample_37', 'assets/train_data_samples/data_sample_59', 'as

In [None]:
#calculate the score of your model on train_data predictions
! python assets/objective/metrics.py \
  --debug \
  --opener-path assets/dataset/opener.py \
  --data-samples-path assets/train_data_samples \
  --input-predictions-path assets/pred-train.csv \
  --output-perf-path assets/perf-train.json \
  --log-path assets/logs/train_metrics.log
  

substratools.opener - loading y from '['assets/train_data_samples/data_sample_2', 'assets/train_data_samples/data_sample_49', 'assets/train_data_samples/data_sample_48', 'assets/train_data_samples/data_sample_13', 'assets/train_data_samples/data_sample_79', 'assets/train_data_samples/data_sample_74', 'assets/train_data_samples/data_sample_71', 'assets/train_data_samples/data_sample_52', 'assets/train_data_samples/data_sample_26', 'assets/train_data_samples/data_sample_3', 'assets/train_data_samples/data_sample_31', 'assets/train_data_samples/data_sample_46', 'assets/train_data_samples/data_sample_62', 'assets/train_data_samples/data_sample_6', 'assets/train_data_samples/data_sample_14', 'assets/train_data_samples/data_sample_67', 'assets/train_data_samples/data_sample_45', 'assets/train_data_samples/data_sample_28', 'assets/train_data_samples/data_sample_37', 'assets/train_data_samples/data_sample_59', 'assets/train_data_samples/data_sample_64', 'assets/train_data_samples/data_sample_5

#### Testing task


In [None]:
#predict the labels of test_data with your previously trained model
! python assets/algo_inference/algo.py predict \
  --debug \
  --opener-path assets/dataset/opener.py \
  --data-samples-path assets/test_data_samples \
  --output-predictions-path assets/pred-test.csv \
  --models-path assets/model/ \
  --log-path assets/logs/test_predict.log \
  model


substratools.utils - Module 'opener' loaded from path 'assets/dataset/opener.py'
substratools.opener - loading X from '['assets/test_data_samples/data_sample_2', 'assets/test_data_samples/data_sample_13', 'assets/test_data_samples/data_sample_3', 'assets/test_data_samples/data_sample_6', 'assets/test_data_samples/data_sample_14', 'assets/test_data_samples/data_sample_0', 'assets/test_data_samples/data_sample_17', 'assets/test_data_samples/data_sample_5', 'assets/test_data_samples/data_sample_15', 'assets/test_data_samples/data_sample_18', 'assets/test_data_samples/data_sample_11', 'assets/test_data_samples/data_sample_10', 'assets/test_data_samples/data_sample_4', 'assets/test_data_samples/data_sample_19', 'assets/test_data_samples/data_sample_8', 'assets/test_data_samples/data_sample_16', 'assets/test_data_samples/data_sample_1', 'assets/test_data_samples/data_sample_12', 'assets/test_data_samples/data_sample_9', 'assets/test_data_samples/data_sample_7']'
Finding features file...
subs

In [None]:
#calculate the score of your model on test_data predictions
! python assets/objective/metrics.py \
  --debug \
  --opener-path assets/dataset/opener.py \
  --data-samples-path assets/test_data_samples \
  --input-predictions-path assets/pred-test.csv \
  --output-perf-path assets/perf-test.json \
  --log-path assets/logs/test_metrics.log

substratools.opener - loading y from '['assets/test_data_samples/data_sample_2', 'assets/test_data_samples/data_sample_13', 'assets/test_data_samples/data_sample_3', 'assets/test_data_samples/data_sample_6', 'assets/test_data_samples/data_sample_14', 'assets/test_data_samples/data_sample_0', 'assets/test_data_samples/data_sample_17', 'assets/test_data_samples/data_sample_5', 'assets/test_data_samples/data_sample_15', 'assets/test_data_samples/data_sample_18', 'assets/test_data_samples/data_sample_11', 'assets/test_data_samples/data_sample_10', 'assets/test_data_samples/data_sample_4', 'assets/test_data_samples/data_sample_19', 'assets/test_data_samples/data_sample_8', 'assets/test_data_samples/data_sample_16', 'assets/test_data_samples/data_sample_1', 'assets/test_data_samples/data_sample_12', 'assets/test_data_samples/data_sample_9', 'assets/test_data_samples/data_sample_7']'
Finding label file...
Loading labels...
substratools.opener - loading predictions from 'assets/pred-test.csv'


### Using substra cli

Before pushing our assets to the platform, we need to make sure they work well. To do so, we can run them locally in a
Docker container. This way, if the training fails, we can access the logs and debug our code.

To test the assets, we'll use `substra run-local`, passing it paths to our algorithm of course, but also the opener,
the metrics and to the data samples we want to use. It will launch a training task on the train data, a prediction task on the test data and return the accuracy score.

In [None]:
#you will need Docker to run this (not available in Colab)
! substra run-local assets/algo_inference \
  --train-opener=assets/dataset/opener.py \
  --test-opener=assets/dataset/opener.py \
  --metrics=assets/objective/ \
  --train-data-samples=assets/train_data_samples \
  --test-data-samples=assets/test_data_samples

At the end of this step, you'll find in the newly created `sandbox/model` folder a `model` file that contains your
trained model. There is also a `sandbox/pred_train` folder that contains both the predictions made by the model on
train data and the associated performance.


#### Debugging

It's more than probable that your code won't run perfectly the first time. Since runs happen in dockers, you can't
debug using prints. Instead, you should use the `logging` module from python. All logs can then be consulted at the end
of the run in  `sandbox/model/log_model.log`.


## Adding the assets to substra



### Adding the objective, dataset and data samples to substra

A script has been written that adds objective, data manager and data samples to substra. It uses the `substra` python
sdk to perform actions. It's main goal is to create assets, get their keys and use these keys in the creation of other
assets.

To run it:

In [None]:
! python scripts/add_dataset_objective.py


This script just generated an `assets_keys.json` file in the `deepfake-detection` folder. This file contains the keys of all assets
we've just created and organizes the keys of the train data samples in folds. This file will be used as input when
adding an algorithm so that we can automatically launch all training and testing tasks.

### Adding the algorithm and training it

The script `add_train_algo_inference.py` pushes our simple algo to substra and then uses the `assets_keys.json` file
we just generated to train it against the dataset and objective we previously set up. It will then update the
`assets_keys.json` file with the newly created assets keys (algo, traintuple and testtuple)

To run it:


In [None]:
! python scripts/add_train_algo_inference.py

It will end by providing a couple of commands you can use to track the progress of the train and test tuples as well
as the associated scores. Alternatively, you can browse the frontend to look up progress and scores.
