# Explanation
Here I have tried to pre train a model using a synthetically generated dataset created using face-swappping on the [labeled faces in the wild (lfw)](http://vis-www.cs.umass.edu/lfw/) dataset. To create this supplementary dataset I used a modified version of the [FaceSwap app](https://github.com/MarekKowalski/FaceSwap) to perform random swaps between faces in the lfw dataset. The idea was to pretrain a model on this dataset and then refine the model to work with the orginal fake vs real dataset. As the fakes generated by FaceSwap were of much lower quality than in the fake vs real dataset my model was able to quite accurately categorise them. However it was not able to transfer this knowledge to the fake vs real dataset, implying that the underlying distributions are to distinct to enable transfer learning.

This notebook deals with training the neural network to classify images as photoshopped or not and is desgined to be run on google colab to make use of their free GPU

## Imports

In [None]:
# Modules required
import sys
import os
import importlib
import joblib
import itertools
from sklearn.model_selection import ParameterGrid
import shutil

## Colab setup

Training neural nets on my laptop is very slow so I used Google Colab to speed things up a bit. This function gets run if I am using Colab which does some setup like downloading the dataset from github and linking my google drive so that model logs can get saved.

In [None]:
def colab_setup():

    # Specify tensorflow version 2.0 and import, checking that gpu is used
    %tensorflow_version 2.x
    import tensorflow as tf

    # Check that the GPU is being used
    device_name = tf.test.gpu_device_name()
    if device_name != '/device:GPU:0':
        raise SystemError('GPU device not found')
    print('Found GPU at: {}'.format(device_name))

    # Add the project directory to path, allows access to items saved in google drive   
    if project_dir not in sys.path:
        sys.path.append(project_dir)

    # If there is no data in this directory and on colab then copy the data
    if not os.path.exists('./data'):

        # Download raw data from github onto colab instance and move it into the raw data directory, utilises sparse checkout
        !git init
        !git config core.sparsecheckout true
        !echo data >> .git/info/sparse-checkout
        !git remote add origin -f https://github.com/ERees1/faces-fake-vs-real
        !git pull origin master

## General Setup

In [3]:
# Add local drive to path if running on colab
if 'edwardrees' in sys.exec_prefix:
    device_loc = 'local'
    project_dir = '..'
    local_project_dir = project_dir
else:
    device_loc = 'colab'
    local_project_dir = '.'

    # Mount google drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=False)
    project_dir = 'drive/My Drive/GA/Capstone/faces-fake-vs-real'

    # Run setup function
    colab_setup()

# Want to be able to access files in my src folder
sys.path.append(project_dir + '/src')

# Import tensorflow, needs to be done after specifiying %tensorflow_version 2.x
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
from tensorboard.plugins.hparams import api as hp

# Loading is one of my modules so need to import after linking my google drive
import loading as ld

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
TensorFlow 2.x selected.
Found GPU at: /device:GPU:0
Initialized empty Git repository in /content/.git/
Updating origin
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 18663 (delta 0), reused 14 (delta 0), pack-reused 18646[K
Receiving objects: 100% (18663/18663), 468.52 MiB | 36.90 MiB/s, done.
Resolving deltas: 100% (4186/4186), done.
From https://gi

## Pretrain a model on LFW synthetic dataset

In [None]:
lfw_data_dir= './data/processed/LFW_faceswap_split'
model_dir = project_dir+'/models'

I initially utilised keras's `ImageDataGenerator` to load the data but when training I found it was much faster to use the `tf.data` api. As such I wrote a [class](../src/loading.py) to load the images and their corresponding labels in this way. I used some iamge augmentation in order to improve generalisability (and artifically increase the size of the training set).

In [None]:
def train_model(model, model_id='', data_dir='', epochs=15,steps_per_epoch=None,
                patience=5, lr=0.001):

    # Get the input_shape the model requires
    input_shape = model.input.get_shape().as_list()[-3:-1]
    
    # Load the data using DsLoader class
    data = ld.DsLoader(data_dir, image_size=input_shape)
    train = data.get_ds(split='train', augment=True, batch_size=32)
    val = data.get_ds(split='val', augment=False, batch_size=32)

    # Early stopping function
    es = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',
                                          verbose=1,
                                          patience=patience)

    # Setup a model checkpoint to save our best model
    checkpoint = tf.keras.callbacks.ModelCheckpoint(
        f'{model_dir}/{model_id}.h5',
        monitor='val_accuracy',
        verbose=1,
        save_best_only=True,
    )

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    #  Fit the tensorflow model
    model_fit = model.fit(train,
                          epochs=epochs,
                          steps_per_epoch=steps_per_epoch,
                          validation_data=val,
                          callbacks=[checkpoint])

    # Save the model fit history
    joblib.dump(model_fit.history, f'{model_dir}/{model_id}_history.gz')

    return model

In [None]:
# Function to construct the model, with various hyperparamters
def build_model(img_width=64,
                num_conv_blocks=3,
                num_filters=32,
                filter_size=3,
                n_output_units=2,
                dilation_rate=1):
    
    # Consituent model blocks
    def conv_block(x, num_filters, filter_size, dilation_rate, block):
        x = layers.Conv2D(num_filters,
                          filter_size,
                          activation='relu',
                          padding='same',
                          dilation_rate=dilation_rate,
                          name=f'conv_{block}_0')(x)
        x = layers.Conv2D(num_filters,
                          filter_size,
                          activation='relu',
                          padding='same',
                          dilation_rate=dilation_rate,
                          name=f'conv_{block}_1')(x)
        x = layers.MaxPooling2D(pool_size=(2, 2), name=f'pool_{block}')(x)
        x = layers.BatchNormalization(name=f'norm_{block}')(x)
        return x


    def output_block(x, n_output_units):
        x = layers.Flatten(name='output_flatten')(x)
        x = layers.Dense(units=n_output_units, activation='softmax',
                         name='output')(x)
        return x

    # Covert hparams to tuples where required
    filter_size = (filter_size,) * 2
    input_shape = (img_width,) * 2 + (3,)
    dilation_rate = (dilation_rate, dilation_rate)

    inputs = keras.Input(shape=input_shape, name='input')
    x = inputs
    for i in range(num_conv_blocks):
        x = conv_block(x, min(num_filters*2**i, 256), filter_size, dilation_rate, i)
    x = output_block(x, n_output_units)

    model = keras.Model(inputs, x, name='conv_1')
    return model

In [8]:
cnn1_model = build_model(img_width=250, num_conv_blocks=6, num_filters=32, filter_size=3, dilation_rate=2)
cnn1_model.summary()

Model: "conv_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 250, 250, 3)]     0         
_________________________________________________________________
conv_0_0 (Conv2D)            (None, 250, 250, 32)      896       
_________________________________________________________________
conv_0_1 (Conv2D)            (None, 250, 250, 32)      9248      
_________________________________________________________________
pool_0 (MaxPooling2D)        (None, 125, 125, 32)      0         
_________________________________________________________________
norm_0 (BatchNormalization)  (None, 125, 125, 32)      128       
_________________________________________________________________
conv_1_0 (Conv2D)            (None, 125, 125, 64)      18496     
_________________________________________________________________
conv_1_1 (Conv2D)            (None, 125, 125, 64)      36928

In [9]:
base_model = tf.keras.applications.MobileNetV2(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

cnn2_model = keras.Sequential([
    base_model,
    keras.layers.GlobalAvgPool2D(),
    keras.layers.Dense(units=2, activation='softmax')
                               
])

Downloading data from https://github.com/JonathanCMitchell/mobilenet_v2_keras/releases/download/v1.1/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224_no_top.h5


In [12]:
history = train_model(cnn2_model, model_id='lfw_pretrain1', data_dir=lfw_data_dir, epochs=10, patience=4, lr=0.001)

Train for 50 steps, validate for 38 steps
Epoch 1/10
Epoch 00001: val_accuracy improved from -inf to 0.61913, saving model to drive/My Drive/GA/Capstone/faces-fake-vs-real/models/lfw_pretrain1.h5
Epoch 2/10
Epoch 00002: val_accuracy improved from 0.61913 to 0.69715, saving model to drive/My Drive/GA/Capstone/faces-fake-vs-real/models/lfw_pretrain1.h5
Epoch 3/10
Epoch 00003: val_accuracy improved from 0.69715 to 0.92534, saving model to drive/My Drive/GA/Capstone/faces-fake-vs-real/models/lfw_pretrain1.h5
Epoch 4/10
Epoch 00004: val_accuracy did not improve from 0.92534
Epoch 5/10
Epoch 00005: val_accuracy did not improve from 0.92534
Epoch 6/10

In [None]:
data = ld.DsLoader(lfw_data_dir, image_size=(224, 224))
val = data.get_ds(split='val', augment=False, batch_size=32)
# test = ld.load_test_generator(lfw_data_dir+'/test', img_shape=(224,224,3))
test = data.get_ds(split='test', augment=False, batch_size=32)

In [22]:
cnn2_model.evaluate(val)



[2.026366394797438, 0.8229866]

In [23]:
cnn2_model.evaluate(test)



[1.988599101963796, 0.8105616]

In [None]:
fvr_data = ld.DsLoader('./data/processed/sf/all', image_size=(224, 224))
fvr_train = fvr_data.get_ds(split='train', augment=False, batch_size=32)
fvr_test = fvr_data.get_ds(split='test', augment=False, batch_size=32)

In [34]:
cnn2_model.evaluate(fvr_train)



[26.62042621537751, 0.47671568]

In [None]:
cnn2_model2 = keras.models.load_model(filepath=model_dir+'/lfw_pretrain1.h5')

In [51]:

train_model(cnn2_model2,'fvr_finetune' ,'./data/processed/sf/all', epochs=5, lr=0.0005)

Train for 51 steps, validate for 7 steps
Epoch 1/5
Epoch 00001: val_accuracy improved from -inf to 0.54412, saving model to drive/My Drive/GA/Capstone/faces-fake-vs-real/models/fvr_finetune.h5
Epoch 2/5
Epoch 00002: val_accuracy improved from 0.54412 to 0.59804, saving model to drive/My Drive/GA/Capstone/faces-fake-vs-real/models/fvr_finetune.h5
Epoch 3/5
Epoch 00003: val_accuracy did not improve from 0.59804
Epoch 4/5
Epoch 00004: val_accuracy did not improve from 0.59804
Epoch 5/5
Epoch 00005: val_accuracy did not improve from 0.59804


<tensorflow.python.keras.engine.sequential.Sequential at 0x7f18575c3ba8>

In [54]:
cnn2_model2.evaluate(fvr_test)



[1.267842514174325, 0.54146343]