# Can you determine if two individuals are related? 

https://www.kaggle.com/c/recognizing-faces-in-the-wild

This notebook is to run in Google collab. Data are in my repo '/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP'.

We will use some pretrained models (VGGFace and FaceNet) to determine if two individuals are related, based on photographs. 

In [1]:
import numpy as np
import os
import matplotlib.pyplot as plt
import cv2
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
from imageio import imread
from skimage.transform import resize
from scipy.spatial import distance
from keras.models import load_model
import pandas as pd
from tqdm import tqdm

Using TensorFlow backend.


In [2]:
from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


## Loading data





In [6]:
os.chdir('/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP')
 
df = pd.read_csv('train_relationships.csv')

print(df)

              p1          p2
0     F0002/MID1  F0002/MID3
1     F0002/MID2  F0002/MID3
2     F0005/MID1  F0005/MID2
3     F0005/MID3  F0005/MID2
4     F0009/MID1  F0009/MID4
...          ...         ...
3593  F1000/MID5  F1000/MID8
3594  F1000/MID5  F1000/MID9
3595  F1000/MID6  F1000/MID9
3596  F1000/MID7  F1000/MID8
3597  F1000/MID7  F1000/MID9

[3598 rows x 2 columns]


In [0]:
test_df = pd.read_csv("/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP/sample_submission.csv")

In [0]:
test_df.head()

Unnamed: 0,img_pair,is_related
0,face05508.jpg-face01210.jpg,0
1,face05750.jpg-face00898.jpg,0
2,face05820.jpg-face03938.jpg,0
3,face02104.jpg-face01172.jpg,0
4,face02428.jpg-face05611.jpg,0


## Models

We use 2 models. Each of then first compute the face embeddings for each image in the test set . Euclidian distance between each image pair in the test set is then computed. We will then take the mean of the distances obtained with the two models. Wefinally convert this distance to a probability using cumulative probabilites based on the distribution of the distance itself on the test set.

### VGGFace

In [10]:
!pip install git+https://github.com/rcmalli/keras-vggface.git

Collecting git+https://github.com/rcmalli/keras-vggface.git
  Cloning https://github.com/rcmalli/keras-vggface.git to /tmp/pip-req-build-96su0104
  Running command git clone -q https://github.com/rcmalli/keras-vggface.git /tmp/pip-req-build-96su0104
Building wheels for collected packages: keras-vggface
  Building wheel for keras-vggface (setup.py) ... [?25l[?25hdone
  Created wheel for keras-vggface: filename=keras_vggface-0.6-cp36-none-any.whl size=8311 sha256=136be4e58aece1799bdc6c51f9b182a1da37af689ea01c315a000d333b8abf2d
  Stored in directory: /tmp/pip-ephem-wheel-cache-0q9vu4m8/wheels/36/07/46/06c25ce8e9cd396dabe151ea1d8a2bc28dafcb11321c1f3a6d
Successfully built keras-vggface
Installing collected packages: keras-vggface
Successfully installed keras-vggface-0.6


In [11]:
from keras_applications.imagenet_utils import _obtain_input_shape
from keras_vggface.vggface import VGGFace

# Convolution Features
vgg_features = VGGFace(include_top=False, input_shape=(160, 160, 3), pooling='avg')
model = vgg_features





Downloading data from https://github.com/rcmalli/keras-vggface/releases/download/v2.0/rcmalli_vggface_tf_notop_vgg16.h5








### Images pre-processing

In [0]:
def prewhiten(x):
    if x.ndim == 4:
        axis = (1, 2, 3)
        size = x[0].size
    elif x.ndim == 3:
        axis = (0, 1, 2)
        size = x.size
    else:
        raise ValueError('Dimension should be 3 or 4')

    mean = np.mean(x, axis=axis, keepdims=True)
    std = np.std(x, axis=axis, keepdims=True)
    std_adj = np.maximum(std, 1.0/np.sqrt(size))
    y = (x - mean) / std_adj
    return y

def l2_normalize(x, axis=-1, epsilon=1e-10):
    output = x / np.sqrt(np.maximum(np.sum(np.square(x), axis=axis, keepdims=True), epsilon))
    return output

def load_and_align_images(filepaths, margin,image_size = 160):
    
    aligned_images = []
    for filepath in filepaths:
        img = imread(filepath)
        aligned = resize(img, (image_size, image_size), mode='reflect')
        aligned_images.append(aligned)
            
    return np.array(aligned_images)


### Computing embeddings 

In [0]:
def calc_embs(filepaths, margin=10, batch_size=512):
    pd = []
    for start in tqdm(range(0, len(filepaths), batch_size)):
        aligned_images = prewhiten(load_and_align_images(filepaths[start:start+batch_size], margin))
        pd.append(model.predict_on_batch(aligned_images))
    embs = l2_normalize(np.concatenate(pd))

    return embs

In [14]:
test_images = os.listdir("/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP/TEST/")
test_embs = calc_embs([os.path.join("/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP/TEST/", f) for f in test_images])
np.save("test_embs_vgg.npy", test_embs)

 23%|██▎       | 3/13 [10:03<34:03, 204.37s/it]

KeyboardInterrupt: ignored

In [0]:
test_embs.shape

(6282, 512)

## FaceNet

In [0]:
model_path = '/content/gdrive/My Drive/facenet_keras.h5'
model = load_model(model_path)






In [0]:
test_embs_vgg = calc_embs([os.path.join("/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP/TEST/", f) for f in test_images])
np.save("test_embs_fnet.npy", test_embs_vgg)

100%|██████████| 13/13 [01:33<00:00,  5.91s/it]


OSError: ignored

## Euclidean distance between image pairs

In [0]:
test_df["distance"] = 0
img2idx = dict()
for idx, img in enumerate(test_images):
    img2idx[img] = idx

In [0]:
for idx, row in tqdm(test_df.iterrows(), total=len(test_df)):
    imgs = [test_embs[img2idx[img]] for img in row.img_pair.split("-")]
    test_df.loc[idx, "distance1"] = distance.euclidean(*imgs)
    
    # For vggface
    imgs_2 = [test_embs_vgg[img2idx[img]] for img in row.img_pair.split("-")]
    test_df.loc[idx, "distance2"] = distance.euclidean(*imgs_2)

100%|██████████| 5310/5310 [00:07<00:00, 695.48it/s]


In [0]:
test_df['distance'] = test_df[['distance1','distance2']].mean(axis=1)
test_df.head()

Unnamed: 0,img_pair,is_related,distance,distance1,distance2
0,face05508.jpg-face01210.jpg,0,1.284604,1.102604,1.466604
1,face05750.jpg-face00898.jpg,0,1.241447,1.092807,1.390086
2,face05820.jpg-face03938.jpg,0,1.255383,1.050549,1.460218
3,face02104.jpg-face01172.jpg,0,1.099331,0.920364,1.278298
4,face02428.jpg-face05611.jpg,0,1.10086,0.976633,1.225088


## From distances to kinship probability 

In [0]:
all_distances = test_df.distance.values
sum_dist = np.sum(all_distances)

In [0]:
probs = []
for dist in tqdm(all_distances):
    prob = np.sum(all_distances[np.where(all_distances <= dist)[0]])/sum_dist
    probs.append(1 - prob)

100%|██████████| 5310/5310 [00:00<00:00, 23430.36it/s]


In [0]:
sub_df = pd.read_csv("/content/gdrive/My Drive/Projet gneugneu/IJN/DATA/KINSHIP/sample_submission.csv")
sub_df.is_related = probs
sub_df.to_csv("/content/gdrive/My Drive/submission.csv", index=False)