# Face Detection, Allignment, Embeddings, Clustering

### Notebook content:
- 1 Correct jpg files metadata
- 2 MTCNN (detect faces on photos)
- 3 Filter results (remove bad quality photos)
- 4 FaceNet embeddings (obtain 128-dim vector/embedding representing faces features numerically with FaceNet)
- 5 t-SNE: 2-D representation of clustering similar photos together and separating different photos further from each other. 

In [1]:
# Load libraries and helper functions by running utils.py with a jupyter magic function:
%run utils.py

ModuleNotFoundError: No module named 'piexif'

## 1. Correct jpg files metadata (creation-time)
It is often that metadata for jpg files is wrong (e.g. the camera was not set correctly), here are some usefull functions to check and correct the metadata:

### Check metadata for jpg files

In [2]:
# Scan over immediate subfolders in the main directory folder and check creation times within these subfolders
directory = r'Test\Photoset'
print_creation_times_for_subfolders(directory)

NameError: name 'print_creation_times_for_subfolders' is not defined

In [None]:
# Check creation times in one of the subfolders:
subdirectory = r'Test\Photoset\2020'
get_creation_times_range(subdirectory)

In [None]:
file_path = r'Test\Photoset\2018\0.jpeg'
get_creation_time(file_path)

### Correct metadata (if needed)

In [None]:
# Change single file creation datetime (if needed)
file_path_to_change_metadata = r'Test\Photoset\2018\0.jpeg'
change_jpg_datetime(file_path_to_change_metadata, 2018, 9, 8)

In [None]:
# Change creation times of all jpg files in a folder to specific date
folder_to_change_metadata = r'Test\Photoset\2020'
change_datetime_in_folder(folder_to_change_metadata, 2020, 6, 13)

# 2. MTCNN (Multi-Task CNN)
Multi Task CNN performs multiple tasks simultaneously: face detection and face allignment - i.e. finding face box and landmarks (coordinates of eyes, nose, mouth edges) on the face.
- Original paper (2016): https://arxiv.org/abs/1604.02878  
- Github repo: https://github.com/ipazc/mtcnn by https://www.linkedin.com/in/ivandepazcenteno/:   
- Description with examples https://machinelearningmastery.com/how-to-perform-face-detection-with-classical-and-deep-learning-methods-in-python-with-keras/  

In [None]:
# Scan a folder with subfolders containing photos and perfrom MTCNN on all photos, save results to a csv file
# Depending on the archive size, scanning photos might take a lot of time.
# The results are saved to the MTCNN_results file for each photo scanned.
# Thus, if interrupted, the results are still saved, and the scan can continued later.
# If new photos are added to the archive this function will do MTCNN only for new photos.
# If one wants a completly new scan with different parameters (min_face_size) new MTCNN_results csv file should be created

# Main archive photo folder with subfolders containing photos
photo_folder = r'Test\Photoset'
# csv file to save scan results
MTCNN_results_file_path = r'Test\MTCNN_results.csv'

In [None]:
%%time
# Run MTCNN and save results:
get_mtcnn_results(photo_folder, MTCNN_results_file_path, min_face_size = 100)

## 3. Filter photos

In [None]:
df = load_MTCNN_scan_results(MTCNN_results_file_path)
df.tail()

## MTCNN results need filtering for several reasons:
E.g: Not a face detected, face size is too small for further processing, photo not in front (person looks to the side), face not in focus, face is inclined too much, grayscale image, etc...

### Define the filters:
- **confidence_filter** allows to filter out photos with low confidence. MTCNN provides its confidence in a found face as one of it output results. Low confidence values correspond to either not-a-face (objects resembling face) or occluded faces - it is better to filter them out, e.g. faces with confidence less than 0.99 will be excluded.

- **face_height_filter**. Faces with height less than face_height_filter will be excluded.  Although MTCNN takes min_face_size as an argument some of the found faces are smaller than the min_face_size parameter.

- **nose_shift_filter**. If nose shift is bigger than e.g 15 - face will be excluded. I tried to filter out the photos which are not in front. Nose shift is determined from positions of eyes and nose landmarks and if the nose is strongly shifted relative to center between the eyes the photo is filtered out. 

- **eye_line_angle_filter**. If eye_line_angle is more than eye_line_angle_filter - face will be excluded. All photos will be rotated so that the eyeline is horizontal.

- **sharpness_filter**. Assessment of bluriness of an image. Bigger value - lower bluriness. Idea taken from https://www.pyimagesearch.com/2015/09/07/blur-detection-with-opencv/ and modified. I found it emperically that (max-min)**2/var is working better than simple variance as an assessment of bluriness, it is also implemented on a central part of the face around nose (not on the whole face image). If sharpness is less than sharpness_filter face will be excluded

In [None]:
# Let's define the filters:
confidence_filter = 0.98
face_height_filter = 10
nose_shift_filter = 25
eye_line_angle_filter = 45
sharpness_filter = 20
# grayscale_image_filter is on by default, so no grayscale images will pass through

# And preview the images (one can find appropriate filters depending on the task)
# The landmarks on faces are shown in preview mode and are not saved in the save mode
save_image_folder = False
preview = True

plt.rcParams["figure.figsize"] = (3,3)
image_filter(df[0:100], save_image_folder, preview, confidence_filter, face_height_filter, nose_shift_filter, eye_line_angle_filter, sharpness_filter)

In [None]:
%%time
# Now that we decided on filter parameters for the task, let's get and save the filtered results:
save_image_folder = r'Test\Cropped_face_images'
preview = False

df_filtered = image_filter(df, save_image_folder, preview, confidence_filter, face_height_filter, nose_shift_filter, eye_line_angle_filter, sharpness_filter)

#file_path_filtered_results = r'output\MTCNN_min_face_200_filtered.csv'
file_path_filtered_results = r'Test\MTCNN_results_filtered.csv'

print('Number of faces before filtering:', len(df), 'Number of faces after filtering:', len(df_filtered))
df_filtered.to_csv(file_path_filtered_results,index=False)

In [None]:
#df_filtered = pd.read_csv(r'output\MTCNN_min_face_200_filtered.csv')
df_filtered = pd.read_csv(file_path_filtered_results)
print(len(df_filtered))
df_filtered.head()

## 4. FaceNet Embeddings

In [None]:
# load the pretrained facenet model
# The model is taking 160X160 colored figure as input 
# And outputs 128 value vector of embedding generated from this figure
model = load_model(r'facenet_keras_pretrained/model/facenet_keras.h5')
print('model_input:', model.inputs)
print('model_output:', model.outputs)
model.summary()

In [None]:
%%time
# Get embeddings for all face images
embeddings = []
for file_path in tqdm(df_filtered['face_file_path']):
    emb = get_facenet_embedding(file_path, model)
    embeddings.append(emb)
embeddings = np.array(embeddings)
#print(embeddings.shape)

# Save embeddings to the file
embeddings_file_path = r'Test\embeddings'
np.save(embeddings_file_path, embeddings)

## 5. t-SNE

In [None]:
# Load filetered results
df_filtered = pd.read_csv(file_path_filtered_results)

# load the embeddings
X = np.load('Test\embeddings.npy')

In [None]:
# Calculate t-SNE
X_tsne = TSNE(perplexity=2, learning_rate = 1000, n_iter=1000, random_state=0).fit_transform(X)

x = X_tsne[:,0]
y = X_tsne[:,1]
paths =  df_filtered.face_file_path
creation_dates = df_filtered.creation_date


# The idea on how to plot faces is taken from https://stackoverflow.com/questions/22566284/matplotlib-how-to-plot-images-instead-of-points
def getImage(path, size):
    image = plt.imread(path)
    image = resize_image(image, size)
    return OffsetImage(image)

plt.rcParams["figure.figsize"] = (20,20)
# This part is plotting faces
fig, ax = plt.subplots()
ax.scatter(x, y)
for x0, y0, path in zip(x, y, paths):
    ab = AnnotationBbox(getImage(path, (50,50)), (x0, y0), frameon=False)
    ax.add_artist(ab)

# This part is plotting colors
# Creation date is shown with colors on the plot below
# Spectral(rainbow) palette is used with older photos shown in red and recent in blue
sns.scatterplot(x=x, y=y, hue = creation_dates, s=7000, palette=sns.color_palette('Spectral',len(set(creation_dates))))
#plt.legend([],[], frameon=False) # hide the legend if it too long

# Set figure background
sns.set_style("whitegrid", {'axes.grid' : False,'axes.facecolor': 'white'})
plt.show()

As we see t-SNE separated these photos to groups using face embeddings generated by FaceNet. t-SNE is not ideal (one photo of my daughter mixed with my son's phots on this figure for example), and the result depends on the hyperparameters of the t-SNE algorithm. It also initialize randomly, so changing random_state brings different results. The output becomes more robust and accurate if more photos is used as is shown below for 1700 facial images from my archive. On this figure I can clearly members of our family and friends nicely separated to groups.<img src="Test/all.png">
# Do try it for your photo archive!