In [1]:
import os
import face_recognition
import cv2
import pandas as pd
import numpy as np
import json

# Contents:
[1. Data preparation](#data_prep)  
[2. Creating data structures for face recognition algorithm](#data_structures)
  - `df_final`  
  - `encodings`   
  - `users`   
  - `df_test_encodings`  
  
[3. Measuring different methods' performance](#testing)
  - Baseline  
  - Mean encodings  
  - 5 in a row  
  - Face distance threshold    
  - Minimum distance  
  
[4. Conclusions](#conclusions)    

# <a id='data_prep'>1. Data preparation</a>

### identity_CelebA dataframe
here we have only `filename` column and corresponding `identity`

In [5]:
# reading dataset 'identity_CelebA'
df_identity = pd.read_csv('dataset/identity_CelebA.txt', sep=' ', names=['filename', 'identity'])
df_identity.head(3)

Unnamed: 0,filename,identity
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692


leaving only those identities, who have more than 7 photos

In [124]:
distribution = df_identity['identity'].value_counts()  # counting how many images there are for each identity
distr = distribution.loc[distribution > 4]             # leaving identities, who has 7+ images 
idx_list = distr.index                                 # getting indexes of the left identities
len(idx_list)

9343

### Creating the DataFrame, where each row is the identity index and the corresponding filenames

In [126]:
def create_identities_dict(row):
    if row['identity'] in identities_dict:
        identities_dict[row['identity']][0].append(row['filename'])
    else:
        identities_dict[row['identity']] = [[row['filename']]]

In [127]:
%%time
identities_dict = {}

for index, row in df_identity.iterrows():
    create_identities_dict(row)

# creating identities/filenames DataFrame
df_if = pd.DataFrame.from_dict(identities_dict, orient='index').reset_index().rename(columns={"index": 'identities', 0: 'filenames'})

CPU times: user 12.3 s, sys: 7.99 ms, total: 12.3 s
Wall time: 12.4 s


In [128]:
df_if = df_if.loc[df_if['identities'].isin(idx_list)]
print('DataFrame has ' + str(len(df_if)) + ' rows')
df_if.head(5)

DataFrame has 9343 rows


Unnamed: 0,identity,filenames
0,2880,"[000001.jpg, 000404.jpg, 003415.jpg, 004390.jp..."
1,2937,"[000002.jpg, 011437.jpg, 016335.jpg, 017121.jp..."
2,8692,"[000003.jpg, 015648.jpg, 033840.jpg, 038887.jp..."
3,5805,"[000004.jpg, 001778.jpg, 010191.jpg, 013676.jp..."
4,9295,"[000005.jpg, 008431.jpg, 014427.jpg, 016680.jp..."


# <a id='data_structures'>2. Creating data structures for face recognition algorithm</a>
We need:  
`df_final` - a DataFrame where each identity has 5 face encodings  
`encodings` - array of all encodings  
`users` - dictionary, where keys are users' idx and the values are the corresponding names/addresses (here these are identities' idx, but in my project it will change)  
`df_test_images` - DataFrame with the images that weren't parsed into encodings earlier  
`df_test_encodings` - DataFrame with identities and images parsed into encodings  

### 2.1 `df_final`:

In [130]:
def image_to_encodings(filename):
    """
        - Gets filename, reads the image, finds faces, creates face encodings
        - Returns face encoding, if a face was found on the image
    """
    img = cv2.imread('dataset/img_align_celeba/' + filename)
    locations = face_recognition.face_locations(img, model='hog')
    encodings_ = face_recognition.face_encodings(img, locations)
    return encodings_

In [139]:
def find_new_image(identity, filenames_list):
    """
        - Gets id of identity and list of filenames, which belongs to identity
        - Proceeds a search of non-seen images and trying to recognize face on it
        - Goes into recursion, if a face wasn't found
        - Returns face encoding and filename of the image, on which the face was recognized
    """
    global seen_images
    for filename_ in filenames_list:
        if filename_ not in seen_images:
            encodings_ = image_to_encodings(filename_)
            seen_images.append(filename_)
            filenames_list.remove(filename_)
            if len(encodings_) != 1:
                new_encoding_, new_filename_ = find_new_image(identity, filenames_list)
                seen_images.append(new_filename_)
                return new_encoding_, new_filename_
            return encodings_[0], filename_
    return None, None
    

In [326]:
def filenames_into_encodings(df: pd.DataFrame):
    """
        - Takes a DataFrame with columns 'filenames' and 'identities'
        - Reads each image, creates face encodings, tries to find new image in case the face wasn't found 
        or skips damaged images/identities
        - Returns:
            - `encodings_overall` - list with all encodings
            - `errors` - log of images, on which there was 0 or more than 1 faces
            - `seen_images_` - list of seen images
            - `data_` - dictionary, where keys are identities idx and values are list of 5 face encodings
    """
    encodings_overall = []  # list of all encodings 
    errors_ = []  # list for images, where weren't located any faces
    seen_images_ = []
    data_ = {'identities': [], 'encodings': []}

    # for identity in df['identity'].iloc[:100].values:
    # be careful with the activation of the line below, it takes almost 2 hours to proceed
    for identity in df['identities'].values:
#         filenames_list = df['filenames'].loc[df['identities'] == identity].values[0]
        filenames_list = json.loads(df['filenames'].loc[df['identities'] == identity].values[0])
        counter = 0
        data_['identities'].append(identity)
        temp_encodings = []
        for filename in filenames_list:
            if counter >= 5:
                continue

            encodings_ = image_to_encodings(filename)
            seen_images_.append(filename)

            # if a face wasn't found or there were more than 1 face
            if len(encodings_) != 1:
                # try to find another image of the same identity
                new_encoding, new_filename = find_new_image(identity, filenames_list)

                # if there wasn't any other photos
                if new_encoding is None:
                    # append an error message
                    errors_.append(('problem in:', filename, new_filename, identity, 'solved with:', 'ISN\'T SOLVED'))
                else:
                    encodings_overall.append(new_encoding)
                    temp_encodings.append(list(new_encoding))
                    errors_.append(('problem in:', filename, identity, 'solved with:', new_filename))
            else:
                encodings_overall.append(encodings_[0])
                temp_encodings.append(list(encodings_[0]))

            counter += 1
        data_['encodings'].append(json.dumps(temp_encodings))
    return encodings_overall, errors_, seen_images_, data_

In [140]:
%%time
encodings, errors, seen_images, data = filenames_into_encodings(df_if)

CPU times: user 2h 28min, sys: 7.35 s, total: 2h 28min 8s
Wall time: 2h 35min 39s


In [308]:
len(seen_images)

48567

Creating and saving the dataframe

In [141]:
df_fin = pd.DataFrame(data=data, columns=['identities', 'encodings'])
df_fin.to_csv('identities_encodings_super_final.csv', index=False)
df_fin.head(5)

Unnamed: 0,identities,encodings
0,2880,"[[-0.09392920881509781, 0.16680526733398438, 0..."
1,2937,"[[-0.13565079867839813, 0.04019104316830635, 0..."
2,8692,"[[-0.08772975206375122, 0.17790649831295013, 0..."
3,5805,"[[-0.207429900765419, 0.17357361316680908, -0...."
4,9295,"[[-0.11555222421884537, 0.05275091528892517, 0..."


Reading the DF

In [142]:
df_read_fin = pd.read_csv('identities_encodings_super_final.csv')
df_read_fin

Unnamed: 0,identities,encodings
0,2880,"[[-0.09392920881509781, 0.16680526733398438, 0..."
1,2937,"[[-0.13565079867839813, 0.04019104316830635, 0..."
2,8692,"[[-0.08772975206375122, 0.17790649831295013, 0..."
3,5805,"[[-0.207429900765419, 0.17357361316680908, -0...."
4,9295,"[[-0.11555222421884537, 0.05275091528892517, 0..."
...,...,...
9338,5920,"[[-0.20153068006038666, 0.13730086386203766, 0..."
9339,9268,"[[-0.10642589628696442, 0.13724175095558167, 0..."
9340,8052,"[[0.06224041059613228, 0.005616001784801483, -..."
9341,8146,"[[-0.12671472132205963, 0.1192317008972168, 0...."


Checking how many face encodings each person has

In [143]:
def check_enc_quant(row):
    if len(json.loads(row['encodings'])) != 5:
        damaged_rows.append(row['identities'])

Finding the damaged rows

In [162]:
damaged_rows = []

for index, row in df_read_fin.iterrows():
    check_enc_quant(row)
    
len(damaged_rows)

60

Dropping the damaged rows and saving the result

In [163]:
df_read_fin = df_read_fin.loc[~df_read_fin['identities'].isin(damaged_rows)]
df_read_fin.to_csv('identities_encodings_super_final2.csv', index=False)
len(df_read_fin)

9283

### 2.2 `encodings`:

Reading the DataFrame

In [7]:
df_final = pd.read_csv('identities_encodings_super_final2.csv')
print(f'{len(df_final)} unique identities in the DataFrame')

9283 unique identities in the DataFrame


Creating list of all encodings

In [8]:
encodings = []
for index, row in df_final.iterrows():
    encodings.extend(json.loads(row['encodings']))

print(f'{len(encodings)} face encodings in the encodings list')

46415 face encodings in the encodings list


### 2.3 `users`:  
Creating map:  
- from encoding index // 5   
- to identity or user's name/address  

In [9]:
user_idx = df_final['identities']
users = {key: value for key, value in zip(range(len(user_idx)), user_idx)}

### 2.4 `df_test_images`:

Creating new DataFrame, which consists of images, which weren't seen during creation of the main dataset (`df_final`)

In [240]:
%%time
data_test = {'identities': [], 'filenames': []}

for index, row in df_if.iterrows():
    data_test['identities'].append(row['identity'])
    temp_list = []
    for filename in row['filenames']:
        if filename not in seen_images:
            temp_list.append(filename)
    data_test['filenames'].append(json.dumps(temp_list))

len(data_test['filenames'])

CPU times: user 3min 39s, sys: 19.9 ms, total: 3min 39s
Wall time: 3min 39s


9343

In [241]:
df_test = pd.DataFrame(data=data_test, columns=['identities', 'filenames'])
len(df_test)

9343

Dropping identities, who doesn't have any images

In [242]:
rows_to_drop = []

for index, row in df_test.iterrows():
    if len(row['filenames']) < 2:
        rows_to_drop.append(row['identities'])

len(rows_to_drop)

0

In [243]:
df_test_images = df_test.loc[~df_test['identities'].isin(rows_to_drop)]
df_test_images.to_csv('test_identities_filenames.csv', index=False)
len(df_test_images)

9343

### 2.5 `df_test_encodings`:

Creating DataFrame, where each identity will have his list of face encodings

In [321]:
%%time
test_encodings, test_errors, test_seen_images, test_data = filenames_into_encodings(df_test)

CPU times: user 2h 18min 34s, sys: 8.02 s, total: 2h 18min 42s
Wall time: 2h 34min 19s


Saving the DataFrame

In [322]:
df_test_encodings = pd.DataFrame(data=test_data, columns=['identities', 'encodings'])
df_test_encodings.to_csv('identities_encodings_test.csv', index=False)
df_test_encodings.head(5)

Unnamed: 0,identities,encodings
0,2880,"[[-0.08185373246669769, 0.10833615064620972, -..."
1,2937,"[[-0.08213772624731064, 0.0223865807056427, 0...."
2,8692,"[[-0.03371645510196686, 0.10744943469762802, 0..."
3,5805,"[[-0.2582927942276001, 0.08399071544408798, 0...."
4,9295,"[[-0.10332327336072922, 0.07786936312913895, 0..."


Reading the DataFrame

In [2]:
df_test_read = pd.read_csv('identities_encodings_test.csv')
df_test_read

Unnamed: 0,identities,encodings
0,2880,"[[-0.08185373246669769, 0.10833615064620972, -..."
1,2937,"[[-0.08213772624731064, 0.0223865807056427, 0...."
2,8692,"[[-0.03371645510196686, 0.10744943469762802, 0..."
3,5805,"[[-0.2582927942276001, 0.08399071544408798, 0...."
4,9295,"[[-0.10332327336072922, 0.07786936312913895, 0..."
...,...,...
9338,5920,"[[-0.18387816846370697, 0.0963742733001709, 0...."
9339,9268,"[[-0.21787138283252716, 0.15073078870773315, 0..."
9340,8052,[]
9341,8146,"[[-0.03438824787735939, 0.09196805953979492, 0..."


In [3]:
def find_identities_with_less_than_n_encodings(df: pd.DataFrame, n: int):
    identities_to_drop = []
    for index, row in df.iterrows():
        encodings_list = json.loads(row['encodings'])
        if len(encodings_list) < n:
            identities_to_drop.append(row['identities'])
    return identities_to_drop


def create_column_with_n_encodings(df: pd.DataFrame, n: int):
    n_encodings = []
    for index, row in df.iterrows():
        encodings_list = json.loads(row['encodings'])
        temp_encodings = []
        counter = 0
        for enc in encodings_list:
            if counter < n:
                temp_encodings.append(enc)
                counter += 1
        n_encodings.append(temp_encodings)
    return n_encodings

Cleaning identities, who has less than 5 encodings

In [4]:
idx_to_drop = find_identities_with_less_than_n_encodings(df_test_read, 5)
df_test_read = df_test_read.loc[~df_test_read['identities'].isin(idx_to_drop)]
len(df_test_read)

8270

Creating new column, where is left only 5 encodings for each identity

In [15]:
new_column = create_column_with_n_encodings(df_test_read, 5)
df_test_read.loc[:, ('5_encodings')] = new_column
df_test_read

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_read.loc[:, ('5_encodings')] = new_column


Unnamed: 0,identities,encodings,5_encodings
0,2880,"[[-0.08185373246669769, 0.10833615064620972, -...","[[-0.08185373246669769, 0.10833615064620972, -..."
1,2937,"[[-0.08213772624731064, 0.0223865807056427, 0....","[[-0.08213772624731064, 0.0223865807056427, 0...."
2,8692,"[[-0.03371645510196686, 0.10744943469762802, 0...","[[-0.03371645510196686, 0.10744943469762802, 0..."
3,5805,"[[-0.2582927942276001, 0.08399071544408798, 0....","[[-0.2582927942276001, 0.08399071544408798, 0...."
4,9295,"[[-0.10332327336072922, 0.07786936312913895, 0...","[[-0.10332327336072922, 0.07786936312913895, 0..."
...,...,...,...
9329,5013,"[[-0.07761994749307632, 0.055308748036623, 0.1...","[[-0.07761994749307632, 0.055308748036623, 0.1..."
9331,9710,"[[-0.111241415143013, 0.07181288301944733, 0.0...","[[-0.111241415143013, 0.07181288301944733, 0.0..."
9334,6258,"[[-0.04995613917708397, 0.057822298258543015, ...","[[-0.04995613917708397, 0.057822298258543015, ..."
9335,10101,"[[-0.1197553351521492, 0.00509718619287014, 0....","[[-0.1197553351521492, 0.00509718619287014, 0...."


# <a id="testing">3. Testing perfomance of the different approaches</a>  
## 3.1 Baseline  
- Take the first encoding of each person
- Choosing the best tolerance threshold
- Maximize the accuracy metric

In [18]:
def match_encoding(encodings_list, encoding, np_ar, method):
    """
        - Takes the list of all encodings, encoding which is going to be compared, 
            parameters for np.arange function, which will choose the threshold and method name
        - Depends on method name proceeds different algorithm of comparing face encodings
        - Returns a list of predicted identities. List length equals to amount of iterated thresholds
        
    """
    threshold_results = []
    encodings_list = np.array(encodings_list)
    encoding = np.array(encoding)
    
    
    if method == 'distance_threshold':
        results = face_recognition.face_distance(encodings_list, encoding)
        batch_size = 5
        for threshold in np.arange(np_ar[0], np_ar[1], np_ar[2]):
            matched_identity_idx = None
            for i in range(0, len(results), batch_size):
                mean_distance = np.mean(results[i:i+batch_size])
                if mean_distance < threshold:
                    matched_identity_idx = i // 5
                    break
            if matched_identity_idx is None:
                threshold_results.append(None)
            else:
                identity_actual = users[matched_identity_idx]
                threshold_results.append(identity_actual)    
            
            
    elif method == 'min_distance':        
        results = face_recognition.face_distance(encodings_list, encoding)
        batch_size = 5
        distances = []
        for i in range(0, len(results), batch_size):
            mean_distance = np.mean(results[i:i+batch_size])
            distances.append(mean_distance)
        matched_identity_idx = distances.index(np.min(distances))
        identity_actual = users[matched_identity_idx]
        threshold_results.append(identity_actual)
    
    
    else:        
        for i in np.arange(np_ar[0], np_ar[1], np_ar[2]):
            results = face_recognition.compare_faces(encodings_list, encoding, tolerance=i)
            if True in results:
                if method == '5inarow':
                    batch_size = 5
                    matched_identity_idx = None
                    truth_list = []
                    for index in range(0, len(results), batch_size):
                        true_num = results[index:index+batch_size].count(True)
                        truth_list.append(true_num)
#                         if true_num == 5:
#                             matched_identity_idx = index // 5
#                             break
                    matched_identity_idx = truth_list.index(max(truth_list))
                            
                    if matched_identity_idx is None:
                        threshold_results.append(None)
                        
                    else:
                        identity_actual = users[matched_identity_idx]
                        threshold_results.append(identity_actual)
                        
                        
                else:
                    matched_identity_idx = results.index(True)
                    identity_actual = users[matched_identity_idx]
                    threshold_results.append(identity_actual)
            else:
                threshold_results.append(None)
    
    
    return threshold_results 


def calc_acc(array: list, true_value: int):
    """
        - Takes array of predictions and a `true_value`
        - Compares and counts accuracy
        - Returns accuracy for a list
    """
    right_preds = 0
    for el in array:
        if el == true_value:
            right_preds += 1
    accuracy = right_preds / len(array)
    return accuracy


def get_predictions(row, encodings_list, np_ar, method: str, limit: int = 5) -> list:
    """
        - Takes a row of `df_test_encodings` DataFrame
        - Goes into a for loop for each encoding and matches it to `encodings` list
        - Calls `match_encoding` function
    """
    predictions_for_each_image = []    
    for encoding in row['5_encodings']:
        results = match_encoding(encodings_list, encoding, np_ar, method)    
            
        if results is None:
            continue        
        predictions_for_each_image.append(results)  
    return predictions_for_each_image

        
def get_accuracy_by_threshold(row, results_by_threshold):
    """
        - Takes the dict from `get_results_by_threshold`
        - Calculates accuracy for each threshold
        - Return accuracies list
    """
    acc_by_threshold = []
    for key in results_by_threshold:
        acc = calc_acc(results_by_threshold[key], row['identities'])
        acc_by_threshold.append(acc)  
    return acc_by_threshold


def get_accuracy(row, predictions, np_ar, method):
    """
        - Takes the dict from `get_results_by_threshold`
        - Calculates accuracy for each threshold
        - Return accuracies list
    """
    acc_by_threshold = []
    if method == 'min_distance':
        acc = calc_acc(predictions[:, 0], row['identities'])
        acc_by_threshold.append(acc) 
    else:
        length = len(np.arange(np_ar[0], np_ar[1], np_ar[2]))
        for i in range(length):
            acc = calc_acc(predictions[:, i], row['identities'])
            acc_by_threshold.append(acc)  
    return acc_by_threshold


def create_accuracy_column(df, encodings_list, np_ar, method: str):
    """
        - Takes `df_test` DataFrame
        - Calls `get_predictions`, `get_results_by_threshold`, `get_accuracy_by_threshold`
        - Returns list of lists (for each person) with accuracies (for each tolerance threshold)
    """
    accuracy_list = [] 
    for index, row in df.iloc[8000:8200].iterrows():
#     for index, row in df.iterrows():
        predictions_for_each_image = get_predictions(row, encodings_list, np_ar, method)        
        predictions = np.array(predictions_for_each_image)
#         print(predictions)
        acc_by_threshold = get_accuracy(row, predictions, np_ar, method)
        accuracy_list.append(acc_by_threshold)
        
    return accuracy_list


def print_acc_by_threshold(accuracies, np_ar):
    accuracies = np.array(accuracies)
    
    for i, threshold in enumerate(np.arange(np_ar[0], np_ar[1], np_ar[2])):
        print(f'Accuracy with {threshold:.3f} threshold: {accuracies[:, i].mean():.2f}')

### Measuring accuracy for the Baseline method

In [10]:
encodings_1_per_person = encodings[::5]

Test on the entire dataset:

In [None]:
%%time
identities_accuracies_053_06 = create_accuracy_column(df_test_read, encodings_1_per_person, np_ar=(0.53, 0.601, 0.01), method='default')

In [None]:
print_acc_by_threshold(identities_accuracies_053_06, (0.53, 0.601, 0.01))

#### The biggest accuracy was achieved with the threshold of 0.55

## 3.2 Mean Encoding  
- Create one mean encoding from five given encodings for each person
- Choose the best tolerance threshold
- Maximize the accuracy metric

#### Creating mean encodigns list

In [25]:
counter = 0
mean_encodings = []
batch_size = 5

for i in range(0, len(encodings), batch_size):
    mean_enc = np.mean(encodings[i:i+batch_size])
    mean_encodings.append(mean_enc)

len(mean_encodings)

9283

Test on the entire dataset:

In [508]:
%%time
identities_accuracies_mean_enc_0_3_0_5 = create_accuracy_column(df_test_read, mean_encodings, np_ar=(0.3, 0.501, 0.01), method='default')

CPU times: user 4h 30min 16s, sys: 49min 22s, total: 5h 19min 39s
Wall time: 5h 19min 43s


In [509]:
print_acc_by_threshold(identities_accuracies_mean_enc_0_3_0_5, (0.3, 0.501, 0.01))

Accuracy with 0.300 threshold: 0.15
Accuracy with 0.310 threshold: 0.19
Accuracy with 0.320 threshold: 0.24
Accuracy with 0.330 threshold: 0.29
Accuracy with 0.340 threshold: 0.34
Accuracy with 0.350 threshold: 0.40
Accuracy with 0.360 threshold: 0.46
Accuracy with 0.370 threshold: 0.51
Accuracy with 0.380 threshold: 0.56
Accuracy with 0.390 threshold: 0.61
Accuracy with 0.400 threshold: 0.65
Accuracy with 0.410 threshold: 0.69
Accuracy with 0.420 threshold: 0.72
Accuracy with 0.430 threshold: 0.74
Accuracy with 0.440 threshold: 0.74
Accuracy with 0.450 threshold: 0.73
Accuracy with 0.460 threshold: 0.71
Accuracy with 0.470 threshold: 0.65
Accuracy with 0.480 threshold: 0.58
Accuracy with 0.490 threshold: 0.49
Accuracy with 0.500 threshold: 0.39


#### The best accuracy with this method was get with 0.43 threshold and equals 74%

## 3.3 Minimum accordance amount
- After getting a result from `compare_faces` function, look for 3, 4, 5 True values in a row and return the prediction  
- Choose the best `min_accordance` value (3-5)  
- Choose the best tolerance threshold  
- Maximize the accuracy metric  

In [705]:
%%time
identities_accuracies_5inarow_048_056 = create_accuracy_column(df_test_read, encodings, np_ar=(0.48, 0.561, 0.01), method='5inarow')

CPU times: user 5min 27s, sys: 31.4 s, total: 5min 59s
Wall time: 5min 59s


In [706]:
print_acc_by_threshold(identities_accuracies_5inarow_048_056, (0.48, 0.561, 0.01))

Accuracy with 0.480 threshold: 0.77
Accuracy with 0.490 threshold: 0.76
Accuracy with 0.500 threshold: 0.76
Accuracy with 0.510 threshold: 0.76
Accuracy with 0.520 threshold: 0.75
Accuracy with 0.530 threshold: 0.76
Accuracy with 0.540 threshold: 0.76
Accuracy with 0.550 threshold: 0.72
Accuracy with 0.560 threshold: 0.70


#### Test on the entire dataset:

In [708]:
%%time
identities_accuracies_5inarow_03_056 = create_accuracy_column(df_test_read, encodings, np_ar=(0.3, 0.561, 0.01), method='5inarow')

CPU times: user 12min 34s, sys: 1min 29s, total: 14min 4s
Wall time: 14min 4s


In [709]:
print_acc_by_threshold(identities_accuracies_5inarow_03_056, (0.3, 0.561, 0.01))

Accuracy with 0.300 threshold: 0.06
Accuracy with 0.310 threshold: 0.10
Accuracy with 0.320 threshold: 0.10
Accuracy with 0.330 threshold: 0.15
Accuracy with 0.340 threshold: 0.21
Accuracy with 0.350 threshold: 0.25
Accuracy with 0.360 threshold: 0.30
Accuracy with 0.370 threshold: 0.37
Accuracy with 0.380 threshold: 0.44
Accuracy with 0.390 threshold: 0.49
Accuracy with 0.400 threshold: 0.53
Accuracy with 0.410 threshold: 0.59
Accuracy with 0.420 threshold: 0.63
Accuracy with 0.430 threshold: 0.68
Accuracy with 0.440 threshold: 0.71
Accuracy with 0.450 threshold: 0.74
Accuracy with 0.460 threshold: 0.76
Accuracy with 0.470 threshold: 0.76
Accuracy with 0.480 threshold: 0.77
Accuracy with 0.490 threshold: 0.76
Accuracy with 0.500 threshold: 0.76
Accuracy with 0.510 threshold: 0.76
Accuracy with 0.520 threshold: 0.75
Accuracy with 0.530 threshold: 0.76
Accuracy with 0.540 threshold: 0.76
Accuracy with 0.550 threshold: 0.72
Accuracy with 0.560 threshold: 0.70


## 3.4 Face distance  
- Use `face_distance` instead of compare faces  
- Calculate overall distance by 5 encodings for each person 
- Choosing the first value, which is lower than the threshold and the corresponding person

In [703]:
%%time
ident_accs_dist_thresh_048_055 = create_accuracy_column(df_test_read, encodings, np_ar=(0.48, 0.551, 0.01), method='distance_threshold')

CPU times: user 4min 4s, sys: 5.59 s, total: 4min 9s
Wall time: 4min 11s


In [704]:
print_acc_by_threshold(ident_accs_dist_thresh_048_055, (0.38, 0.451, 0.01))

Accuracy with 0.380 threshold: 0.52
Accuracy with 0.390 threshold: 0.57
Accuracy with 0.400 threshold: 0.61
Accuracy with 0.410 threshold: 0.62
Accuracy with 0.420 threshold: 0.62
Accuracy with 0.430 threshold: 0.65
Accuracy with 0.440 threshold: 0.63
Accuracy with 0.450 threshold: 0.57


#### Test on the entire dataset:

In [710]:
%%time
ident_accs_dist_thresh_039_045 = create_accuracy_column(df_test_read, encodings, np_ar=(0.39, 0.451, 0.01), method='distance_threshold')

CPU times: user 4min 4s, sys: 5.06 s, total: 4min 9s
Wall time: 4min 9s


In [712]:
print_acc_by_threshold(ident_accs_dist_thresh_039_045, (0.39, 0.451, 0.01))

Accuracy with 0.390 threshold: 0.12
Accuracy with 0.400 threshold: 0.15
Accuracy with 0.410 threshold: 0.20
Accuracy with 0.420 threshold: 0.24
Accuracy with 0.430 threshold: 0.28
Accuracy with 0.440 threshold: 0.35
Accuracy with 0.450 threshold: 0.40


## 3.5 Minimum face distance  
- Use `face_distance` instead of compare faces  
- Calculate overall distance by 5 encodings for each person 
- Choosing the lowest value and the corresponding person

In [628]:
%%time
ident_accs_min_dist = create_accuracy_column(df_test_read, encodings, np_ar=(0.38, 0.451, 0.01), method='min_distance')

CPU times: user 5h 59min 13s, sys: 13min 51s, total: 6h 13min 5s
Wall time: 6h 13min 11s


In [629]:
print(f'{np.mean(ident_accs_min_dist):.2f}')

0.85


Distribution of accuracies

In [631]:
pd.Series(ident_accs_min_dist).value_counts()

[1.0]    4940
[0.8]    1795
[0.6]     726
[0.4]     360
[0.0]     249
[0.2]     200
dtype: int64

In [649]:
res = pd.Series(ident_accs_min_dist).apply(lambda x: x[0])
np.mean(res.loc[res > 0.4])

0.9129607291247822

In [651]:
len(res.loc[res > 0.4])

7461

# <a id="conclusion">4. Conclusions</a>  
At the end of the data preparation, we've left with the `df_final` and `encodings` list both include **9283 unique identities**. It can be considered as the complexity of the system because the more registered people we have in a database the easier algorithm can mismatch the face.  
  
While testing a few different approaches to face comparison, I was using `df_test_encodings`, which includes **8270 identities** and each identity has 5 encodings, so in total each test contained **41350 face encodings**. The DataFrame contains fewer identities than `df_final` due to the number of images and their quality. Some images were 'damaged', which means that on such images `face_locations` method couldn't find a face.  
  
The best accuracy was measured with the last comparison algorithm. It was calculating Euclidean distances to each encoding, got the mean distance for each identity from the dataset, and returned the identity with the least distance value. The final accuracy was 85%  
  
The accuracy can be more than 90% if the dataset is more accurate. For example, on the final distribution of the results, we can find 450 identities with 0-20% accuracy. Such accuracy means that zero or only one out of 5 checked test face encodings was evenly close to the corresponding identity. There can be many causes of such accuracy, but I think, a big impact was made by the quality of the images. They were just downloaded from the internet. The faces were captured under different lighting, angles, AOV, etc.
  
In the project, we can easily retake 5 photos and the overall accuracy will increase. Actually, if we take all photos qualitatively from the beginning, the algorithm's accuracy will be much higher.