# First Step of Data Preparation Process

In this notebook file, raw images will be read with opencv library.

Persons with more than 4 photos will be saved as a pandas dataframe with their names and specified IDs.

Photos of persons will be saved as Training, Validation and Test dataframes with match their IDs.

All dataframes will be saved as md5 files for future use after checking.

See All images gzipped tar file, in http://vis-www.cs.umass.edu/lfw/ for dataset.

All folders in the lfw folder inside the lfw.tgz file have been copied to the /Data/RawData/FaceImage/ path.

In [1]:
#Importing libraries
import os
import numpy as np
import cv2
import pandas as pd

In [2]:
#Printing library versions
print('numpy Version: ' + np.__version__)
print('cv2 Version: ' + cv2.__version__)
print('pandas Version: ' + pd.__version__)

numpy Version: 1.21.5
cv2 Version: 3.4.2
pandas Version: 1.3.5


In [3]:
#Reading Images
#cv2 library works with BGR color ordering instead of RGB by default.
#Since all reading and writing operations will be done with the cv2 library, 
#the images will be used in BGR format in order not to make continuous color conversions in the living application.

#The data is splitting as training and test data.
#Ensuring each person is represented in both training and test data.

#Although it seems that 1/3 of the data is splitted as test data in the codes,
#this ratio is expected to be around 1/4 since the Floor division operator is used for each person.

#The splitting of test data is done in the first step.
#In this way, models can be trained on the same training dataset.
#In this way, the performance difference caused by randomly splitting the training dataset with better quality will be eliminated.
#In this way, it will be possible to see more accurately which model or parameters are better.

person = []
trainingImage = []
testImage = []
imageSize = {}

personCounter = 0
directory = os.listdir('../Data/RawData/FaceImage/')
for s in directory:
    path, dirName, file = next(os.walk('../Data/RawData/FaceImage/' + s))
    
    #Checking if there is a different directories inside the directories.
    if len(dirName) > 0:
        print('Warning: Another directory is detected in ' + path)
    
    #Those with less than 5 photos are ignored
    fileSize = len(file)
    if fileSize > 4:
        person.append({'ID' : personCounter, 'Name' : s.replace('_', ' ')})
        counter = 0
        for f in file:
            counter += 1
            img = cv2.imread('../Data/RawData/FaceImage/' + s  + '/' + f)
            
            #Checking the shape of the images
            if imageSize.get(img.shape) is None:
                print('New image shape detected ' + str(img.shape) + ' in ' +  '../Data/RawData/FaceImage/' + s  + '/' + f)
                imageSize[img.shape] = 1
            else:
                imageSize[img.shape] += 1
                
            if counter < fileSize // 3:
                testImage.append({'PersonID' : personCounter, 'ImageBGR' : img})
            else:
                trainingImage.append({'PersonID' : personCounter, 'ImageBGR' : img})
            
        personCounter += 1

New image shape detected (250, 250, 3) in ../Data/RawData/FaceImage/Abdullah_Gul/Abdullah_Gul_0001.jpg


#### As all images are the same shape, from now on there will be no worries about the shape of the images

In [4]:
#size of imageSize Dictionary
len(imageSize)

1

In [5]:
#imageSize Dictionary
imageSize

{(250, 250, 3): 5985}

In [6]:
#size of person Dictionary
len(person)

423

In [7]:
#person Dictionary
person

[{'ID': 0, 'Name': 'Abdullah Gul'},
 {'ID': 1, 'Name': 'Adrien Brody'},
 {'ID': 2, 'Name': 'Ahmed Chalabi'},
 {'ID': 3, 'Name': 'Ai Sugiyama'},
 {'ID': 4, 'Name': 'Alan Greenspan'},
 {'ID': 5, 'Name': 'Alastair Campbell'},
 {'ID': 6, 'Name': 'Albert Costa'},
 {'ID': 7, 'Name': 'Alejandro Toledo'},
 {'ID': 8, 'Name': 'Ali Naimi'},
 {'ID': 9, 'Name': 'Allyson Felix'},
 {'ID': 10, 'Name': 'Alvaro Uribe'},
 {'ID': 11, 'Name': 'Al Gore'},
 {'ID': 12, 'Name': 'Al Sharpton'},
 {'ID': 13, 'Name': 'Amelia Vega'},
 {'ID': 14, 'Name': 'Amelie Mauresmo'},
 {'ID': 15, 'Name': 'Ana Guevara'},
 {'ID': 16, 'Name': 'Ana Palacio'},
 {'ID': 17, 'Name': 'Andre Agassi'},
 {'ID': 18, 'Name': 'Andy Roddick'},
 {'ID': 19, 'Name': 'Angela Bassett'},
 {'ID': 20, 'Name': 'Angela Merkel'},
 {'ID': 21, 'Name': 'Angelina Jolie'},
 {'ID': 22, 'Name': 'Anna Kournikova'},
 {'ID': 23, 'Name': 'Ann Veneman'},
 {'ID': 24, 'Name': 'Antonio Banderas'},
 {'ID': 25, 'Name': 'Antonio Palocci'},
 {'ID': 26, 'Name': 'Ariel Shar

In [8]:
#size of trainingImage Dictionary.
#since this Dictionary contains image information, printing it will create a messy look
len(trainingImage)

4579

In [9]:
#size of testImage Dictionary.
#since this Dictionary contains image information, printing it will create a messy look
len(testImage)

1406

In [10]:
#person DataFrame
person = pd.DataFrame(person)
person

Unnamed: 0,ID,Name
0,0,Abdullah Gul
1,1,Adrien Brody
2,2,Ahmed Chalabi
3,3,Ai Sugiyama
4,4,Alan Greenspan
...,...,...
418,418,Yasser Arafat
419,419,Yoko Ono
420,420,Yoriko Kawaguchi
421,421,Zhu Rongji


In [11]:
#trainingImage DataFrame
trainingImage = pd.DataFrame(trainingImage)
trainingImage

Unnamed: 0,PersonID,ImageBGR
0,0,"[[[183, 187, 181], [184, 188, 182], [183, 187,..."
1,0,"[[[57, 106, 128], [80, 129, 151], [69, 118, 14..."
2,0,"[[[85, 115, 140], [85, 115, 140], [85, 115, 14..."
3,0,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4,0,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
...,...,...
4574,422,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4575,422,"[[[254, 236, 207], [254, 236, 207], [254, 236,..."
4576,422,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4577,422,"[[[176, 150, 150], [179, 151, 151], [183, 153,..."


In [12]:
#trainingImage DataFrame is shuffling
trainingImage = trainingImage.sample(frac=1).reset_index(drop=True)
trainingImage

Unnamed: 0,PersonID,ImageBGR
0,150,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1,148,"[[[4, 24, 41], [4, 24, 41], [5, 25, 42], [6, 2..."
2,12,"[[[0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], ..."
3,120,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4,126,"[[[3, 8, 11], [4, 9, 12], [4, 9, 12], [2, 7, 1..."
...,...,...
4574,236,"[[[6, 8, 2], [19, 23, 17], [41, 48, 43], [74, ..."
4575,222,"[[[17, 105, 181], [20, 108, 184], [22, 110, 18..."
4576,222,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4577,396,"[[[228, 218, 211], [221, 211, 204], [214, 204,..."


In [13]:
#testImage DataFrame
testImage = pd.DataFrame(testImage)
testImage

Unnamed: 0,PersonID,ImageBGR
0,0,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1,0,"[[[26, 65, 74], [29, 68, 77], [30, 69, 78], [2..."
2,0,"[[[255, 255, 245], [255, 254, 243], [254, 252,..."
3,0,"[[[182, 210, 217], [172, 199, 209], [109, 135,..."
4,0,"[[[0, 1, 0], [0, 1, 0], [0, 1, 0], [0, 1, 0], ..."
...,...,...
1401,420,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1402,420,"[[[245, 219, 203], [245, 219, 203], [245, 219,..."
1403,421,"[[[108, 129, 137], [108, 129, 137], [107, 128,..."
1404,421,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."


In [14]:
#Splitting validation data from testImage DataFrame
validationImage = testImage.sample(frac=0.5)
validationImage

Unnamed: 0,PersonID,ImageBGR
698,154,"[[[3, 5, 0], [0, 2, 0], [0, 2, 0], [0, 2, 1], ..."
1282,388,"[[[106, 123, 120], [104, 121, 118], [102, 118,..."
718,166,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1119,315,"[[[15, 83, 72], [16, 84, 73], [17, 85, 74], [1..."
386,120,"[[[131, 139, 156], [130, 138, 155], [132, 140,..."
...,...,...
141,44,"[[[9, 8, 52], [12, 11, 55], [13, 14, 58], [13,..."
387,120,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
374,120,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
753,171,"[[[142, 140, 129], [159, 157, 146], [176, 174,..."


In [15]:
#Data that is not split as validation data is kept as test data
testImage = testImage.drop(validationImage.index)
testImage

Unnamed: 0,PersonID,ImageBGR
0,0,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1,0,"[[[26, 65, 74], [29, 68, 77], [30, 69, 78], [2..."
2,0,"[[[255, 255, 245], [255, 254, 243], [254, 252,..."
6,1,"[[[14, 16, 70], [11, 13, 67], [9, 10, 66], [9,..."
8,6,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
...,...,...
1397,417,"[[[108, 122, 104], [110, 124, 106], [112, 127,..."
1400,420,"[[[6, 8, 16], [6, 8, 16], [6, 8, 16], [5, 8, 1..."
1402,420,"[[[245, 219, 203], [245, 219, 203], [245, 219,..."
1403,421,"[[[108, 129, 137], [108, 129, 137], [107, 128,..."


In [16]:
#validationImage DataFrame is shuffling
validationImage = validationImage.sample(frac=1).reset_index(drop=True)
validationImage

Unnamed: 0,PersonID,ImageBGR
0,378,"[[[2, 0, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], ..."
1,148,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
2,363,"[[[0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], ..."
3,152,"[[[211, 222, 230], [210, 221, 229], [210, 219,..."
4,10,"[[[32, 17, 15], [32, 17, 15], [33, 18, 16], [3..."
...,...,...
698,209,"[[[249, 251, 251], [250, 252, 252], [250, 252,..."
699,167,"[[[8, 10, 4], [8, 10, 4], [8, 10, 4], [8, 10, ..."
700,115,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
701,186,"[[[16, 3, 0], [16, 3, 0], [16, 3, 0], [16, 3, ..."


In [17]:
#testImage DataFrame is shuffling
testImage = testImage.sample(frac=1).reset_index(drop=True)
testImage

Unnamed: 0,PersonID,ImageBGR
0,124,"[[[115, 91, 7], [117, 93, 9], [117, 93, 9], [1..."
1,366,"[[[137, 130, 115], [134, 127, 112], [131, 124,..."
2,311,"[[[23, 47, 183], [42, 62, 193], [47, 59, 183],..."
3,0,"[[[26, 65, 74], [29, 68, 77], [30, 69, 78], [2..."
4,19,"[[[51, 18, 2], [51, 18, 2], [51, 19, 0], [51, ..."
...,...,...
698,169,"[[[138, 118, 83], [138, 118, 83], [137, 118, 8..."
699,171,"[[[62, 63, 24], [61, 62, 23], [60, 61, 22], [6..."
700,56,"[[[46, 106, 152], [47, 107, 153], [49, 109, 15..."
701,95,"[[[94, 131, 175], [98, 135, 179], [99, 136, 18..."


In [18]:
#Selected data save as pkl file for future uses
person.to_pickle("../Data/RawData/Selected/Person.pkl")
trainingImage.to_pickle("../Data/RawData/Selected/Training.pkl")
validationImage.to_pickle("../Data/RawData/Selected/Validation.pkl")
testImage.to_pickle("../Data/RawData/Selected/Test.pkl")