In [1]:
import scipy.io
import datetime
import numpy as np
import pandas as pd

%matplotlib inline

The file containing photo metadata is in MATLAB format.

In [2]:
matlab_file = scipy.io.loadmat('imdb_crop/imdb.mat')

It is a dictionary with the following keys.

In [3]:
print(type(matlab_file))
matlab_file.keys()

<class 'dict'>


dict_keys(['__header__', '__version__', '__globals__', 'imdb'])

Let's have a look at its contents.

In [4]:
matlab_file['__header__']

b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Jan 17 11:30:27 2016'

In [5]:
matlab_file['__version__']

'1.0'

In [6]:
matlab_file['__globals__']

[]

The most important field contains a numpy array with a non-standard composite type.

In [7]:
print(type(matlab_file['imdb']))
matlab_file['imdb'].dtype

<class 'numpy.ndarray'>


dtype([('dob', 'O'), ('photo_taken', 'O'), ('full_path', 'O'), ('gender', 'O'), ('name', 'O'), ('face_location', 'O'), ('face_score', 'O'), ('second_face_score', 'O'), ('celeb_names', 'O'), ('celeb_id', 'O')])

According to the description on the website from which the data were downloaded the format is as follows: 
- **dob**: date of birth (MATLAB serial date number)
- **photo_taken**: year when the photo was taken
- **full_path**: path to file
- **gender**: 0 for female and 1 for male, NaN if unknown
- **name**: name of the celebrity
- **face_location**: location of the face. 
- **face_score**: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image
- **second_face_score**: detector score of the face with the second highest score. This is useful to ignore images with more than one face. second_face_score is NaN if no second face was detected.
- **celeb_names** (IMDB only): list of all celebrity names
- **celeb_id** (IMDB only): index of celebrity name


**Metadata extraction and selection**

In [8]:
metadata = matlab_file['imdb'][0, 0]

In [9]:
date_of_birth = metadata[0][0]
date_of_birth

array([693726, 693726, 693726, ..., 726831, 726831, 726831])

In [10]:
date_photo_taken = metadata[1][0]
date_photo_taken

array([1968, 1970, 1968, ..., 2011, 2011, 2011], dtype=uint16)

In [11]:
photo_path = metadata[2][0]
photo_path

array([array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43'),
       array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44'),
       array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43'),
       ...,
       array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44'),
       array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44'),
       array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')],
      dtype=object)

In [12]:
gender = metadata[3][0]
gender

array([1., 1., 1., ..., 0., 0., 0.])

In [13]:
name = metadata[4][0]
name

array([array(['Fred Astaire'], dtype='<U12'),
       array(['Fred Astaire'], dtype='<U12'),
       array(['Fred Astaire'], dtype='<U12'), ...,
       array(['Jane Levy'], dtype='<U9'),
       array(['Jane Levy'], dtype='<U9'),
       array(['Jane Levy'], dtype='<U9')], dtype=object)

In [14]:
face_score = metadata[6][0]
face_score

array([1.45969291, 2.5431976 , 3.45557949, ...,       -inf, 4.45072452,
       2.13350269])

In [15]:
second_face_score = metadata[7][0]
second_face_score

array([1.11897336, 1.85200773, 2.98566022, ...,        nan,        nan,
              nan])

We select the most relevant pieces of information.

In [16]:
column_dict = {'date_of_birth': date_of_birth, 
               'date_photo_taken': date_photo_taken,
               'photo_path': photo_path,
               'name': name,
               'gender': gender,
               'face_score': face_score,
               'second_face_score': second_face_score
              }

raw_photo_data = pd.DataFrame(column_dict)
raw_photo_data.head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender,face_score,second_face_score
0,693726,1968,[01/nm0000001_rm124825600_1899-5-10_1968.jpg],[Fred Astaire],1.0,1.459693,1.118973
1,693726,1970,[01/nm0000001_rm3343756032_1899-5-10_1970.jpg],[Fred Astaire],1.0,2.543198,1.852008
2,693726,1968,[01/nm0000001_rm577153792_1899-5-10_1968.jpg],[Fred Astaire],1.0,3.455579,2.98566
3,693726,1968,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],[Fred Astaire],1.0,1.872117,
4,693726,1968,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],[Fred Astaire],1.0,1.158766,


In [17]:
len(raw_photo_data)

460723

**Data Selection**

1) Keep only those photos where a face was detected

In [18]:
face_detected = np.isfinite(raw_photo_data['face_score'].values)
raw_photo_data = raw_photo_data[face_detected]
raw_photo_data = raw_photo_data.drop(columns='face_score')
len(raw_photo_data)

398421

2) Keep only those photos where no second face was detected

In [19]:
no_second_face_detected = raw_photo_data['second_face_score'].isnull()
raw_photo_data = raw_photo_data[no_second_face_detected]
raw_photo_data = raw_photo_data.drop(columns='second_face_score')
len(raw_photo_data)

184624

3) Keep only those photos where gender is known

In [20]:
gender_known = raw_photo_data['gender'].notnull()
raw_photo_data = raw_photo_data[gender_known]
raw_photo_data['gender'] = raw_photo_data['gender'].astype(np.uint8)
len(raw_photo_data)

181690

Note that since the date of birth is given as a MATLAB serial date number we have to convert it to a more manageable form. Moreover, day one in MATLAB is January 1st 0000 while in python it is January 1st 0001. Hence, we will have to account for <br> the 0th leap year. But first, let's check that all of the serial date numbers are indeed integers.

In [21]:
raw_photo_data['date_of_birth'].apply(lambda elem: int(elem) == elem).all()

True

Second, observe that some of the **date_of_birth** values are way too small to be correct (for example, a value of 675000 implies that a person was born in 1849). We are going to deffer dealing with this issue until we compute the age of each person.  

In [22]:
raw_photo_data.sort_values(by='date_of_birth').head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender
237005,75,2013,[14/nm0912414_rm2110498304_0-3-15_2013.jpg],[Bree Michael Warner],0
237004,75,2013,[14/nm0912414_rm2029561344_0-3-15_2013.jpg],[Bree Michael Warner],0
324170,175,2014,[41/nm2749141_rm1023657472_0-6-23_2014.jpg],[Wendy McColm],0
324176,175,2013,[41/nm2749141_rm4078689280_0-6-23_2013.jpg],[Wendy McColm],0
324174,175,2015,[41/nm2749141_rm2952003072_0-6-23_2015.jpg],[Wendy McColm],0


Convert *date_of_birth* to actual date.

In [23]:
def datenum_to_date(datenum):
    """Convert MATLAB serial date number to a date."""
    adjusted_datenum = datenum - 366
    return datetime.date.fromordinal(adjusted_datenum)

datenum_to_date_vectorized = np.vectorize(datenum_to_date)

In [24]:
# Increase small values of date_of_birth for technical reasons, 
# eventually we are going to remove such records anyway
raw_photo_data.loc[raw_photo_data['date_of_birth'] < 367, 'date_of_birth'] = 367

# Convert
raw_photo_data['date_of_birth'] = datenum_to_date_vectorized(raw_photo_data['date_of_birth'])
raw_photo_data.head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender
3,1899-05-10,1968,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],[Fred Astaire],1
4,1899-05-10,1968,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],[Fred Astaire],1
6,1924-09-16,2004,[02/nm0000002_rm1346607872_1924-9-16_2004.jpg],[Lauren Bacall],0
7,1924-09-16,2004,[02/nm0000002_rm1363385088_1924-9-16_2004.jpg],[Lauren Bacall],0
12,1924-09-16,1974,[02/nm0000002_rm221957120_1924-9-16_1974.jpg],[Lauren Bacall],0


According to the website from which the data were downloaded it is assumed that pictures were taken in the middle of the year (1st of July). Hence, we proceed accordingly, first converting **date_photo_taken** to actual date.

In [25]:
raw_photo_data['date_photo_taken'] = \
    raw_photo_data['date_photo_taken'].apply(lambda year: datetime.date(year, 7, 1))

raw_photo_data.head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender
3,1899-05-10,1968-07-01,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],[Fred Astaire],1
4,1899-05-10,1968-07-01,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],[Fred Astaire],1
6,1924-09-16,2004-07-01,[02/nm0000002_rm1346607872_1924-9-16_2004.jpg],[Lauren Bacall],0
7,1924-09-16,2004-07-01,[02/nm0000002_rm1363385088_1924-9-16_2004.jpg],[Lauren Bacall],0
12,1924-09-16,1974-07-01,[02/nm0000002_rm221957120_1924-9-16_1974.jpg],[Lauren Bacall],0


Now, we are ready to compute each person's age.

In [26]:
def calculate_age(date_of_birth, reference_date):
    age = reference_date.year - date_of_birth.year
    
    condition_1 = reference_date.month < date_of_birth.month
    condition_2a = reference_date.month == date_of_birth.month
    condition_2b = reference_date.day < date_of_birth.day
    condition_2 = condition_2a and condition_2b
    
    if condition_1 or condition_2:
        age -= 1
        
    return age

calculate_age_vectorized = np.vectorize(calculate_age)

In [27]:
raw_photo_data['age'] = calculate_age_vectorized(raw_photo_data['date_of_birth'], 
                                                 raw_photo_data['date_photo_taken'])

raw_photo_data.head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender,age
3,1899-05-10,1968-07-01,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],[Fred Astaire],1,69
4,1899-05-10,1968-07-01,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],[Fred Astaire],1,69
6,1924-09-16,2004-07-01,[02/nm0000002_rm1346607872_1924-9-16_2004.jpg],[Lauren Bacall],0,79
7,1924-09-16,2004-07-01,[02/nm0000002_rm1363385088_1924-9-16_2004.jpg],[Lauren Bacall],0,79
12,1924-09-16,1974-07-01,[02/nm0000002_rm221957120_1924-9-16_1974.jpg],[Lauren Bacall],0,49


The **age** column exibits undesired behavior - some values are too large (which has already been anticipated) while other values are negative, most likely due to incorect recording of the year in which corresponding photos were taken. These issues will be taken care of by restricting our attention to age values for which there are sufficiently many records, say approximately 400.

In [28]:
value_count_threshold = 397

ample_age_records = raw_photo_data['age'].value_counts() > value_count_threshold
raw_photo_data['age'].value_counts()[ample_age_records].sort_values()

70     398
68     453
69     494
66     511
67     575
11     623
12     635
64     770
65     799
13     813
14     844
62     871
63     929
61     937
15    1032
16    1089
60    1100
57    1234
56    1261
59    1279
17    1306
58    1347
55    1351
18    1654
53    1657
54    1756
52    1948
51    2205
50    2220
19    2326
49    2511
48    2622
47    2935
20    3072
46    3219
21    3315
22    3546
44    3565
23    3960
43    4099
45    4105
42    4314
40    4541
41    4757
25    4826
24    4875
27    5245
26    5251
39    5486
37    5550
28    5626
38    5764
34    5816
33    6026
36    6038
29    6090
35    6122
32    6270
30    6571
31    6663
Name: age, dtype: int64

The above restriction limits available age values to those ranging from 11 to 70.

In [29]:
np.sort(raw_photo_data['age'].value_counts()[ample_age_records].sort_values().index.values)

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
       28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
       45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
       62, 63, 64, 65, 66, 67, 68, 69, 70], dtype=int64)

It is somewhat unfortunate, but pretty much unavoidable, that after all these steps the length of our dataset has shrunk to less than half of its original size as we are left with little more than 177000 data points.

In [30]:
raw_photo_data = raw_photo_data.loc[raw_photo_data['age'].between(11, 70), :]
raw_photo_data['age'] = raw_photo_data['age'].astype(np.uint8)
len(raw_photo_data)

177197

On the bright side, it can be seen that the gender distribution is roughly balanced.

In [31]:
print(raw_photo_data['gender'].value_counts(normalize=True)\
      .apply(lambda x: str(round(100*x, 2)) + '%'))

1    54.8%
0    45.2%
Name: gender, dtype: object


Before selecting the data we are going to use further, let's reformat the **photo_path** and **name** columns and reindex the entire DataFrame.

In [32]:
for column in ['photo_path', 'name']:
    raw_photo_data[column] = raw_photo_data[column].apply(lambda elem: elem[0]).values

raw_photo_data = raw_photo_data.reset_index(drop=True)   
raw_photo_data.head()

Unnamed: 0,date_of_birth,date_photo_taken,photo_path,name,gender,age
0,1899-05-10,1968-07-01,01/nm0000001_rm946909184_1899-5-10_1968.jpg,Fred Astaire,1,69
1,1899-05-10,1968-07-01,01/nm0000001_rm980463616_1899-5-10_1968.jpg,Fred Astaire,1,69
2,1924-09-16,1974-07-01,02/nm0000002_rm221957120_1924-9-16_1974.jpg,Lauren Bacall,0,49
3,1924-09-16,1974-07-01,02/nm0000002_rm238734336_1924-9-16_1974.jpg,Lauren Bacall,0,49
4,1924-09-16,1991-07-01,02/nm0000002_rm370988544_1924-9-16_1991.jpg,Lauren Bacall,0,66


Now, we can make our final selection of data...

In [33]:
processed_photo_data = raw_photo_data.loc[:, ['photo_path', 'name', 'gender', 'age']].copy()
processed_photo_data.head()

Unnamed: 0,photo_path,name,gender,age
0,01/nm0000001_rm946909184_1899-5-10_1968.jpg,Fred Astaire,1,69
1,01/nm0000001_rm980463616_1899-5-10_1968.jpg,Fred Astaire,1,69
2,02/nm0000002_rm221957120_1924-9-16_1974.jpg,Lauren Bacall,0,49
3,02/nm0000002_rm238734336_1924-9-16_1974.jpg,Lauren Bacall,0,49
4,02/nm0000002_rm370988544_1924-9-16_1991.jpg,Lauren Bacall,0,66


...and save the resulting dataset.

In [34]:
processed_photo_data.to_csv('processed_photo_metadata.csv', index=False)