<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Name" data-toc-modified-id="Name-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Name</a></span></li><li><span><a href="#Search" data-toc-modified-id="Search-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Search</a></span><ul class="toc-item"><li><span><a href="#Load-Cached-Results" data-toc-modified-id="Load-Cached-Results-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Cached Results</a></span></li><li><span><a href="#Build-Model-From-Google-Images" data-toc-modified-id="Build-Model-From-Google-Images-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Build Model From Google Images</a></span></li></ul></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#Gender-cross-validation" data-toc-modified-id="Gender-cross-validation-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Gender cross validation</a></span></li><li><span><a href="#Face-Sizes" data-toc-modified-id="Face-Sizes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Face Sizes</a></span></li><li><span><a href="#Screen-Time-Across-All-Shows" data-toc-modified-id="Screen-Time-Across-All-Shows-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Screen Time Across All Shows</a></span></li><li><span><a href="#Appearances-on-a-Single-Show" data-toc-modified-id="Appearances-on-a-Single-Show-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Appearances on a Single Show</a></span></li></ul></li><li><span><a href="#Persist-to-Cloud" data-toc-modified-id="Persist-to-Cloud-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Persist to Cloud</a></span><ul class="toc-item"><li><span><a href="#Save-Model-to-Google-Cloud-Storage" data-toc-modified-id="Save-Model-to-Google-Cloud-Storage-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Save Model to Google Cloud Storage</a></span></li><li><span><a href="#Save-Labels-to-DB" data-toc-modified-id="Save-Labels-to-DB-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Save Labels to DB</a></span><ul class="toc-item"><li><span><a href="#Commit-the-person-and-labeler" data-toc-modified-id="Commit-the-person-and-labeler-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Commit the person and labeler</a></span></li><li><span><a href="#Commit-the-FaceIdentity-labels" data-toc-modified-id="Commit-the-FaceIdentity-labels-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Commit the FaceIdentity labels</a></span></li></ul></li></ul></li></ul></div>

In [None]:
from esper.prelude import *
from esper.identity import *
from esper import embed_google_images

# Name

Please add the person's name and their expected gender below (Male/Female).

In [None]:
name = 'Melissa Francis'
gender = 'Female'

# Search

## Load Cached Results

Reads cached identity model from local disk. Run this if the person has been labelled before and you only wish to regenerate the graphs. Otherwise, if you have never created a model for this person, please see the next section.

In [None]:
assert name != ''
results = FaceIdentityModel.load(name=name)
imshow(tile_imgs([cv2.resize(x[1][0], (200, 200)) for x in results.model_params['images']], cols=10))
plt.show()
plot_precision_and_cdf(results)

## Build Model From Google Images

Run this section if you do not have a cached model and precision curve estimates. This section will grab images using Google Image Search and score each of the faces in the dataset. We will interactively build the precision vs score curve.

It is important that the images that you select are accurate. If you make a mistake, rerun the cell below.

In [None]:
assert name != ''
# Grab face images from Google
img_dir = embed_google_images.fetch_images(name)

# If the images returned are not satisfactory, rerun the above with extra params:
#     query_extras='' # additional keywords to add to search
#     force=True      # ignore cached images

face_imgs = load_and_select_faces_from_images(img_dir)
face_embs = embed_google_images.embed_images(face_imgs)
assert(len(face_embs) == len(face_imgs))

reference_imgs = tile_imgs([cv2.resize(x[0], (200, 200)) for x in face_imgs if x], cols=10)
def show_reference_imgs():
    print('User selected reference images for {}.'.format(name))
    imshow(reference_imgs)
    plt.show()
show_reference_imgs()

In [None]:
# Score all of the faces in the dataset (this can take a minute)
face_ids_by_bucket, face_ids_to_score = face_search_by_embeddings(face_embs)

In [None]:
precision_model = PrecisionModel(face_ids_by_bucket)

Now we will validate which of the images in the dataset are of the target identity.

__Hover over with mouse and press S to select a face. Press F to expand the frame.__

In [None]:
show_reference_imgs()
print(('Mark all images that ARE NOT {}. Thumbnails are ordered by DESCENDING distance '
       'to your selected images. (The first page is more likely to have non "{}" images.) '
       'There are a total of {} frames. (CLICK THE DISABLE JUPYTER KEYBOARD BUTTON '
       'BEFORE PROCEEDING)').format(
       name, name, precision_model.get_lower_count()))
lower_widget = precision_model.get_lower_widget()
lower_widget

In [None]:
show_reference_imgs()
print(('Mark all images that ARE {}. Thumbnails are ordered by ASCENDING distance '
       'to your selected images. (The first page is more likely to have "{}" images.) '
       'There are a total of {} frames. (CLICK THE DISABLE JUPYTER KEYBOARD BUTTON '
       'BEFORE PROCEEDING)').format(
       name, name, precision_model.get_lower_count()))
upper_widget = precision_model.get_upper_widget()
upper_widget

Run the following cell after labelling to compute the precision curve. Do not forget to re-enable jupyter shortcuts.

In [None]:
# Compute the precision from the selections
lower_precision = precision_model.compute_precision_for_lower_buckets(lower_widget.selected)
upper_precision = precision_model.compute_precision_for_upper_buckets(upper_widget.selected)
precision_by_bucket = {**lower_precision, **upper_precision}

results = FaceIdentityModel(
    name=name, 
    face_ids_by_bucket=face_ids_by_bucket, 
    face_ids_to_score=face_ids_to_score,
    precision_by_bucket=precision_by_bucket, 
    model_params={
        'images': list(zip(face_embs, face_imgs))
    }
)
plot_precision_and_cdf(results)

The next cell persists the model locally.

In [None]:
results.save()

# Analysis

## Gender cross validation

Situations where the identity model disagrees with the gender classifier may be cause for alarm. We would like to check that instances of the person have the expected gender as a sanity check. This section shows the breakdown of the identity instances and their labels from the gender classifier.

In [None]:
gender_breakdown = compute_gender_breakdown(results)

print('Expected counts by gender:')
for k, v in gender_breakdown.items():
    print('  {} : {}'.format(k, int(v)))
print()

print('Percentage by gender:')
denominator = sum(v for v in gender_breakdown.values())
for k, v in gender_breakdown.items():
    print('  {} : {:0.1f}%'.format(k, 100 * v / denominator))
print()

Situations where the identity detector returns high confidence, but where the gender is not the expected gender indicate either an error on the part of the identity detector or the gender detector. The following visualization shows randomly sampled images, where the identity detector returns high confidence, grouped by the gender label. 

In [None]:
high_probability_threshold = 0.8
show_gender_examples(results, high_probability_threshold)

## Face Sizes

Faces shown on-screen vary in size. For a person such as a host, they may be shown in a full body shot or as a face in a box. Faces in the background or those part of side graphics might be smaller than the rest. When calculuating screentime for a person, we would like to know whether the results represent the time the person was featured as opposed to merely in the background or as a tiny thumbnail in some graphic.

The next cell, plots the distribution of face sizes. Some possible anomalies include there only being very small faces or large faces. 

In [None]:
plot_histogram_of_face_sizes(results)

The histogram above shows the distribution of face sizes, but not how those sizes occur in the dataset. For instance, one might ask why some faces are so large or whhether the small faces are actually errors. The following cell groups example faces, which are of the target identity with probability, by their sizes in terms of screen area.

In [None]:
high_probability_threshold = 0.8
show_faces_by_size(results, high_probability_threshold, n=10)

## Screen Time Across All Shows

One question that we might ask about a person is whether they received a significantly different amount of screentime on different shows. The following section visualizes the amount of screentime by show in total minutes and also in proportion of the show's total time. For a celebrity or political figure such as Donald Trump, we would expect significant screentime on many shows. For a show host such as Wolf Blitzer, we expect that the screentime be high for shows hosted by Wolf Blitzer.

In [None]:
screen_time_by_show = get_screen_time_by_show(results)

In [None]:
plot_screen_time_by_show(name, screen_time_by_show)

## Appearances on a Single Show

For people such as hosts, we would like to examine in greater detail the screen time allotted for a single show. First, fill in a show below.

In [None]:
show_name = 'Happening Now'

In [None]:
# Compute the screen time for each video of the show
screen_time_by_video_id = compute_screen_time_by_video(results, show_name)

One question we might ask about a host is "how long they are show on screen" for an episode. Likewise, we might also ask for how many episodes is the host not present due to being on vacation or on assignment elsewhere. The following cell plots a histogram of the distribution of the length of the person's appearances in videos of the chosen show.

In [None]:
plot_histogram_of_screen_times_by_video(name, show_name, screen_time_by_video_id)

For a host, we expect screentime over time to be consistent as long as the person remains a host. For figures such as Hilary Clinton, we expect the screentime to track events in the real world such as the lead-up to 2016 election and then to drop afterwards. The following cell plots a time series of the person's screentime over time. Each dot is a video of the chosen show. Red Xs are videos for which the face detector did not run.

In [None]:
plot_screentime_over_time(name, show_name, screen_time_by_video_id)

We hypothesized that a host is more likely to appear at the beginning of a video and then also appear throughout the video. The following plot visualizes the distibution of shot beginning times for videos of the show.

In [None]:
plot_distribution_of_appearance_times_by_video(results, show_name)

# Persist to Cloud

The remaining code in this notebook uploads the built identity model to Google Cloud Storage and adds the FaceIdentity labels to the database.

## Save Model to Google Cloud Storage

In [None]:
gcs_model_path = results.save_to_gcs()

To ensure that the model stored to Google Cloud is valid, we load it and print the precision and cdf curve below. 

In [None]:
gcs_results = FaceIdentityModel.load_from_gcs(name=name)
imshow(tile_imgs([cv2.resize(x[1][0], (200, 200)) for x in gcs_results.model_params['images']], cols=10))
plt.show()
plot_precision_and_cdf(gcs_results)

## Save Labels to DB

If you are satisfied with the model, we can commit the labels to the database.

In [None]:
from django.core.exceptions import ObjectDoesNotExist

def standardize_name(name):
    return name.lower()

person_type = ThingType.objects.get(name='person')

try:
    person = Thing.objects.get(name=standardize_name(name), type=person_type)
    print('Found person:', person.name)
except ObjectDoesNotExist:
    person = Thing(name=standardize_name(name), type=person_type)
    print('Creating person:', person.name)

labeler = Labeler(name='face-identity-{}'.format(person.name), data_path=gcs_model_path)

### Commit the person and labeler

The labeler and person have been created but not set saved to the database. If a person was created, please make sure that the name is correct before saving.

In [None]:
person.save()
labeler.save()

### Commit the FaceIdentity labels

Now, we are ready to add the labels to the database. We will create a FaceIdentity for each face whose probability exceeds the minimum threshold.

In [None]:
commit_face_identities_to_db(results, person, labeler, min_threshold=0.001)

In [None]:
print('Committed {} labels to the db'.format(FaceIdentity.objects.filter(labeler=labeler).count()))