# dialogue_scene_boundary_aggregation

We've previously been able to identify two-character dialogue scenes given an arbitrary series of input frames. Next we'll use those functions to identify scenes throughout an entire film.

In [1]:
import os
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
from keras import models
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
from scene_cluster_io import *

Using TensorFlow backend.


In [2]:
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

## Arbitrary Input Frames
Here's an example of the end-to-end process of generating scenes. An arbitrary range of input frames is used.

In [3]:
film = 'extremely_wicked'
frame_choice = list(range(650, 1050))
threshold = 3000

dialogue_folder = os.path.join('dialogue_frames', film)
print('There are', len(os.listdir(dialogue_folder)), 'images in the folder')
print('Selected', len(frame_choice), 'of those frames')

hac_labels = label_clusters(dialogue_folder, frame_choice, film, threshold)

There are 6603 images in the folder
Selected 400 of those frames
Number of clusters: 18


In [4]:
tuned_model = models.load_model('saved_models/tuned_model')

In [5]:
y_pred_values = predict_mcu(dialogue_folder, tuned_model, frame_choice, film)

In [6]:
shot_id_list = get_shot_ids(frame_choice, hac_labels)

In [7]:
scene_df = pd.DataFrame(zip(frame_choice, hac_labels, shot_id_list, y_pred_values), columns=['frame_file', 'cluster', 'shot_id', 'mcu'])
scene_df.head(3)

Unnamed: 0,frame_file,cluster,shot_id,mcu
0,650,2,0,1
1,651,2,0,1
2,652,2,0,1


In [8]:
alternating_pairs = get_alternating_pairs(frame_choice, hac_labels, y_pred_values, shot_id_list)
alternating_pairs

[[0, 2], [9, 13], [0, 1], [5, 10], [0, 6]]

In [9]:
speaker_pairs = mcu_check(alternating_pairs, scene_df)
speaker_pairs

cluster	 count	 mcu probability
0 	 165 	 48.48%
2 	 61 	 65.57%
Fails MCU check

9 	 28 	 96.43%
13 	 21 	 100.00%
Passes MCU check

0 	 165 	 48.48%
1 	 9 	 66.67%
Fails MCU check

5 	 23 	 95.65%
10 	 17 	 100.00%
Passes MCU check

0 	 165 	 48.48%
6 	 9 	 55.56%
Fails MCU check



[[9, 13], [5, 10]]

In [10]:
anchors = anchor_scenes(speaker_pairs, scene_df)
anchors

Speaker A and B Clusters: [9, 13]
Anchor Start/End Frames: 690 743

Speaker A and B Clusters: [5, 10]
Anchor Start/End Frames: 954 997



[(690, 743), (954, 997)]

In [11]:
scenes = expand_scenes(speaker_pairs, scene_df)
scenes

Speaker A and B Clusters: [9, 13]
Anchor Start/End Frames: 690 743
Cutaway Clusters: [0]
Expanded Start/End Frames: 690 745

Speaker A and B Clusters: [5, 10]
Anchor Start/End Frames: 954 997
Cutaway Clusters: [16]
Expanded Start/End Frames: 954 997



[(690, 745), (954, 997)]

## Functions
This entire process can be bundled into a single function `generate_scenes()`, encompassing all the individual functions. It returns a list of `(scene_start, scene_end)`.

In [2]:
def generate_scenes(film, frame_choice, threshold):
    dialogue_folder = os.path.join('dialogue_frames', film)

    hac_labels = label_clusters(dialogue_folder, frame_choice, film, threshold)
    print()
    y_pred_values = predict_mcu(dialogue_folder, tuned_model, frame_choice, film)
    shot_id_list = get_shot_ids(frame_choice, hac_labels)
    scene_df = pd.DataFrame(zip(frame_choice, hac_labels, shot_id_list, y_pred_values), columns=['frame_file', 'cluster', 'shot_id', 'mcu'])
    alternating_pairs = get_alternating_pairs(frame_choice, hac_labels, y_pred_values, shot_id_list)
    speaker_pairs = mcu_check(alternating_pairs, scene_df)
    anchors = anchor_scenes(speaker_pairs, scene_df)
    scenes = expand_scenes(speaker_pairs, scene_df)

    return scenes

We plan to use the above function repeatedly throughout the entire film, which presents a problem: each time the function is called, it returns a list, so we'll end up with a list of lists. The below function will take a list of lists and return a flat list, with duplicates removed, but order maintained.

In [3]:
def clean_scene_list(scene_list):

    # take list of lists and just make a singular list
    scene_set = []

    for results in scene_list:
        for scene in results:
            scene_set.append(scene)

    # remove duplicates while maintaining order
    seen = set()
    seen_add = seen.add
    
    return [x for x in scene_set if not (x in seen or seen_add(x))]

In [4]:
tuned_model = models.load_model('saved_models/tuned_model')

## Sliding Window

To get scenes from the entire film, we'll repeatedly run `generate_scenes()` for a subset of frames. Since we're limited by hardware, we can't load all of the film's frames at once. So we slide a "window" of 400 frames through the film, increasing by 100 each time. These are the first four ranges to be analyzed using this technique.

- 0-399
- 100-499
- 200-599
- 300-699

By ensuring a healthy overlap of window ranges, we ensure that we capture a scene even if it straddles one of the windows. For example, a scene comprised of the frames (350, 450) won't be picked up in the first window, but will be in the second window. It's worth noting that each window analyzes nearly seven minutes of the film; the average scene is two minutes so there isn't too much danger of lengthy scenes not being identified.

Below is a small test, limiting the total frames to 2,000 instead of all of the film's frames.

In [5]:
film = 'extremely_wicked'
threshold = 3000

x = 200

frame_choices = [] # holds all ranges for our sliding window

while x < (2000 - 200): # hard-coded a limit of 2000 frames, instead of all the film's frames, for demonstration purposes
    frame_choices.append(range(x - 200, x + 200)) # generates a range like (0, 400)
    x += 100 # increments by 100, so the second range is (100, 500)

scene_list = []

for frames in frame_choices:
    scene_list.append(generate_scenes(film, frames, threshold))

scene_list = clean_scene_list(scene_list)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
Number of clusters: 19

cluster	 count	 mcu probability
0 	 68 	 17.65%
3 	 138 	 48.55%
Fails MCU check

2 	 64 	 48.44%
3 	 138 	 48.55%
Fails MCU check

7 	 22 	 59.09%
13 	 12 	 66.67%
Passes MCU check

3 	 138 	 48.55%
4 	 9 	 88.89%
Fails MCU check

Speaker A and B Clusters: [7, 13]
Anchor Start/End Frames: 234 274

Speaker A and B Clusters: [7, 13]
Anchor Start/End Frames: 234 274
Cutaway Clusters: [6]
Expanded Start/End Frames: 234 274

Number of clusters: 17

cluster	 count	 mcu probability
0 	 64 	 48.44%
1 	 162 	 46.91%
Fails MCU check

8 	 22 	 59.09%
13 	 12 	 66.67%
Passes MCU check

1 	 162 	 46.91%
7 	 9 	 88.89%
Fails MCU check

1 	 162 	 46.91%
2 	 62 	 38.71%
Fails MCU check

1 	 162 	 46.91%
11 	 8 	 62.50%
Fails MCU check

Speaker A and B Clusters: [8, 13]
Anchor Start/End Frames: 234 274

Speaker A and B Clusters: [8, 13]

Number of clusters: 14

cluster	 count	 mcu probability
2 	 23 	 100.00%
7 	 13 	 61.54%
Passes MCU check

0 	 217 	 55.30%
4 	 12 	 41.67%
Fails MCU check

0 	 217 	 55.30%
2 	 23 	 100.00%
Passes MCU check

0 	 217 	 55.30%
11 	 24 	 8.33%
Fails MCU check

0 	 217 	 55.30%
8 	 18 	 100.00%
Passes MCU check

0 	 217 	 55.30%
3 	 33 	 87.88%
Passes MCU check

Speaker A and B Clusters: [2, 7]
Anchor Start/End Frames: 1215 1381

Speaker A and B Clusters: [0, 2]
Anchor Start/End Frames: 1200 1599

Speaker A and B Clusters: [0, 8]
Anchor Start/End Frames: 1200 1599

Speaker A and B Clusters: [0, 3]
Anchor Start/End Frames: 1200 1599

Speaker A and B Clusters: [2, 7]
Anchor Start/End Frames: 1215 1381
Cutaway Clusters: [1 0 4]
Expanded Start/End Frames: 1200 1386

Speaker A and B Clusters: [0, 2]
Anchor Start/End Frames: 1200 1599
Cutaway Clusters: [ 4  7  1 10 12 13  6  3  5  9  8 11]
Expanded Start/End Frames: 1200 1599

Speaker A and B Clusters: [0, 8]
Anchor Start/End Frames: 1200 1599


These functions worked as intended, identifying many scene possibilities. The high number of potential scenes suggests we should make the original scene-partitioning algorithm more discriminating.

In [6]:
scene_list

[(234, 274),
 (235, 274),
 (690, 743),
 (600, 999),
 (700, 751),
 (700, 1099),
 (954, 997),
 (800, 1189),
 (1128, 1189),
 (1134, 1189),
 (998, 1234),
 (1019, 1399),
 (1000, 1386),
 (1100, 1449),
 (1100, 1499),
 (1200, 1386),
 (1200, 1599),
 (1300, 1699),
 (1391, 1672),
 (1400, 1799),
 (1400, 1672),
 (1594, 1672),
 (1400, 1764),
 (1500, 1899),
 (1598, 1658),
 (1578, 1899),
 (1676, 1764),
 (1776, 1874)]

The below code block is the same algorithm as before, except it slides the 400-frame window throughout ALL of the film's frames. However, this yielded an OOM error. To get around this hardware restraint, we could apply the window technique to just 2,000 frames at a time, like above. This implementation is similar to the sliding-window strategy itself, so we'll follow the same technique (and make sure there's sufficient overlap so no scenes are missed because they're straddling a subset of frames).

In [None]:
film = 'hustle'
threshold = 3000

x = 200

frame_choices = []
dialogue_folder = os.path.join('dialogue_frames', film)

while x < (len(os.listdir(dialogue_folder)) - 200):
    frame_choices.append(range(x - 200, x + 200))
    x += 100

scene_list = []

for frames in frame_choices:
    scene_list.append(generate_scenes(film, list(frames), threshold))

scene_list = clean_scene_list(scene_list)