<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Replicating-Cutting" data-toc-modified-id="Replicating-Cutting-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Replicating Cutting</a></span><ul class="toc-item"><li><span><a href="#Distribution-of-movies-by-year" data-toc-modified-id="Distribution-of-movies-by-year-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Distribution of movies by year</a></span></li><li><span><a href="#Shot-Duration" data-toc-modified-id="Shot-Duration-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Shot Duration</a></span><ul class="toc-item"><li><span><a href="#Average-Shot-Duration-Over-the-Years" data-toc-modified-id="Average-Shot-Duration-Over-the-Years-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Average Shot Duration Over the Years</a></span></li><li><span><a href="#Changes-in-Shot-Duration-Within-Movies-(Narrative-Structure)" data-toc-modified-id="Changes-in-Shot-Duration-Within-Movies-(Narrative-Structure)-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Changes in Shot Duration Within Movies (Narrative Structure)</a></span></li></ul></li><li><span><a href="#Shot-Scale" data-toc-modified-id="Shot-Scale-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Shot Scale</a></span><ul class="toc-item"><li><span><a href="#Changes-in-shot-scale-over-time" data-toc-modified-id="Changes-in-shot-scale-over-time-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Changes in shot scale over time</a></span></li><li><span><a href="#More-Changes-in-Shot-Scale-over-time" data-toc-modified-id="More-Changes-in-Shot-Scale-over-time-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>More Changes in Shot Scale over time</a></span></li><li><span><a href="#Relationship-between-shot-duration-and-shot-scale" data-toc-modified-id="Relationship-between-shot-duration-and-shot-scale-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Relationship between shot duration and shot scale</a></span></li></ul></li><li><span><a href="#Brightness" data-toc-modified-id="Brightness-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Brightness</a></span><ul class="toc-item"><li><span><a href="#Changes-in-brightness-over-the-years" data-toc-modified-id="Changes-in-brightness-over-the-years-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Changes in brightness over the years</a></span></li><li><span><a href="#Changes-in-brightness-within-movies-(narrative-structure)" data-toc-modified-id="Changes-in-brightness-within-movies-(narrative-structure)-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Changes in brightness within movies (narrative structure)</a></span></li></ul></li><li><span><a href="#Saturation" data-toc-modified-id="Saturation-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Saturation</a></span><ul class="toc-item"><li><span><a href="#Changes-in-saturation-over-the-years" data-toc-modified-id="Changes-in-saturation-over-the-years-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Changes in saturation over the years</a></span></li><li><span><a href="#Changes-in-saturation-within-movies-(narrative-structure?)" data-toc-modified-id="Changes-in-saturation-within-movies-(narrative-structure?)-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Changes in saturation within movies (narrative structure?)</a></span></li></ul></li><li><span><a href="#Mean-Number-of-People" data-toc-modified-id="Mean-Number-of-People-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Mean Number of People</a></span><ul class="toc-item"><li><span><a href="#Mean-number-of-people-per-frame-over-the-years" data-toc-modified-id="Mean-number-of-people-per-frame-over-the-years-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Mean number of people per frame over the years</a></span></li><li><span><a href="#Distribution-of-number-of-people-per-frame" data-toc-modified-id="Distribution-of-number-of-people-per-frame-1.6.2"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Distribution of number of people per frame</a></span></li></ul></li><li><span><a href="#Number-of-people-in-a-shot-vs-shot-duration" data-toc-modified-id="Number-of-people-in-a-shot-vs-shot-duration-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Number of people in a shot vs shot duration</a></span></li></ul></li></ul></div>

In [None]:
from query.models import Video, Shot, Labeler, Face, PoseMeta
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import numpy as np
from django.db.models import Avg
from tqdm import tqdm
import rekall
from rekall.video_interval_collection import VideoIntervalCollection
from rekall.interval_list import IntervalList
from rekall.merge_ops import payload_plus
from rekall.temporal_predicates import overlaps, equal

# Replicating Cutting

In this notebook we'll be replicating some of James Cutting's findings (primarily about shot duration, and eventually shot scale) on our dataset.

Right now this notebook has the following graphs:
* Distribution of our movies by year
* Average shot duration of movies by year
* Average shot duration **within** a movie
* Average number of people per frame in movie by year
* Distribution of number of people in frames, split into two buckets

## Distribution of movies by year
Let's first find out what the distribution of our movies across time is.

In [None]:
all_videos = Video.objects.filter(decode_errors=False).order_by('id').all()

In [None]:
print('Number of videos: ', all_videos.count())

In [None]:
release_years = sorted([video.year for video in all_videos])

In [None]:
print('Release year range: {}-{}'.format(release_years[0], release_years[-1]))

In [None]:
# Plot histogram of release years
def hist(data, n_bins, label, title):
    fig, ax = plt.subplots(figsize=(10, 5))
    ret = ax.hist(data, n_bins, histtype='bar', label=[label])
    ax.legend()
    ax.set_title(title)
    plt.show()
    
    return ret
_, bins, _ = hist(
    release_years,
    release_years[-1] - release_years[0]+1,
    'Number of films',
    'Histogram of films by release year')

print('bins:', bins)

## Shot Duration

### Average Shot Duration Over the Years
Now let's plot the average shot duration over time.

In [None]:
def average_shot_duration(video):
    return Shot.objects.filter(
        video_id=video.id, labeler=Labeler.objects.get(name='shot-hsvhist-face')
    ).all().aggregate(
        avg_duration=Avg(F('max_frame') - F('min_frame'))
    )['avg_duration'] / video.fps

videos_with_avg_shot_duration = [
    (video, average_shot_duration(video))
    for video in tqdm(all_videos)
]

In [None]:
def plot_shot_durations_by_year(videos_with_avg_shot_duration, min_year=None):
    data = sorted([(v.year, shot_duration) for v, shot_duration in videos_with_avg_shot_duration])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3)
    ax.set_ylim(0, 20)
    ax.set_xlabel('Year')
    ax.set_ylabel('Average Shot Duration (seconds)')
    ax.set_title('Average shot durations with cubic and linear fits')
    
    #ax.set_yscale('symlog')
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 3))(np.unique(x)))
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
    plt.show()

In [None]:
plot_shot_durations_by_year(videos_with_avg_shot_duration)
#plot_shot_durations_by_year(videos_with_avg_shot_duration, min_year=1930)

In [None]:
# What are the movies with really long shot lengths in the 2010's?
sorted([
    (v.title, v.year, avg_duration)
    for v, avg_duration in videos_with_avg_shot_duration if v.year >= 2010 and avg_duration > 8
], key=lambda tup: (tup[1], tup[0], tup[2]))

In [None]:
# Plot moving average of average shot duration
avg_shot_durations_by_year = IntervalList([
    (video.year, video.year, [shot_duration])
    for video, shot_duration in videos_with_avg_shot_duration
]).coalesce(payload_merge_op=payload_plus).map(
    lambda intrvl: (intrvl.start, intrvl.end,
        {'avg': np.mean(intrvl.payload), 'std': np.std(intrvl.payload)})
)
data = [
    (intrvl.get_start(), intrvl.get_payload()['avg'], intrvl.get_payload()['std'])
    for intrvl in avg_shot_durations_by_year.get_intervals()
]
ax = plt.gca()
ax.plot([d[0] for d in data], [d[1] for d in data])

### Changes in Shot Duration Within Movies (Narrative Structure)
How do shot durations differ within movies?

Methodology: bucket shot boundaries into 100 bins per movie. Get a normalized count of number of shot transitions per bin by dividing the number of shot boundaries by the total **number of shots** in the movie. Take the average of this number across all films. Then "scale this number back up to seconds" by ???. Not clear from the Cutting methodology.

In [None]:
def compute_shot_durations_per_bin(videos):
    bucket_proportions = [[] for i in range(0, 100)]
    
    average_shot_duration_data = 0.
    total_num_shots = 0.
    
    for video in tqdm(videos):
        # Get all the shots, removing the first and last one to get rid of boundaries at 0 and at the end
        shots = list(Shot.objects.filter(
            video_id=video.id,
            labeler=Labeler.objects.get(name='shot-hsvhist-face')
        ).order_by('min_frame').all())[1:-1]
        num_shots = len(shots)

        divider = video.num_frames / 100.
        bin_num = 0

        num_boundaries_in_bin = 0.
        for shot in shots:
            if shot.min_frame < (bin_num + 1.) * divider:
                num_boundaries_in_bin += 1
            else:
                bucket_proportions[bin_num].append(num_boundaries_in_bin / num_shots)
                bin_num += 1
                num_boundaries_in_bin = 1.
        bucket_proportions[bin_num].append(num_boundaries_in_bin / num_shots)
        
        avg_duration = average_shot_duration(video)
        average_shot_duration_data += avg_duration * num_shots
        total_num_shots += num_shots
    
    normalized_shots_per_bin = [
        np.mean(proportions)
        for proportions in bucket_proportions
    ]
        
    average_shot_duration_data /= total_num_shots
    
    shot_lengths_per_bin = [
        ((1. / 100.) / shot_proportion) * average_shot_duration_data
        for shot_proportion in normalized_shots_per_bin
    ]
    
    return shot_lengths_per_bin

In [None]:
def plot_shot_lengths_per_bin(data, title, polynomial_degree=6):
    x = [(i + 0.5) / 100. for i in range(0, 100)]
    y = data
    
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Proportion of movie')
    ax.set_ylabel('Average Shot Duration (seconds)')
    ax.set_title(title)
    ax.set_xlim(0, 1)
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, polynomial_degree))(np.unique(x)))
    
    ax.axvline(x=0.25, color='k')
    ax.axvline(x=0.5, color='k')
    ax.axvline(x=0.75, color='k')
    
#     ax.text(x=0.1, y=6.2, s='Setup')
#     ax.text(x=0.32, y=6.2, s='Complication')
#     ax.text(x=0.57, y=6.2, s='Development')
#     ax.text(x=0.85, y=6.2, s='Climax')
    
    plt.show()

In [None]:
shot_lengths_all_videos = compute_shot_durations_per_bin(all_videos)

In [None]:
shot_lengths_1915_to_1959 = compute_shot_durations_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1915,
        year__lte=1959
    ).order_by('id').all()
)

In [None]:
shot_lengths_1960_to_1985 = compute_shot_durations_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1960,
        year__lte=1985
    ).order_by('id').all()
)

In [None]:
shot_lengths_1986_to_2016 = compute_shot_durations_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1986,
        year__lte=2016
    ).order_by('id').all()
)

In [None]:
plot_shot_lengths_per_bin(
    shot_lengths_all_videos,
    'Average shot durations over the course of our movies, full dataset',
    polynomial_degree=6
)
plot_shot_lengths_per_bin(
    shot_lengths_1915_to_1959,
    'Average shot durations over the course of our movies, 1915-1959',
    polynomial_degree=1
)
plot_shot_lengths_per_bin(
    shot_lengths_1960_to_1985,
    'Average shot durations over the course of our movies, 1960-1985',
    polynomial_degree=6
)
plot_shot_lengths_per_bin(
    shot_lengths_1986_to_2016,
    'Average shot durations over the course of our movies, 1986-2016',
    polynomial_degree=6
)

## Shot Scale
We want to find out whether shot scale has changed over time, and the relationship between shot duration and shot scale.

### Changes in shot scale over time
For this, we'll just graph the distribution of different shot scales and bucket by different eras.


In [None]:
# This takes about 4 and a half minutes to run!
frames_qs = Frame.objects.annotate(
    numbermod=F('number') % 12
).filter(numbermod=0, video_id__in=all_videos).annotate(
    shot_scale_name=F('shot_scale__name')
).all()
num_frames = frames_qs.count()
shot_scales = VideoIntervalCollection.from_django_qs(
    frames_qs,
    schema={
        "start": "number",
        "end": "number",
        "payload": "shot_scale_name"
    },
    progress=True,
    total=num_frames
)

In [None]:
def distribution_of_shot_scales(frames_with_shot_scale):
    labels = ['unknown', 'extreme_long', 'long', 'medium_long', 'medium', 'medium_close_up',
             'close_up', 'extreme_close_up']
    
    shot_scale_proportions = {
        label: 0. for label in labels
    }
    total_videos = 0.
    
    for video_id in list(frames_with_shot_scale.get_allintervals().keys()):
        counts_for_video = {
            label: 0. for label in shot_scale_proportions
        }
        total_frames = 0.
        for intrvl in frames_with_shot_scale.get_intervallist(video_id).get_intervals():
            shot_scale = intrvl.payload
            counts_for_video[shot_scale] += 1
            total_frames += 1
        for label in counts_for_video:
            shot_scale_proportions[label] += counts_for_video[label] / total_frames
        total_videos += 1
    
    return labels, [shot_scale_proportions[label] / total_videos for label in labels]

In [None]:
def graph_shot_scale_distribution(shot_scale_labels, shot_scale_distributions,
                                               distribution_labels, title):
    fig, ax = plt.subplots(figsize=(10, 5))
    for distribution, label in zip(shot_scale_distributions, distribution_labels):
        x = range(0, len(shot_scale_labels))
        y = distribution
        ax.plot(x, y, label=label)
        
        ax.legend()
        
    ax.set_xlabel('Shot scale')
    ax.set_ylabel('Proportion of frames')
    ax.set_title(title)
    
    plt.xticks(x, shot_scale_labels)
    
    plt.show()

In [None]:
labels, shot_scale_distribution_all_videos = distribution_of_shot_scales(shot_scales)

In [None]:
_, shot_scale_distribution_1915_to_1969 = distribution_of_shot_scales(
    VideoIntervalCollection(
        {
            video_id: shot_scales.get_intervallist(video_id)
            for video_id in list(shot_scales.get_allintervals().keys())
            if Video.objects.get(id=video_id).year <= 1969
        }
    )
)

In [None]:
_, shot_scale_distribution_1970_to_2016 = distribution_of_shot_scales(
    VideoIntervalCollection(
        {
            video_id: shot_scales.get_intervallist(video_id)
            for video_id in list(shot_scales.get_allintervals().keys())
            if Video.objects.get(id=video_id).year >= 1970
        }
    )
)

In [None]:
graph_shot_scale_distribution(
    labels, 
    [
        shot_scale_distribution_all_videos,
        shot_scale_distribution_1915_to_1969,
        shot_scale_distribution_1970_to_2016
    ], 
    [
        'All videos',
        '1915-1969',
        '1970-2016'
    ],
    'Shot scale distribution for all videos'
)

### More Changes in Shot Scale over time
Next, We'll graph linear fits of the proportion of different shot scales.

In [None]:
shot_scale_labels = ['unknown', 'extreme_long', 'long', 'medium_long', 'medium', 'medium_close_up',
             'close_up', 'extreme_close_up']

def get_shot_scale_proportions(intervallist, labels):
    counts_for_video = {
        label: 0. for label in labels
    }
    total_frames = 0.
    for intrvl in intervallist.get_intervals():
        shot_scale = intrvl.payload
        counts_for_video[shot_scale] += 1
        total_frames += 1
    for label in counts_for_video:
        counts_for_video[label] = counts_for_video[label] / total_frames
    return counts_for_video

In [None]:
videos_with_shot_scale_proportions = [
    (Video.objects.get(id=video_id),
     get_shot_scale_proportions(shot_scales.get_intervallist(video_id), shot_scale_labels))
    for video_id in tqdm(list(shot_scales.get_allintervals().keys()))
]

In [None]:
def plot_shot_scale_proportions_by_year(videos_with_shot_scale_proportions, labels,
                                        fit_lines_only=False, min_year=None):
    data = sorted([(v.year, shot_scale_proportions) 
                   for v, shot_scale_proportions in videos_with_shot_scale_proportions
                  ],
                  key=lambda year_and_proportions: year_and_proportions[0])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    fig, ax = plt.subplots(figsize=(10, 5))
    
    for i, label in enumerate(labels):
        x = [d[0] for d in data]
        y = [d[1][label] for d in data]
        
        if not fit_lines_only:
            ax.scatter(x, y, s=3, label=label)
            ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
        else:
            ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), label=label)
        
    ax.set_xlabel('Year')
    ax.set_ylabel('Shot scale proportion')
    ax.set_title('Shot scale proportions by year')
    if not fit_lines_only:
        ax.set_ylim(ymax=1)
    ax.legend(loc=(1.04,0))
    
    #ax.set_yscale('symlog')
    plt.show()

In [None]:
plot_shot_scale_proportions_by_year(videos_with_shot_scale_proportions, shot_scale_labels, fit_lines_only=False)
plot_shot_scale_proportions_by_year(videos_with_shot_scale_proportions, shot_scale_labels, fit_lines_only=True)

### Relationship between shot duration and shot scale
For this, we'll need to assign shots a certain shot scale by looking at the mode for the shot and graph shot duration vs. shot scale.

In [None]:
# First, load all the shots into Rekall
shots_qs = Shot.objects.filter(
    video__in=all_videos,
    labeler=Labeler.objects.get(name='shot-hsvhist-face')
).all()
num_shots = shots_qs.count()
shots = VideoIntervalCollection.from_django_qs(
    shots_qs,
    progress=True,
    total=num_shots
)

In [None]:
# Next, overlap the shots with per-frame shot scales so that the payload
#   for each shot is a list of all the shot scales in the shot
# Finally, take the mode of the shot scales to get the scale of the whole shot
def get_mode(items):
    return max(set(items), key=items.count)

shots_with_scale = shots.merge(
    shot_scales,
    payload_merge_op=lambda shot_id, frame_shot_scale: [frame_shot_scale],
    predicate=overlaps(),
    working_window=1
).coalesce(
    payload_merge_op=payload_plus
).map(
    lambda shot_interval: (shot_interval.get_start(), shot_interval.get_end(),
                          get_mode(shot_interval.get_payload()))
)

In [None]:
# Compute normalized shot durations for each category of shot
def compute_shot_scale_normalized_duration(shots_with_scale):
    scale_proportions = {label: [] for label in shot_scale_labels}
    
    average_shot_duration_data = 0.
    total_videos = 0.
    
    for video_id in shots_with_scale.get_allintervals():
        scale_proportions_for_video = {label: [] for label in shot_scale_labels}
        
        shots_in_video = shots_with_scale.get_intervallist(video_id)
        
        video = Video.objects.get(id=video_id)
        
        for intrvl in shots_in_video.get_intervals():
            scale_proportions_for_video[intrvl.get_payload()].append(
                intrvl.get_end()-intrvl.get_start()
            )
        
        avg_shot_duration_for_video = float(video.num_frames) / shots_in_video.size()
        
        for label in shot_scale_labels:
            if len(scale_proportions_for_video[label]) > 0:
                scale_proportions[label].append(
                    np.mean(scale_proportions_for_video[label]) / avg_shot_duration_for_video
                )
        
        average_shot_duration_data += avg_shot_duration_for_video / video.fps
        total_videos += 1
    
    average_shot_duration_data /= total_videos
    
    normalized_proportions = [
        np.mean(scale_proportions[label]) * average_shot_duration_data
        for label in shot_scale_labels
    ]
    
    return normalized_proportions

In [None]:
def plot_duration_per_shot_scale(shot_scale_labels, shot_scale_distributions,
                                  distribution_labels, title):
    fig, ax = plt.subplots(figsize=(10, 5))
    for distribution, label in zip(shot_scale_distributions, distribution_labels):
        x = range(0, len(shot_scale_labels))
        y = distribution
        ax.plot(x, y, label=label)
        
        ax.legend()
        
    ax.set_xlabel('Shot scale')
    ax.set_ylabel('Average duration (s)')
    ax.set_title(title)
    
    plt.xticks(x, shot_scale_labels)
    
    plt.show()

In [None]:
shot_lengths_per_bin_all_videos = compute_shot_scale_normalized_duration(shots_with_scale)

In [None]:
shot_lengths_per_bin_1915_to_1969 = compute_shot_scale_normalized_duration(
    VideoIntervalCollection(
        {
            video_id: shots_with_scale.get_intervallist(video_id)
            for video_id in list(shots_with_scale.get_allintervals().keys())
            if Video.objects.get(id=video_id).year <= 1969
        }
    )
)

In [None]:
shot_lengths_per_bin_1970_to_2016 = compute_shot_scale_normalized_duration(
    VideoIntervalCollection(
        {
            video_id: shots_with_scale.get_intervallist(video_id)
            for video_id in list(shots_with_scale.get_allintervals().keys())
            if Video.objects.get(id=video_id).year >= 1970
        }
    )
)

In [None]:
plot_duration_per_shot_scale(
    shot_scale_labels,
    [
        shot_lengths_per_bin_all_videos,
        shot_lengths_per_bin_1915_to_1969,
        shot_lengths_per_bin_1970_to_2016
    ],
    [
        'All videos',
        '1915-1969',
        '1970-2016',
    ], 'Shot duration vs. shot scale')

## Brightness

How has the brightness of movies changed over the years? What about over the course of individual movies?

### Changes in brightness over the years 

In [None]:
# Load frames that have non-null brightness values from the database
frames_qs = Frame.objects.filter(
    video__in=all_videos
).exclude(brightness__isnull=True).annotate(
    min_frame=F('number'),
    max_frame=F('number'),
    video_id=F('video_id')
).all()
num_frames = frames_qs.count()
brightness = VideoIntervalCollection.from_django_qs(
    frames_qs,
    with_payload=lambda frame: frame.brightness,
    progress=True,
    total=num_frames
)

In [None]:
def avg_brightness(intervallist):
    return np.mean([intrvl.payload for intrvl in intervallist.get_intervals()])

In [None]:
videos_with_avg_brightness = [
    (Video.objects.get(id=video_id), avg_brightness(brightness.get_intervallist(video_id)))
    for video_id in list(brightness.get_allintervals().keys())
]

In [None]:
def plot_avg_brightness_by_year(videos_with_avg_brightness, min_year=None):
    data = sorted([(v.year, people_per_frame) for v, people_per_frame in videos_with_avg_brightness])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Year')
    ax.set_ylabel('Average Brightness per Film (0-255 scale)')
    ax.set_title('Average brightness by year')
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
    plt.show()

In [None]:
plot_avg_brightness_by_year(videos_with_avg_brightness)

### Changes in brightness within movies (narrative structure)

In [None]:
def compute_brightness_per_bin(videos):
    bucket_proportions = [[] for i in range(0, 100)]
    
    # Keep track of total brightness of entire dataset to normalize later
    total_brightness = 0.
    total_videos = 0.
    for video in tqdm(videos):
        # Get all the frames with non-null brightness values
        frames = list(Frame.objects.filter(video_id=video.id).exclude(brightness__isnull=True).order_by('number').all())

        divider = video.num_frames / 100.
        bin_num = 0

        # Keep track of total brightness to normalize this film later
        film_brightness = 0.
        num_frames = len(frames)
        
        if num_frames == 0:
            continue
        
        # Keep track of average brightness in each bin
        bin_brightness = 0.
        num_frames_in_bin = 0.
        for frame in frames:
            if frame.number > video.num_frames:
                break
            if frame.number <= (bin_num + 1.) * divider:
                num_frames_in_bin += 1
                bin_brightness += frame.brightness
            else:
                bucket_proportions[bin_num].append(bin_brightness / num_frames_in_bin)
                bin_num += 1
                num_frames_in_bin = 1.
                bin_brightness = frame.brightness
            film_brightness += frame.brightness
        bucket_proportions[bin_num].append(bin_brightness / num_frames_in_bin)
        
        # Update brightness of entire dataset
        film_brightness = film_brightness / num_frames
        total_brightness += film_brightness
        total_videos += 1
        
        # Normalize bucket values
        for i in range(0, 100):
            bucket_proportions[i][-1] *= 1 / film_brightness
    
    normalized_brightness_per_bin = [
        np.mean(proportions) * (total_brightness / total_videos)
        for proportions in bucket_proportions
    ]
    
    return normalized_brightness_per_bin

In [None]:
def plot_brightness_per_bin(data, title, polynomial_degree=6):
    x = [(i + 0.5) / 100. for i in range(0, 100)]
    y = data
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Proportion of movie')
    ax.set_ylabel('Average Brightness (0-255 scale)')
    ax.set_title(title)
    ax.set_xlim(0, 1)
    
    data_ymax = np.max(y)
    graph_ymax = data_ymax + 20
    ax.set_ylim(top=graph_ymax)
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, polynomial_degree))(np.unique(x)))
    
    ax.axvline(x=0.25, color='k')
    ax.axvline(x=0.5, color='k')
    ax.axvline(x=0.75, color='k')
    # Line for credits, maybe?
    # ax.axvline(x=0.95, color='k')
    
    text_y = data_ymax + 10
    ax.text(x=0.1, y=text_y, s='Setup')
    ax.text(x=0.32, y=text_y, s='Complication')
    ax.text(x=0.57, y=text_y, s='Development')
    ax.text(x=0.85, y=text_y, s='Climax')
    
    plt.show()

In [None]:
brightness_all_videos = compute_brightness_per_bin(all_videos)

In [None]:
brightness_1915_to_1959 = compute_brightness_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1915,
        year__lte=1959
    ).order_by('id').all()
)

In [None]:
brightness_1960_to_1985 = compute_brightness_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1960,
        year__lte=1985
    ).order_by('id').all()
)

In [None]:
brightness_1986_to_2016 = compute_brightness_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1986,
        year__lte=2016
    ).order_by('id').all()
)

In [None]:
plot_brightness_per_bin(
    brightness_all_videos,
    'Average brightness over the course of our movies, full dataset',
    polynomial_degree=6
)
plot_brightness_per_bin(
    brightness_1915_to_1959,
    'Average brightness over the course of our movies, 1915-1959',
    polynomial_degree=1
)
plot_brightness_per_bin(
    brightness_1960_to_1985,
    'Average brightness over the course of our movies, 1960-1985',
    polynomial_degree=6
)
plot_brightness_per_bin(
    brightness_1986_to_2016,
    'Average brightness over the course of our movies, 1986-2016',
    polynomial_degree=6
)

## Saturation

How has the saturation of movies changed over the years? What about over the course of individual movies?

### Changes in saturation over the years

In [None]:
# Takes about four minutes to run!
# Load frames that have non-null saturation values from the database
frames_saturation_qs = Frame.objects.filter(
    video__in=all_videos
).exclude(saturation__isnull=True).annotate(
    min_frame=F('number'),
    max_frame=F('number'),
    video_id=F('video_id')
).all()
num_frames_saturation = frames_saturation_qs.count()
saturation = VideoIntervalCollection.from_django_qs(
    frames_saturation_qs,
    with_payload=lambda frame: frame.saturation,
    progress=True,
    total=num_frames_saturation
)

In [None]:
def avg_saturation(intervallist):
    return np.mean([intrvl.payload for intrvl in intervallist.get_intervals()])

In [None]:
videos_with_avg_saturation = [
    (Video.objects.get(id=video_id), avg_saturation(saturation.get_intervallist(video_id)))
    for video_id in list(saturation.get_allintervals().keys())
]

In [None]:
def plot_avg_saturation_by_year(videos_with_avg_saturation, min_year=None):
    data = sorted([(v.year, people_per_frame) for v, people_per_frame in videos_with_avg_saturation])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Year')
    ax.set_ylabel('Average Saturation per Film (0-255 scale)')
    ax.set_title('Average saturation by year')
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
    plt.show()

In [None]:
plot_avg_saturation_by_year(videos_with_avg_saturation)

In [None]:
# What are the movies with highest saturation?
sorted([
    (v.title, v.year, avg_saturation)
    for v, avg_saturation in videos_with_avg_saturation if avg_saturation > 150
], key=lambda tup: (tup[1], tup[0], tup[2]))

In [None]:
# What are the movies with lowest saturation?
sorted([
    (v.title, v.year, avg_saturation)
    for v, avg_saturation in videos_with_avg_saturation if avg_saturation < 10
], key=lambda tup: (tup[1], tup[0], tup[2]))

### Changes in saturation within movies (narrative structure?)

In [None]:
def compute_saturation_per_bin(videos):
    bucket_proportions = [[] for i in range(0, 100)]
    
    # Keep track of total saturation of entire dataset to normalize later
    total_saturation = 0.
    total_videos = 0.
    for video in tqdm(videos):
        # Get all the frames with non-null saturation values
        frames = list(Frame.objects.filter(video_id=video.id).exclude(
            saturation__isnull=True).order_by('number').all())

        divider = video.num_frames / 100.
        bin_num = 0

        # Keep track of total saturation to normalize this film later
        film_saturation = 0.
        num_frames = len(frames)
        
        if num_frames == 0:
            continue
        
        # Keep track of average saturation in each bin
        bin_saturation = 0.
        num_frames_in_bin = 0.
        for frame in frames:
            if frame.number > video.num_frames:
                break
            if frame.number <= (bin_num + 1.) * divider:
                num_frames_in_bin += 1
                bin_saturation += frame.saturation
            else:
                bucket_proportions[bin_num].append(bin_saturation / num_frames_in_bin)
                bin_num += 1
                num_frames_in_bin = 1.
                bin_saturation = frame.saturation
            film_saturation += frame.saturation
        bucket_proportions[bin_num].append(bin_saturation / num_frames_in_bin)
        
        # Update saturation of entire dataset
        film_saturation = film_saturation / num_frames
        total_saturation += film_saturation
        total_videos += 1
        
        # Normalize bucket values
        for i in range(0, 100):
            if film_saturation > 0:
                bucket_proportions[i][-1] *= 1 / film_saturation
    
    normalized_saturation_per_bin = [
        np.mean(proportions) * (total_saturation / total_videos)
        for proportions in bucket_proportions
    ]
    
    return normalized_saturation_per_bin

In [None]:
def plot_saturation_per_bin(data, title, polynomial_degree=6):
    x = [(i + 0.5) / 100. for i in range(0, 100)]
    y = data
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Proportion of movie')
    ax.set_ylabel('Average Saturation (0-255 scale)')
    ax.set_title(title)
    ax.set_xlim(0, 1)
    
    data_ymax = np.max(y)
    graph_ymax = data_ymax + 20
    ax.set_ylim(top=graph_ymax)
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, polynomial_degree))(np.unique(x)))
    
    ax.axvline(x=0.25, color='k')
    ax.axvline(x=0.5, color='k')
    ax.axvline(x=0.75, color='k')
    # Line for credits, maybe?
    # ax.axvline(x=0.95, color='k')
    
    text_y = data_ymax + 10
    ax.text(x=0.1, y=text_y, s='Setup')
    ax.text(x=0.32, y=text_y, s='Complication')
    ax.text(x=0.57, y=text_y, s='Development')
    ax.text(x=0.85, y=text_y, s='Climax')
    
    plt.show()

In [None]:
saturation_all_videos = compute_saturation_per_bin(all_videos)

In [None]:
saturation_1915_to_1959 = compute_saturation_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1915,
        year__lte=1959
    ).order_by('id').all()
)

In [None]:
saturation_1960_to_1985 = compute_saturation_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1960,
        year__lte=1985
    ).order_by('id').all()
)

In [None]:
saturation_1986_to_2016 = compute_saturation_per_bin(
    Video.objects.filter(
        decode_errors=False,
        year__gte=1986,
        year__lte=2016
    ).order_by('id').all()
)

In [None]:
plot_saturation_per_bin(
    saturation_all_videos,
    'Average saturation over the course of our movies, full dataset',
    polynomial_degree=6
)
plot_saturation_per_bin(
    saturation_1915_to_1959,
    'Average saturation over the course of our movies, 1915-1959',
    polynomial_degree=1
)
plot_saturation_per_bin(
    saturation_1960_to_1985,
    'Average saturation over the course of our movies, 1960-1985',
    polynomial_degree=6
)
plot_saturation_per_bin(
    saturation_1986_to_2016,
    'Average saturation over the course of our movies, 1986-2016',
    polynomial_degree=6
)

## Mean Number of People

### Mean number of people per frame over the years
We can use either face detection or pose detection to get the mean number of people per frame.

In [None]:
# This takes about five minutes to run!
faces_qs = Face.objects.filter(frame__video__in=all_videos).annotate(
    min_frame=F('frame__number'),
    max_frame=F('frame__number'),
    video_id=F('frame__video_id')
).all()
total_faces = faces_qs.count()
print(total_faces)
face_counts = VideoIntervalCollection.from_django_qs(
    faces_qs,
    with_payload=lambda row: row.probability,
    progress=True,
    total=total_faces
).coalesce(payload_merge_op=payload_plus)

In [None]:
# This takes about seven minutes to run!
pose_qs = PoseMeta.objects.filter(frame__video__in=all_videos).annotate(
    min_frame=F('frame__number'),
    max_frame=F('frame__number'),
    video_id=F('frame__video_id')
)
total_poses = pose_qs.count()
print(total_poses)
pose_counts = VideoIntervalCollection.from_django_qs(
    pose_qs,
    with_payload=lambda row: 1,
    progress=True,
    total=total_poses
).coalesce(payload_merge_op=payload_plus)

In [None]:
def avg_number_of_people(intervallist, truncate=False):
    people_per_frame = [
        intrvl.payload if not truncate else min(5, intrvl.payload)
        for intrvl in intervallist.get_intervals()
    ]
    return np.mean(people_per_frame)

In [None]:
videos_with_avg_people_per_frame_faces = [
    (Video.objects.get(id=video_id), avg_number_of_people(face_counts.get_intervallist(video_id)))
    for video_id in tqdm(list(face_counts.get_allintervals().keys()))
]

In [None]:
videos_with_avg_people_per_frame_poses = [
    (Video.objects.get(id=video_id), avg_number_of_people(pose_counts.get_intervallist(video_id)))
    for video_id in tqdm(list(pose_counts.get_allintervals().keys()))
]

In [None]:
videos_with_avg_people_per_frame_faces_truncated = [
    (Video.objects.get(id=video_id), avg_number_of_people(face_counts.get_intervallist(video_id), truncate=True))
    for video_id in tqdm(list(face_counts.get_allintervals().keys()))
]

In [None]:
videos_with_avg_people_per_frame_poses_truncated = [
    (Video.objects.get(id=video_id), avg_number_of_people(pose_counts.get_intervallist(video_id), truncate=True))
    for video_id in tqdm(list(pose_counts.get_allintervals().keys()))
]

In [None]:
def plot_avg_people_per_frame_by_year(videos_with_avg_people_per_frame, title, min_year=None):
    data = sorted([(v.year, people_per_frame) for v, people_per_frame in videos_with_avg_people_per_frame])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.scatter(x, y, s=3)
    ax.set_xlabel('Year')
    ax.set_ylabel('Average People per Frame')
    ax.set_title(title)
    
    #ax.set_yscale('symlog')
    
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
    plt.show()

In [None]:
plot_avg_people_per_frame_by_year(videos_with_avg_people_per_frame_faces,
                                  'Average people per frame by year (from face detections)')

In [None]:
plot_avg_people_per_frame_by_year(videos_with_avg_people_per_frame_faces_truncated,
                                  'Average people per frame by year (from face detections, truncated counts to 5 max)')

In [None]:
plot_avg_people_per_frame_by_year(videos_with_avg_people_per_frame_poses,
                                  'Average people per frame by year (from pose detections)')

In [None]:
plot_avg_people_per_frame_by_year(videos_with_avg_people_per_frame_poses_truncated,
                                  'Average people per frame by year (from pose detections, truncated counts to 5 max)')

### Distribution of number of people per frame

Let's look at the distribution of number of people per frame - what proportion of frames have 0 people, 1 person, 2 people, up to 5+ people?

In [None]:
def distribution_of_people_per_frame(person_counts):
    bucket_proportions = [0. for i in range(0, 6)]
    total_videos = 0.
    for video_id in list(person_counts.get_allintervals().keys()):
        bucket_counts = [0. for i in range(0, 6)]
        total_frames = 0.
        for intrvl in person_counts.get_intervallist(video_id).get_intervals():
            if intrvl.get_start() % 12 != 0:
                continue
            count = min(round(intrvl.payload), 5)
            bucket_counts[count] += 1
            total_frames += 1
        for i in range(0, 6):
            bucket_proportions[i] += bucket_counts[i] / total_frames
        total_videos += 1
    return [i / total_videos for i in bucket_proportions]

In [None]:
def graph_person_count_distribution(person_distribution_list, labels, title):
    fig, ax = plt.subplots(figsize=(5, 5))
    for distribution, label in zip(person_distribution_list, labels):
        x = ['0', '1', '2', '3', '4', '5+']
        y = distribution
        ax.plot(x, y, label=label)
        
        ax.legend()
        
    ax.set_xlabel('Number of people in frame')
    ax.set_ylabel('Proportion of frames')
    ax.set_title(title)
    
    plt.show()

In [None]:
person_distribution_all_videos = distribution_of_people_per_frame(face_counts)

In [None]:
person_distribution_1915_to_1969 = distribution_of_people_per_frame(
    VideoIntervalCollection(
        {
            video_id: face_counts.get_intervallist(video_id)
            for video_id in list(face_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year <= 1969
        }
    )
)

In [None]:
person_distribution_1970_to_2016 = distribution_of_people_per_frame(
    VideoIntervalCollection(
        {
            video_id: face_counts.get_intervallist(video_id)
            for video_id in list(face_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year >= 1970
        }
    )
)

In [None]:
person_distribution_all_videos_poses = distribution_of_people_per_frame(pose_counts)

In [None]:
person_distribution_1915_to_1969_poses = distribution_of_people_per_frame(
    VideoIntervalCollection(
        {
            video_id: pose_counts.get_intervallist(video_id)
            for video_id in list(pose_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year <= 1969
        }
    )
)

In [None]:
person_distribution_1970_to_2016_poses = distribution_of_people_per_frame(
    VideoIntervalCollection(
        {
            video_id: pose_counts.get_intervallist(video_id)
            for video_id in list(pose_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year >= 1970
        }
    )
)

In [None]:
graph_person_count_distribution(
    [
        person_distribution_all_videos,
        person_distribution_1915_to_1969,
        person_distribution_1970_to_2016
    ],
    [
        'All videos',
        '1915-1969',
        '1970-2016'
    ],
    'Distribution of number of people in frames (from faces)'
)

In [None]:
graph_person_count_distribution(
    [
        person_distribution_all_videos_poses,
        person_distribution_1915_to_1969_poses,
        person_distribution_1970_to_2016_poses
    ],
    [
        'All videos',
        '1915-1969',
        '1970-2016'
    ],
    'Distribution of number of people in frames (from poses)'
)

## Number of people in a shot vs shot duration

In [None]:
# First, load all the shots into Rekall
shots_qs = Shot.objects.filter(
    video__in=all_videos,
    labeler=Labeler.objects.get(name='shot-hsvhist-face')
).all()
num_shots = shots_qs.count()
shots = VideoIntervalCollection.from_django_qs(
    shots_qs,
    progress=True,
    total=num_shots
)

In [None]:
# Next, overlap the shots with per-frame counts of people so that the payload
#   for each shot is a list of all the shot scales in the shot
# Finally, take the max of the counts

shots_with_pose_counts = shots.merge(
    pose_counts,
    payload_merge_op=lambda shot_id, count: [count],
    predicate=overlaps(),
    working_window=1
).coalesce(
    payload_merge_op=payload_plus
).map(
    lambda shot_interval: (shot_interval.get_start(), shot_interval.get_end(),
                          max(shot_interval.get_payload()))
).set_union(
    shots.map(lambda intrvl: (intrvl.get_start(), intrvl.get_end(), 0))
).coalesce(
    payload_merge_op=lambda p1, p2: max(p1, p2)
)

In [None]:
# Compute normalized shot durations for number of people
def compute_pose_count_normalized_duration(shots_with_pose_counts):
    bucket_proportions = [[] for i in range(0, 6)]
    
    average_shot_duration_data = 0.
    total_videos = 0.
    
    for video_id in shots_with_pose_counts.get_allintervals():
        bucket_proportions_for_video = [[] for i in range(0, 6)]
        
        shots_in_video = shots_with_pose_counts.get_intervallist(video_id)
        
        video = Video.objects.get(id=video_id)
        
        for intrvl in shots_in_video.get_intervals():
            count = min(round(intrvl.payload), 5)
            
            bucket_proportions_for_video[count].append(
                intrvl.get_end()-intrvl.get_start()
            )
        
        avg_shot_duration_for_video = float(video.num_frames) / shots_in_video.size()
        
        for i in range(6):
            if len(bucket_proportions_for_video[i]) > 0:
                bucket_proportions[i].append(
                    np.mean(bucket_proportions_for_video[i]) / avg_shot_duration_for_video
                )
        
        average_shot_duration_data += avg_shot_duration_for_video / video.fps
        total_videos += 1
    
    average_shot_duration_data /= total_videos
    
    normalized_proportions = [
        np.mean(bucket_proportions[i]) * average_shot_duration_data
        for i in range(6)
    ]
    
    return normalized_proportions

In [None]:
def graph_pose_count_duration(count_duration_list, labels, title):
    fig, ax = plt.subplots(figsize=(5, 5))
    for distribution, label in zip(count_duration_list, labels):
        x = ['0', '1', '2', '3', '4', '5+']
        y = distribution
        ax.plot(x, y, label=label)
        
        ax.legend()
        
    ax.set_xlabel('Number of people in shot')
    ax.set_ylabel('Average shot duration')
    ax.set_title(title)
    
    plt.show()

In [None]:
count_duration_all_videos = compute_pose_count_normalized_duration(shots_with_pose_counts)

In [None]:
count_duration_1915_to_1969_poses = compute_pose_count_normalized_duration(
    VideoIntervalCollection(
        {
            video_id: shots_with_pose_counts.get_intervallist(video_id)
            for video_id in list(shots_with_pose_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year <= 1969
        }
    )
)

In [None]:
count_duration_1970_to_2016_poses = compute_pose_count_normalized_duration(
    VideoIntervalCollection(
        {
            video_id: shots_with_pose_counts.get_intervallist(video_id)
            for video_id in list(shots_with_pose_counts.get_allintervals().keys())
            if Video.objects.get(id=video_id).year > 1969
        }
    )
)

In [None]:
graph_pose_count_duration(
    [
        count_duration_all_videos,
        count_duration_1915_to_1969_poses,
        count_duration_1970_to_2016_poses
    ],
    [
        'All videos',
        '1915-1969',
        '1970-2016'
    ],
    'Normalized average shot duration vs. number of people in the shot')