<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Gender-Analysis" data-toc-modified-id="Gender-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Gender Analysis</a></span><ul class="toc-item"><li><span><a href="#Computing-Gender-Screen-Time" data-toc-modified-id="Computing-Gender-Screen-Time-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Computing Gender Screen Time</a></span></li><li><span><a href="#Gender-Screen-Time-in-Our-Dataset" data-toc-modified-id="Gender-Screen-Time-in-Our-Dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Gender Screen Time in Our Dataset</a></span></li><li><span><a href="#Changes-in-gender-representation-over-time" data-toc-modified-id="Changes-in-gender-representation-over-time-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Changes in gender representation over time</a></span></li><li><span><a href="#Gender-Representation-by-Genre" data-toc-modified-id="Gender-Representation-by-Genre-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Gender Representation by Genre</a></span></li><li><span><a href="#Distribution-of-movies-by-year" data-toc-modified-id="Distribution-of-movies-by-year-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Distribution of movies by year</a></span></li><li><span><a href="#Shot-Duration" data-toc-modified-id="Shot-Duration-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Shot Duration</a></span><ul class="toc-item"><li><span><a href="#Average-Shot-Duration-Over-the-Years" data-toc-modified-id="Average-Shot-Duration-Over-the-Years-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Average Shot Duration Over the Years</a></span></li><li><span><a href="#Changes-in-Shot-Duration-Within-Movies-(Narrative-Structure)" data-toc-modified-id="Changes-in-Shot-Duration-Within-Movies-(Narrative-Structure)-1.6.2"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Changes in Shot Duration Within Movies (Narrative Structure)</a></span></li></ul></li><li><span><a href="#Shot-Scale" data-toc-modified-id="Shot-Scale-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Shot Scale</a></span><ul class="toc-item"><li><span><a href="#Changes-in-shot-scale-over-time" data-toc-modified-id="Changes-in-shot-scale-over-time-1.7.1"><span class="toc-item-num">1.7.1&nbsp;&nbsp;</span>Changes in shot scale over time</a></span></li><li><span><a href="#More-Changes-in-Shot-Scale-over-time" data-toc-modified-id="More-Changes-in-Shot-Scale-over-time-1.7.2"><span class="toc-item-num">1.7.2&nbsp;&nbsp;</span>More Changes in Shot Scale over time</a></span></li><li><span><a href="#Relationship-between-shot-duration-and-shot-scale" data-toc-modified-id="Relationship-between-shot-duration-and-shot-scale-1.7.3"><span class="toc-item-num">1.7.3&nbsp;&nbsp;</span>Relationship between shot duration and shot scale</a></span></li></ul></li><li><span><a href="#Brightness" data-toc-modified-id="Brightness-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Brightness</a></span><ul class="toc-item"><li><span><a href="#Changes-in-brightness-over-the-years" data-toc-modified-id="Changes-in-brightness-over-the-years-1.8.1"><span class="toc-item-num">1.8.1&nbsp;&nbsp;</span>Changes in brightness over the years</a></span></li><li><span><a href="#Changes-in-brightness-within-movies-(narrative-structure)" data-toc-modified-id="Changes-in-brightness-within-movies-(narrative-structure)-1.8.2"><span class="toc-item-num">1.8.2&nbsp;&nbsp;</span>Changes in brightness within movies (narrative structure)</a></span></li></ul></li><li><span><a href="#Saturation" data-toc-modified-id="Saturation-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Saturation</a></span><ul class="toc-item"><li><span><a href="#Changes-in-saturation-over-the-years" data-toc-modified-id="Changes-in-saturation-over-the-years-1.9.1"><span class="toc-item-num">1.9.1&nbsp;&nbsp;</span>Changes in saturation over the years</a></span></li><li><span><a href="#Changes-in-saturation-within-movies-(narrative-structure?)" data-toc-modified-id="Changes-in-saturation-within-movies-(narrative-structure?)-1.9.2"><span class="toc-item-num">1.9.2&nbsp;&nbsp;</span>Changes in saturation within movies (narrative structure?)</a></span></li></ul></li><li><span><a href="#Mean-Number-of-People" data-toc-modified-id="Mean-Number-of-People-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Mean Number of People</a></span><ul class="toc-item"><li><span><a href="#Mean-number-of-people-per-frame-over-the-years" data-toc-modified-id="Mean-number-of-people-per-frame-over-the-years-1.10.1"><span class="toc-item-num">1.10.1&nbsp;&nbsp;</span>Mean number of people per frame over the years</a></span></li><li><span><a href="#Distribution-of-number-of-people-per-frame" data-toc-modified-id="Distribution-of-number-of-people-per-frame-1.10.2"><span class="toc-item-num">1.10.2&nbsp;&nbsp;</span>Distribution of number of people per frame</a></span></li></ul></li></ul></li></ul></div>

In [None]:
from query.models import Video, Shot, Labeler, Face, FaceGender, Genre
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import numpy as np
from django.db.models import Avg, Sum
from tqdm import tqdm
import rekall
from rekall.video_interval_collection import VideoIntervalCollection
from rekall.interval_list import IntervalList
from rekall.merge_ops import payload_plus
from rekall.parsers import in_array, merge_dict_parsers, dict_payload_parser
from rekall.temporal_predicates import overlaps
from rekall.payload_predicates import payload_satisfies

# Gender Analysis

In this notebook we'll be conducting some analyses of gender screen time ratios (male vs female screentime) over the film dataset. We will exclude all animated films.

## Computing Gender Screen Time
Faces are computed at 2 FPS and at every microshot boundary. To avoid weird edge cases at the microshot boundaries, we'll only compute statistics at the 2 FPS sampling.

In [None]:
all_videos = Video.objects.filter(decode_errors=False).exclude(genres__name='animation').order_by('id').all()

In [None]:
# Takes about five and a half minutes to run!
# Load all FaceGender data into Rekall. faces_with_gender has one interval per face.
facegender_qs = FaceGender.objects.filter(
    face__frame__video__in=all_videos
).annotate(
    min_frame=F('face__frame__number'),
    max_frame=F('face__frame__number'),
    video_id=F('face__frame__video_id'),
    gender_name=F('gender__name'),
    face_probability=F('face__probability')
).all()

total_faces = facegender_qs.count()

faces_with_gender = VideoIntervalCollection.from_django_qs(
    facegender_qs,
    with_payload=merge_dict_parsers([
        dict_payload_parser(VideoIntervalCollection.django_accessor, { 'gender': 'gender_name' }),
        dict_payload_parser(VideoIntervalCollection.django_accessor, { 'gender_probability': 'probability' }),
        dict_payload_parser(VideoIntervalCollection.django_accessor, { 'face_probability': 'face_probability' })
    ]),
    progress=True,
    total=total_faces
)

In [None]:
def compute_gender_ratio_for_video(intervallist):
    male_time = intervallist.filter(
        payload_satisfies(lambda payload: payload['gender'] == 'M')
    ).fold(lambda acc, intrvl: (acc + 
                                intrvl.get_payload()['gender_probability'] * 
                                intrvl.get_payload()['face_probability']), 0.)
    female_time = intervallist.filter(
        payload_satisfies(lambda payload: payload['gender'] == 'F')
    ).fold(lambda acc, intrvl: (acc + 
                                intrvl.get_payload()['gender_probability'] * 
                                intrvl.get_payload()['face_probability']), 0.)
    
    return male_time / (male_time + female_time), female_time / (male_time + female_time)

In [None]:
videos_with_gender_ratios = [
    (video, compute_gender_ratio_for_video(faces_with_gender.get_intervallist(video.id)))
    for video in tqdm(all_videos)
]

In [None]:
# For sanity checking, remove films calssified as fantasy (old men with long hair) or family (young kids)
videos_no_family_fantasy = Video.objects.filter(decode_errors=False).exclude(
    genres__name__in=['animation', 'family', 'fantasy']
)
videos_with_gender_ratios_no_family_fantasy = [
    (video, compute_gender_ratio_for_video(faces_with_gender.get_intervallist(video.id)))
    for video in tqdm(videos_no_family_fantasy)
]

## Gender Screen Time in Our Dataset

In [None]:
# Let's plot histograms of male and female screen time
def plot_male_female_screen_time_histograms(videos_with_gender_ratios):
    male_screen_time = sorted([m for v, (m, f) in videos_with_gender_ratios])
    female_screen_time = sorted([f for v, (m, f) in videos_with_gender_ratios])
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.hist([male_screen_time, female_screen_time], [i * .01 for i in range(100)], histtype='bar',
             label=['Male Screen Time', 'Female Screen Time'], color=['y', 'g'])
    ax.legend()
    ax.set_title('Histogram of male and female screen time')
    plt.show()

In [None]:
plot_male_female_screen_time_histograms(videos_with_gender_ratios)

In [None]:
plot_male_female_screen_time_histograms(videos_with_gender_ratios_no_family_fantasy)

In [None]:
# What about overall male/female screen time?
def plot_average_male_female_screen_time(videos_with_gender_ratios):
    male_screen_time = [np.mean([m for v, (m, f) in videos_with_gender_ratios])]
    female_screen_time = [np.mean([f for v, (m, f) in videos_with_gender_ratios])]
    
    names = ['Average Screen Time']
    N = len(names)

    ax = plt.gca()
    
    width = 0.35
    ind = np.arange(N)
    p1 = ax.bar(ind, male_screen_time, width, color='y')
    p2 = ax.bar(ind + width, female_screen_time, width, color='g')
    
    ax.set_title('Average Male/Female Screen Time')
    ax.set_xticks(ind + width / 2)
    ax.set_xticklabels(names)
    ax.set_ylabel('Average Screen Time')
    ax.set_ylim((0, 1))
    
    ax.legend((p1[0], p2[0]), ('Male Screen Time', 'Female Screen Time'))
    
    def autolabel(rects):
        """
        Attach a text label above each bar displaying its height
        """
        for rect in rects:
            height = rect.get_height()
            ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                    '%f' % height,
                    ha='center', va='bottom')

    autolabel(p1)
    autolabel(p2)
    
    plt.show()

In [None]:
plot_average_male_female_screen_time(videos_with_gender_ratios)

In [None]:
plot_average_male_female_screen_time(videos_with_gender_ratios_no_family_fantasy)

## Changes in gender representation over time

In [None]:
# Plot male ratios by year
def plot_male_gender_ratios_by_year(videos_with_gender_ratios, min_year=None):
    data = sorted([(v.year, male_ratio) for v, (male_ratio, female_ratio) in videos_with_gender_ratios])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3, color='y')
    ax.set_ylim(0, 1)
    ax.set_xlabel('Year')
    ax.set_ylabel('Male Screen Time Percentage')
    ax.set_title('Male Screen Time Percentage Over Time')
    
    #ax.set_yscale('symlog')
    
#     ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 3))(np.unique(x)))
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='y')
    plt.show()
    
# Plot male ratios by year
def plot_female_gender_ratios_by_year(videos_with_gender_ratios, min_year=None):
    data = sorted([(v.year, female_ratio) for v, (male_ratio, female_ratio) in videos_with_gender_ratios])
    if min_year is not None:
        data = [d for d in data if d[0] >= min_year]

    x = [d[0] for d in data]
    y = [d[1] for d in data]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y, s=3, color='g')
    ax.set_ylim(0, 1)
    ax.set_xlabel('Year')
    ax.set_ylabel('Female Screen Time Percentage')
    ax.set_title('Female Screen Time Percentage Over Time')
    
    #ax.set_yscale('symlog')
    
#     ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 3))(np.unique(x)))
    ax.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='g')
    plt.show()

In [None]:
plot_male_gender_ratios_by_year(videos_with_gender_ratios)
plot_female_gender_ratios_by_year(videos_with_gender_ratios)

In [None]:
plot_male_gender_ratios_by_year(videos_with_gender_ratios_no_family_fantasy)
plot_female_gender_ratios_by_year(videos_with_gender_ratios_no_family_fantasy)

In [None]:
# What videos have exceptionally high male screen times (low female screen times)?
sorted([
    (v.title, v.year, male_screen_time)
    for v, (male_screen_time, female_screen_time) in videos_with_gender_ratios if male_screen_time > .85
], key=lambda tup: (tup[1], tup[0], tup[2]))

In [None]:
# What videos have exceptionally low male screen times (high female screen times)?
sorted([
    (v.title, v.year, male_screen_time)
    for v, (male_screen_time, female_screen_time) in videos_with_gender_ratios if male_screen_time < .25
], key=lambda tup: (tup[1], tup[0], tup[2]))

## Gender Representation by Genre

We have 22 different genres. How does gender representation differ for each?

In [None]:
# First, let's see how many films are in each genre
genres_and_counts = sorted([
    (g.name, Video.objects.filter(genres=g).count())
    for g in Genre.objects.all()
], key=lambda g_and_c: g_and_c[1])

In [None]:
genres_and_counts

In [None]:
# Let's exclude animation, short, and documentary
genres = Genre.objects.exclude(name__in=['animation', 'short', 'documentary']).all()

In [None]:
def plot_gender_screen_time_by_genre(videos_with_gender_ratios, genres, title):
    data = []
    for genre in genres:
        videos_in_genre = [v.id for v in Video.objects.filter(genres=genre).all()]
        male_screen_time = np.mean([m for v, (m, f) in videos_with_gender_ratios 
                                     if v.id in videos_in_genre])
        female_screen_time = np.mean([f for v, (m, f) in videos_with_gender_ratios 
                                     if v.id in videos_in_genre])
        data.append((male_screen_time, female_screen_time, genre.name))
    
    data = sorted(data)
    
    male_screen_times = [m for m, _, _ in data]
    female_screen_times = [f for _, f, _ in data]
    genres = [genre for _, _, genre in data]
    N = len(genres)

    fig, ax = plt.subplots(figsize=(10, 10))
    
    height = 0.35
    ind = np.arange(N)
    p1 = ax.barh(ind + height, male_screen_times, height, color='y')
    p2 = ax.barh(ind, female_screen_times, height, color='g')
    
    ax.set_title(title)
    ax.set_yticks(ind + height / 2)
    ax.set_yticklabels(genres)
    ax.set_xlim((0, 1))
    
    ax.legend((p1[0], p2[0]), ('Male Screen Time', 'Female Screen Time'), loc=4)
    
    def autolabel(rects):
        """
        Attach a text label above each bar displaying its height
        """
        for rect in rects:
            width = rect.get_width()
            ax.text(width + .01, rect.get_y()-.2,
                    '%f Male' % width,
                    ha='left', va='bottom')

    autolabel(p1)
#     autolabel(p2)
    
    plt.show()

In [None]:
plot_gender_screen_time_by_genre(videos_with_gender_ratios, genres, 'Male/female screen time by genre')