## Question 3 - 4283724 Nadine Kanbier

##### Use the subsets of movie trailers from 1920-1940, 1960-1980 and 2000-2020 from exercise 6.1, but instead of comparing the shot types and shot lengths, use one of the (pre-trained) image feature extraction methods we discussed in exercise 5.2 to compare the subsets, and explain your choice.
##### [Note: you are not allowed to use the same method you used in Q2. So if you analyzed gender in Q2, you have to pick something else for this question.]
##### Make a plan for tackling the dimensionality of the data: each subset consists of multiple videos, each video consist of multiple frames/seconds/shots, and each frame/second/shot could contain multiple faces/genders/emotions/objects/-texts/colors. How are you going to compare the subsets? Explain the choices you make carefully. Then implement your plan and interpret the results.

##### Your answer must consist of the following:
##### • Explanation of choice for features, plan for tackling the dimensionality (ca. 350 words)
##### • The complete code to answer the question with a short comment for every step (max. 2 sentences per step)
##### • Interpretation and conclusion (ca. 200 words)

### Plan
It is commonly thought that ‘older’ movies are in black and white and more recent movies are in color. However, there is no distinct dividing line between the two [1]. On top of that, filmmakers continue to choose to shoot their films in black and white (e.g. Schindler’s List (1993) and The Artist (2011)) [2]. 

Because there is no distinct dividing line, it is interesting to look at it from a time-series analysis approach. Before this can be done on on a larger scale using multiple subsets (e.g. subsets per decade or even per year), we will take a look at more general subsets: 1920-1940, 1960-1980 and 2000-2020. I will be comparing the number of dominant colors between these subsets. Findings could provide an overview of dominant color-use in the different periods.

To analyze this, we need a plan to tackle the dimensionality of the data. The process will be as follows:
Step 1. Making the subsets out of the trailers data. 
Step 2. Getting the middle frame of the scenes for each video. Looking at all the frames would be computationally intensive and we would end up with a lot of the same frames. Because we are analyzing the most dominant colors, looking at scenes only will be sufficient. More specifically, we will look at the middle frame of the scene. 
Step 3. Loading the frames for each subset. 
Step 4. Using the package colorgram, we will look at the number of dominant colors in each subset.

### 1. Preparing the subsets

In [19]:
# Import the necessary packages
from scenedetect import VideoManager
from scenedetect import SceneManager

from scenedetect.detectors import ContentDetector

import cv2

import pandas as pd
import random
import wget
from tqdm.notebook import tqdm
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from PIL import Image, ImageOps
from tensorflow.keras.preprocessing import image

import wget
from tqdm.notebook import tqdm

import colorgram
from matplotlib.colors import to_hex

import webcolors

In [1]:
# First, I make and download the subsets
trailers = pd.read_csv('trailers.csv')

trailers_20 = trailers[(trailers.year > 1920) & (trailers.year < 1940)].sample(3)
trailers_60 = trailers[(trailers.year > 1960) & (trailers.year < 1980)].sample(3)
trailers_00= trailers[trailers.year > 2000].sample(3)

In [2]:
def dl_sample(df, folder):
    if not os.path.exists(folder):
        os.mkdir(folder)
    
    video_paths = []
    for video in df.itertuples():
        video_url = video.url
        output_path = folder + video.trailer_title + '.mp4'
        filename = wget.download(video_url, out=output_path)
        video_paths.append(output_path)
        
    return video_paths

In [3]:
trailer1920 = dl_sample(trailers_20, 'vid_1920/')
trailer1960 = dl_sample(trailers_60, 'vid_1960/')
trailer2000 = dl_sample(trailers_00, 'vid_2000/')

### 2. Downloading the middle frame of each scene for each subset

In [11]:
# Define find scenes.
def find_scenes(video_path, threshold=10.0):
    video_manager = VideoManager([video_path])
    scene_manager = SceneManager()
    scene_manager.add_detector(
        ContentDetector(threshold=threshold))
    base_timecode = video_manager.get_base_timecode()
    video_manager.set_downscale_factor()
    video_manager.start()
    scene_manager.detect_scenes(frame_source=video_manager, show_progress=False)    
    return scene_manager.get_scene_list(base_timecode)

In [12]:
# Download scenes.
def dl_scenes(filename):
    # create list of scenes
    scene_list = find_scenes(filename, threshold=10)
    
    # get the middle frame of the scenes
    frames = []
    
    cap = cv2.VideoCapture(filename)
    
    for start_time, end_time in scene_list:
        duration = end_time - start_time
        frame = (start_time.get_frames() + int(duration.get_frames() / 2))
        cap.set(cv2.CAP_PROP_POS_FRAMES,frame)
        ret, frame = cap.read()
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(frame)
    
    # save the scenes
    if not os.path.exists('scenes/'):
        os.mkdir('scenes/')

    for i, frame in enumerate(frames):
        cv2.imwrite('scenes/frame_{}.jpg'.format(i), frame)

In [17]:
def analyze_trailers(videos):
    for vid in videos:
        dfs.append(dl_scenes(vid))

In [18]:
shots1920 = analyze_trailers(trailer1920)

In [20]:
shots1960 = analyze_trailers(trailer1960)

In [21]:
shots2000 = analyze_trailers(trailer2000)

### 3. Loading the frames for each subset

In [2]:
# Define load image from path.

%config InlineBackend.figure_format='retina' 

def load_image_from_path(image_path, target_size=None, color_mode='rgb'):
    pil_image = image.load_img(image_path, 
                               target_size=target_size,
                            color_mode=color_mode)
    return image.img_to_array(pil_image)

In [7]:
# Next, define the path and create the dataframe.

mypath1 = '/Users/nadinekanbier/Desktop/Applied Data Science/Periode 2/Data Mining/Exam 2/vid_1920/scenes'

image_paths1 = [image_path.path for image_path in os.scandir(mypath1)] # the image paths
image_paths1 = [image for image in image_paths1 if image[-3:] in ['jpg', 'gif', 'epg', 'png']]

df1920 = pd.DataFrame(image_paths1) 
df1920.columns = ['file_path']

In [8]:
mypath2 = '/Users/nadinekanbier/Desktop/Applied Data Science/Periode 2/Data Mining/Exam 2/vid_1960/scenes'

image_paths2 = [image_path.path for image_path in os.scandir(mypath2)] # the image paths
image_paths2 = [image for image in image_paths2 if image[-3:] in ['jpg', 'gif', 'epg', 'png']]

df1960 = pd.DataFrame(image_paths2) 
df1960.columns = ['file_path']

In [9]:
mypath3 = '/Users/nadinekanbier/Desktop/Applied Data Science/Periode 2/Data Mining/Exam 2/vid_2000/scenes'

image_paths3 = [image_path.path for image_path in os.scandir(mypath3)] # the image paths
image_paths3 = [image for image in image_paths3 if image[-3:] in ['jpg', 'gif', 'epg', 'png']]

df2000 = pd.DataFrame(image_paths3)
df2000.columns = ['file_path']

### 4. Getting the most dominant colors for each subset

In [20]:
# Get colours
def get_colour_name(rgb_triplet):
    """
    From https://stackoverflow.com/questions/9694165/convert-rgb-color-to-english-color-name-like-green-with-python
    """
    min_colours = {}
    for key, name in webcolors.CSS21_HEX_TO_NAMES.items():
        r_c, g_c, b_c = webcolors.hex_to_rgb(key)
        rd = (r_c - rgb_triplet[0]) ** 2
        gd = (g_c - rgb_triplet[1]) ** 2
        bd = (b_c - rgb_triplet[2]) ** 2
        min_colours[(rd + gd + bd)] = name
    return min_colours[min(min_colours.keys())]

In [21]:
colors_list = []

for i in tqdm(range(0,len(df1920))):
    color_image = load_image_from_path(df1920.file_path.values[i],color_mode='rgb')
    img = Image.fromarray(color_image.astype(np.uint8)) # convert to PIL image object
    colors = colorgram.extract(img, 1) 

    for color in colors:
        rgb = tuple(color.rgb)
        color_name = get_colour_name(rgb)
    
    colors_list.append(color_name)

HBox(children=(FloatProgress(value=0.0, max=73.0), HTML(value='')))




In [22]:
df1920['Dominant_color'] = colors_list
df1920['Dominant_color'].unique()

array(['black', 'gray'], dtype=object)

In [26]:
# The 1920-40 subset has two most dominant colors. 

In [23]:
colors_list = []

for i in tqdm(range(0,len(df1960))):
    color_image = load_image_from_path(df1960.file_path.values[i],color_mode='rgb')
    img = Image.fromarray(color_image.astype(np.uint8)) # convert to PIL image object
    colors = colorgram.extract(img, 1) 

    for color in colors:
        rgb = tuple(color.rgb)
        color_name = get_colour_name(rgb)
    
    colors_list.append(color_name)

HBox(children=(FloatProgress(value=0.0, max=264.0), HTML(value='')))




In [24]:
df1960['Dominant_color'] = colors_list
df1960['Dominant_color'].unique()

array(['black', 'gray', 'silver'], dtype=object)

In [27]:
# The 1960-1980 subset has three most dominant colors.

In [25]:
colors_list = []

for i in tqdm(range(0,len(df2000))):
    color_image = load_image_from_path(df2000.file_path.values[i],color_mode='rgb')
    img = Image.fromarray(color_image.astype(np.uint8)) # convert to PIL image object
    colors = colorgram.extract(img, 1) 

    for color in colors:
        rgb = tuple(color.rgb)
        color_name = get_colour_name(rgb)
    
    colors_list.append(color_name)

HBox(children=(FloatProgress(value=0.0, max=203.0), HTML(value='')))




In [28]:
df2000['Dominant_color'] = colors_list
df2000['Dominant_color'].unique()

array(['black', 'gray', 'white', 'navy', 'orange', 'silver', 'olive',
       'aqua', 'teal'], dtype=object)

In [29]:
# The 2000-2020 subset has nine most dominant colors.

### Conclusion and discussion

The analysis shows us that the early period (1920-1940) uses only two dominant colors. This is expected, although color films were introduced in the 1920, black and white movies were dominating the cinematic world. The second period (1960-1980) uses only three dominant colors in its trailers. The last period (2000-2020) uses nine dominant colors, a significant difference compared to the other periods. This suggest that sometime during the last periods, the number of dominant colors in movies rised. This analysis provides cause to further research the color-use in movies throughout the decades. 

That being said, we have to be careful when interpreting the results. We have only used three trailers for each subset/period. Further research with more trailers and more subsets is necessary to provide a time-series analysis on the color-use in movies.

### References
1. https://www.liveabout.com/how-movies-went-from-black-white-to-color-4153390
2. Li, J. (2012). Discoloured vestiges of history: Black and white in the age of colour cinema. Journal of Chinese Cinemas, 6(3), 247-262.