### INTRODUCTION

Some time ago I came up with an idea to explore and show "the timeline of color in arts" through the means of data visualization. By "the color timeline" here, I mean how the usage of colors changed through the years. This particular dataset seems to be a good basis for such exploration.

The resulting visualization is at the end of this notebook. 

My approach is the commonly used ML-based method of finding prevalent colors in an image - K-Means clustering of pixels based on their distance in 3D space represented by their R, G, B color components.

In general, I did the following steps:

1. Subset the paintings from Wikiart data source and with the year of creation after 1500;
2. Grouped the subset by half-decades and randomly sampled 50 paintings from each half-decade;
3. Within every half-decade's subsample the paintings images were resized to a fixed dimension and combined into a single image as an array of pixel's RGB components;
4. K-Means clustering (k=10) was applied over this array, producing clusters of 10 most prevalent colors. For each color cluster the pixels arranged  to this cluster (color) were counted giving the size of each cluster.
5. The resulting clustering information (colors and size of color clusters for each half-decade) was used for creating final visualization of the colors timeline.

Image processing code is mostly not my own. The references are at the bottom of the notebook.

In [None]:
import numpy as np
import pandas as pd
from zipfile import ZipFile
import re
import cv2
from sklearn.cluster import KMeans, MiniBatchKMeans
from math import sqrt
from collections import Counter

import altair as alt
alt.renderers.enable('kaggle')

### PREPARE PAINTINGS INFORMATION FILE & SUBSET WIKIART PAINTINGS AFTER 1500

In [None]:
# read paintings info data and parse year of creation
def parse_year(date):
    '''
    Helper to parse painting's year of creation
    '''
    if isinstance(date, str):
        res = re.findall('([0-9]+)', date)
        if res:
            return res[0]
        else:
            return -1
        
    if pd.isnull(date):
        return -1
    
    return date_val

info = pd.read_csv('/kaggle/input/painter-by-numbers/all_data_info.csv')
info['year'] = info.loc[:, 'date'].apply(parse_year).astype('int32')

# Only paintings with year of creation after 1500
info = info.loc[info.year>=1500]

# Only paitings from wikiart
info = info.loc[info.source=='wikiart']

In [None]:
info.source.value_counts()

In [None]:
# add decades and half decades columns
info['decade'] = info['year']//10*10
info['half_decade'] = info['decade'] + ((info['year'] - info['decade']) // 5 * 5 )

In [None]:
# add folders to filenames
info.loc[info.in_train==True, 'new_filename'] = 'train/' + info.loc[info.in_train==True, 'new_filename']
info.loc[info.in_train==False, 'new_filename'] = 'test/' + info.loc[info.in_train==False, 'new_filename']

In [None]:
info.head(3)

In [None]:
info.loc[:, ['half_decade', 'new_filename']].groupby('half_decade').count().describe()

### CREATE IMAGES SAMPLES FOR EACH HALF-DECADE

In [None]:
def sample_data(info, sample_size, groupby='half_decade'):
    '''
    Sampling procedure
    '''
    def sample_func(x):
        subsample_size = len(x) if len(x) < sample_size else sample_size
        return list(x.sample(subsample_size, random_state=1))
    
    sample = info.groupby(groupby).agg({
        'new_filename': sample_func
    }).to_dict()
    
    return sample['new_filename']

In [None]:
sample_info = sample_data(info, sample_size=50, groupby='half_decade')

### DETECT 10 PREVALENT COLORS IN SAMPLES & CREATE VISUALIZATION DATA

In [None]:
# LOAD AND RESIZE IMAGE
def load_img(filename):
    zip_name = 'train.zip' if 'train/' in filename else 'test.zip'
    
    with ZipFile('/kaggle/input/painter-by-numbers/' + zip_name) as zip:
        with zip.open(filename) as file:
            #file.seek(0)
            img_array = np.asarray(bytearray(file.read()), dtype='uint8')
            img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
            try:
                img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            except cv2.error as e:
                print('Failed to load:', filename)
                return None
            
    return img

def resize_img(img):
    h, w, _ = img.shape
    w_new = int(200 * w / max(w, h))
    h_new = int(200 * h / max(w, h))
    
    return cv2.resize(img, (w_new, h_new), interpolation = cv2.INTER_AREA)


# COLORS CONVERSION
def rgb_to_hex(rgb):
    return "#{:02x}{:02x}{:02x}".format(int(rgb[0]), int(rgb[1]), int(rgb[2]))

def rgb_to_hsp(rgb):
    """
    Calculates RGB color brightness
    """
    return sqrt(0.299 * (rgb[0]**2) + 0.587 * (rgb[1]**2) + 0.114 * (rgb[2]**2))


# PREVALENT COLORS DETECTION WITH K-Means
def detect_prevalent_colors(files, group_label, colors_num=10):
    """
    Detect N-most prevalent colors in the list of given image files
    
    Params:
    files: list of filenames
    group_label: label of group as tuple of variable and string
    colors_num: N most prevalent colors to detect 
    """
    data = []
    for f in files:
        img = load_img(f)
        if img is not None:
            mod_img = resize_img(img)
            mod_img = mod_img.reshape(mod_img.shape[0] * mod_img.shape[1], 3).astype('float32')
            data.append(mod_img)
        
    data = np.concatenate(data, axis=0)
    
    clustering_method = MiniBatchKMeans # KMeans or MiniBatchKMeans
    
    clf = clustering_method(n_clusters = colors_num)
    labels = clf.fit_predict(data)
    cluster_centers = clf.cluster_centers_
    
    counts = Counter(labels)
    counts_sum = sum(counts.values())
    
    colors_data = [
        {
            'hex': rgb_to_hex( cluster_centers[i]),
            'hsp': rgb_to_hsp(cluster_centers[i]),
            'prop': (counts[i] / counts_sum),
            group_label[0]: group_label[1]
        } for i in counts.keys()
    ]
    
    return colors_data


# RUN
viz_colors = []
for year, files in sample_info.items():
    print('Processing year:', year)
    
    year_colors = detect_prevalent_colors(files, ('year', year), colors_num=10)
    
    # Sort colors by brightness
    viz_colors.extend(
        sorted(year_colors, key=lambda x: x['hsp'])
    )

### PREPARE & PLOT COLORS TIMELINE VISUALIZATION

In [None]:
viz_df = pd.DataFrame(viz_colors)

#### VISUALIZE WITH COLOR PROPORTIONS

In [None]:
alt.Chart(viz_df, title='Prevalent colors by half-decade').mark_bar(size=9).encode(
    alt.X(
        'year',
        scale=alt.Scale(
            domain=(1495, 2015),
            nice=False
        ),
        axis=alt.Axis(
            title='Year (half-decade)',
            format='.4')
    ),
    alt.Y(
        'sum(prop)',
        scale=alt.Scale(domain=(0,1)),
        axis=alt.Axis(
            title='Prevalent color proportions'
        )
    ),
    
    color=alt.Color(
        'hex',
        scale=None,
        legend=None
    ),
    order=alt.Order(
        'hsp',
        sort='ascending'
    ),
    tooltip=['year', 'hex', 'prop']
).properties(
    width=900,
    height=500
)

#### VISUALIZE JUST WITH COLORS

In [None]:
alt.Chart(viz_df, title='Prevalent colors by half-decade').mark_bar(size=9).encode(
    alt.X(
        'year',
        scale=alt.Scale(
            domain=(1495, 2015),
            nice=False
        ),
        axis=alt.Axis(
            title='Year (half-decade)',
            format='.4')
    ),
    alt.Y(
        'count()',
        scale=alt.Scale(domain=(0,10)),
        axis=alt.Axis(
            title='Prevalent color'
        )
    ),
    
    color=alt.Color(
        'hex',
        scale=None,
        legend=None
    ),
    order=alt.Order(
        'hsp',
        sort='ascending'
    ),
    tooltip=['year', 'hex']
).properties(
    width=900,
    height=500
)

### REFERENCES

1. [Color Identification in Images: Machine Learning Application](https://towardsdatascience.com/color-identification-in-images-machine-learning-application-b26e770c4c71)
2. [How to find the main colours in an image](https://www.alanzucconi.com/2015/05/24/how-to-find-the-main-colours-in-an-image/)
3. [The incredibly challenging task of sorting colours](https://www.alanzucconi.com/2015/09/30/colour-sorting/)
4. [HSP Color Model — Alternative to HSV (HSB) and HSL](http://alienryderflex.com/hsp.html)