Eventually outputs a full subset of the Rijksmuseum dataset, in the form of a csv-file containing image file name and corresponding **ARTIST**.

**BIG DISCLAIMER** This project was initially intended for material classification alone, so in terms of variable names, etc, the code often mentions materials, while in reality it can also be type or artist classification.

The Rijksmuseum set is very unbalanced: while there are 400 material classes, 84% falls within the `papier' class. The script below outputs a subset that is much more balanced.

Moreover, a small fraction of the collection has multiple materials, and another fraction has none at all. These are removed from the dataset.

In [1]:
import os
import random
import pandas as pd
import numpy as np

In [2]:
# Folder containing all the xml metadata files:
xmlPath = "/home/vincent/Documenten/BachelorsProject/Rijksdata/xml/"

In [3]:
# Takes the contents of an xml metadata file as input, and outputs
# list of types it specified
def extractCreators(xmlFile):
    with open(xmlFile) as f:
        xmlStr = f.read()
    
    creators = []
    
    matchStr = "<dc:creator>"
    begin = xmlStr.find(matchStr)
    while begin != -1:
        end = xmlStr.find("<", begin + len(matchStr))
        if "anoniem" not in xmlStr[begin + len(matchStr):end]:
            creators += [xmlStr[begin + len(matchStr):end]]
        begin = xmlStr.find(matchStr, end)
    
    return creators

In [4]:
# Getting all ("image-filename", [type]) pairs
pairs_full = [[file.name.replace(".xml", ".jpg"), extractCreators(file.path)]
    for file in os.scandir(xmlPath) if file.is_file()]

In [5]:
# Now only the ones with a single material (so not 0 and not multiple)
pairs = [[pair[0], pair[1][0]] for pair in pairs_full if len(pair[1]) == 1]

In [6]:
# Creating a histogram containing how often each class occurs:
def createHist(pairs):
    hist = {}
    for pair in pairs:
        if pair[1] in hist:
            hist[pair[1]] += 1
        else:
            hist[pair[1]] = 1

    # Convert to sorted list:
    hist = [[mat, hist[mat]] for mat in hist]
    hist.sort(key = lambda x: x[1], reverse = True)

    return hist

hist = createHist(pairs)
hist.remove(hist[1]) # This is ' : ', which should have just been anonymous, but well..



In [7]:
# Get the gini score: indicates how balanced the dataset is
def gini(arr: np.array):
    arr = np.array(arr)
    A = np.ones((len(arr), len(arr))) * arr
    AT = A.T
    A = np.abs(A - AT)
    numerator = np.sum(A)
    denominator = 2 * len(arr) ** 2 * np.mean(arr)
    return numerator / denominator

In [12]:
print(f"Number of classes: {len(hist)}")
print(f"Dataset size:      {sum([x[1] for x in hist])}")
print(f"Gini coefficient:  {gini([x[1] for x in hist])}")

Number of classes: 8592
Dataset size:      38296
Gini coefficient:  0.6757501731586828


In [9]:
# Printing material, count, and percentage count of total
def printHist(hist):
    total = 0
    for row in hist:
        total += row[1]

    for row in hist:
        print(f"{row[0]}, {row[1]}, {(100 * row[1] / total):>0.1f}%")

printHist(hist)

 porseleinfabriek: Meissener Porzellan Manufaktur, 854, 2.2%
vermeld op object prentmaker: Dürer, Albrecht, 323, 0.8%
 tekenaar: Bernard, Jean, 301, 0.8%
 tekenaar: Brandes, Jan, 295, 0.8%
 prentmaker: Hogenberg, Frans, 286, 0.7%
 uitgever: Aa, Pieter van der (I), 267, 0.7%
 tekenaar: Gordon, Robert Jacob, 250, 0.7%
 prentmaker: Bos, Anthonie van den, 241, 0.6%
atelier van prentmaker: Hogenberg, Frans, 225, 0.6%
 prentmaker: Fock, Hermanus, 224, 0.6%
 prentmaker: Passe, Crispijn van de (I), 215, 0.6%
 tekenaar: Ducros, Louis, 195, 0.5%
 prentmaker: Fokke, Simon, 195, 0.5%
 fotograaf: Woodbury &amp; Page, 194, 0.5%
vermeld op object tekenaar: Borch, Harmen ter, 193, 0.5%
 tekenaar: Pronk, Cornelis, 192, 0.5%
 fotograaf: Asser, Eduard Isaac, 188, 0.5%
 naar prent van: Hogenberg, Frans, 178, 0.5%
vermeld op object prentmaker: Fokke, Simon, 170, 0.4%
 tekenaar: Buys, Jacobus, 167, 0.4%
 prentmaker: Luyken, Jan, 166, 0.4%
vermeld op object prentmaker: Everdingen, Allaert van, 161, 0.4%
 tek

In [10]:
# Creating a subset where there is a maximum to the number of instances per class.
# If a class has more than this maximum, an random sample is picked.
# Moreover, there is a maximum number of classes
max_instances = 1000
num_classes = 30

# Splitting the first 'num_classes' into a set of small enough ones and too big ones:
good_sized_classes = [row[0] for row in hist[:num_classes] if row[1] <= max_instances]
too_big_classes    = [row[0] for row in hist[:num_classes] if row[1] >  max_instances]

# Already adding all instances of 'good_sized_classes':
pairs_subset = [pair for pair in pairs if pair[1] in good_sized_classes]

# Adding 'max_instances' random samples of the too big classes:
for material in too_big_classes:
    all = [pair for pair in pairs if pair[1] == material]
    random.shuffle(all)
    pairs_subset += all[:max_instances]

# Finally, randomly shuffling the subset:
random.shuffle(pairs_subset)

In [11]:
hist_subset = createHist(pairs_subset)

print(f"Number of classes: {len(hist_subset)}")
print(f"Dataset size:      {sum([x[1] for x in hist_subset])}")
print(f"Gini coefficient:  {gini([x[1] for x in hist_subset])}")

print("\n------------------\n")

printHist(hist_subset)

Number of classes: 30
Dataset size:      6530
Gini coefficient:  0.23612046962736088

------------------

 porseleinfabriek: Meissener Porzellan Manufaktur, 854, 13.1%
vermeld op object prentmaker: Dürer, Albrecht, 323, 4.9%
 tekenaar: Bernard, Jean, 301, 4.6%
 tekenaar: Brandes, Jan, 295, 4.5%
 prentmaker: Hogenberg, Frans, 286, 4.4%
 uitgever: Aa, Pieter van der (I), 267, 4.1%
 tekenaar: Gordon, Robert Jacob, 250, 3.8%
 prentmaker: Bos, Anthonie van den, 241, 3.7%
atelier van prentmaker: Hogenberg, Frans, 225, 3.4%
 prentmaker: Fock, Hermanus, 224, 3.4%
 prentmaker: Passe, Crispijn van de (I), 215, 3.3%
 prentmaker: Fokke, Simon, 195, 3.0%
 tekenaar: Ducros, Louis, 195, 3.0%
 fotograaf: Woodbury &amp; Page, 194, 3.0%
vermeld op object tekenaar: Borch, Harmen ter, 193, 3.0%
 tekenaar: Pronk, Cornelis, 192, 2.9%
 fotograaf: Asser, Eduard Isaac, 188, 2.9%
 naar prent van: Hogenberg, Frans, 178, 2.7%
vermeld op object prentmaker: Fokke, Simon, 170, 2.6%
 tekenaar: Buys, Jacobus, 167, 2.6

In [37]:
# Saving the new subset as a csv file:
subsetDf = pd.DataFrame.from_dict({
    "jpg":      [row[0] for row in pairs_subset],
    "material": [row[1] for row in pairs_subset]
})

subsetDf.to_csv("subset_data.csv", index=False)

In [38]:
# Saving the corresponding histogram as well:
subsetHistDf = pd.DataFrame.from_dict({
    "material": [row[0] for row in hist_subset],
    "count":    [row[1] for row in hist_subset]
})

subsetHistDf.to_csv("subset_hist_data.csv", index=False)