Eventually outputs a full subset of the Rijksmuseum dataset, in the form of a csv-file containing image file name and corresponding material.

The Rijksmuseum set is very unbalanced: while there are 400 material classes, 84% falls within the `papier' class. The script below outputs a subset that is much more balanced.

Moreover, a small fraction of the collection has multiple materials, and another fraction has none at all. These are removed from the dataset.

In [1]:
import os
import pandas as pd

In [2]:
# Folder containing all the xml metadata files:
xmlPath = "/home/vincent/Documenten/BachelorsProject/Rijksdata/xml/"

In [7]:
# Takes the contents of an xml metadata file as input, and outputs
# list of materials it specified
def extractMaterials(xmlFile):
    with open(xmlFile) as f:
        xmlStr = f.read()
    
    materials = []
    
    matchStr = "<dc:format>materiaal: "
    begin = xmlStr.find(matchStr)
    while begin != -1:
        end = xmlStr.find("<", begin + len(matchStr))
        materials += [xmlStr[begin + len(matchStr):end]]
        begin = xmlStr.find(matchStr, end)
    
    return materials

In [35]:
# Getting all ("image-filename", [materials]) pairs
pairs_full = [[file.name, extractMaterials(file.path)] for file in os.scandir(xmlPath) if file.is_file()]

In [36]:
# Now only the ones with a single material (so not 0 and not multiple)
pairs = [[pair[0], pair[1][0]] for pair in pairs_full if len(pair[1]) == 1]

In [37]:
# Creating a histogram containing how often each class occurs:
hist = {}
for pair in pairs:
    if pair[1] in hist:
        hist[pair[1]] += 1
    else:
        hist[pair[1]] = 1

# Convert to sorted list:
hist = [[mat, hist[mat]] for mat in hist]
hist.sort(key = lambda x: x[1], reverse = True)

In [38]:
# Printing material, count, and percentage count of total
total = 0
for row in hist:
    total += row[1]

for row in hist:
    print(f"{row[0]}, {row[1]}, {(100 * row[1] / total):>0.1f}%")

papier, 88298, 91.4%
porselein, 1626, 1.7%
zilver, 1232, 1.3%
faience, 1009, 1.0%
hout, 507, 0.5%
brons, 372, 0.4%
glas (materiaal), 349, 0.4%
perkament, 294, 0.3%
geprepareerd papier, 255, 0.3%
fotopapier, 239, 0.2%
ijzer, 203, 0.2%
Japans papier, 201, 0.2%
ivoor, 180, 0.2%
Oosters papier, 150, 0.2%
eikenhout, 133, 0.1%
terracotta, 107, 0.1%
aardewerk, 88, 0.1%
zijde, 80, 0.1%
koper, 70, 0.1%
messing, 64, 0.1%
goud, 63, 0.1%
klei, 60, 0.1%
tin, 59, 0.1%
karton, 57, 0.1%
steengoed, 55, 0.1%
satijn, 47, 0.0%
kardoespapier, 40, 0.0%
palmhout, 39, 0.0%
paneel, 39, 0.0%
lood (materiaal), 37, 0.0%
wit marmer, 36, 0.0%
linnen, 35, 0.0%
katoen, 33, 0.0%
chine collé, 32, 0.0%
zandsteen, 32, 0.0%
marmer, 29, 0.0%
wol, 23, 0.0%
kraakporselein, 23, 0.0%
notenhout, 19, 0.0%
andesiet, 16, 0.0%
leer, 16, 0.0%
Chinees papier, 13, 0.0%
lindehout, 13, 0.0%
albast, 13, 0.0%
stucwerk, 13, 0.0%
blik, 13, 0.0%
pijpaarde, 12, 0.0%
olieverf, 12, 0.0%
kalksteen, 12, 0.0%
schildpad, 12, 0.0%
lak, 8, 0.0%
parel

In [42]:
# Creating a subset where there is a maximum to the number of instances per class.
# If a class has more than this maximum, an random sample is picked.
max_instances = 1700
num_classes = 30

subset_classes = hist[:num_classes]

[['papier', 88298],
 ['porselein', 1626],
 ['zilver', 1232],
 ['faience', 1009],
 ['hout', 507],
 ['brons', 372],
 ['glas (materiaal)', 349],
 ['perkament', 294],
 ['geprepareerd papier', 255],
 ['fotopapier', 239],
 ['ijzer', 203],
 ['Japans papier', 201],
 ['ivoor', 180],
 ['Oosters papier', 150],
 ['eikenhout', 133],
 ['terracotta', 107],
 ['aardewerk', 88],
 ['zijde', 80],
 ['koper', 70],
 ['messing', 64],
 ['goud', 63],
 ['klei', 60],
 ['tin', 59],
 ['karton', 57],
 ['steengoed', 55],
 ['satijn', 47],
 ['kardoespapier', 40],
 ['palmhout', 39],
 ['paneel', 39],
 ['lood (materiaal)', 37]]