Eventually outputs a full subset of the Rijksmuseum dataset, in the form of a csv-file containing image file name and corresponding **TYPE**.

**BIG DISCLAIMER** This project was initially intended for material classification alone, so in terms of variable names, etc, the code often mentions materials, while in reality it can also be type or artist classification.

The Rijksmuseum set is very unbalanced: while there are 400 material classes, 84% falls within the `papier' class. The script below outputs a subset that is much more balanced.

Moreover, a small fraction of the collection has multiple materials, and another fraction has none at all. These are removed from the dataset.

In [1]:
import os
import random
import pandas as pd
import numpy as np

In [2]:
# Folder containing all the xml metadata files:
xmlPath = "/home/vincent/Documenten/BachelorsProject/Rijksdata/xml/"

In [3]:
# Takes the contents of an xml metadata file as input, and outputs
# list of types it specified
def extractType(xmlFile):
    with open(xmlFile) as f:
        xmlStr = f.read()
    
    types = []
    
    matchStr = "<dc:type>"
    begin = xmlStr.find(matchStr)
    while begin != -1:
        end = xmlStr.find("<", begin + len(matchStr))
        types += [xmlStr[begin + len(matchStr):end]]
        begin = xmlStr.find(matchStr, end)
    
    return types

In [4]:
# Getting all ("image-filename", [type]) pairs
pairs_full = [[file.name.replace(".xml", ".jpg"), extractType(file.path)]
    for file in os.scandir(xmlPath) if file.is_file()]

In [5]:
# Now only the ones with a single material (so not 0 and not multiple)
pairs = [[pair[0], pair[1][0]] for pair in pairs_full if len(pair[1]) == 1]

In [6]:
# Creating a histogram containing how often each class occurs:
def createHist(pairs):
    hist = {}
    for pair in pairs:
        if pair[1] in hist:
            hist[pair[1]] += 1
        else:
            hist[pair[1]] = 1

    # Convert to sorted list:
    hist = [[mat, hist[mat]] for mat in hist]
    hist.sort(key = lambda x: x[1], reverse = True)

    return hist

hist = createHist(pairs)

In [7]:
# Get the gini score: indicates how balanced the dataset is
def gini(arr: np.array):
    arr = np.array(arr)
    A = np.ones((len(arr), len(arr))) * arr
    AT = A.T
    A = np.abs(A - AT)
    numerator = np.sum(A)
    denominator = 2 * len(arr) ** 2 * np.mean(arr)
    return numerator / denominator

In [8]:
print(f"Number of classes: {len(hist)}")
print(f"Dataset size:      {sum([x[1] for x in hist])}")
print(f"Gini coefficient:  {gini([x[1] for x in hist])}")

Number of classes: 801
Dataset size:      77628
Gini coefficient:  0.9738106261386695


In [9]:
# Printing material, count, and percentage count of total
def printHist(hist):
    total = 0
    for row in hist:
        total += row[1]

    for row in hist:
        print(f"{row[0]}, {row[1]}, {(100 * row[1] / total):>0.1f}%")

printHist(hist)

prent, 52167, 67.2%
tekening, 10882, 14.0%
schilderij, 3271, 4.2%
foto, 1768, 2.3%
beeldhouwwerk, 538, 0.7%
bord (vaatwerk), 519, 0.7%
schotel, 491, 0.6%
tekstblad, 488, 0.6%
demonstratiemodel, 365, 0.5%
kom, 337, 0.4%
vaas, 283, 0.4%
miniatuur, 274, 0.4%
werfmodel, 238, 0.3%
boek, 207, 0.3%
schaal (objectnaam), 186, 0.2%
historiepenning, 182, 0.2%
vuursteenpistool, 180, 0.2%
wijnglas, 136, 0.2%
kan, 128, 0.2%
beeld, 116, 0.1%
kop, 115, 0.1%
fles, 108, 0.1%
carte-de-visite, 101, 0.1%
kandelaar, 95, 0.1%
pot, 94, 0.1%
ruit, 94, 0.1%
tsuba, 93, 0.1%
wandtapijt, 91, 0.1%
kop-en-schotel, 76, 0.1%
doos, 72, 0.1%
titelpagina, 67, 0.1%
silhouet, 66, 0.1%
plaat, 60, 0.1%
figuur, 58, 0.1%
instructiemodel, 52, 0.1%
beker, 52, 0.1%
kruik, 52, 0.1%
plaquette, 48, 0.1%
vouwwaaier, 48, 0.1%
bokaal, 47, 0.1%
theepot, 45, 0.1%
tegel, 45, 0.1%
zoutvat, 44, 0.1%
inhoudsmaat droge waren, 42, 0.1%
Indiase miniatuur, 40, 0.1%
terrine, 40, 0.1%
theebus, 35, 0.0%
dekselpot, 33, 0.0%
sleutel, 32, 0.0%
snuifdo

In [12]:
# Creating a subset where there is a maximum to the number of instances per class.
# If a class has more than this maximum, an random sample is picked.
# Moreover, there is a maximum number of classes
max_instances = 1000
num_classes = 15

# Splitting the first 'num_classes' into a set of small enough ones and too big ones:
good_sized_classes = [row[0] for row in hist[:num_classes] if row[1] <= max_instances]
too_big_classes    = [row[0] for row in hist[:num_classes] if row[1] >  max_instances]

# Already adding all instances of 'good_sized_classes':
pairs_subset = [pair for pair in pairs if pair[1] in good_sized_classes]

# Adding 'max_instances' random samples of the too big classes:
for material in too_big_classes:
    all = [pair for pair in pairs if pair[1] == material]
    random.shuffle(all)
    pairs_subset += all[:max_instances]

# Finally, randomly shuffling the subset:
random.shuffle(pairs_subset)

In [13]:
hist_subset = createHist(pairs_subset)

print(f"Number of classes: {len(hist_subset)}")
print(f"Dataset size:      {sum([x[1] for x in hist_subset])}")
print(f"Gini coefficient:  {gini([x[1] for x in hist_subset])}")

print("\n------------------\n")

printHist(hist_subset)

Number of classes: 15
Dataset size:      7926
Gini coefficient:  0.3099503742955673

------------------

tekening, 1000, 12.6%
schilderij, 1000, 12.6%
foto, 1000, 12.6%
prent, 1000, 12.6%
beeldhouwwerk, 538, 6.8%
bord (vaatwerk), 519, 6.5%
schotel, 491, 6.2%
tekstblad, 488, 6.2%
demonstratiemodel, 365, 4.6%
kom, 337, 4.3%
vaas, 283, 3.6%
miniatuur, 274, 3.5%
werfmodel, 238, 3.0%
boek, 207, 2.6%
schaal (objectnaam), 186, 2.3%


In [14]:
# Saving the new subset as a csv file:
subsetDf = pd.DataFrame.from_dict({
    "jpg":      [row[0] for row in pairs_subset],
    "material": [row[1] for row in pairs_subset]
})

subsetDf.to_csv("subset_data.csv", index=False)

In [15]:
# Saving the corresponding histogram as well:
subsetHistDf = pd.DataFrame.from_dict({
    "material": [row[0] for row in hist_subset],
    "count":    [row[1] for row in hist_subset]
})

subsetHistDf.to_csv("subset_hist_data.csv", index=False)