# Data Set Cleaning
In this Notebok I will discard information that is not needed to perform our tasks and I also bring the data into a format that can be worked with.

Note: Some of the processes in this notebook may be somewhat inefficient or unoptimized, but since this notebook only runs once, that is alright.

Note: Because of the license of the used dataset, it might be the case that the result of this notebook must be useable by other parties. The section "Disclaimer" in the READ.me file applies, as correctness cannot be guaranteed - especially with the data enhancements.

In [None]:
import warnings
import copy
import os
import gc
import numpy as np
import torch
from sklearn.model_selection import ParameterGrid
import pandas as pd
import numpy as np
import pickle
import os.path
from tqdm import tqdm

from Project.Utils.Misc.Nlp import NLP
from Project.Utils.TextTagProcessing.GermanWordSplitter import GermanWordSplitter
from Project.Utils.IconclassCache.IconclassCache import IconclassCacheGenerationFailedException, NumberOfTriesExceededException, IconclassCache
from Project.Utils.ConfigurationUtils.ConfigLoader import ConfigLoader
from Project.VisualFeaturesBranches.ObjectDetection.ObjectDetectionNetworkUtils import get_results_path, train_model_on_all_data, get_winning_config_with_augmentations, get_winning_config_without_augmentations

nlp = NLP.instance.nlp
gws = GermanWordSplitter()
cl: ConfigLoader = ConfigLoader.instance

cl.data_cleaning_in_progress = True

database_file = os.path.join(os.getcwd(), 'xmlkultur.xlsx')
assert os.path.exists(database_file), 'The Dataset is not in the expected directory'

image_dir = os.path.join(os.getcwd(), 'images')
assert os.path.exists(image_dir), 'Image directory could not be found.'
df = pd.read_excel(database_file, index_col=0)

# Checking the uniqueness of the identifier
assert len(df['Identifier'].unique()) == df.shape[0], 'Object Id is not a unique identifier'

if not os.path.exists('DatasetCleaningCheckpoints'):
    os.mkdir('DatasetCleaningCheckpoints')

pickle_path_1 = os.path.join('DatasetCleaningCheckpoints', 'pickle_1.pkl')
pickle_path_2 = os.path.join('DatasetCleaningCheckpoints', 'pickle_2.pkl')
pickle_path_3 = os.path.join('DatasetCleaningCheckpoints', 'pickle_3.pkl')
pickle_path_4 = os.path.join('DatasetCleaningCheckpoints', 'pickle_4.pkl')
pickle_path_5 = os.path.join('DatasetCleaningCheckpoints', 'pickle_5.pkl')
pickle_path_6 = os.path.join('DatasetCleaningCheckpoints', 'pickle_6.pkl')
pickle_path_7 = os.path.join('DatasetCleaningCheckpoints', 'pickle_7.pkl')
pickle_path_8 = os.path.join('DatasetCleaningCheckpoints', 'pickle_8.pkl')
pickle_path_9 = os.path.join('DatasetCleaningCheckpoints', 'pickle_9.pkl')
pickle_path_10 = os.path.join('DatasetCleaningCheckpoints', 'pickle_10.pkl')
pickle_path_11 = os.path.join('DatasetCleaningCheckpoints', 'pickle_11.pkl')
pickle_path_12 = os.path.join('DatasetCleaningCheckpoints', 'pickle_12.pkl')

df.head()

## Dropping columns that will not be of much value further on

In order to educate our selection, I first check columns for NaNs and look for single valued columns, which would add no information at all.

In [None]:
df.isnull().any()

In [None]:
tuple(f'{name}: {len(df[name].unique())}' for name in df.columns.values)

The following columns are dropped right away:
* 'IsShownAt'
    * Unreleated to the task
* 'ObjectType'
    * Single valued column
* 'Object'
    * No longer needed
* 'Format'
    * No longer needed
* 'Dimensions'
    * Unreleated to the task
* 'Location'
    * Unreleated to the task
* 'Rights'
    * Single valued column
* 'CreativeCommons'
    * Single valued column
* 'Publisher'
    * Single valued column
* 'Spatial'
    * Unrelated to the task
* 'Provenance'
    * Unrelated to the task

In [None]:
df = df.drop(
    columns=['IsShownAt', 'ObjectType', 'Object', 'Format', 'Dimensions', 'Location', 'Rights', 'CreativeCommons',
             'Publisher', 'Spatial', 'Provenance'])
df.head()

I check again how many values exist per column:

In [None]:
[f'{name}: {len(df[name].unique())}' for name in df.columns.values]

In [None]:
[[entry for entry in df['CreationDate'] if not entry.isdigit()][i] for i in
 [0, 1, 9, 11, 13, 19, 82, 260, 290, 413, 578, 643, 877, 904]]

This column will be very hard to clean, but it is necessary in order to put the images into a temporal order and to use them later on for self-supervised learning. Particularly the "undatiert" (undated) fields will be hard to substiitute. Luckily I have another source of information when it comes to the dates of the creations: The "Temporal" Row will be able to give us at least rough estimates of when a particular work was created. Unfortunately there will still be 55 images left that lack both pieces of information. I will explain how these are dealt with later.

In [None]:
print(df['Temporal'].unique())
print(len(df[df['Temporal'].isna()]))

Above we see the different epochs. There are 134 entries in the dataset which lack the association with a particular epoch.

There is also a typo in 'BIedermeier-Realismus' where the second letter is written in capital. This will be corrected in the next cell, effectively merging the uncorrectly written category with the correctly written one.

In [None]:
df.loc[df['Temporal'] == 'BIedermeier-Realismus', 'Temporal'] = 'Biedermeier-Realismus'
print(df['Temporal'].unique())

For the Data Enhancement of the "Temporal" column, Wikipedia articles were checked to find an approxiamte year. Please be aware that this labeling is a very rough approximation used only on a tiny part of the dataset (<100 of instances) and therefore not very scientific. An art historian will almost certainly disagree with the dates that were picked and said historian might also find some faults in the labeling altogether. It is also appearent that a single label is absent the information of how great the timespan of an epoche is. This can mean quite a large margin of error for longer epochs. However these labels are only needed to bring some ordering into the labels that lack a creation date, but have an epoch associated to them and for that task this rough estimate should certainly be sufficient. A perhaps more accurate way of labeling would be to look up each creator individually, which I will actually do for some of them. This problem with the large margin of error only concerns long epoch like the Renaissance or the "Romanik" and "Barock". Most creators of these epochs are unknown anyway. For Barock, however there exist a lot of rows in the database, but only a handful of creators. Therefore I will use the lifetime of the creators to create the labels in these cases. There is also one known creator in "Romanik", which I also looked up.

In the following there is a list with short description of why certain dates were chosen. These are necessary in order for a reader to determine wheter a mistake (like a typo or miscalculation) was made in the process of the labeling.

* "Kunst um 1900"
    * Will obviously be labeled with 1900
* "Moderne vor 1945"
    * "https://de.wikipedia.org/wiki/Moderne" states that the period started in literature- and arthistory with the beginning of the 19th century, and as style with the end of the 19th century. I therefore take the end of the 19th century, as the style is something that will be appearent in the works of art. However, the label says "Moderne vor 1945" so I roughly average the year to 1925, so that it is later then "Kunst um 1900".
* "Realismus"
    * https://de.wikipedia.org/wiki/Realismus_(Kunst) states that the artworks of this epoch have their beginnings in mid 19th century in Europe. As I understand it, the epoch ended with the early 20th century, so the label chosen is 1860.
* "Historismus"
    * https://de.wikipedia.org/wiki/Historismus states that this epoch dates to the late 19th and early 20th century, therefore  1900 is chosen again.
* "Spätbarrock"
    * https://de.wikipedia.org/wiki/Barock states the years 1700-1730 for "Spätbarrock", but also states that sometimes people use the name also when talking about the epoch "Rokoko" which goes from 1730 to 1760/70. Since there is no epoch called "Rokoko" in our dataset, the assumption is made, that this is also the case in the dataset at hand. Therefore the year 1730 is chosen.
* "Romantik"
    * https://de.wikipedia.org/wiki/Romantik states that the epoch started at the ending of the 18th century and lasted until the far into the 19th century. Therefore the label 1840 is chosen.
* "Bidermeier-Realismus"
    * The difference between Bidermeier and Bidermeier-Realismus was not appearent for me, which is why they share the same date (see further below in this list)
* "Neoklassizismus"
    * https://de.wikipedia.org/wiki/Neoklassizismus_(bildende_Kunst) states the beginning around 1900, but is not so clear about the end of the epoch. As I understand it, this seems to be in the 40s or 50s, which is why the date 1920 is chosen as label.
* "Symbolismus"
    * https://de.wikipedia.org/wiki/Symbolismus_(bildende_Kunst) puts the height of the epoch between 1880 and 1910. Therefore the date 1895 is chosen.
* "Klassizismus"
    * https://de.wikipedia.org/wiki/Klassizismus puts the time of this epoch to approximately 1770 and 1840. The label will therefore be 1805.
* "Barock"
    * https://de.wikipedia.org/wiki/Barock puts the epoch in the timeframe between the end of the 16th century and to epproximately 1760/70, therefore the label 1680 is used.
* "Stimmungsimpressionismus"
    * https://de.wikipedia.org/wiki/Stimmungsimpressionismus puts this epoch in the years from 1870 to 1900, therefore the label will be 1885.
* "Spätgotik"
    * https://de.wikipedia.org/wiki/Malerei_in_der_Gotik puts the end of this epoch to 1525/30 and https://de.wikipedia.org/wiki/Gotik#Regionale_Verbreitung_und_Weiterentwicklung puts the beginning of it to 1350 (although this year was stated concerning the architecture of buildings!). The label chosen is 1450.
* "Biedermeier"
    * https://de.wikipedia.org/wiki/Biedermeier dates the epoch to 1815-1848, therefore the roughly averaged label is 1835.
* "Moderne nach 1945"
    * https://de.wikipedia.org/wiki/Moderne discusses a "Zweite Moderne" (second "Moderne"), which is, as I understand still ongoing, which is why the label 1980 is chosen.
* "Impressionismus"
    * https://de.wikipedia.org/wiki/Impressionismus_(Malerei) puts the beginning of this epoch into the second half of the 19th century and the epoch seems to go to the beginning of the next century. Therefore the date 1880 is chosen, which puts it a bit earlier than the label of "Stimmungsimpressionismus".
* "Neue Sachlichkeit"
    * https://de.wikipedia.org/wiki/Neue_Sachlichkeit_(Kunst) states that the year 1925 is very important for this epoch and it seems to be right in the middle of its timespan as well, which is why this year was chosen as label.
* "Realismus; Naturalismus"
    * The average of the labels I put on "Realismus" (see further up in the list) and "Naturalismus" (see further down in the list) is 1865, so this year will be used.
* "Expressionismus"
    * https://de.wikipedia.org/wiki/Expressionismus puts this epoch into the early 20th century, so the year 1915 is chosen.
* "Hochbarock"
    * https://de.wikipedia.org/wiki/Barock puts this epoch to approximately in 1850 to 1900, which is why it is labeled 1875.
* "Spätimpressionismus"
    * https://de.wikipedia.org/wiki/Post-Impressionismus states that "Spätimpressionismus" and "Postimpressionismus" are synonyms which is why I put the same label on it as in "Postimpressionismus" (see further below in the list).
* "Renaissance"
    * https://de.wikipedia.org/wiki/Renaissance puts the most important time of this epoch to the 15th and 16th century, which is why I choose the label 1500
* "Klassische Moderne"
    * https://de.wikipedia.org/wiki/Moderne has a quote from "Detlev Peukert: Max Webers Diagnose der Moderne. Vandenhoeck & Ruprecht, Göttingen 1989, ISBN 3-525-33562-8" where he states that the time of the Weirmarer Republic was the height of the "Klassische Moderne", which is why in our labeling, the year 1925 was chosen (which is the same label which was put on "Moderne vor 1945").
* "Naturalismus"
    * https://de.wikipedia.org/wiki/Naturalismus_(bildende_Kunst) puts this epoch from approximately 1870 to 1890, therefore the label will be 1880.
* "Postimpressionismus"
    * https://de.wikipedia.org/wiki/Post-Impressionismus puts this epoch between 1880 and 1905, which is why it gets the label 1890.
* "Frühbarock"
    * https://de.wikipedia.org/wiki/Barock puts the epoch to approximately 1650.
* "Neuromantik"
    * https://de.wikipedia.org/wiki/Neuromantik puts this epoch in 1890 to 1915 (altough it refers to the literary epoch). I therefore label it with 1905.
* "Hochrenaissance"
    * https://de.wikipedia.org/wiki/Hochrenaissance puts the epoch in approximately 1500 to 1530, which is why 1515 was chosen as a label.
* "Klassizismus; Biedermeier-Realismus"
    * Since "Klassizismus" is labeled with 1805 and "Biedermeier-Realismus" with 1835, the chosen level is the average of 1820.
* "Spätromantik"
    * https://de.wikipedia.org/wiki/Sp%C3%A4tromantik puts this epoch in the time between 1815 und 1848, which is why 1830 is chosen as the label.
* "Romanik"
    * https://de.wikipedia.org/wiki/Romanik states that the beginning of this eoch is around 950/60 and the end depends on the country, but is in the 13 century at the latest. It should be mentioned that here again, the years in the article refer to the architecture. 1100 is chosen as label.
* "Historismus; Naturalismus"
    * Since the label for "Historismus" is 1900 and the label for "Naturalismus" is 1880, the average is chosen as new label, which is 1890.
* "Realismus; Symbolismus"
    * Since the label for "Realismus" is 1850 and the label for "Symbolismus" is 1895, the rough average is chosen as new label, which is 1875.
* "Frühgotik"
    * https://de.wikipedia.org/wiki/Gotik puts this epoch between 1130 and 1180 which is why the label 1155 is chosen.

Artists in these categories are unknown as discussed above:

In [None]:
df[df['Temporal'] == 'Renaissance']

In [None]:
df[df['Temporal'] == 'Romanik']

7 creators from the epooch of "Barrock" are known, as can be seen in the cell below. I will look them up together with the other creators further below.

In [None]:
print(len(df[df['Temporal'] == 'Barock']))
print(df[df['Temporal'] == 'Barock']['Creator'].unique())

In [None]:
temporal_to_year = {
    'Kunst um 1900': 1900,
    'Moderne vor 1945': 1925,
    'Realismus': 1860,
    'Historismus': 1900,
    'Spätbarock': 1730,
    'Romantik': 1840,
    'Biedermeier-Realismus': 1835,
    'Neoklassizismus': 1920,
    'Symbolismus': 1895,
    'Klassizismus': 1805,
    'Barock': 1680,
    'Stimmungsimpressionismus': 1885,
    'Spätgotik': 1450,
    'Biedermeier': 1835,
    'Moderne nach 1945': 1980,
    'Impressionismus': 1880,
    'Neue Sachlichkeit': 1925,
    'Realismus; Naturalismus': 1865,
    'Expressionismus': 1915,
    'Hochbarock': 1875,
    'Spätimpressionismus': 1890,
    'Renaissance': 1500,
    'Klassische Moderne': 1925,
    'Naturalismus': 1880,
    'Postimpressionismus': 1890,
    'Frühbarock': 1650,
    'Neuromantik': 1905,
    'Hochrenaissance': 1515,
    'Klassizismus; Biedermeier-Realismus': 1820,
    'Spätromantik': 1830,
    'Romanik': 1100,
    'Historismus; Naturalismus': 1890,
    'Realismus; Symbolismus': 1875,
    'Frühgotik': 1155,
    np.nan: -1
}

There are rows which have no entry in "CreationDate". For these two ways to handle them are appearent:
* Option 1: There are 20 unique creators for the creators in question and I could label the images with the average of their lifetime. For the entries where also the artist is not known I would then use option 2:
* Option 2: Look up what collections these images where part of and average their epoch-labels (which I have already created)

Option 1 has the advantage that it would be fairly close to the real date, for that reason I will go with that wherever possible. Since I am looking at hundrets of years of art (and my knowledge of art is very limited at this point), I am susceptible to confusing artists with the same name. Therefore only sammlung.belvedere.at is used as a source and if the information cannot be gathered there, option 2 is used. The perhaps greater argument behind this is, that if the art experts that labeled these images could not find a date, then the date that is found with one quick google search is probably not the right one.

All resources in this list where accessed on 15.08.2021.

The years between birth and death will be roughly averaged for the labeling.

* E. Guenther
    * https://sammlung.belvedere.at/people/679/e-guenther -> 19th century -> Therefore we take 1850 as a date
* Anton Hans Karlinsky
    * https://sammlung.belvedere.at/people/1003/anton-hans-karlinsky/ -> 1872 - 1945
* Franz Schmied
    * https://sammlung.belvedere.at/people/objects/2028 -> 1796 - 1851
* Joseph Hasslwander
    * https://sammlung.belvedere.at/people/747/joseph-hasslwander/ -> 1812 - 1878
* Franz Hoffmann
    * https://sammlung.belvedere.at/people/8985/franz-hoffmann/ -> Nothing, therefore option 2 will be used
* P. Dupin
    * https://sammlung.belvedere.at/people/6320/pierre-dupin -> Nothing
* Joseph Lavos
    * https://sammlung.belvedere.at/people/1243/joseph-lavos -> 1807 - 1848
* Eduard Gurk
    * https://sammlung.belvedere.at/people/687/eduard-gurk -> 1801 - 1841
* Carl Heindel
    * https://sammlung.belvedere.at/people/774/carl-heindel -> 1810 - 1869
* Viktor Scharf
    * https://sammlung.belvedere.at/people/1980/viktor-scharf -> 1872 - 1943 (?)
* Friedrich Hasslwander
    * https://sammlung.belvedere.at/people/746/friedrich-hasslwander/objects? -> 1840 - 1914
* Leopold Pollak
    * https://sammlung.belvedere.at/people/1736/leopold-pollak -> 1806 - 1880
* Balthasar Moncornet
    * https://sammlung.belvedere.at/people/6318/balthasar-moncornet -> approximately 1600 to 1668
* Bernhard Reinhold
    * https://sammlung.belvedere.at/people/1832/bernhard-reinhold -> 1824 - 1892
* Peter Schenk der Ältere
    * https://sammlung.belvedere.at/people/7980/peter-schenk-der-altere -> 1660 - 1730
* Francesco Milani
    * https://sammlung.belvedere.at/people/1449/francesco-milani -> Nothing
* Erasmus von Engert
    * https://sammlung.belvedere.at/people/434/erasmus-von-engert/objects? -> 1796 - 1871
* Lore Scheid
    * https://sammlung.belvedere.at/people/1990/lore-scheid/objects? -> 1884 - 1946
* Felice Zuliani
    * https://sammlung.belvedere.at/people/2647/felice-zuliani/objects? -> (?) - 1834 -> The label will be 1810
* Johann Peter Krafft
    * https://sammlung.belvedere.at/people/1136/johann-peter-krafft/objects? -> 1780 - 1856

As mentioned above, I also want to look up the artists from the period of "Barock" since the significant length of the epoch would mean that the average error would be rather high.

* Salomon Kleiner
    * https://sammlung.belvedere.at/people/1055/salomon-kleiner -> 1700 - 1761
* Jakob Kellner
    * https://sammlung.belvedere.at/people/6545/jakob-kellner -> ? - 1775 -> The label will be 1750
* Simon Hueter
    * https://sammlung.belvedere.at/people/888/simon-hueter -> Nothing
* Johann Baptist Gumpp
    * https://sammlung.belvedere.at/people/8065/johann-baptist-gumpp -> 1651 -1728
* Jeremias Jakob Sedelmayr
    * https://sammlung.belvedere.at/people/7764/jeremias-jakob-sedelmayr -> 1706 - 1761
* Martin van Meytens d. J.
    * https://sammlung.belvedere.at/people/1435/martin-van-meytens-d-j/objects? -> 1695-1770
* Martino Altomonte
    * https://sammlung.belvedere.at/people/32/martino-altomonte -> 1657/1659 – 1745

For the one artist in Romanik that is actually known I have also looked up the date:
* Meister I.P.
    * https://sammlung.belvedere.at/people/5763/meister-i-p -> Mentions that the artist created art around 1520 / 1540 -> 1530 is used therefore.

In [None]:
creator_to_year = {
    'E. Guenther': 1850,
    'Anton Hans Karlinsky': 1910,
    'Franz Schmied': 1825,
    'Joseph Hasslwander': 1845,
    'Franz Hoffmann': -1,
    'P. Dupin': -1,
    'Joseph Lavos': 1830,
    'Eduard Gurk': 1821,
    'Carl Heindel': 1840,
    'Viktor Scharf': 1907,
    'Friedrich Hasslwander': 1877,
    'Leopold Pollak': 1843,
    'Balthasar Moncornet': 1634,
    'Bernhard Reinhold': 1858,
    'Peter Schenk der Ältere': 1695,
    'Francesco Milani': -1,
    'Erasmus von Engert': 1835,
    'Lore Scheid': 1915,
    'Felice Zuliani': 1810,
    'Johann Peter Krafft': 1818,
    # Artists from the epoch of "Barock":
    'Salomon Kleiner': 1730,
    'Jakob Kellner': 1750,
    'Simon Hueter': -1,
    'Johann Baptist Gumpp': 1690,
    'Jeremias Jakob Sedelmayr': 1733,
    'Martin van Meytens d. J.': 1730,
    'Martino Altomonte': 1700,
    # Artists from the epoch of "Romanik":
    'Meister I.P.': 1530,
    # Unknown artists:
    'Unbekannter Künstler': -1,
    'Unbekannter Stecher': -1
}

Below the few examples that will still lack a label can be seen. Interestingly they all come from the same 2 collections.

In [None]:
df[(df['CreationDate'] == 'undatiert') & (df['Temporal'].map(temporal_to_year) == -1) & (
        df['Creator'].map(creator_to_year) == -1)][['CreationDate', 'Temporal', 'Creator', 'Collection']]

We can see that all the columns that are still left are either part of "Klassizismus | Romantik | Biedermeier ## Classicism | Romanticism | Biedermeier" or "Barrock ## Baroque". I have previously labeld "Romantik" with the year 1840, "Biedermeier" with the year 1835 and "Klassizismus; Biedermeier-Realismus" with 1820. For that reason I will put the label 1830 on the objects that are still not labeled and that are part of the first collection. For the still unlabeled objects from the second collection I will use the label I assigned to "Barock" previously, which is 1680.

First however I correct two typos that were found in this column:

In [None]:
df.loc[df['CreationDate'] == '1902 um', 'CreationDate'] = 'um 1902'
df.loc[df['CreationDate'] == 'um 1800 /1820', 'CreationDate'] = 'um 1800/1820'

With all this out of the way, I am ready for creating our new column:

In [None]:
df[df['Creator'] == 'Adolf Hirémy-Hirschl']

In [None]:
df.dtypes

In [None]:
df['YearEstimate'] = ''
to_update = dict()

for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
    current = row['CreationDate']
    result = 0

    # Treating special cases first
    if current == "undatiert":

        # For "Barock" artists I use the date information from the artists as discussed above
        # Note that I only want to do this if we can actually associate the creator to a year
        if row['Temporal'] == "Barock" and creator_to_year[row['Creator']] != -1:
            result = creator_to_year[row['Creator']]
        
        # The same goes for the epoch of "Romanik"
        elif row['Temporal'] == "Romanik" and creator_to_year[row['Creator']] != -1:
            result = creator_to_year[row['Creator']]
        
        # Most of the images without a date can be matched with an epoch
        elif (temporal_year := temporal_to_year[row['Temporal']]) != -1:
            result = temporal_year

        # For some the authors had to be looked up
        elif (creator_year := creator_to_year[row['Creator']]) != -1:
            result = creator_year

        # For a few the labels have to be based solely on what collection they were part of
        elif row['Collection'] == 'Klassizismus | Romantik | Biedermeier ## Classicism | Romanticism | Biedermeier':
            result = 1830
        elif row['Collection'] == 'Barock ## Baroque':
            result = 1680
        else:
            raise Exception(f'Row is incorrectly handled\n{row}')

    # Now I take care of 7 particular entries that are written in an uncommon manner, so that breaking them
    # down programmatically would not make a lot of sense.
    elif current == '1778 oder 1788':
        result = 1783
    elif current == 'Ende 12. Jahrhundert / um 1200':
        result = 1200
    elif current == 'Ende 1878 oder 1880':
        result = 1879
    elif current == '1878 oder 1880':
        result = 1879
    # If this means 19th century then it is possible that this is mislabeled since the 19th century would span the time
    # between 1800 and 1900.
    elif current == 'Ende 19.–Anfang 2000':
        result = 2000
    # Same situation here
    elif current == '19.–Anfang 2000':
        result = 1975
    elif current == 'Ende 19.–Anfang 20. Jahrhundert':
        result = 1900

    # Now to the "general" cases
    else:

        # Some of the list elements appear only once, but the differents to the special cases (further below) is that
        # they do not influence the result variable, because they are not keywords like "Anfang" (meaning "beginning").
        # Therefore they can be truncated

        # Note: The order is important: Combinations like "wohl um " are also removed
        leading = ['wohl ', 'um ', 'vor ', 'nach ', 'spätestens ', 'dem Sommer ']

        trailing = [' ?', ' (?)', 'er Jahre', ' (Druck: 1798)', ' (Druck: 1796)', ' oder bald danach',
                    ' (Guss: 1929/1930)', ' um', ' (nach 1483?)', ' (Guss: 1943)', ' (vollendet 1909)',
                    ' (geringfügige Ergänzungen 1907)', ' (Guss 1927)', ' (Nachguss: 1980)',
                    ' (1931 signiert)', ', 1924 überarbeitet']

        for l in leading:
            if current.startswith(l):
                current = current[len(l):]

        for t in trailing:
            if current.endswith(t):
                current = current[:-len(t)]

        # e.g. 19. Jahrhundert -> 1800
        jahrhundert_correction = 0
        if current.endswith('. Jahrhundert'):
            current = current[0:-13] + "00"
            jahrhundert_correction = 100
        elif current.endswith('. Jahrhunderts'):
            current = current[0:-14] + "00"
            jahrhundert_correction = 100

        # Now I add a certain amount of years based on the keywords that precede the year
        leading_keywords = {
            'Anfang ': 10,
            'Ende der ': 90,
            'Ende ': 90,
            '2. Hälfte des ': 75,
            '4. Viertel ': 90,
            '2. Viertel ': 40,
            'Letztes Drittel ': 80
        }

        for key, value in leading_keywords.items():
            if current.startswith(key):
                current = current[len(key):]
                result = value
                break

        # e.g. 1900
        if len(current) == 4 and current.isdigit():
            result += int(current)
            result -= jahrhundert_correction
        # e.g. 1900-1901 or 1900/1901
        elif len(current) == 9 and current[0:4].isdigit() and current[6:].isdigit():
            assert jahrhundert_correction == 0
            result = int((int(current[0:4]) + int(current[5:])) / 2)
        # e.g. 1900/01
        elif len(current) == 7 and current[0:4].isdigit() and current[6:].isdigit():
            assert jahrhundert_correction == 0
            result = int(int(current[0:2]) * 100 + int(int(current[2:4]) + int(current[5:])) / 2)
        else:
            raise Exception(f'Row not taken care of: \n{row}')

    df.loc[ind, 'YearEstimate'] = result
df.head()

There are now no undefined values left in the coulmn:

In [None]:
len(df[(df['YearEstimate'].isin([-1, '', 0])) | (df['YearEstimate'].isna())])

I also check if the Barock artists are labeled correctly:

In [None]:
df[df['Temporal'] == 'Barock'].groupby('Creator').first()

There are still a few columns left, where its not clear whether they shuld be in the dataset or not. I will take a closer look at them now.

In [None]:
print(df['ObjectClass'].unique())

The "ObjectClass" column could prove to be interesting in further processing. If I consider for example that a "Landkarte" (map) would likely fill the whole space of the picture, while a 'Skulptur' (sculpture) is likely placed in front of a background, it could mean that differentiating between such categories could be beneficial e.g. for self supervised learning.

In [None]:
print(df['MaterialTechnique'].unique()[0:50])
print(len(df['MaterialTechnique'].unique()))

Although the "MaterialTechnique" column has the power to indicate visual similarities, the small excerpt from the dataset shows already that the cleaning of this column would be hours of work. Furthermore, the aim of the project is not so much finding visually similar objects, but rather contextually similar ones, which means that focusing too much on the former would not help a lot in reaching the goal. Therefore this column should be dropped. However, there is a chance that the distribution of the values allows that the number of classes can be greatly reduced. Therefore this column is left in for now.

In [None]:
print(df['Collection'].unique())

It is appearent that objects that are part of the same collection are likely related to each other in some ways. However, our learning algorithm should not learn to classify the object based on that. This would essential be the same as learning to group objects simply based on the fact that people have done it previously. While that would not technically be wrong and might even be leveraged for some machine learning task, it would distract from the real goal of grouping individual images based on their contextual information. Therefore this column should be dropped.

In [None]:
df = df.drop(columns=['Collection'])

## For each row a picture and vice versa
If everything was downloaded correctly, the image folder should only have png's in it and everyone of them should be related to a row in the dataset.
We will also drop the rows where the images are no longer available from the website.

In [None]:
rows_without_image = [row['Identifier'] for _, row in tqdm(df.iterrows(), total=df.shape[0]) if
                      not os.path.isfile(os.path.join(image_dir, row['Identifier'] + '.png'))]
print(f'There are {len(rows_without_image)} entries without images associated to them.')

# Checking that only png files are in the directory and that every file is matched with an entry in the dataset
# Note that this raises an Exception since the problem cannot be handled by simple data cleaning
for f in os.listdir(image_dir):
    assert os.path.isfile(os.path.join(image_dir, f)) and f.endswith('.png'), f'Unexpected finding (either folder or file that is not a png): {f}'
    assert (df['Identifier'] == f[:-4]).any(), f'No matching identifier was found in the dataset for the picture {f}'

In [None]:
shape_before = df.shape
df = df[~df['Identifier'].isin(rows_without_image)]
print(f'Shape before: {shape_before}\nShape after: {df.shape}')

I previously checked that the "ExpertsTag" column was non empty. However, many rows only hold an empty array as an entry:

In [None]:
print(len(df[df['ExpertTags'] == '[]']))
print(len(df[df['Description'].isna()]))
print(len(df[(df['ExpertTags'] == '[]') & (df['Description'].isna())]))
print(len(df[(df['ExpertTags'] == '[]') & (df['Description'].isna()) & (df['Title'].isna())]))

If we look at the output of the code below, we see that there are two problems with the descriptions. First, the short ones are often references to the artists life or to other works of art. Second, it is unreasonable to assume that a learner could decide what information concerns the art itself and what is just side information. 

In [None]:
for i in range(0, 300, step := 100):
    print(f'---------------- Descriptions with length {i} to {i + step} ----------------')
    np.random.seed(1)
    np.random.shuffle(current := [entry for entry in df[~df['Description'].isna()]['Description'].unique() if
                                  i < len(entry) < i + step])
    print(*current[0:5], '\n\n\n\n', sep='\n\n')

However we also see that squared brackets are used to denote references and those entries seem to be primarly descriptions of the objects themselves. If I would filter out all the other, I would lose 425 descriptions which means that this concerns more then 10 % of what is still left of the dataset. Still, it is necessary because having no additional data is better than adding potentially misleading tags.

In [None]:
useable = [entry for entry in df[~df['Description'].isna()]['Description'].unique() if
           ']' in entry and 'Inv.' not in entry]

print(f'{len(useable)} useable entities.\nExamples from useable entries:')
print(*[useable[i] for i in range(1, 101, 10)], sep='\n\n')


different = [u for u in useable if not u.strip().endswith(']')]
print(len(different))
print(*[different[i][0:500] + '...' for i in range(1, 16, 3)], sep='\n\n')

useable_not_unique = len([entry for entry in df[~df['Description'].isna()]['Description'] if
           ']' in entry and 'Inv.' not in entry])
unusable_not_unique = len(df[~df['Description'].isna()]['Description']) - useable_not_unique
print(f'{useable_not_unique} descriptions are left from the original {useable_not_unique + unusable_not_unique} ({unusable_not_unique} were filtered out).')

As can be seen from what is shown in the cell above, the 'useable' entries talk about the images themselves. The 'unusable' entries often talk about the life of the artist, etc. which is too loosely correlated with the images for our task.

As can also be seen in the cell above, there are also 15 entries that contain the symbol ']' that are not in the expected format. Nevertheless they are left in, since they, for the most part, also describe the images themselves.

Another problem becomes appearent here: if I generate labels from these 375 'useable' entities and later go on matching images based on their labels, these 375 images would appear significantly more often, just because they have so many more opportunities to match. At this point let it be mentioned that this actually became a problem and is discussed furhter in the notebook called 'matching'.

This could be alleviated by weighing each label with 1/(number of labels for entity), but then entities with a greater number of labels would never be matched, because they require a large overlap. For example 299/500 overlap would lose against a 3/5 overlap, eventhough the former one would likely be the better candidate.

I could also rank labels based on their origins, so I will add them to have this option in the future. In order to stay flexible later on, I will not save the weights directly, but instead just denote their column of origin:

'Exp' - Origin from column 'ExpertTags'
'Title' - Origin from column 'Title'
'Des' - Origin from column 'Description'

Note: Later it was decided that the NLP library 'spacy' will be used, which features a method to calculate similarity between strings. This essentially solves the problem as tags can also negatively impact the similarity. Recording the origin of the tags might come in handy later on nonetheless.

In [None]:
if os.path.exists(pickle_path_1):
    df = pd.read_pickle(pickle_path_1)
else:
    df['GeneratedTags'] = ''


    def extract_tags(cell, marker, from_cell):
        articles = ['der', 'des', 'dem', 'den', 'die', 'das', 'des', 'ein', 'eines', 'einem', 'einen', 'einer', 'dieser',
                    'diese', 'diesem', 'diesen', 'dieses', 'jener', 'jenes', 'jenem', 'jenen', 'er', 'sie', 'es', 'wir',
                    'ihr']

        if pd.isnull(cell):
            return []
        cur_tags = []
        doc = nlp(cell)
        for nc in doc.noun_chunks:
            current_tag = []
            words = nc.text.split(' ')
            for w in words:
                if w not in nlp.Defaults.stop_words and w.lower() not in articles and len(w.strip()) > 0:
                    current_tag.append(w.strip())
            if len(current_tag) > 0:
                cur_tags.append((' '.join(current_tag), marker, from_cell))
        return cur_tags

    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        generated = []
        generated += extract_tags(row['Description'], 'Des', row['Description'])
        generated += extract_tags(row['Title'], 'Title', row['Title'])

        expert_tags = [t for t in row['ExpertTags'][2:-2].split('\', \'') if t != '']

        for tag in expert_tags:
            generated.append((tag, 'Exp', tag))

        df.loc[ind, 'GeneratedTags'] = generated
    # Checkpoint
    df.to_pickle(pickle_path_1)

Since the ExpertTags column is now fully contained in the newly generated column, I would not need the associated column anymore.
However, later on in the project the unprocessed Expert Tag labels are needed for lookups in the iconclass hierarchy, which they are based upon. Therefore they will be left in.

I leave the Description in as well, since later, when the user is shown the images, possibly through a web-frontend, it would add to the experience if also the description was shown.

I could have also included the 'Temporal' column as tags, but experiments have shown that not all of the tags would be recognized by the library.
This would lead to the unwanted behaviour that for some work of arts the epochs are considered and for others they are not.
Instead I will keep this column for selfsupervised learning later.

In [None]:
def write_top_1000_to_file(name:str):
    all_tags = dict()
    for df_ind, df_row in tqdm(df.iterrows(), total=df.shape[0]):
        for gt in df_row['GeneratedTags']:
            if gt not in all_tags.keys():
                all_tags[gt] = [gt[0], 0]
            all_tags[gt][1] += 1
    sorted_tags = sorted([(all_tags[at][1], all_tags[at][0]) for at in all_tags])[::-1]

    with open(name, 'w+', encoding="utf-8") as file:
        file.write('\n'.join([st[1] for st in sorted_tags][0:1000]))
write_top_1000_to_file('all_tags.txt')

Let us also take a look at the three cells below. Three problems become appearent:
- Tags can include other tags (like in 'Emilie' and 'Emilie von')
    - This has to be taken care of separetely, since the words are lemmatized in the processes below which will cause more tags to contain each other again.
- The entity recognition does not work so well on single words
    - I need to find the entities on the whole tag text and then filter out in a seperate step

Furthermore, persons are generally not that interesting to the context, Locations and Misc and Organizations on the other hand likely add more context. For that reason these are left in.

In [None]:
[t for t in df[df['Identifier'] == '3613']['GeneratedTags']]

In [None]:
print(nlp('Emilie').ents)
print(nlp('Emilie isst einen Apfel.').ents)

In [None]:
print(nlp('tschechischer').ents[0].label_)
print(nlp(nlp.tokenizer('tschechischer')[0].lemma_).ents[0].label_)
print(nlp('Ceska Krajiná').ents)
print(nlp(nlp.tokenizer('Ceska Krajiná')[0].lemma_).ents[0].label_)

In [None]:
if os.path.exists(pickle_path_2):
    df = pd.read_pickle(pickle_path_2)
else:
    df = pd.read_pickle(pickle_path_1)
    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        current_tags = []
        for tag_tuple in row['GeneratedTags']:
            tag = tag_tuple[0]

            if tag.strip().lower() in ['bild', 'teil', 'seitenansicht', 'profil', 'gemälde', 'darstellung', 'radierungen',
                                       'halbprofil', 'künstler', 'dreiviertelprovil', 'bild im bild', 'stichvorlage', 'etc',
                                       'dargestellten', 'bildes', 'bilder', 'min', 'max', 'vgl', 'ausdruck', 'stillleben',
                                       'stil', 'seite', 'url', 'szene aus']:
                continue

            for p in ['nr.', 'inv.', 'inv ']:
                if p in tag.lower():
                    continue

            ent_list = nlp(tag).ents
            labels = [l.label_ for l in ent_list]

            to_filter = [str(e) for e in nlp(tag).ents if e.label_ == 'PER' and 'christus' not in e.label_.lower() and 'könige' not in e.label_.lower()]

            temp = ' '.join([''.join([l for l in word if l.isalpha()]).strip() for word in tag.split(' ')])
            for f in to_filter:
                temp.replace(f, '')
            tag = ''.join(temp)
            current_words = ''
            for cur_word in tag.split(" "):
                word = cur_word
                if word.lower() == 'hl' or word.lower() == 'hll':
                    # Note that it does not matter what word ending is used, because of the later lemmatization
                    word = 'heiliger'

                word = ''.join([l for l in word if l.isalpha()]).strip()

                #filter out all special characters and numbers
                if len(word) == 0:
                    continue

                # words that are supposed to be written with a lower letter at the beginning sometimes have a capital letter
                # at the beginning. Since these words are then not represented by the model, the lemmatization would not work
                if sum(nlp(word).vector) == 0 and sum(nlp(word.lower()).vector) != 0:
                    word = word.lower()
                if word == 'Paar' or word == 'Herde':  # These words get lemmatized wrong
                    current_words += word + ' '
                else:
                    current_lemma = nlp.tokenizer(word)[0].lemma_
                    if nlp.vocab[current_lemma].rank < nlp.vocab[word].rank:
                        current_words += current_lemma + ' '
                    else:
                        current_words += word + ' '
            if len(current_words) > 0:
                current_tags.append((current_words.strip(), tag_tuple[1], tag_tuple[2]))
        # Because of the steps above it can happen that different words result in the same tag. Therefore we take the set.
        df.loc[ind, 'GeneratedTags'] = list(set(current_tags))
        
    # Checkpoint
    df.to_pickle(pickle_path_2)

In [None]:
list(df[df['Identifier'] == '1093a-p']['GeneratedTags'])

Words like 'Baumstudie', 'Bauerngehöft' etc. are (wrongly) filtered out as well if I filter out named entities, but it has to be considered that these are words that the model does not know anyway and therefore there is no vector representation, meaning that the similarity function would only return 0 for any comparison other than completly identical words.

There exist methods where part of words are also understood, but for the most part this will not do any good in this dataset, because 'Bartholomäus' has nothing to do with 'Bart', 'Mannersdorf' nothing with 'Mann' and 'Fischerwirt' nothing with 'Fisch' (except that one could probably order fish there) etc.

Unfortunately not all names are recognized as such and so we still have 'Emilie' in the tags.

In [None]:
print(nlp('Baumstudie').similarity(nlp('Baum')))
print(nlp('Männer').similarity(nlp('Mann')))
print(nlp(nlp.tokenizer('Männer')[0].lemma_).similarity(nlp(nlp.tokenizer('Mannes')[0].lemma_)))

Here we see the above mentioned problem: The language model cannot associate Baumstudie with any other word. However, we see a different problem as well: One would think that 'Männer' (men) and 'Mann' (man) are pretty similar to each other, but this is not reflected in the results. Likely this is because the two words rarely appear in the same sentence together. For this reason we will use lemmatization.

In [None]:
print(nlp('Katze').similarity(nlp('Futter')))
print(nlp('Katze Napf').similarity(nlp('Futter')))
print(nlp('Katze Baum').similarity(nlp('Futter')))

In the cell above it can be seen that additional words can both positively or negatively impact the similarity. While this seems obvious, it is still important to mention, since this means that entities with have more tags do not necessarily have an advantage in the matching process later.

In [None]:
def count_unrepresented_and_empty_tags(for_df):
    not_found = 0
    empty_tags = 0
    concerning_ids = set()
    for df_ind, df_row in tqdm(for_df.iterrows(), total=for_df.shape[0]):
        tup_arr = df_row['GeneratedTags']
        for ta in tup_arr:
            assert len(ta[0]) != 0, f'Empty tag: {df_row["GeneratedTags"]}'
            assert len(ta[1]) != 0, f'Empty tag-label: {df_row["GeneratedTags"]}'

            if sum(nlp(ta[0]).vector) == 0:
                not_found += 1
                concerning_ids.add(df_row['Identifier'])

        if len(tup_arr)==0:
            empty_tags += 1
            concerning_ids.add(df_row['Identifier'])
    print(f'{not_found} entities are not represented by the nlp model. '
          f'{empty_tags} entities are without any tags. This concerns {len(concerning_ids)} artworks.')
    return concerning_ids
_ = count_unrepresented_and_empty_tags(df)

Note: Below we will see that the similarity function can work with entities that are not represented in the model by ignoring the unrepresented parts.

An open question is whether I should filter out all the tags that are not represented by a vector in the model. Let us first look at the behaviour of the similarity function in cases where unrepresented words are part of the input.

In [None]:
print(nlp('abcd').similarity(nlp('abcd')))
print(nlp('abcd').similarity(nlp('Abcd')))
print(nlp('abcd Haus').similarity(nlp('abcd Garten')))
print(nlp('Haus').similarity(nlp('abcd Garten')))
print(nlp('Garten').similarity(nlp('abcd Garten')))
print(nlp('Haus').similarity(nlp('Garten')))

If two strings are identical, a similarity of 1 is returned, regardless of whether the word is represented in the model or not. This is also important to know since it means that unknown tags are not necessarily useless altogether. However in the great majority of cases they would be a computational burden. It would also be an unwanted behaviour that sim('abcd', 'abcd') = 1, but sim('abcd', 'Abcd') = 0. Therefore it is necessary to process these unusable words in order to find pieces that are represented by the model.

In [None]:
if os.path.exists(pickle_path_3) and os.path.exists(pickle_path_4) and os.path.exists(pickle_path_5):
    df = pd.read_pickle(pickle_path_3)
    with open(pickle_path_4, 'rb') as f:
        total_tag_words = pickle.load(f)
    with open(pickle_path_5, 'rb') as f:
        removed_words = pickle.load(f)
else:
    split_word_cache = dict()
    total_tag_words = 0
    removed_words = []
    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        current_tags = []
        for tag_tuple in row["GeneratedTags"]:
            tag = tag_tuple[0]

            current_words = ''
            for cur_word in tag.split(" "):
                total_tag_words += 1
                word = cur_word

                if sum(nlp(word).vector) == 0:
                    if word in split_word_cache.keys():
                        found = split_word_cache[word]
                    else:
                        found = gws.split_german_word(word)
                        split_word_cache[word] = found
                    if not found:
                        removed_words.append(word)
                    else:
                        for f in found:
                            current_words += f + ' '
                else:
                    current_words += word + ' '
            if len(current_words) > 0:
                current_tags.append((current_words.strip(), tag_tuple[1], tag_tuple[2]))
        # Because of the steps above it can happen that different words result in the same tag. Therefore the set is taken.
        df.loc[ind, 'GeneratedTags'] = list(set(current_tags))
    # Checkpoint
    df.to_pickle(pickle_path_3)
    with open(pickle_path_4, 'wb+') as f:
        pickle.dump(total_tag_words, f)
    with open(pickle_path_5, 'wb+') as f:
        pickle.dump(removed_words, f)

In [None]:
print(total_tag_words)
print(len(removed_words))
print(removed_words[0:50])

In [None]:
[t for t in df[df['Identifier'] == '3613']['GeneratedTags']]

As we can see in the example above, some tags are fully contained in other tags. (Disregarding the fact that names should have filtered out in the steps above. The nlp library does not work perfectly on these small text snippets.)

Therefore these will be filtered out. However, this is implemented in the project, so that the filtering is done dynamically. The reason behind this is that there may be a duplicate considering Title and Exp tags e.g. with tuples like ('tag', 'Exp'), ('I am a tag', 'Title'), but if only 'Exp' tags are considered there would not be a duplicate value.

Note: The list that was used for filtering is from an older version of this file.

In [None]:
write_top_1000_to_file('all_tags_new.txt')

In [None]:
filter_out = ['Georg Lechner', 'Selbstporträt ein Maler', 'mit Figur', 'Staffage', 'Sabine Grabner', 'Nr', 'Werk', 'Literatur', 'Kat',
              'Studie', 'Künstler', 'Jahrhundert', 'Maler', 'Selbstbildnis', 'Porträt', 'Künstler', 'S', 'Vgl', 'Vordergrund', 'Studie',
              'Eindruck', 'Skizze', 'Gemälde', 'Entwurf', 'Sammlung', 'Malerei', 'a', 'Österreichischen Galerie Belvedere verb',
              'etc', 'Selbstporträt ein Bildhauer', 'Sabine', 'Reproduktion einer Skulptur', 'Prospekt', 'Folge', 'Selbstporträt ein Grafikers',
              'Porträt ein Schauspieler', 'Dies', 'Brustbild', 'Bildnis', 'Beschreibung', 'zwölf Radierung', 'Skulptur',
              'Bildnis', 'Art', 'verschieden Ansicht', 'Motiv', 'Büste', 'Titelblatt', 'Teil', 'Johann Peter Krafft', 'Bedeutung',
              'Name', 'österreichisch Malerei', 'österreichisch Maler', 'Österreichischen Galerie', 'vorliegend Bild', 'b', 'Titel',
              'Rahmen', 'Porträt', 'Detail', 'Bild', 'Aquarell', 'zugehörig', 'weiters', 'ich Jahr', 'ernten', 'erneuert Blendrahmen', 'd',
              'c', 'Stelle', 'Serie', 'Fotografie', 'Datierung', 'Beispiel', 'Anlass', 'Öl', 'ich', 'Stillleben mit verwandt Gegenstand',
              'Hinweis', 'Gegenstand', 'Überlieferung', 'Ölbild', 'vgl', 'stilistisch Grund', 'eigenhändige Radierung', 'radierung']

if os.path.exists(pickle_path_6):
    df = pd.read_pickle(pickle_path_6)
else:
    df = pd.read_pickle(pickle_path_3)
    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        current_tags = []
        for tag_tuple in row['GeneratedTags']:
            tag = tag_tuple[0]
            
            # Single letters sometimes appear because of Titles like "F. Gawet"
            if tag.strip().lower() not in filter_out and len(tag.strip()) > 1:
                current_tags.append((tag.strip(), tag_tuple[1], tag_tuple[2]))
        df.loc[ind, 'GeneratedTags'] = list(set(current_tags))
        
    # Checkpoint
    df.to_pickle(pickle_path_6)

In [None]:
tagless_images = count_unrepresented_and_empty_tags(df)
print(tagless_images, sep = ' ')

## Converting Lists to Tuples

In [None]:
if os.path.exists(pickle_path_7):
    df = pd.read_pickle(pickle_path_7)
else:
    df = pd.read_pickle(pickle_path_6)
    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        df.loc[ind, 'GeneratedTags'] = tuple(df.loc[ind, 'GeneratedTags'])

    # Checkpoint
    df.to_pickle(pickle_path_7)

## Creating Iconclass Tags Column

In [None]:
if os.path.exists(pickle_path_8):
    df = pd.read_pickle(pickle_path_8)
else:
    df = pd.read_pickle(pickle_path_7)
    df['IconclassTags'] = ''
    def create_iconclass_column(error_mode):
        ic: IconclassCache = IconclassCache.instance
        all_tags = set()
        for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
            for e in tuple(t for t in row['ExpertTags'][2:-2].split('\', \'') if t != ''):
                all_tags.add(e)

        non_iconclass_tags = ic.get_all_missing_tags(all_tags=all_tags)

        for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
            current_tags = []
            current_tag_tuples = [tt for tt in row['GeneratedTags']]

            for tag in [t for t in row['ExpertTags'][2:-2].split('\', \'') if t != '']:
                if error_mode or tag not in non_iconclass_tags:
                    current_tags.append(tag)
                    current_tag_tuples.append((tag, 'Icon', tag))
            df.loc[ind, 'IconclassTags'] = list(set(current_tags))
            df.loc[ind, 'GeneratedTags'] = current_tag_tuples
    try:
        create_iconclass_column(False)
    except NumberOfTriesExceededException as e:
        warnings.warn('The initialization of the IconclassCache was not successfull. The column "IconclassTags" will be made a copy of the "ExpertTags" column. This will affect the iconclass measure visualization.')
        create_iconclass_column(True)
        df.to_pickle(pickle_path_7)
        warnings.warn('Checkpoint saved, continue this notebook manually or rerun it.')
        raise e
    except IconclassCacheGenerationFailedException as e:
        warnings.warn('On the third fail, the column will be made a copy of the "ExpertTags" column.')
        raise e

    # Checkpoint
    df.to_pickle(pickle_path_8)

These new tags need to be filtered and processed as well

In [None]:
if os.path.exists(pickle_path_9):
    df = pd.read_pickle(pickle_path_9)
else:
    df = pd.read_pickle(pickle_path_8)

    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        current_tag_tuples = [t for t in row['GeneratedTags']]

        new_tag_tuples = []

        for tt in current_tag_tuples:

            if tt[1] == 'Exp':
                found = False
                for compare in current_tag_tuples:
                    if compare[1] == 'Icon' and tt[2] == compare[2]:
                        found = True
                        break
                if not found:
                    new_tag_tuples.append((tt[0], 'NotIcon', tt[2]))
            elif tt[1] == 'Icon':
                for compare in current_tag_tuples:
                    if compare[1] == 'Exp' and tt[2] == compare[2]:
                        found = True
                        new_tag_tuples.append((compare[0], tt[1], tt[2]))
                        break
                continue

            new_tag_tuples.append(tt)
        df.loc[ind, 'GeneratedTags'] = new_tag_tuples

    # Checkpoint
    df.to_pickle(pickle_path_9)

In [None]:
pd.read_pickle(pickle_path_9).head()

The ExpertTags column and the Iconclass tags column should be tuples as well

In [None]:
if os.path.exists(pickle_path_10):
    df = pd.read_pickle(pickle_path_10)
else:
    df = pd.read_pickle(pickle_path_9)
    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        df.loc[ind, 'ExpertTags'] = tuple(t for t in row['ExpertTags'][2:-2].split('\', \'') if t != '')
        df.loc[ind, 'IconclassTags'] = tuple(df.loc[ind, 'IconclassTags'])

    # Checkpoint
    df.to_pickle(pickle_path_10)

In [None]:
df = pd.read_pickle(pickle_path_10)
[t for t in df[df['Identifier'] == '9388']['GeneratedTags']]

# Add all objects

## Saving the Dataframe to pickle
Using pickle preserves the metainformation like types and avoids errors that can happen when parsing strings.
The downside is that the result is not a human readable file.

In [None]:
df = pd.read_pickle(pickle_path_10)
df.to_pickle('cleaned_dataframe.pkl')

In [None]:
best_config, best_number_of_epochs = get_winning_config_with_augmentations()

if not os.path.exists(get_results_path('With_Augmentations', 'model_all_data')):
    train_model_on_all_data('With_Augmentations', best_config, best_number_of_epochs)

if not os.path.exists(get_results_path('Without_Augmentations', 'model_all_data')):
    train_model_on_all_data('Without_Augmentations', *get_winning_config_without_augmentations())

In [None]:
if os.path.exists(pickle_path_11):
    df = pd.read_pickle(pickle_path_11)
else:
    df = pd.read_pickle(pickle_path_10)
    df['Objects'] = ''
    from Project.VisualFeaturesBranches.ObjectDetection.ObjectDetectionNetworkUtils import load_model
    from Project.VisualFeaturesBranches.ObjectDetection.ArtDataset import ArtDataset
    from Project.VisualFeaturesBranches.ObjectDetection.ObjectDetectionNetworkUtils import reduce_box_count
    from Project.AutoSimilarityCacheConfiguration.DataAccess import DataAccess

    da: DataAccess = DataAccess.instance
    best_experiment = 'With_Augmentations'
    device = best_config['device']
    best_model = load_model(best_experiment)
    best_model.to(device)
    best_model.eval()

    result_dict = dict()

    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        identifier = row['Identifier']

        if identifier in da.get_ids_for_which_bounding_boxes_exist():
            labels = da.get_bounding_boxes_and_labels_for_identifier(identifier)[1]
        else:
            dataset = ArtDataset([identifier], 1200, 1200, 0 ,0 ,0 ,0 ,0, 0, (0, 0), 0)
            img = dataset[0]
            output = best_model([img[0].to(device)])
            reduced = reduce_box_count(output[0], 0.1, 0.25, 0)
            labels = [da.get_class_label_for_index(l) for l in reduced['labels']]
        df.loc[ind, 'Objects'] = tuple(labels)

    # Checkpoint
    df.to_pickle(pickle_path_11)
    del DataAccess.instance

In [None]:
if os.path.exists(pickle_path_12):
    df = pd.read_pickle(pickle_path_12)
else:
    df = pd.read_pickle(pickle_path_11)

    for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
        df.loc[ind, 'GeneratedTags'] = list(row['GeneratedTags']) + list((t, 'Obj', t) for t in (row['Objects']))

    # Checkpoint
    df.to_pickle(pickle_path_12)
df.head()

## Calculating weights

In order to avoid a bias towards the sources that give many noun chunks, the results also receive individual weights.

In [None]:
def calculate_weights_for_column():
        for ind, row in tqdm(df.iterrows(), total=df.shape[0]):
            current_count_dict = dict()

            for tt in row['GeneratedTags']:
                current_key = (tt[1], tt[2])
                if current_key not in current_count_dict.keys():
                    current_count_dict[current_key] = 1
                else:
                    current_count_dict[current_key] += 1

            new_entries = []

            for tt in row['GeneratedTags']:
                new_entries.append((tt[0], tt[1], 1/current_count_dict[(tt[1], tt[2])]))

            nr_of_obj_tags = 0
            for e in new_entries:
                if e[1] == 'Obj':
                    nr_of_obj_tags += 1

            # This works because every object recognized leads to exactly one tag
            tmp = []
            for e in new_entries:
                if e[1] == 'Obj':
                    tmp.append((e[0], e[1], 1 / nr_of_obj_tags))
                else:
                    tmp.append(e)
            new_entries = tmp

            df.loc[ind, 'GeneratedTags'] = new_entries

## Saving the Dataframe to pickle and csv
(and checking that it worked correctly)

In [None]:
df = pd.read_pickle(pickle_path_12)
calculate_weights_for_column()
df.to_pickle('cleaned_dataframe.pkl')
file_name = 'cleaned_dataframe.csv'
pd.read_pickle('cleaned_dataframe.pkl').to_csv(file_name, index=False)
pd.read_csv(file_name).head()

In [None]:
print([t for t in df[df['Identifier'] == '396']['GeneratedTags']])