# Data Analysis Framework for Photographic Collection Data from the KHI

Introductory text

## Getting started: How to use this notebook

Text content

## Table of contents

<ol start='0'>
<li><a href='#preparation'>Preparation</a></li>
    <li><a href='#upload'>Upload your KHI dataset</a></li>
    <li><a href='#refining'>Refining your photograph dataset: applying filters</a></li>
    <li><a href='#sort'>Sort your data</a></li>
    <li><a href='#visualization'>Data visualization</a></li>
    <li><a href='#download'>Download results</a></li>
</ol>

## 0. Preparation<a id='preparation'></a>

In [1]:
!pip install pathlib
!pip install numpy
!pip install matplotlib
!pip install pandas
%load_ext autoreload
%autoreload 2



In [2]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context
import pathlib
import xml.etree.ElementTree as ET
import re
from pprint import pprint
from resources.PhotoAttributes import PhotoAttributes
from resources.dictionaries_file import *
from resources.Classes_file import *
from Thesis_project_main import *
import numpy
import ipywidgets as widgets
import codecs
from module import *
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import pandas as pd
from pandas.api.types import is_datetime64_any_dtype as is_datetime

In [3]:
%run ./resources/dataclasses_creation.py

## 1. Upload your KHI dataset<a id='upload'></a>

This section allows you to **upload your XML file** containing data about the collections of the KHI. Click on the upload button to upload the file in XML format from your device. Beware that executing the cell by pressing Shift+Enter will remove your upload.

In [4]:
display(upload)

FileUpload(value=(), accept='.xml', description='Upload')

Execute the following cell by pressing Shift+Enter to decode your data and save the content in a text file.

In [5]:
try:
    uploaded_file = upload.value[0]
    codecs.decode(uploaded_file.content, encoding="windows-1252")
except IndexError:
    print("Please upload an XML file in the cell above")

with open("./saved-output.txt", "wb") as fp:
    fp.write(uploaded_file.content)

### Extract Data

Execute the following cell to get an overview of your dataset. The data is presented within a **DataFrame**, a data structure used in computer programming and data analysis to organize and manipulate tabular data, especially when working with large datasets. Each **row** represents an individual entry, like a record in a database, and each **column** represents a specific type of information.

DataFrames let you organize, explore, and manipulate data easily, allowing tasks such as **filtering**, **sorting**, and **summarizing data**. You can also perform statistical analysis, create visualizations, and prepare data for machine learning models using DataFrames in Python and other programming languages. 

In [6]:
photos_collection = XmlReaderKHI.get_dataframe("./saved-output.txt")

# The dataframe "photos_dataframe" also includes entries of the photographs' digital version. 
photos_dataframe = photos_collection.dataframe

# The dataframe "photos_dataframe_no_scan" does not contain entries of digital photos of the so-called "Cimelia" photographs.
# It is the best option to obtain more accurate results if you are not interested in including data about the digital versions 
# of these photographs in your results.
photos_dataframe_no_scan = photos_collection.dataframe_no_scan
photos_dataframe_no_scan

Unnamed: 0,obj_id,obj_id_level2,obj_id_level3,artist,other_artist_attribution,date,title,description_khi,status,genre,...,photo_file_format,photo_equipment,photo_dimension,photo_subject,photo_comment,photo_literature_citation,photo_file_number,photo_credit_line,photo_old_archival_section,photo_preservation_status
0,70010647,,,[Tizian],,[Datierung: um 1514],Noli me tangere,,,Malerei,...,,,"[25,2 x 19,2 cm (Druck)]",Gesamtansicht,,,,,,
1,07705412,,,"[Basaiti, Marco]",,,Berufung der Söhne des Zebedäus,,,Tafelmalerei,...,,,"[26,1 x 16,5 cm (Foto)]",Gesamtansicht,,,,,,
3,07705412,,,"[Basaiti, Marco]",,,Berufung der Söhne des Zebedäus,,,Tafelmalerei,...,,,,Gesamtansicht,,,,,,
4,07705412,,,"[Basaiti, Marco]",,,Berufung der Söhne des Zebedäus,,,Tafelmalerei,...,,,,Gesamtansicht,,,,,,
5,70013046,,,"[Lotto, Lorenzo]",,[Datierung: 1527/1533],Maria mit dem Kind und den heiligen Katharina ...,,,Malerei,...,,,"[19,7 x 24,9 cm (Foto)]",Gesamtansicht,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117,70012001,,,"[Gaulli, Giovanni Battista]",,[Datierung: 1651/1700],Christuskind auf Wolken,,,Malerei,...,,,"[25,1 x 19,7 cm (Foto)]",Gesamtansicht,,,,,,
1118,70013060,,,"[Morone, Francesco (1471)]",,[Datierung: um 1526],Maria mit Kind,,,Malerei,...,,,"[24,2 x 20,2 cm (Foto)]",Gesamtansicht,,,,,,
1120,70013210,,,"[Vaga, Perino del]",,,Die Geschichte von Amor und Psyche,,,Wandmalerei,...,,,[20 x 26 cm (Foto)],Gesamtansicht: Die Haushälterin der Räuber erz...,,,,,,
1122,70013210,,,"[Vaga, Perino del]",,,Die Geschichte von Amor und Psyche,,,Wandmalerei,...,,,,Gesamtansicht: Die Haushälterin der Räuber erz...,,,,,,


### Dataset Information

Execute the following cell to see a short description of your dataset.

In [None]:
print(photos_collection.get_dataset_description())

## 2. Refining your photograph dataset: applying  filters <a id='refining'></a>

DataFrames are a powerful data structure, allowing to easily manipulate and filter data. In this section, you can filter the data by specifying a search text and the column in which you want to perform the search. Please note that the dataset is in German.

### Filter by any column

Choose the column you want to filter from the menu below and write an input text to search in the dedicated field:

In [11]:
filter_column_widget.options=[column for column in photos_collection.dataframe_no_scan.columns if column != 'date']
filter_column_widget.value=photos_collection.dataframe_no_scan.columns[3]

In [14]:
display(filter_column_widget)
display(text_to_filter)
filter_by_column = photos_collection.filter_by(filter_column_widget.value, text_to_filter.value)
filter_by_column

Dropdown(description='Column:', index=3, options=('obj_id', 'obj_id_level2', 'obj_id_level3', 'artist', 'other…

Text(value='alessandro', description='Search text:', placeholder='Type your text here')

Unnamed: 0,obj_id,obj_id_level2,obj_id_level3,artist,other_artist_attribution,date,title,description_khi,status,genre,...,photo_file_format,photo_equipment,photo_dimension,photo_subject,photo_comment,photo_literature_citation,photo_file_number,photo_credit_line,photo_old_archival_section,photo_preservation_status
227,70013080,,,"[Moretto, Alessandro]",,,Thronende Madonna mit Kind und vier Kirchenvätern,,,Malerei,...,,,"[27,6 x 18,4 cm (Foto)]",Gesamtansicht,,,,,,
240,70013053,,,"[Allori, Alessandro]","[frühere Zuschreibung: Bronzino, Agnolo]",,Bildnis eines jungen Mannes,,,Malerei,...,,,"[26 x 20,1 cm (Foto)]",Gesamtansicht,,,,,,
343,70013067,,,"[Moretto, Alessandro]",,[Datierung: 1530/1534],"Die heilige Justina, von einem Stifter verehrt",,,Malerei,...,,,"[25,8 x 17,1 cm (Foto)]",Gesamtansicht,,,,,,
344,70013067,,,"[Moretto, Alessandro]",,[Datierung: 1530/1534],"Die heilige Justina, von einem Stifter verehrt",,,Malerei,...,,,"[27,8 x 22,2 cm (Foto)]",Ausschnitt: Heilige Justina,,,,,,
580,7703325,,,"[Botticelli, Sandro]",,,Madonna,,,Tafelmalerei,...,,,"[25,7 x 16,5 cm (Foto)]",Gesamtansicht,,,,,,
581,7703325,,,"[Botticelli, Sandro]",,,Madonna,,,Tafelmalerei,...,,,"[38,7 x 26,4 cm (Foto)]",Gesamtansicht,,,,,,
672,70005381,,,"[Moretto, Alessandro]",,[Datierung: um 1520/1545],Ein Edelmann und ein Poet in der Pose der Mela...,,,Malerei,...,,,"[23,6 x 33,7 cm (Passpartout)]",Zwei historische Fotografien mit Gesamtansicht...,,,,,,
673,70005381,70005382.0,,"[Moretto, Alessandro]",,[Datierung: um 1520/1545],Porträt eines jungen Mannes in der Pose der Me...,,,Malerei,...,,,"[20,3 x 15,3 cm (Foto)]",Gesamtansicht,,,,,,
674,70005381,70005382.0,,"[Moretto, Alessandro]",,[Datierung: um 1520/1545],Porträt eines jungen Mannes in der Pose der Me...,,,Malerei,...,,,,Gesamtansicht,,,,,,
675,70005381,70005382.0,,"[Moretto, Alessandro]",,[Datierung: um 1520/1545],Porträt eines jungen Mannes in der Pose der Me...,,,Malerei,...,,,,Gesamtansicht (mit Rahmen),,,,,,


### Filter by date

Choose a date and an operator or two dates to filter photographs by dates related to the represented artwork.

In [None]:
display(text_date_to_filter, text_date_to_filter_2, year_operator)

In [None]:
filter_by_date = photos_collection.filter_by('date', '1400', '=')
filter_by_date

### Combining multiple filters

You can apply multiple filters on your dataset by filtering the results obtained in the previous sections. If you want to apply an additional filter on the data obtained in the section <a href="http://localhost:8889/notebooks/Data_Analysis_Framework_for_Photo_Collection_Data_KHI.ipynb#Filter-by-any-column"><b>Filter by any column</b></a>, keep the DataFrame name 'additional_filter_by_column' in the brackets in the snippet below. If you would like to filter the result obtained by filtering artwork dates in section <a href=""><b>Filter by date</b></a>, replace the name in the brackets with 'additional_filter_by_date'.

In [None]:
display(filter_column_widget)
display(text_to_filter)
additional_filter_by_column = photos_collection.filter_by(filter_column_widget.value, text_to_filter.value, filter_by_column)
additional_filter_by_date = photos_collection.filter_by(filter_column_widget.value, text_to_filter.value, filter_by_date)


# Replace with 'additional_filter_by_date' to filter on the result of the "Filter by date" section
display(additional_filter_by_column)

## 3. Sort your data <a id='sort'></a>

Text

## 4. Data visualization<a id='visualization'></a>

To obtain a visualization of the data in your dataset, select a column to display its content and the type of chart you want to use. Beware that not all visualization methods are suitable depending on the selected data. Try to experiment with different visualizations to find the one that fits your data.

In [None]:
x_widget.options = photos_collection.dataframe_no_scan.columns
x_widget.value = photos_collection.dataframe_no_scan.columns[3]
display(x_widget, y_widget)

In [None]:
photos_collection.plot_values(x_widget.value, y_widget.value)

## 5. Download results<a id='download'></a>

If you'd like, you can download your DataFrame to your device. The data will be saved in **CSV format**, which is commonly used for storing and distributing tabular data. Simply uncomment the line corresponding to the DataFrame you wish to download by removing the "#" symbol.

You can find additional documentation on the `pandas.DataFrame.to_csv()` method here, if you wish to explore additional options: <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html" target="_blank">pandas.Dataframe.to_csv() (official documentation)</a>.

In [None]:
# filter_by_column.to_csv('filter_by_column.csv', index=False)
# filter_by_date.to_csv('filter_by_date.csv', index=False)
# additional_filter_by_column.to_csv('additional_filter_by_column.csv', index=False)
# additional_filter_by_date.to_csv('additional_filter_by_date.csv', index=False)