The goal of this notebook is to clean completely the data and adapt it to Unity3D.

In [13]:
import pandas as pd
import re
import numpy as np

from string import punctuation

import urllib.request

from tqdm.autonotebook import tqdm

from PIL import Image
from bs4 import BeautifulSoup

# Nettoyage des données

## Séparation salles-murs

In [14]:
artworks_df = pd.read_csv("data/complete_artworks.csv")
artworks_df = artworks_df.drop('Unnamed: 0',axis = 1).drop('Unnamed: 0.1',axis = 1).drop('Unnamed: 0.1.1',axis = 1)

In [15]:
wall_regex = "-[NESO0]$|-[NESO0] "

def get_wall(text):
    for match in re.finditer(wall_regex,text):
        wall = match.group()[1]
        return "O" if wall == "0" else wall
    return np.NaN

def delete_wall(text):
    for match in re.finditer(wall_regex,text):
        start, end = match.span()
        return text[:start]
    return text

In [16]:
artworks_df["wall"] = artworks_df.position.apply(get_wall)
artworks_df["position"] = artworks_df.position.apply(delete_wall)

## Nettoyage des positions

In [17]:
for text in artworks_df["position"].unique():
    print(text)

Ire S. de la Céramique antique
VIII
Palier esc.T.T.
XVI
Musée Charles X Salle IX Baptiste, J.-B. Monnoyer dit. Voir Monnoyer.
T. T.
Coll. Camondo
S. Barye
E. F. XIX S. 2e étage II
XIV
E. F. XIXe S. 2e étage II
E. F. XIXe , , , X
X
Esc. T. T.
II
E. F. XIXe S. 1er étage I
S. de Mécène
S. d'Auguste
S. Meubles XVIIe
Gai. d'Apollon
S. Meubles XVIIe S.
Marine 1
Coll. Schlichting
Esc. Henri IV
S. Meubles XVIIIe
I
S. des Pastels
E. F. XIX^ S. 1er étage III
XV
XII
- Voir Courtois
S. N° Bureau, Pierre-Isidore. (1822-1876). — Clair de lune sur les bords de l'Oise à l'Isle - Adam E. F. XIXe S. 1er étage 1
S. N° Bureau, Pierre-Isidore. (1822-1876). — Clair de lune sur les bords de l'Oise à l'Isle - Adam E. F. XIXe S. 1er étage III
E. F XIXe S. 2e étage 1
1
S. Meubles XVIIIme
S. 2e étage I
Palier esc.T. T
E. F. XIXe S. 1er étage III
E. F. XIXe S. 1er étage II
E. F. XIXe S. 2e étage I
E. F. XIXe S. 2e étage H
XI
- XI
S. de la Colonnade
III
S. des fresques et verres antiques i^l) Constant, Benjamin. (

Les données sont visiblement encore particulièrement sales. Certaines salles apparaissent sous plusieurs formes (Gai. ou Gal) ou possèdent des déchets.

On doit avoir un ensemble fini de salles avec un nom bien défini pour pouvoir choisir comment nommer les salles lors de l'utilisation du GUI. Il faut donc créer cet ensemble dès maintenant.

Par ailleurs, la ponctuation ne semble pas essentielle. On peut la retirer.

In [18]:
def standardize_position(s):
    s = re.sub("(\S)- (\S)","\\1\\2",s)
    p = punctuation.replace("-","")
    p = p.replace("'","")
    s = s.translate({ord(i) : " " for i in p})
    s = " ".join(s.split())
    if (s.startswith("-") or s.startswith("—")):
        if len(s) > 10:
            return np.NaN
        else:
            return s[2:] #Cas tiret + numéro de salle
    else:
        return s

artworks_df.position = artworks_df.position.apply(standardize_position)

In [19]:
for text in sorted(list(str(x) for x in artworks_df["position"].unique())):
    print(text)

1
11
111
A-I
B
Coll Arconati Visconti
Coll Arconati-Visconti
Coll Camondo
Coll Camondo 159 i36 La femme à la potiche galerie
Coll Camondo 192 — Londres le Parlement Trouée de soleil dans le brouillard —
Coll Chauchard
Coll Chauchard 1
Coll Chauchard I
Coll Chauchard II
Coll Chauchard III
Coll Chauchard IV
Coll Chauchard IV i36 — Le marchand d'oranges — galerie
Coll Chauchard galerie
Coll Chauchard galerie 1 135' — Le Palais ducal à Venise — 1
Coll Chauchard galerie 1 135' — Le Palais ducal à Venise — galerie
Coll Chauchard galerie 1 135' — Le S M della Salute — 1
Coll Chauchard galerie 1 135' — Le S M della Salute — galerie
Coll Chauchard galerie 1 135' — Le S galerie INCONNUS DE L'ÉCOLE FRANÇAISE
Coll Chauchard galerie i 120 — Le pâturage à la gardeuse d'oies — 1
Coll Chauchard galerie i 120 — Le pâturage à la gardeuse d'oies — IV
Coll Chauchard galerie i 120 — Le pâturage à la gardeuse d'oies — galerie
Coll Chauchard i
Coll II
Coll IV
Coll Schlichting
Coll Schlichting Victors Jan voi

On constate que le merge position a parfois fail (Notamment S III ou S 1er étage III sont clairement des données incomplètes)

In [20]:
#Inverted dictionnary with unclean_value : clean_value
clean_rooms_inverted = {}
with open("Rooms.txt",encoding = "utf-8") as f:
    for line in f:
        (val, keys) = (line.split("/")[0].replace("\n",""), list(map(lambda text : text.replace("\n",""),line.split("/"))))
        for k in keys:
            clean_rooms_inverted[k] = val

# Attention : avec un startwith, on risque de choisir I au lieu de I tr E par exemple.
# Avec un match parfait, on ne va pas tenir compte de toutes les données dégénérées.
# Il faut donc choisir le plus long des match startwith.

def clean_room(text):
    text = str(text)
    possible_matches = []
    for room in clean_rooms_inverted:
        if text.startswith(room):
            possible_matches.append(room)
    
    if len(possible_matches) > 0:
        return clean_rooms_inverted.get(max(possible_matches, key = len))
    else:
        return np.NaN

In [21]:
artworks_df["clean_position"] = artworks_df.position.apply(clean_room)

In [22]:
artworks_df[artworks_df.clean_position.isnull()] #Pour l'instant, ça marche bien. Ce dataframe doit rester nul !

Unnamed: 0,number,author,life,title,position,width,height,image_url,wall,clean_position


In [23]:
artworks_df = artworks_df.drop("position",axis = 1)
artworks_df = artworks_df.rename(columns = {"clean_position":"position"})

In [24]:
artworks_df.head()

Unnamed: 0,number,author,life,title,width,height,image_url,wall,position
0,*,"['alaux', 'jean']",(1786-1864').,poussin arrivant de rome est présenté par rich...,,,,,Céramique antique I
1,*,"['alaux', 'jean']",(1786-1864').,douze médaillons d'or représentant les travaux...,,,,,Céramique antique I
2,2,"['aligny', 'claude', 'françois', 'théodore', '...",(r 798- 1871).,une villa itajienne,,,,E,VIII
3,S. N°,"['amaury', 'duval']",(1808-1885).,portrait de mille x,,,,,Esc T T
4,9,"['aved', 'andré', 'joseph']",(1702-1766).,portrait du marquis de mirabeau,,,,S,XVI


## Nettoyage des dimensions

In [25]:
artworks_df[~artworks_df.width.isnull()].sample(5)

Unnamed: 0,number,author,life,title,width,height,image_url,wall,position
969,742,"['poussin', 'nicolas']",11594-1665).,apollon amoureux de daphné,200±1 Q174728,155±1 Q174728,https://upload.wikimedia.org/wikipedia/commons...,S,XIV
283,S. N°,"['daumier', 'honoré']",(1808-1879).,crispin et scapin,82 centimetre,61 centimetre,https://upload.wikimedia.org/wikipedia/commons...,S,VIII
61,50 a,"['boucher', 'françois']",(1703-1770).,le déjeuner,65.5±0.1 centimetre,81.5±0.1 centimetre,https://upload.wikimedia.org/wikipedia/commons...,S,XVI
2100,2499,"['ostade', 'adriaen', 'van']",(1610-1685).,un homme d'affaires dans son cabinet,27.8±0.1 centimetre,33.7±0.1 centimetre,https://upload.wikimedia.org/wikipedia/commons...,,XXI
1751,1966,"['dyck', 'anthonis', 'van']",(i599-i64i).,renaud et armide,109±1 centimetre,133±1 centimetre,https://s3.eu-west-3.amazonaws.com/pop-photote...,,XVII


In [26]:
# On trouve des textes de plusieurs types : absent (nan) ou des déchets (un seul à l'heure actuelle)
def clean_painting_size(text):
    #Unit : centimeter
    l = str(text).split(" ")
    if len(l) == 2:
        value, unit = l

        for match in re.finditer("±",value):
            start, end = match.span()
            value = value[:start]

        if unit in ["centimetre","Q174728"] and value.replace(".","",1).isdigit():
            return float(value)

    elif l[0] != "nan":
        print(l)

artworks_df.width = artworks_df.width.apply(clean_painting_size)
artworks_df.height = artworks_df.height.apply(clean_painting_size)

['cultural', 'depictions', 'of', 'Jesus']


In [27]:
artworks_df.sample(5)

Unnamed: 0,number,author,life,title,width,height,image_url,wall,position
1302,1195,"['caliari', 'paolo']",(1528-1588).,le calvaire v,,,,S,I tr B
1241,S. Nu,"['antoniazzo', 'romano', 'ant', 'aquili', 'dit']",(2TUE moitié du XVe siècle).,la vierge et l'enfant v,,,,N,I tr A
76,60,"['boulongne', 'valentin']",(1591-1634).,un concert,,,,N,XIV
1140,(121),"['troyon', 'constant']",(1810-1865).,le retour à la ferme,261.0,391.0,https://s3.eu-west-3.amazonaws.com/pop-photote...,,Coll Chauchard galerie
2116,2516,"['poel', 'egbert', 'van', 'der']",(1621-1664).,la maison rustique,,,,,XXIII


# Retrieve of Catalogue pictures

In [28]:
filename_alto = "data/alto_louvre_1923/alto_louvre_1923_{}.xml"
filename_iiif = "data/images_louvre_1923/iiif_louvre_1923_{}.jpg"

catalogue_index = range(21,193)

numbers_punct = punctuation.replace("(","").replace(")","") + "°"

def standardize_numbers(s):
    s = str(s)
    if "(" in s or ")" in s:
        s = "(" + s.strip(" )(") +")"
    return s if s == "*" else s.translate({ord(i) : None for i in punctuation + "°"})

def is_painting(number, author, row):
    #return row["number"] == number and row["author"].strip('][').split(', ')[0].replace("'","") == author
    return row["number"] == number and row["author"].lower().startswith(author)

In [29]:
#Retransformation des auteurs en plain text
artworks_df.author = artworks_df.author.apply(lambda l : " ".join(l.strip('][').split(', ')).replace("'","").capitalize())

#Standardisation des numéros
artworks_df.number = artworks_df.number.apply(standardize_numbers)

artworks_df["image_path"] = np.NaN

In [30]:
for i in tqdm(catalogue_index):
    
    im = Image.open(filename_iiif.format(i))

    xml_file = open(filename_alto.format(i),"rb")
    xml = xml_file.read()
    xml_file.close()
    
    soup = BeautifulSoup(xml,"lxml-xml")
    
    previous_was_illu = False
    for x in soup.findAll(["Illustration","TextBlock"]):
        if x.name == "Illustration":
            height = int(x['HEIGHT'])
            width = int(x['WIDTH'])
            hpos = int(x['HPOS'])
            vpos = int(x['VPOS'])
            previous_was_illu = True
        
        elif previous_was_illu:
            df_index = None
            text_list = []
            previous_was_illu = False
            if x.name == "TextBlock":
                
                for s in x.findAll("String"):
                    text_list.append(s["CONTENT"])
                    
                text = " ".join(text_list)
                
                regex_num = "^\* |^\(?\d+\)? ?[a-z]? |^S\.? [NXo].? "
                
                for match in re.finditer(regex_num, text):
                #Author and number are standardized
                    start, end = match.span()
                    author = text[end:].lower()
                    number = standardize_numbers(text[:end])
                    
                    #We keep only family name of author. Composed names are "la tour" or "de la tour"
                    if not (author.startswith("de") or author.startswith("la")):
                        author = author.split(" ")[-1]
                        
                    if number.startswith("S"):
                        number = "S N"
                    
                    # Numbers are not unique, but non unique are in parenthesis in the catalogue
                    # Unfortunately, the parenthesis are often not OCRised well
                    # Therefore, we test only with the number and then with the number and author                
                    indexes_hard = artworks_df.index\
                    [artworks_df.apply(lambda row : is_painting(number, author, row), axis = 1)].tolist()
                    
                    if len(indexes_hard) == 1:
                        df_index = indexes_hard[0]
                    
                    elif number != "S N":
                        indexes_soft = artworks_df.index\
                        [artworks_df.apply(lambda row : row["number"] == number, axis = 1)].tolist()
                        
                        if len(indexes_soft) == 1:
                            df_index = indexes_soft[0]
                
                    if df_index != None:
                        cropped = im.crop((hpos,vpos,hpos+width,vpos+height))
                        path = "data/images_paintings/{}.jpg".format(df_index)
                        cropped.save(path)
                    
                        artworks_df.loc[[df_index], ['image_path']] = path

HBox(children=(IntProgress(value=0, max=172), HTML(value='')))




# Retrieve images from URLs

In [31]:
#Tant qu'on cherche les urls après les images du catalogue, on va écraser les images du catalogue.
#Inverser si les images du catalogue sont les plus importantes.

In [32]:
for index, row in tqdm(artworks_df.dropna(subset = ["image_url"]).iterrows()):
    #if row["image_path"].isnull(): 
    path = "data/images_paintings/{}.jpg".format(index)
    urllib.request.urlretrieve(row["image_url"],path)
    artworks_df.loc[[index],["image_path"]] = path

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [33]:
artworks_df.dropna(subset = ["image_path"])

Unnamed: 0,number,author,life,title,width,height,image_url,wall,position,image_path
2,2,"Aligny claude françois théodore caruelle ""d""",(r 798- 1871).,une villa itajienne,,,,E,VIII,data/images_paintings/2.jpg
6,2800,Barye antoine louis,(1795-1875).,lions près de leur antre,49.5,38.5,https://upload.wikimedia.org/wikipedia/commons...,,T T,data/images_paintings/6.jpg
9,S N,Barye antoine louis,(1795-1875).,le jean de paris forêt de fontainebleau,38.5,30.5,https://s3.eu-west-3.amazonaws.com/pop-photote...,,S Barye,data/images_paintings/9.jpg
10,S N,Barye antoine louis,(1795-1875).,combat de cerfs,31.5,25.0,https://s3.eu-west-3.amazonaws.com/pop-photote...,,S Barye,data/images_paintings/10.jpg
16,995,Bellechose henri,"(travaillait à Dijon de 1415 à 1431, mort ver...",la dernière communion et le martyre de saint d...,,,,,Musée Charles X S III,data/images_paintings/16.jpg
...,...,...,...,...,...,...,...,...,...,...
2235,2718,Holbein hans,(1497-1543).,portrait d'anne de clèves reine d'angleterre q...,,,,,XXXIII,data/images_paintings/2235.jpg
2243,2737,Maître de saint barthélemy,(vers 1400-1 soo).,le christ descendu de la croix,,,,,XXXII,data/images_paintings/2243.jpg
2244,2738,Maître de la sainte parenté,(vers 1486-1520).,la présentation nu temple l'adoration des mage...,,,,,XXXII,data/images_paintings/2244.jpg
2247,2724,Mignon abraham,( 1640-1679).,le nid de pinsons,100.0,82.0,https://upload.wikimedia.org/wikipedia/commons...,,XXXII,data/images_paintings/2247.jpg


# Saving

In [34]:
artworks_df.to_csv("data/full_final_artworks.csv")
pure_artworks_df = artworks_df.drop("number",axis=1).drop("image_url",axis=1)
pure_artworks_df.to_csv("data/final_artworks.csv")

  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))
  force_unicode(url))


## Saving per room

In [36]:
pure_artworks_df = pd.read_csv("data/final_artworks.csv")

with open("Rooms.txt",encoding = "utf-8") as f:
    for line in f:
        room = line.replace("é","e").split("/")[0].replace("\n","")
        pure_artworks_df[pure_artworks_df.position == room].to_json("data/paintings_json/{}.json".format(room), orient="records")

# Tests divers

In [24]:
l = list(map(lambda text : text.replace("\n",""),["a\n","ab\n"]))

In [25]:
l

['a', 'ab']

In [26]:
len(artworks_df)

2254

In [27]:
for k in {"w":"a"}:
    print(k)

w


In [28]:
sorted(["I","I tr A","I tr A c"])

['I', 'I tr A', 'I tr A c']

In [29]:
#max([])

In [30]:
x =  {"a":"h"}.get("b")

In [31]:
print(x)

None


In [32]:
a, b = [1,2]

In [33]:
a

1

In [34]:
"5.1".replace(".","").isdigit()

True

In [35]:
urllib.request.urlretrieve("https://gallica.bnf.fr/iiif/ark:/12148/bpt6k959891g/f1/full/full/0/native.jpg","data/images_paintings/test.jpg")

('data/images_paintings/test.jpg', <http.client.HTTPMessage at 0x184fdbd6278>)

In [36]:
np.isnan("Jean")

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''