# TITOLO<br>
This notebook was created for the exam of **Digital Pubishing and Electronic Storytelling**, taught by **Professor Marilena Daquino** at the **University of Bologna** during academic year **2021/2022**.
<br>
It presents the project NOME developed by [Francesca Borriello](https://github.com/Fran-cesca), [Lorenza Pierucci](https://github.com/LorenzaPierucci), [Laura Travaglini](https://github.com/lauratravaglini).

# 1. Creating our dataframes.<br>
Museums: evolving creatures feeding up on the artworks that will ultimately define what they look like, what they are. <br>
Collage: an assemblage of different forms creating a new whole, a mixture of heterogeneous elements, forms only apparently unrelated, no casual aggregation. A heap of broken images.<br>

MoMA and Tate: with 1,160,686 and 1,156,037 visitors per year in 2021 (despite dramatic attendance drops due to the pandemic) are universally known and recognised as two of the most influential museums worldwide.<br>

The first, founded in NY in 1929, is often identified as one of the largest and most influential museums of modern art in the world. It plays a major role in developing and collecting modern art. It includes works of architecture and design, drawing, painting, sculpture, photography, prints, illustrated books and artist's books, film, and electronic media
<br>

The second,  La Tate Modern è il museo d'arte moderna più visitato al mondo e si stima che ogni anno attiri oltre 5 milioni e mezzo di visitatori. (?)<br> 

Both 
that is only apparently casual
We worked with two museums (MoMa and Tate) making available datasets (csv format) containing informstion about their artworks, artists, acquisitions throughout time.<br>
After importing all the necessary libraries, we gather existing data from the into a `Pandas Dataframe` for easier data manipulation and table operations. <br>
In particular, data are 



# Import

In [None]:
import pandas as pd
import csv
import re
from collections import defaultdict
from rdflib import Namespace , Literal , URIRef
from rdflib.namespace import RDF , RDFS
import ssl
from qwikidata.sparql import return_sparql_query_results # python library for working with sparql and linked data from WikiData

# MoMa

In [None]:
spreadsheet = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')
pd.set_option('display.max_columns', None)
MoMaArtworks = spreadsheet[['Title', 'Artist', 'ConstituentID', 'Nationality', 'BeginDate', 'EndDate','Date', 'Department', 'DateAcquired']]
MoMaArtists = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv')
MoMaArtists["ConstituentID"] = MoMaArtists["ConstituentID"].astype(str)
MoMa = pd.merge(MoMaArtworks,MoMaArtists[['ConstituentID', 'Wiki QID']],on='ConstituentID', how='left')
MoMa.rename(columns = {'ConstituentID':'Id', 'BeginDate':'BirthDate', 'EndDate':'DeathDate'}, inplace = True)

# Tate

In [None]:
spreadsheet = pd.read_csv('https://raw.githubusercontent.com/tategallery/collection/master/artwork_data.csv')
pd.set_option('display.max_columns', None)
TateArtworks = spreadsheet[['artist', 'artistId', 'title', 'medium', 'creditLine', 'year', 'acquisitionYear', 'url']]
TateArtworks.rename(columns = {'artistId':'id'}, inplace = True)
TateArtworks.id = TateArtworks.id.astype(str)
TateArtists = pd.read_csv('https://raw.githubusercontent.com/tategallery/collection/master/artist_data.csv')
TateArtists["id"] = TateArtists["id"].astype(str)
Tate = pd.merge(TateArtworks,TateArtists[['id', 'gender', 'yearOfBirth', 'yearOfDeath']], on='id', how='left')
Tate.rename(columns = {'artist':'Artist', 'id':'Id', 'title':'Title', 'yearOfBirth':'BirthDate', 'yearOfDeath':'DeathDate', 'medium':'Medium', 'creditLine':'CreditLine', 'year':'Date', 'acquisitionYear':'DateAcquired', 'url':'URL', 'gender':'Gender'}, inplace = True)

# Data Cleaning

## MoMa

Substitute NaN values with zeros

In [None]:
MoMa.fillna(value='0', inplace=True)

Clean MoMa Acquisition Dates: they are on the form YYYY-MM-DD. We just want the YYYY as an int.

In [None]:
def cleanAcquisitionDatesMoMa(date):
    if '-' in date:
        date = date.split('-')[0]
    return date

In [None]:
MoMa["DateAcquired"] = MoMa["DateAcquired"].apply(cleanAcquisitionDatesMoMa)

Clean MoMa Artworks' Dates. <br>
Indeed, they are in some unclear forms, such as:<br>
(1950).  (Prints executed 1948<br>
(1883, published 1897)<br>
(1911, dated 1912, published c. 1917)<br>
(April 9 and August 31, 1971)<br>
(April 23) 1968<br>
(journals published October 2003 through June 2004)<br>

We just need the year.

In [None]:
def cleanDatesMoMa(date):
    if '-' in date:
        splitted = date.split('-')
        date = ' '.join(splitted) 
    if '/' in date:
        splitted = date.split('/')
        date = ' '.join(splitted) 
    if ',' in date:
        splitted = date.split(',')
        date = ' '.join(splitted) 
    if '.' in date:
        splitted = date.split('.')
        date = ' '.join(splitted) 
        
    x = re.search("\d{4}", date)
    if x:
        date = x.group()
    else:
        date = '0'
    return date

In [None]:
MoMa["Date"] = MoMa["Date"].astype(str)
MoMa["Date"] = MoMa["Date"].apply(cleanDatesMoMa)

## Tate

Substitute NaN values and strings with zeros

In [None]:
Tate.fillna(value='0', inplace=True)
Tate['Date'].replace(to_replace=['no date', 'c'], value='0', inplace= True)

Clean Tate Acquisition and Artworks' Dates: they are on the form YYYY.0. We want an integer.

In [None]:
def cleanDatesTate(date):
    if '.' in date:
        date = date.split('.')[0] 
    return date

In [None]:
Tate["Date"] = Tate["Date"].astype(str)
Tate["Date"] = Tate["Date"].apply(cleanDatesTate)

Tate["DateAcquired"] = Tate["DateAcquired"].astype(str)
Tate["DateAcquired"] = Tate["DateAcquired"].apply(cleanDatesTate)

# Exploration

## How many artworks?

In [None]:
museums=[MoMa, Tate]
names = ['Moma','Tate']
for museum in museums:
    selected_rows = museum[~museum['Title'].isnull()]
    name = names.pop(0)
    print("Total artworks at", name, ":", len(selected_rows.index))

## Which kind of artworks?

In [None]:
MoMa.to_csv('MoMa.csv')
from collections import defaultdict 

with open('MoMa.csv', mode='r', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    artworksType = defaultdict(dict)
    for item in reader:
        if item['Department'] not in artworksType:
            artworksType[item['Department']] = 1
        else:
            artworksType[item['Department']] += 1
    print(artworksType)

## When do artworks date back?

In [None]:
museums=[MoMa, Tate]
names = ['Moma','Tate']
for museum in museums:
    museum["Date"] = museum["Date"].astype(int)
    museum.sort_values(by=['Date'], inplace=True)
    museumWithoutZeros = museum[museum['Date'] != 0]
    firstDate = museumWithoutZeros['Date'].iat[0]
    lastDate = museumWithoutZeros['Date'].iat[-1]
    name = names.pop(0)
    print("Most ancient artwork at", name, "dates back to",firstDate )
    print("Most recent artwork at", name, "dates back to",lastDate )    

## When do artworks date back?

In [None]:
museums=[MoMa, Tate]
names = ['Moma','Tate']
for museum in museums:
    museum["Date"] = museum["Date"].astype(int)
    museum.sort_values(by=['Date'], inplace=True)
    museumWithoutZeros = museum[museum['Date'] != 0]
    firstDate = museumWithoutZeros['Date'].iat[0]
    lastDate = museumWithoutZeros['Date'].iat[-1]
    name = names.pop(0)
    print("Most ancient artwork at", name, "dates back to",firstDate )
    print("Most recent artwork at", name, "dates back to",lastDate )    

## Artists

In [None]:
For examining artist-related issues, we will rely on the specific csvs from the museums, whoch we already transformed into dataframes.

### How many artists?

In [None]:
print('Total number of artists at MoMa', len(MoMaArtists))

In [None]:
print('Total number of artists at Tate', len(TateArtists))

### What is the most represented gender?

In [None]:
TateArtists['gender'].value_counts()

In [None]:
MoMaArtists['Gender'].value_counts()

### What are the most represented nationalities?

### Tate

Since artists' names are in the form 'Surname, Name', we use a function to normalise that as 'Name Surname'.

In [None]:
TateIntegration = TateArtists[TateArtists['gender'].isna()]
TateIntegration = TateIntegration[TateIntegration['name'] != 'Anonymous']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Art & Language (Michael Baldwin, born 1945; Mel Ramsden, born 1944)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Art & Language (Terry Atkinson, born 1939; David Bainbridge, born 1941; Michael Baldwin, born 1945; Harold Hurrell, born']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Art & Language (Terry Atkinson, born 1939; Michael Baldwin, born 1945)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Atlas Group']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Becher, Prof. Bernd']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Black Audio Film Collective (John Akomfrah; Reece Auguis; Edward George; Lina Gopaul; Avril Johnson; David Lawson; Trevo']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Booth, L']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Boyd and Evans, Fionnuala and Leslie']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British (?) School']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British (?) School 19th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 17th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 16th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 17th or 18th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 18th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 19th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'British School 20th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Chinese School 18th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'French School 18th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'French School 19th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Gent, G.W.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Glik, M.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'International Local (Sarah Charlesworth; Joseph Kosuth; Anthony McCall)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Italian or German (?) School 17th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Langlands and Bell, Ben and Nikki']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Italian or German (?) School 17th century']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Langlands and Bell, Ben and Nikki']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Lucy and Eegyudluk']
TateIntegration = TateIntegration[TateIntegration['name'] != 'M/M (Paris, France)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Moore, T.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Skeaf, D.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'T R Uthco (Doug Hall born 1944, Diane Andrews Hall born 1945, Jody Procter 1944-1998)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Art & Language (Ian Burn, 1939-1993; Mel Ramsden, born 1944)']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Thomson, W.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Turton, M.']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Unknown']
TateIntegration = TateIntegration[TateIntegration['name'] != 'Young-Hae Chang Heavy Industries (Young-Hae Chang, Marc Voge)']

In [None]:
def cleanArtistsNames(name):
    if ',' in name:
        name= name.split(',')
        name[0], name[1] = name[1], name[0]
        name = ' '.join(name)
    return name.strip()

In [None]:
TateIntegration["name"] = TateIntegration["name"].apply(cleanArtistsNames)

In [None]:
artists_genders_from_ids = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artist
WHERE {{
    ?artist wdt:P31 wd:Q5 . 
    ?artist wdt:P106 ?occupation
                  FILTER (?occupation IN (wd:Q10774753) ) 
    ?artist rdfs:label ?o
    FILTER regex(?o, \"^{}$\" )
            FILTER (langMatches(lang(?o), "EN")).
}}

"""

In [None]:
def find_artists_genders_from_ids(name):
    query = artists_genders_from_ids.format(name.strip())
    res = return_sparql_query_results(query_string=query)
    print(query)
    try:
        wdt_uri = res['results']['bindings'][0]['artist']['value']
    except (IndexError, KeyError):
        return ""
    return wdt_uri.split("/")[-1]

In [None]:
TateIntegration["Artist Entity"] = TateIntegration["name"].apply(find_artists_genders_from_ids)

In [None]:
pd.set_option('display.max_rows', None)
display(TateIntegration)

In [None]:
new = TateIntegration.copy(deep=True)
new['Artist Entity'].replace(to_replace='', value='0', inplace= True)

newLo =  new[new['Artist Entity'].isna()]

display(new)