Importing the dataset from metmuseums github

In [1]:
!git lfs clone https://github.com/metmuseum/openaccess

          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into 'openaccess'...
remote: Enumerating objects: 936, done.[K
remote: Counting objects: 100% (323/323), done.[K
remote: Compressing objects: 100% (318/318), done.[K
remote: Total 936 (delta 7), reused 315 (delta 5), pack-reused 613 (from 1)[K
Receiving objects: 100% (936/936), 128.45 KiB | 8.56 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [2]:
%cd

/root


In [3]:
import pandas as pd

met = pd.read_csv('/content/openaccess/MetObjects.csv', low_memory=False)


In [4]:
met.shape

(484956, 54)

In [5]:
met.columns.tolist()

['Object Number',
 'Is Highlight',
 'Is Timeline Work',
 'Is Public Domain',
 'Object ID',
 'Gallery Number',
 'Department',
 'AccessionYear',
 'Object Name',
 'Title',
 'Culture',
 'Period',
 'Dynasty',
 'Reign',
 'Portfolio',
 'Constituent ID',
 'Artist Role',
 'Artist Prefix',
 'Artist Display Name',
 'Artist Display Bio',
 'Artist Suffix',
 'Artist Alpha Sort',
 'Artist Nationality',
 'Artist Begin Date',
 'Artist End Date',
 'Artist Gender',
 'Artist ULAN URL',
 'Artist Wikidata URL',
 'Object Date',
 'Object Begin Date',
 'Object End Date',
 'Medium',
 'Dimensions',
 'Credit Line',
 'Geography Type',
 'City',
 'State',
 'County',
 'Country',
 'Region',
 'Subregion',
 'Locale',
 'Locus',
 'Excavation',
 'River',
 'Classification',
 'Rights and Reproduction',
 'Link Resource',
 'Object Wikidata URL',
 'Metadata Date',
 'Repository',
 'Tags',
 'Tags AAT URL',
 'Tags Wikidata URL']

Reduce the scope of the project down to European Paintings

In [6]:
met['Department'].unique()

array(['The American Wing', 'European Sculpture and Decorative Arts',
       'Modern and Contemporary Art', 'Arms and Armor', 'Medieval Art',
       'Asian Art', 'Islamic Art', 'Costume Institute',
       'Arts of Africa, Oceania, and the Americas', 'Drawings and Prints',
       'Greek and Roman Art', 'Photographs', 'Ancient Near Eastern Art',
       'Egyptian Art', 'European Paintings', 'Robert Lehman Collection',
       'The Cloisters', 'Musical Instruments', 'The Libraries'],
      dtype=object)

In [7]:
met = met[met['Department']== 'European Paintings']
met.reset_index(drop=True, inplace=True)

met['Department'].unique()

array(['European Paintings'], dtype=object)

Check which columns are not being used

In [8]:
met.isna().all()


Unnamed: 0,0
Object Number,False
Is Highlight,False
Is Timeline Work,False
Is Public Domain,False
Object ID,False
Gallery Number,False
Department,False
AccessionYear,False
Object Name,False
Title,False


In [9]:
empty_cols = met.columns[met.isna().all()].tolist()
print(empty_cols)

['Culture', 'Period', 'Dynasty', 'Reign', 'Portfolio', 'Geography Type', 'City', 'State', 'County', 'Country', 'Region', 'Subregion', 'Locale', 'Locus', 'Excavation', 'River', 'Metadata Date']


In [10]:
met = met.drop(columns=empty_cols)
met.isna().all()


Unnamed: 0,0
Object Number,False
Is Highlight,False
Is Timeline Work,False
Is Public Domain,False
Object ID,False
Gallery Number,False
Department,False
AccessionYear,False
Object Name,False
Title,False


In [11]:
met.columns.to_list()

['Object Number',
 'Is Highlight',
 'Is Timeline Work',
 'Is Public Domain',
 'Object ID',
 'Gallery Number',
 'Department',
 'AccessionYear',
 'Object Name',
 'Title',
 'Constituent ID',
 'Artist Role',
 'Artist Prefix',
 'Artist Display Name',
 'Artist Display Bio',
 'Artist Suffix',
 'Artist Alpha Sort',
 'Artist Nationality',
 'Artist Begin Date',
 'Artist End Date',
 'Artist Gender',
 'Artist ULAN URL',
 'Artist Wikidata URL',
 'Object Date',
 'Object Begin Date',
 'Object End Date',
 'Medium',
 'Dimensions',
 'Credit Line',
 'Classification',
 'Rights and Reproduction',
 'Link Resource',
 'Object Wikidata URL',
 'Repository',
 'Tags',
 'Tags AAT URL',
 'Tags Wikidata URL']

Columns with information that is not deemed useful or contains mainly missing values in this section of the data

In [12]:
columns_to_be_dropped = [
    'Is Highlight',
    'Is Timeline Work',
    'Is Public Domain',
    'Rights and Reproduction',
    'Artist ULAN URL',
    'Artist Wikidata URL',
    'Tags AAT URL',
    'Link Resource',
    'Repository',
    'Object Wikidata URL',
    'Tags Wikidata URL']

met = met.drop(columns = columns_to_be_dropped)

In [13]:
met.head()

Unnamed: 0,Object Number,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,Constituent ID,Artist Role,Artist Prefix,...,Artist End Date,Artist Gender,Object Date,Object Begin Date,Object End Date,Medium,Dimensions,Credit Line,Classification,Tags
0,13.130,435570,,European Paintings,1913,"Painting, miniature",A Ship in a Stormy Sea,10729,Artist,,...,1900,,1892,1837,1900,Card,1 x 2 in. (26 x 53 mm),"Gift of Isabel F. Hapgood, 1913",Miniatures,Seas|Storms|Ships
1,76.10,435572,,European Paintings,1876,"Painting, part of an altarpiece",Saint Giles with Christ Triumphant over Satan ...,10730,Artist,,...,1447,,ca. 1408,1403,1413,"Tempera on wood, gold ground",Overall 59 5/8 x 39 1/2 in. (151.4 x 100.3 cm)...,"Gift of J. Bruyn Andrews, 1876",Paintings,Apostles|Saints|Christ
2,1985.5,435573,,European Paintings,1985,Painting,Flora and Zephyr,16159,Artist,,...,1752,,1730s,1730,1739,Oil on canvas,84 x 58 in. (213.4 x 147.3 cm),"Purchase, Rudolph and Lentilhon G. von Fluegge...",Paintings,Goddess|Putti|Flowers|Landscapes
3,12.6,435574,,European Paintings,1912,"Painting, predella panel",The Crucifixion,16601,Artist,,...,1428,,,1389,1428,"Tempera on wood, gold ground",20 3/4 x 38 1/2 in. (52.7 x 97.8 cm),"Rogers Fund, 1912",Paintings,Soldiers|Men|Crucifixion|Horses|Mountains|Ange...
4,42.53.2,435575,,European Paintings,1942,"Painting, miniature","Jérôme Bonaparte (1784–1860), King of Westphalia",10864,Artist,,...,1808,,,1803,1813,Ivory,2 3/8 x 1 7/8 in. (60 x 48 mm),"Gift of Helen O. Brice, 1942",Miniatures,Kings|Men|Portraits


In [52]:
#from google.colab import files

#met.to_csv('cleanedMet.csv', index=False)

#files.download('cleanedMet.csv')

fields to add:

* metObjectURL
* image_url
* on_view (True/False) "potentially on_view_location"
* overview_text

(
* provenance_text
* exhibition history
* inscriptions

)


In [14]:
BASE = "https://www.metmuseum.org/art/collection/search/"
met['MetObjectURL'] = BASE + met['Object ID'].astype(str)


In [15]:
met['MetObjectURL'].iloc[5]

'https://www.metmuseum.org/art/collection/search/435576'

In [47]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import time
from tqdm.notebook import tqdm


def scrape_met_objects(url_series, delay=0.0, verbose=True):


    image_urls = []
    on_views = []
    overview_texts = []

    for url in tqdm(url_series, desc="Scraping MET pages"):

        try:
            # Build request with browser headers
            request = urllib.request.Request(url)
            request.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36")
            opener = urllib.request.build_opener()
            response = opener.open(request)

            html = response.read().decode("utf-8", errors="replace")

            soup = BeautifulSoup(html, "lxml")

            # 1. IMAGE URL
            image_tag = soup.find("meta", {"property": "og:image"})
            image_urls.append(image_tag.get("content") if image_tag else None)

            # 2. ON VIEW STATUS
            text = soup.get_text(" ", strip=True)

            if "Not on view" in text:
                on_views.append(False)
            elif "On view at" in text:
                on_views.append(True)
            else:
                on_views.append(None)

            # 3. OVERVIEW TEXT
            overview_div = soup.select_one("div[class*='object-overview_label']")
            overview_texts.append(overview_div.get_text(" ", strip=True) if overview_div else None)

        except Exception as e:
            # Fail gracefully
            image_urls.append(None)
            on_views.append(None)
            overview_texts.append(None)

        time.sleep(delay)

    return pd.DataFrame({
        "image_url": image_urls,
        "on_view": on_views,
        "overview_text": overview_texts
    })


In [48]:
scraped = scrape_met_objects(met['MetObjectURL'])


Scraping MET pages:   0%|          | 0/2626 [00:00<?, ?it/s]

In [49]:
met['image_url'] = scraped['image_url']
met['on_view'] = scraped['on_view']
met['overview_text'] = scraped['overview_text']


In [51]:
met.head()

Unnamed: 0,Object Number,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,Constituent ID,Artist Role,Artist Prefix,...,Object End Date,Medium,Dimensions,Credit Line,Classification,Tags,MetObjectURL,image_url,on_view,overview_text
0,13.130,435570,,European Paintings,1913,"Painting, miniature",A Ship in a Stormy Sea,10729,Artist,,...,1900,Card,1 x 2 in. (26 x 53 mm),"Gift of Isabel F. Hapgood, 1913",Miniatures,Seas|Storms|Ships,https://www.metmuseum.org/art/collection/searc...,https://collectionapi.metmuseum.org/api/collec...,False,Aivazovsky was a celebrated painter of seascap...
1,76.10,435572,,European Paintings,1876,"Painting, part of an altarpiece",Saint Giles with Christ Triumphant over Satan ...,10730,Artist,,...,1413,"Tempera on wood, gold ground",Overall 59 5/8 x 39 1/2 in. (151.4 x 100.3 cm)...,"Gift of J. Bruyn Andrews, 1876",Paintings,Apostles|Saints|Christ,https://www.metmuseum.org/art/collection/searc...,https://collectionapi.metmuseum.org/api/collec...,False,"These panels, from an altarpiece for the Valen..."
2,1985.5,435573,,European Paintings,1985,Painting,Flora and Zephyr,16159,Artist,,...,1739,Oil on canvas,84 x 58 in. (213.4 x 147.3 cm),"Purchase, Rudolph and Lentilhon G. von Fluegge...",Paintings,Goddess|Putti|Flowers|Landscapes,https://www.metmuseum.org/art/collection/searc...,https://collectionapi.metmuseum.org/api/collec...,False,The composition celebrates the end of winter t...
3,12.6,435574,,European Paintings,1912,"Painting, predella panel",The Crucifixion,16601,Artist,,...,1428,"Tempera on wood, gold ground",20 3/4 x 38 1/2 in. (52.7 x 97.8 cm),"Rogers Fund, 1912",Paintings,Soldiers|Men|Crucifixion|Horses|Mountains|Ange...,https://www.metmuseum.org/art/collection/searc...,https://collectionapi.metmuseum.org/api/collec...,False,
4,42.53.2,435575,,European Paintings,1942,"Painting, miniature","Jérôme Bonaparte (1784–1860), King of Westphalia",10864,Artist,,...,1813,Ivory,2 3/8 x 1 7/8 in. (60 x 48 mm),"Gift of Helen O. Brice, 1942",Miniatures,Kings|Men|Portraits,https://www.metmuseum.org/art/collection/searc...,https://collectionapi.metmuseum.org/api/collec...,False,


In [16]:
import urllib.request
from bs4 import BeautifulSoup

url = met['MetObjectURL'].iloc[5]

# Add another header to the request.
request = urllib.request.Request( url )
request.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36")
opener = urllib.request.build_opener()
response = opener.open(request)

html = response.read().decode("utf-8", errors="replace")

print(html[:1000])



<!DOCTYPE html><html lang="en" class="
				__variable_e798ec
				__variable_bfed6e
				__variable_968aec
				__variable_683e8c
				__variable_64677c
				__variable_cb5e93" data-sentry-component="LocaleLayout" data-sentry-source-file="layout.tsx"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="preload" href="/_next/static/media/78dbaeca31577a23-s.p.woff2" as="font" crossorigin="" type="font/woff2"/><link rel="preload" href="/_next/static/media/84a4b0cac32cffbe-s.p.woff2" as="font" crossorigin="" type="font/woff2"/><link rel="preload" href="/_next/static/media/c4b700dcb2187787-s.p.woff2" as="font" crossorigin="" type="font/woff2"/><link rel="preload" href="/_next/static/media/e4af272ccee01ff0-s.p.woff2" as="font" crossorigin="" type="font/woff2"/><link rel="preload" as="image" href="https://collectionapi.metmuseum.org/api/collection/v1/iiif/435576/2004970/main-image" fetchPriority="high"/><link rel="stylesheet" href="/_next/sta

image URL

In [17]:
soup = BeautifulSoup(html, "lxml")

image_tag = soup.find("meta", {"property": "og:image"})
image_url = image_tag.get("content")

print(image_url)

https://collectionapi.metmuseum.org/api/collection/v1/iiif/435576/2004970/main-image


On view?

In [18]:
text = soup.get_text(" ", strip=True)

if "Not on view" in text:
    on_view = False
elif "On view at" in text:
    on_view = True
else:
    on_view = None

print(on_view)


False


Overview text

In [19]:
overview_div = soup.select_one("div[class*='object-overview_label']")
overview_text = overview_div.get_text(" ", strip=True)

print(overview_text)


This early work by Fra Angelico dates about 1425 and formed part of the decoration of the frame of an altarpiece still in the church of San Domenico, Fiesole, where the artist was a friar until 1436. The altarpiece was modernized in 1501, and parts of its frame were sold in the nineteenth century. The elegant figure type and delicate modeling owe much to the example of Ghiberti, the author of the famous baptistery doors in Florence.
