# RCE - Archaeology Data extraction and preparation  

This notebook contains code for the extraction of the Archaeology Data Station metadata (based on DOIs that were provided to me prior to this work). It extracts data from zips, takes DOIs and allows for the filtering on certain metadata fields (e.g. selecting only DOIs with missing geospatial data, dendrochronology data etc). 

Goal of this notebook: 
- Extract DOIs
- Filter data
- Initial data exploration: How many datasets do not have geospatial information? What values can the coordinate fields take (i.e. what are the reference system formats)? 


## Explore

In [1]:
import pandas as pd
import zipfile

In [4]:
# Starting this project, two zip files were handed to me by the DANS team. 
# One contains all Archaeology metadata, the other contains DataverseNL metadata, including dendrochronology.


# Paths to the zip files
#zip_file_1 = '../data/explore/metadata-1.zip' # DataverseNL metadata, incl dendrochronology 
zip_file_2 = '../../data/explore/metadata-2.zip' # Archaeology metadata

In [5]:
### Inspect zip file contents 

# Path to zip file
zip_file_path = '../../data/explore/metadata-2.zip'
#zip_file_path = '../data/explore/metadata-1.zip'

# Open the zip file and list its contents
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_contents = zip_ref.namelist()  # List all files inside the zip
    print("Files in the ZIP archive:")
    for file in zip_contents:
        print(file)



Files in the ZIP archive:
metadata-16-01-2025/
metadata-16-01-2025/getmetadata.sh
metadata-16-01-2025/metadata-from-prod-server-0.csv
metadata-16-01-2025/metadata-from-prod-server-1.csv
metadata-16-01-2025/metadata-from-prod-server-10.csv
metadata-16-01-2025/metadata-from-prod-server-11.csv
metadata-16-01-2025/metadata-from-prod-server-12.csv
metadata-16-01-2025/metadata-from-prod-server-13.csv
metadata-16-01-2025/metadata-from-prod-server-14.csv
metadata-16-01-2025/metadata-from-prod-server-15.csv
metadata-16-01-2025/metadata-from-prod-server-16.csv
metadata-16-01-2025/metadata-from-prod-server-17.csv
metadata-16-01-2025/metadata-from-prod-server-18.csv
metadata-16-01-2025/metadata-from-prod-server-19.csv
metadata-16-01-2025/metadata-from-prod-server-2.csv
metadata-16-01-2025/metadata-from-prod-server-20.csv
metadata-16-01-2025/metadata-from-prod-server-21.csv
metadata-16-01-2025/metadata-from-prod-server-22.csv
metadata-16-01-2025/metadata-from-prod-server-23.csv
metadata-16-01-2025/

In [6]:
### Load the metadata from all CSV files in the zip file into a single DataFrame 

# Path to your zip file
zip_file_path = '../../data/explore/metadata-2.zip'

# Create an empty DataFrame
df = pd.DataFrame()

# Open the zip file
with zipfile.ZipFile(zip_file_path, 'r') as z:
    # List all files in the zip
    for filename in z.namelist():
        if filename.endswith('.csv'):
            # Read each CSV file into a DataFrame
            with z.open(filename) as f:
                d = pd.read_csv(f)
                # Append to the combined DataFrame
                df = pd.concat([df, d], ignore_index=True)

# Display the combined DataFrame
df.head()

Unnamed: 0,dsPersistentId,publicationStatus,title,dsDescriptionValue,dansSpatialPointX,dansSpatialPointY,dansSpatialPointScheme,dansSpatialBoxNorth,dansSpatialBoxEast,dansSpatialBoxSouth,dansSpatialBoxWest,dansSpatialBoxScheme,dansSpatialCoverageControlleddansSpatialCoverageText
0,doi:10.17026/dans-zrj-unr7,Published,Gouda Kattensingel Booronderzoek,ADC ArcheoProjecten heeft in januari 2017 een ...,108240,447370,RD (in m.),,,,,,
1,doi:10.17026/dans-299-9dpm,Published,Zandwingebied Q10R,In opdracht van- en in samenwerking met Rijksw...,77230,490214,RD (in m.),,,,,,
2,doi:10.17026/dans-zqy-ymw8,Published,Kerkrade Landgraaf buffer Kraanweg Booronderzoek,ADC ArcheoProjecten heeft in december 2016 een...,203505,324100,RD (in m.),,,,,,
3,doi:10.17026/dans-z8d-9c6h,Published,Archeologisch Bureauonderzoek en Inventarisere...,In het kader van de vergunningverlening ten be...,95873,421799,RD (in m.),,,,,,
4,doi:10.17026/dans-x9v-j3qu,Published,Archeologisch Bureauonderzoek Reconstructie Si...,In het kader van de vergunningverlening ten be...,197332197337197185,356659356100355712,"RD (in m.),RD (in m.),RD (in m.)",,,,,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159289 entries, 0 to 159288
Data columns (total 13 columns):
 #   Column                                                Non-Null Count   Dtype  
---  ------                                                --------------   -----  
 0   dsPersistentId                                        159289 non-null  object 
 1   publicationStatus                                     159289 non-null  object 
 2   title                                                 159289 non-null  object 
 3   dsDescriptionValue                                    159289 non-null  object 
 4   dansSpatialPointX                                     56945 non-null   object 
 5   dansSpatialPointY                                     56944 non-null   object 
 6   dansSpatialPointScheme                                56732 non-null   object 
 7   dansSpatialBoxNorth                                   4445 non-null    object 
 8   dansSpatialBoxEast                          

In [8]:
df

Unnamed: 0,dsPersistentId,publicationStatus,title,dsDescriptionValue,dansSpatialPointX,dansSpatialPointY,dansSpatialPointScheme,dansSpatialBoxNorth,dansSpatialBoxEast,dansSpatialBoxSouth,dansSpatialBoxWest,dansSpatialBoxScheme,dansSpatialCoverageControlleddansSpatialCoverageText
0,doi:10.17026/dans-zrj-unr7,Published,Gouda Kattensingel Booronderzoek,ADC ArcheoProjecten heeft in januari 2017 een ...,108240,447370,RD (in m.),,,,,,
1,doi:10.17026/dans-299-9dpm,Published,Zandwingebied Q10R,In opdracht van- en in samenwerking met Rijksw...,77230,490214,RD (in m.),,,,,,
2,doi:10.17026/dans-zqy-ymw8,Published,Kerkrade Landgraaf buffer Kraanweg Booronderzoek,ADC ArcheoProjecten heeft in december 2016 een...,203505,324100,RD (in m.),,,,,,
3,doi:10.17026/dans-z8d-9c6h,Published,Archeologisch Bureauonderzoek en Inventarisere...,In het kader van de vergunningverlening ten be...,95873,421799,RD (in m.),,,,,,
4,doi:10.17026/dans-x9v-j3qu,Published,Archeologisch Bureauonderzoek Reconstructie Si...,In het kader van de vergunningverlening ten be...,197332197337197185,356659356100355712,"RD (in m.),RD (in m.),RD (in m.)",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
159284,doi:10.17026/AR/H4IDSE,Published,"IVO-karterende fase zonnepark Groot Roodehaan,...",Laagland Archeologie heeft in april 2021 een k...,238853238979239404239437239175,579376579083579242579530579532,"RD (in m.),RD (in m.),RD (in m.),RD (in m.)",,,,,,
159285,doi:10.17026/AR/19BCUB,Published,Gemeente Moerdijk plangebied de Kogelvangers t...,Op basis van het bureauonderzoek geldt er voor...,,,,411813,90282,411486,89850,RD (in m.),
159286,doi:10.17026/AR/QXOI5Q,Published,Archeologisch bureauonderzoek perceel tussen K...,In november 2021 is een archeologisch bureauon...,101.391,497.243,RD (in m.),,,,,,
159287,doi:10.17026/AR/QHHAGE,Published,Archeologisch bureauonderzoek Den Ilp 53a te D...,In november 2021 is in opdracht van een partic...,122.592,495.815,,,,,,,


In [7]:
# Take a look at some titles 
titles = df.title.tolist()
for title in titles[:100]:
    print(title)
    print()

Gouda Kattensingel Booronderzoek

Zandwingebied Q10R

Kerkrade Landgraaf buffer Kraanweg Booronderzoek

Archeologisch Bureauonderzoek en Inventariserend Veldonderzoek door middel van grondboringen, verkennend, Planlocatie Zuiddijk 14b te Maasdam, Gemeente Binnenmaas

Archeologisch Bureauonderzoek Reconstructie Singelring, Fase 1 en 2, Roermond, Gemeente Roermond

Proefsleuvenonderzoek Erf 1 en Erf 12 in het tracé van de N18 Varsseveld-Enschede

Archeologisch onderzoek Vughtse Hoeve, Vught

Bureauonderzoek en Inventariserend veldonderzoek verkennende fase Uddelerveen 97 te Uddel

Bureauonderzoek en inventariserend veldonderzoek d.m.v. boringen (verkennende fase) Onder de Mast en Vincent van Goghstraat te Zundert

Zevenhuizen, Wollenfoppeweg 109 (Gemeente Zuidplas)  Een Bureauonderzoek

Archeologisch Bureauonderzoek en Inventariserend Veldonderzoek door middel van grondboringen Plangebied Sectie AF, nr. 1879 (gedeeltelijk), Ovezande, Gemeente Borsele

Vier locaties in en nabij de Oisterw

In [8]:
# Inspect values of publicationStatus
pubstatus = df.publicationStatus.value_counts()
print(pubstatus)

publicationStatus
Published                      158254
Unpublished,Draft                 703
Unpublished,Draft,In Review       179
Deaccessioned                     109
Draft                              44
Name: count, dtype: int64


In [9]:
# Inspect values of dansSpatialPointScheme
status = df.dansSpatialBoxScheme.value_counts()
print(status)

dansSpatialBoxScheme
RD (in m.)                                                                                                                                                                                    4175
RD (in m.),RD (in m.)                                                                                                                                                                          168
RD (in m.),RD (in m.),RD (in m.)                                                                                                                                                                40
longitude/latitude (degrees)                                                                                                                                                                    29
RD (in m.),RD (in m.),RD (in m.),RD (in m.)                                                                                                                                                     20
RD (

In [10]:
# Count missing values 
nan_counts = df.isna().sum()

In [11]:
# Select only published datasets
df_pub = df[df.publicationStatus == 'Published']

In [12]:
df_pub.info()

<class 'pandas.core.frame.DataFrame'>
Index: 158254 entries, 0 to 159288
Data columns (total 13 columns):
 #   Column                                                Non-Null Count   Dtype  
---  ------                                                --------------   -----  
 0   dsPersistentId                                        158254 non-null  object 
 1   publicationStatus                                     158254 non-null  object 
 2   title                                                 158254 non-null  object 
 3   dsDescriptionValue                                    158254 non-null  object 
 4   dansSpatialPointX                                     56479 non-null   object 
 5   dansSpatialPointY                                     56479 non-null   object 
 6   dansSpatialPointScheme                                56304 non-null   object 
 7   dansSpatialBoxNorth                                   4432 non-null    object 
 8   dansSpatialBoxEast                               

In [14]:
# # Save the archeological datasets to a CSV file
# df_pub.to_csv('../data/archaeology_metadata.csv', index=False) 

In [13]:
# Make a list of DOIs
dois = df_pub.dsPersistentId.tolist()

In [14]:
print(len(dois))

158254


## Get OAI-ORE metadata

### Random sample
This random sample was used in early stages of the project to gain insight into the data

In [16]:
# Get a random sample of 50 dois 
import random 
random_dois = random.sample(dois, 50)

In [17]:
[doi for doi in random_dois]

['doi:10.17026/dans-z57-f4fs',
 'doi:10.17026/dans-2x9-exy3',
 'doi:10.17026/dans-zqu-8kyx',
 'doi:10.17026/dans-zdb-m5db',
 'doi:10.17026/dans-zg2-yta9',
 'doi:10.17026/dans-zfy-2gdd',
 'doi:10.17026/dans-x37-rmm8',
 'doi:10.17026/dans-z8v-p5cy',
 'doi:10.17026/dans-25m-9ymv',
 'doi:10.17026/dans-x6m-sccp',
 'doi:10.17026/dans-x4t-muar',
 'doi:10.17026/dans-2au-kcam',
 'doi:10.17026/dans-285-apxe',
 'doi:10.17026/dans-zfn-z8y9',
 'doi:10.17026/dans-zzy-p4ba',
 'doi:10.17026/dans-z2c-54wb',
 'doi:10.17026/dans-xnv-a4nm',
 'doi:10.17026/dans-29s-w6tm',
 'doi:10.17026/dans-zf2-zmht',
 'doi:10.17026/dans-xbg-c7j8',
 'doi:10.17026/dans-xkq-tquj',
 'doi:10.17026/AR/QOU0ZC',
 'doi:10.17026/AR/XAOIC8',
 'doi:10.17026/dans-xpk-bhv4',
 'doi:10.17026/dans-xvk-d9hq',
 'doi:10.17026/dans-xeq-vgkx',
 'doi:10.17026/dans-zp9-zkyz',
 'doi:10.17026/AR/ANY5BC',
 'doi:10.17026/dans-xcw-hg57',
 'doi:10.17026/dans-zpn-sq97',
 'doi:10.17026/dans-xvp-4u9g',
 'doi:10.17026/dans-xhc-2d9n',
 'doi:10.17026/dans-

### Random sample with no geospatial data

In [18]:
# Get 10 dois of datasets that don't have dansSpatialBoxNorth or dansSpatialPointX  
df_pub_no_spatial = df_pub[df_pub.dansSpatialBoxNorth.isna() & df_pub.dansSpatialPointX.isna()]
dois_no_spatial = df_pub_no_spatial.dsPersistentId.tolist()
random_dois_no_spatial = random.sample(dois_no_spatial, 100)

In [19]:
df_pub_no_spatial.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98417 entries, 109 to 159269
Data columns (total 13 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   dsPersistentId                                        98417 non-null  object 
 1   publicationStatus                                     98417 non-null  object 
 2   title                                                 98417 non-null  object 
 3   dsDescriptionValue                                    98417 non-null  object 
 4   dansSpatialPointX                                     0 non-null      object 
 5   dansSpatialPointY                                     0 non-null      object 
 6   dansSpatialPointScheme                                4 non-null      object 
 7   dansSpatialBoxNorth                                   0 non-null      object 
 8   dansSpatialBoxEast                                    0 no

### Query

This query was used in an early stage of the project to save a sample of the metadata as a JSON to my disc. The Notebook [collect_metadata.ipynb](collect_metadata.ipynb) provides a way to store all metadata in a local instance of MongoDB. 

In [20]:
import requests
import json
import time
# import pprint

def get_json(doi, prefix = 'geo'): 
    """
    Get JSON data of a dataset from the Archaeology Data Stations API.
    Wait for 1 second between requests to avoid overloading the server.

    :param doi: DOI of the dataset. Default is 'geo'
    :param prefix: Prefix of the resulting JSON file name
    :return: JSON data of the dataset 
    """

    url = f"https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId={doi}"

    try:
        # Send a GET request to the URL
        response = requests.get(url)
        print(url)

        # Check if the request was successful
        response.raise_for_status()

        # Parse the JSON data
        data = response.json()

        # Optionally, save it to a file
        out_path = f'../jsons/{prefix}_{doi.replace('/', '%')}.json'


        with open(out_path, 'w') as json_file:
            json.dump(data, json_file, indent=4)

        print(f"JSON data has been saved to '{out_path}'.")
        # pprint.pprint(data) 

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

    # Wait for 1 second to avoid overloading the server
    wait_time = 1  # seconds
    time.sleep(wait_time)

In [21]:
for doi in random_dois_no_spatial:
    get_json(doi)

https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:10.17026/dans-xwu-ac5u
JSON data has been saved to '../jsons/nogeo_doi:10.17026%dans-xwu-ac5u.json'.
https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:10.17026/dans-xup-tcgu
JSON data has been saved to '../jsons/nogeo_doi:10.17026%dans-xup-tcgu.json'.
https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:10.17026/dans-xye-hjvs
JSON data has been saved to '../jsons/nogeo_doi:10.17026%dans-xye-hjvs.json'.
https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:10.17026/dans-z58-gde2
JSON data has been saved to '../jsons/nogeo_doi:10.17026%dans-z58-gde2.json'.
https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:10.17026/dans-zw8-h5dj
JSON data has been saved to '../jsons/nogeo_doi:10.17026%dans-zw8-h5dj.json'.
https://archaeology.datastations.nl/api/datas