# Findings 
Here you find a summary of the initial findings as well as an overview of what I'm currently working on. The actual work takes place in the other notebooks, here I just give an overview. I update this notebook along the way. 

## Step 1: Identify where geospatial metadata is missing

The dataframe below contains a selection of fields from all Archaeology datasets in the Data Station. Its purpose is to show where geospatial data is missing. The data was cleaned to only contain published datasets. 

There are two types of fields that contain explicit geospatial information: `dansSpatialPoint` and `dansSpatialBox`. Below you see how many datasets have information filled in these fields. 

In [3]:
import pandas as pd
df = pd.read_csv('../data/archaeology_metadata.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158254 entries, 0 to 158253
Data columns (total 13 columns):
 #   Column                                                Non-Null Count   Dtype  
---  ------                                                --------------   -----  
 0   dsPersistentId                                        158254 non-null  object 
 1   publicationStatus                                     158254 non-null  object 
 2   title                                                 158254 non-null  object 
 3   dsDescriptionValue                                    158254 non-null  object 
 4   dansSpatialPointX                                     56479 non-null   object 
 5   dansSpatialPointY                                     56479 non-null   object 
 6   dansSpatialPointScheme                                56304 non-null   object 
 7   dansSpatialBoxNorth                                   4432 non-null    object 
 8   dansSpatialBoxEast                          

In [13]:
# box = df[df.dansSpatialBoxScheme.notnull()]
# box

In total, the archaeology data station contains 158254 published datasets. 
About 60000 have geospatial information in the metadata, either in the form of points (x, y), or a bounding box. 

--> About 1/3 of the datasets have spatial metadata available. 

### Inspect the data with missing explicit geospatial metadata

In [7]:
df_nogeo = df[df.dansSpatialBoxNorth.isna() & df.dansSpatialPointX.isna()]

In [8]:
df_nogeo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98417 entries, 109 to 158234
Data columns (total 13 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   dsPersistentId                                        98417 non-null  object 
 1   publicationStatus                                     98417 non-null  object 
 2   title                                                 98417 non-null  object 
 3   dsDescriptionValue                                    98417 non-null  object 
 4   dansSpatialPointX                                     0 non-null      object 
 5   dansSpatialPointY                                     0 non-null      object 
 6   dansSpatialPointScheme                                4 non-null      object 
 7   dansSpatialBoxNorth                                   0 non-null      object 
 8   dansSpatialBoxEast                                    0 no

## Step 2: Identify where geospatial metadata can be found
Based on the PID, I extracted the full metadata for a sample of 100 datasets without explicit geospatial data in JSON format (see [`doi_to_json.py`](scripts/doi_to_json.py)). Through manual inspection I listed all fields that included hints to locations. See the table below for a list of the fields I found. 

Base url: https://archaeology.datastations.nl/api/datasets/export?exporter=OAI_ORE&persistentId=doi:{doi}

### Metadata fields that may hold geospatial information

| field | description | example | doi of example |
|---|---|---|---|
| dansTemporalSpatial:dansSpatialPoint | Has an X, Y and Scheme specification | 6.11557, 52.78667, longitude/latitude (degrees) |  10.17026/25AR/25AAV0BL | 
| dansTemporalSpatial:dansSpatialBox | Has a North, East, South, West and Scheme specification | 392400,392325	52430,52350	392300,392275	52275,52300	RD (in m.),RD (in m.) | 10.17026/dans-zf4-sjty	| 
| dansTemporalSpatial:dansSpatialCoverageControlled |  | Netherlands | 10.17026/25AR/25AAV0BL | 
| dansTemporalSpatial:dansSpatialCoverageText | | 	"Sint-Michielsgestel", "NLD" | 10.17026/AR/KDIBWJ | 
| dansRelationMetadata:dansCollection:title | | Kalverstraat 11 Steenwijk | 10.17026/25AR/25AAV0BL | 
| ore:aggregates:schema:name | Often appears to contain geospatial info in some form | - 115-261_Steenwijk_Kalverstraat 11.pdf </br> - location.xml </br> - Archeologische Begeleiding aanleg wegcunet en Opgraving aanleg watervoorziening Hoge Meet, Stelleweg/Oranjeweg, Gemeente Goes|  10.17026/25AR/25AAV0BL </br> 10.17026/AR/KDIBWJ </br> 10.17026/dans-zf4-sjty | 
| citation:dsDescriptionValue | Contains a free description of the dataset | | | 





It's not yet possible to see how many of these fields are present in the Data Station, because I have not yet extracted all JSONs. However, the SpatialCoverageText seemed to be abundantly available when I inspected the sample. That seemed like an easy point to start. 

## Step 3: Extract all metadata 
This will give us an overview of which fields are present. If a field is rare in general, it may not be too fruitful to start extracting geospatial info from there. The more populated fields should have priority. 

How to manage 10000 JSONS? 
- MongoDB 
- DuckDB

### Approach 1: Extract `SpatialCoverageText`, convert to coordinates

&#9745;  Step 1: Extract a DataFrame from the JSON
Only take the relevant fields, use the script [`json_to_df.py`](scripts/json_to_df.py)

&#9745; Step 2: Convert place names to coordinates 
Use the `Nominatim` package

&#9744; Step 3: Evaluation 
Extract a test sample of datasets with geospatial information and `SpatialCoverageText`, convert the place names to coordinates, and compare to the real coordinates. 


### Approach 2: Extract `schema:name`
steps: clean, apply NER to identify place names (might be more complicated because not all place name mentions are necessarily linked to the excavations), convert to coordinates

### Approach 3: Extract `dsDescriptionValue`
steps: clean, apply NER to identify place names (might be more complicated because not all place name mentions are necessarily linked to the excavations), convert to coordinates