# Using the Portable Antiquities Scheme API

This activity is based on the notebook developed by ODAT and script by Daniel Pett [Fitzwilliam Museum](fitzmuseum.cam.ac.uk), who designed and built the PAS database and API The original notebook can be found in this forked repository. 

The [Portable Antiquities Scheme](https://finds.org.uk/about) provides an open access database of finds in support of the Treasury Act. The PAS records finds discovered by the public, outside of excavation, in England and Wales. The data available on the database can be accessed by anyone, but exact findspots are only available to credentialed researchers. The current data consist of almost 1.5 million objects that often contain visual media in the form of photographs, illustrations, or 3D models, as well as linked data. This model of public–professional collaboration, coupled with an open access dissemination strategy, has proven productive. The PAS lists close to 150 PhD dissertations that have used their data, as well as over 700 total research projects. 

You've already used Python to connect to the Open Context API, but here we'll use R and try to retrieve data using the PAS API. Since you've already go through the process with using an API, try to notice any similarities or differences between using Python and R. 

1. First bring in two packages in R. 

`library(jsonlite)`

`library(RCurl)`


2. We're going to create a variable that will act as the base URL for PAS. This will be used later in the exercise. 

`base <- 'https://finds.org.uk/'`

3. Now we'll set up the query that will set the parameters of the data we want to access. We want to access all of the __Gold__ objects from the __Bronze Age__ that have __Images__. Before we go through the R code let's check the source data we're trying to access. Go to the [PAS database](https://finds.org.uk/database) and search "Gold" - you'll have close to 24K results. Limit the query on the right side to "Bronze Age" which will limit the results to about 4,300. Click on "Only results with images: On" - 385 results. These are the data you're going to access. If you click on the json linke at the bottom of the screen under "Other formats:" you'll see how the data is actually structured "behind the scenes" and how we'll retrive it. 

`url <-"https://finds.org.uk/database/search/results/q/gold/broadperiod/BRONZE+AGE/thumbnail/1/format/json"`

4. We'll set up a variable that goes to the url (our query parameters in PAS) and gets the data. 

`json <- fromJSON(url)`

5. Let's look at the first part of the data we've retrived. 

`head(json)`

6. You can see at the top that there is some metadata related to the query results. We may need it later so we should grab it from the json file

`total <- json$meta$totalResults`

`results <- json$meta$resultsPerPage`

`pagination <- ceiling(total/results)`

7. This API brings back all of the data associated with all of the objects from our PAS query and that can be A LOT of information. If we're just interested in some of the variables for each objects we can limit those

`keeps <- c("id", "objecttype", "old_findID", "broadperiod", "institution", "imagedir", "filename")`

8. We're then going make this limited data available from the json results of our API query and keep the columns we want. 

`data <- json$results`

`data <- data[,(names(data) %in% keeps)]`

9. We can now look at the first part of the dataframe. 

`head(data)`

10. Now loop through all of the pages of results and bind it to a together in a single table. Since this is so much data, it may take a bit of time - be patient. 

`for (i in seq(from=2, to=pagination, by=1)){
  urlDownload <- paste(url, '/page/', i, sep='')
  pagedJson <- fromJSON(urlDownload)
  records <- pagedJson$results
  records <- records[,(names(records) %in% keeps)]
  data <-rbind(data,records)
}`

11. Finally, we can write this data table into a csv file. In this Jupyter file manager, download this csv file so you can upload it into your forked GitHub repo. 

`write.csv(data, file='data.csv',row.names=FALSE, na="")`

12. The last section of this notebook will focus on retrieving images from your original PAS query. Remember, in #7 you limited the attributes of the data you accessed with your query - this included the image directory and image file name. These next steps goes through the data you've already collected, creates a download path for the images, and saves the images to a file in an organized set of folders. First we're going to create a log file that will record any failures we encounter. 

`failures <- "failures.log"`

`log_con <- file(failures)`

13. Now make a function, which is a mini-program, that our R code can use over and over again to download the images. The following cell includes R code and script describing each section of the code. Try to follow each part of the code and understand what it means. You can just run the cell as is. 

In [None]:
# Download function with test for URL
download <- function(data){
  # This should be the object type taken from column 3
  object = data[3]
  # This should be the record old find ID taken from column 2
  record = data[2]
  
  # Check and create a folder for that object type if does not exist
  if (!file.exists(object)){
    dir.create(object)
  }
  
  # Create image url - image path is in column 7 and filename is column 6
  URL = paste0(base,data[7],data[6])
  
  # Test the file exists
  exist <- url.exists(URL) 
  
  # If it does, download. If not say 404
  if(exist == TRUE){
    download.file(URLencode(URL), destfile = paste(object,basename(URL), sep = '/'))
  } else {
    print("That file is a 404")
    # Log the errors for sending back to PAS to fix - probably better than csv as you 
    # can tail -f and watch the errors come in
    message <- paste0(record,"|",URL,"|","404 \n")
    # Write to error file
    cat(message, file = failures, append = TRUE)
  }
}

14. We're finally going to run the fuction, which will start downloading the images into newly created folders in your Jupyter file manager. To check on the status of this split your screen with this notebook on one side and the file manager on the other. What you will see when you run the code listed below are folders popping up in your file manager. You will also get some failed attempts to grab the image and below the code it may list "That file is a 404" which is what we told it to do (see the cell above this). You don't have to download all of the images in the dataset (that will take awhile), so after it runs for a few minutes press the square stop button in your notebook tool bar (next to the Run button). Go through some of the new folders the function has created in your file manager. 

`apply(data, 1, download)`