-
Notifications
You must be signed in to change notification settings - Fork 2
/
waybackapi.Rmd
318 lines (232 loc) · 11.2 KB
/
waybackapi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Using the Wayback Machine API to find snapshots of prison pages"
output: html_notebook
---
# Using the Wayback Machine API to find snapshots of prison pages
The Wayback Machine [has a number of APIs](https://archive.org/help/wayback_api.php) including one that returns the URL of the snapshot of a given URL which is closest to a given timestamp.
We have a list of prison URLs and timestamps for when each was changed, scraped using a Python notebook. We now want to establish if we can locate snapshots for each of those changes (i.e. between changes).
We will need to:
* Loop through each prison URL
* For each of those loop through the change timestamps
* For each of those, query the Wayback API for the nearest snapshot
* Store the resulting URL
We might then also want to scrape those snapshots and/or identify differences between them (the changes)
## Testing one URL
Start by storing a URL and two timestamps from the scraped data:
```{r store url and timestamp}
testurl <- "https://www.gov.uk/guidance/cardiff-prison"
testdates <- c("2020-03-11T09:32:00.000+00:00","2020-03-25T16:09:19.000+00:00")
```
A Wayback API query looks like this: `https://archive.org/wayback/available?url=example.com×tamp=20060101`
Using it with the timestamp gives us a URL like this: `https://archive.org/wayback/available?url=https://www.gov.uk/guidance/cardiff-prison×tamp=2020-03-11T09:32:00.000+00:00`
But that doesn't return any useful data.
Instead we have to translate the timestamp into one that is understood by the API. For example:
`https://archive.org/wayback/available?url=https://www.gov.uk/guidance/cardiff-prison×tamp=20200311`
That works, returning the URL:
`https://web.archive.org/web/20200425142044/https://www.gov.uk/guidance/cardiff-prison`
The API [documentation](https://archive.org/help/wayback_api.php) specifies:
> "The format of the timestamp is 1-14 digits (YYYYMMDDhhmmss)"
We can use `gsub` to substitute the characters from the timestamp that need to be stripped out: the dashes, "T" and colons:
```{r cleaning timestamp gsub}
gsub("[-T:]","","2020-03-11T09:32:00.000+00:00")
```
We also need to remove the last part to limit it to 14 characters:
```{r substr timestamp to first 14 chars}
substr(gsub("[-T:]","","2020-03-11T09:32:00.000+00:00"),1,14)
```
Now let's apply those functions as part of a loop that inserts the cleaned timestamp into an API query URL:
```{r create test query URLs}
for(i in testdates){
testapiquery <- paste("https://archive.org/wayback/available?url=",testurl,"×tamp=",substr(gsub("[-T:]","",i),1,14), sep="")
print(testapiquery)
}
```
## Extracting JSON from the URL
To deal with the JSON provided by the API we need the `jsonlite` package:
```{r activate jsonlite}
library(jsonlite)
```
Now to fetch that JSON
```{r fetch json}
#fetch the JSON from the API query
testjson <- jsonlite::fromJSON("https://archive.org/wayback/available?url=https://www.gov.uk/guidance/cardiff-prison×tamp=20200311093200")
#drill down into the closest url
testjson$archived_snapshots$closest$url
#And timestamp (although this is in the URL anyway)
testjson$archived_snapshots$closest$timestamp
```
We can see that the closest result is actually *after* the other timestamp: 20200325160919, so it isn't a snapshot of what it looked like between those.
This can be teased out in analysis later. The key thing is that the process works and we can begin to roll it out for all the URLs and timestamps we have.
## Import URLs and timestamps - and query API in bulk
The data on URLs and timestamps is in a CSV, which we import below:
```{r import prisons data}
prisonsdata <- read.csv("SCRAPEprisonupdatesJan6.csv", stringsAsFactors = F)
```
We don't need to create a nested loop (URL, then timestamps within each) because the URLs are repeated for each timestamp, so we just need to loop through a sequence of numbers (the index positions for each row) and access the URL and timestamp in that row
```{r loop through data and use to query API}
#create empty vector to store results
queryresults <- c()
#Loop through a vector of numbers from 1 to the number of rows in the dataframe
for (i in seq(1,nrow(prisonsdata))){
print(i)
#store the URL at that position
currenturl <- prisonsdata$url[i]
print(currenturl)
#store the timestamp at that position
currenttimestamp <- prisonsdata$timestamp[i]
print(currenttimestamp)
#form a query url from those, converting the timestamp to the right format
queryurl <- paste("https://archive.org/wayback/available?url=",currenturl,"×tamp=",substr(gsub("[-T:]","",currenttimestamp),1,14), sep="")
print(queryurl)
#fetch the JSON from the URL (API query)
queryjson <- jsonlite::fromJSON(queryurl)
#drill down into the closest url
queryresult <- queryjson$archived_snapshots$closest$url
#add to the vector of results
queryresults <- c(queryresults,queryresult)
}
```
At this point (after 50 responses) we hit a 429 error, which means we've [sent too many requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429). We need to find a way to slow those requests.
## Query a different API
Some searching around suggests that the more powerful [Wayback CDX Server API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) may be better for this.
It includes [filtering](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#filtering) to specify results *between* dates, too.
A URL for that looks like this: `https://web.archive.org/cdx/search/cdx?url=archive.org&from=2010&to=2011`
For our data it might look like this:
`https://web.archive.org/cdx/search/cdx?url=https://www.gov.uk/guidance/cardiff-prison&from=20200311093200&to=20200325160919`
Output is text by default (the above query returns an empty page because there are no results within that date range) but can be requested as JSON by adding `&output=json`:
`https://web.archive.org/cdx/search/cdx?url=https://www.gov.uk/guidance/cardiff-prison&from=20200311093200&to=20200325160919&output=json`
When the date range is extended (so that we have some results to see) the result comes in the form of a list with the nearest result first: `https://web.archive.org/cdx/search/cdx?url=https://www.gov.uk/guidance/cardiff-prison&from=20200311093200&to=20210325160919&output=json`
```{r explore cdx api response}
cdxtextjson <- jsonlite::fromJSON("https://web.archive.org/cdx/search/cdx?url=https://www.gov.uk/guidance/cardiff-prison&from=20200311093200&to=20210325160919&output=json")
#The results are like a matrix with the first row containing the field names...
cdxtextjson[1,]
#...and the other rows containing the data
cdxtextjson[2,]
cdxtextjson[2,2]
```
We can try to adapt our loop for the new API and the date range:
```{r query CDX API}
#create empty vector to store results
cdxqueryresults <- c()
#Loop through a vector of numbers from 1 to the number of rows in the dataframe
for (i in seq(1,nrow(prisonsdata))){
print(i)
#store the URL at that position
currenturl <- prisonsdata$url[i]
print(currenturl)
#store the timestamp at that position
currenttimestamp <- prisonsdata$timestamp[i]
print(currenttimestamp)
#form a query url from those, converting the timestamp to the right format
queryurl <- paste("https://web.archive.org/cdx/search/cdx?url=",currenturl,"&from=",substr(gsub("[-T:]","",currenttimestamp),1,14),"&output=json", sep="")
print(queryurl)
#fetch the JSON from the URL (API query)
queryjson <- jsonlite::fromJSON(queryurl)
#drill down into the closest url
queryresult <- queryjson[2,2]
#add to the vector of results
cdxqueryresults <- c(cdxqueryresults,queryresult)
}
```
This causes an error when we get no results. But also: why generate 1139 queries for each url-date combination when we can just fetch all the results for each URL through just 120 queries?
```{r number of prison urls}
#Show how many unique urls there are in the url field
length(table(prisonsdata$url))
```
In that case the JSON response for each query might be converted to a data frame and bound together:
```{r as df}
#convert to data frame
cdxtextdf <- as.data.frame(cdxtextjson)
#extract first row as fields
colnames(cdxtextdf)<- cdxtextjson[1,]
#remove first row
cdxtextdf <- cdxtextdf[-1,]
cdxtextdf
```
Time to loop through those URLs and fetch all the snapshots:
```{r try with all}
#create empty data frame to store results
cdxtextdf <- cdxtextdf[0,]
#Loop through all the unique urls
for (i in as.data.frame(table(prisonsdata$url))[,1]){
#print url
print(i)
#form a query url from it, with the timestamp 20200301
queryurl <- paste("https://web.archive.org/cdx/search/cdx?url=",i,"&from=20200301&output=json", sep="")
print(queryurl)
#fetch the JSON from the URL (API query)
queryjson <- jsonlite::fromJSON(queryurl)
#convert to data frame
querydf <- as.data.frame(queryjson)
#extract first row as fields
colnames(querydf)<- queryjson[1,]
#remove first row
querydf <- querydf[-1,]
#add to the data frame of results
cdxtextdf <- rbind(cdxtextdf,querydf)
}
```
That seems to work:
```{r show head}
head(cdxtextdf)
```
We can add a full URL by combining the base URL like so: https://web.archive.org/web/20201203081036/https://www.gov.uk/guidance/channings-wood-prison
```{r create snapshot urls}
cdxtextdf$snapshoturl <- paste0("https://web.archive.org/web/",cdxtextdf$timestamp,"/",cdxtextdf$original)
```
And check how that looks:
```{r show head with new col}
head(cdxtextdf)
```
And export:
```{r export csv}
write.csv(cdxtextdf, "waybackurls.csv")
```
## Is there a snapshot between dates?
The next step is to match this data with the data on changes to the pages in question.
In Excel we've filtered the prison webpage scrape to those changes since March 2020 and generate some new columns which generate a 'from' and 'to' timestamp for the API (if it's the latest update, the 'to' timestamp is today)
```{r import prison data}
prisonsdata <- rio::import("waybackurlsANALYSIS.xlsx", sheet = 1)
colnames(prisonsdata)
```
```{r query CDX API with from and to}
#create empty vector to store results
cdxqueryresults <- c()
#Loop through a vector of numbers from 1 to the number of rows in the dataframe
for (i in seq(1,nrow(prisonsdata))){
print(i)
#store the URL at that position
currenturl <- prisonsdata$url[i]
print(currenturl)
#store the timestamp at that position
fromtimestamp <- prisonsdata$fromtimestampforapi[i]
totimestamp <- prisonsdata$totimestamp[i]
print(currenttimestamp)
#form a query url from those, converting the timestamp to the right format
queryurl <- paste("https://web.archive.org/cdx/search/cdx?url=",currenturl,"&from=",fromtimestamp,"&to=",totimestamp,"&output=json", sep="")
print(queryurl)
#fetch the JSON from the URL (API query)
queryjson <- jsonlite::fromJSON(queryurl)
#drill down into the closest url
print(length(queryresult))
if(length(queryresult)>1){
queryresult <- queryjson[2,2]
#add to the vector of results
cdxqueryresults <- c(cdxqueryresults,queryresult)
}
else{
cdxqueryresults <- c(cdxqueryresults,"no snapshot")
}
}
```
Now to add that to the original data frame:
```{r check number of rows}
table(cdxqueryresults)
```
```{r add results to df}
prisonsdata$waybackurl <- cdxqueryresults
```
And export.
```{r export}
write.csv(prisonsdata,"prisonsdata.csv")
```