Archivr is a project by the Qualitative Data Repository that automates preservation of urls in Web Archives.
The easiest way to install is directly from this github using the
library(devtools) install_github("QualitativeDataRepository/archivr") library(archivr)
The basic function is
archiv that takes a list of urls and stores them in
the Way Back Machine. It will return a dataframe containing the callback
data for the service.
arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com")) arc_df$way_back_url # wayback_url # 1 http://web.archive.org/web/20190128171132/http://www.example.com # 2 http://web.archive.org/web/20190128171134/https://github.com/ ...
Archiv can archive all the urls in a webpage.
arc_url_df <- archiv.fromUrl("https://qdr.syr.edu/") df <- data.frame(arc_url_df$url, arc_url_df$wayback_url)[8,] # arc_url_df.url arc_url_df.wayback_url # 8 http://syr.edu http://web.archive.org/web/20170110050058/http://syr.edu/
Archiv will also archive all the urls in a text file. It has been tested for docx, pdf and markdown, although other text-related files should also work. Note that text parsing can be subject to problems, especially if the document has rich features such as tables or columns.
arc_url_df <- archiv.fromText("path_to_file")
To allow for pre-processing of URLs before archiving,
archivr also provides access to the funcitons used to extract URLs from a webpage (
extract_urls_from_webpage("URL")), from a files (
extract_urls_from_text("filepath")) (tested for .docx, markdown, and pdf), and from any supported text file in a folder (
Any of the functions that extract or archiv URLs from a document or URL, accept an
except parameter, a regular expression (using R's
grepl function) that will exclude URLs from extraction and archiving. E.g.
arc_url_df <- archiv.fromText("article.pdf", except="https?:\\/\\/(dx\\.)?doi\\.org\\/")
will exclude DOI links from archiving.
Checking archiving status
You can check whether URLs are archived by the Internet Archive's Wayback machine:
arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "wayback")
If you wish to use perma.cc's archive, you will need to set your api key using:
if you wish to save the urls in a particular perma.cc folder, you will need to set the default folder id using
If you do not remember the ids of your folders, you can retrieve these in a dataframe using:
You can check your current folder using
and then you can archive materials:
arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")
To check if a list of urls are archived in perma_cc's public api, use:
arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")
Archivr is a project developed and maintained by the Qualitative Data Repository at Syracuse University, originally authored by Ryan Deschamps (greebie on github.com) and Agile Humanities.