IMPORTANT UPDATE: unfortunately, the newsriver API is being discontinued at the end of January 2021. Existing accounts will still function until this date.
newsrivr is an R wrapper to the Newsriver API, providing simple functions to retrieve and clean news articles following a tidy framework. Newsriver is a non profit free of charge news API (for commercial purposes, a monthly subscription is encouraged), and when combined with the newsrivr R package, can return up to 36,500 articles in a single search.
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("MikeJohnPage/newsrivr")
newsrivr follows a simple workflow: (1) store credentials, (2) retrieve news, (3) clean news.
To access the Newsriver API, you need to register an API
token. In addition, you are
also required to provide a user agent when using the newsrivr package.
This allows Newsriver to identify who is using the API (and is important
if something goes wrong). A good default user agent is your email
address. To make your credentials available to newsrivr at every
session, use the store_credentials()
function, which will prompt you
for your API token and user agent and then append them to a .Renviron
file located in your home directory (note, you should only do this
once):
library(newsrivr)
# you will be prompted for your credentials
store_creds()
If you would not like newsrivr to alter your .Renviron
file, you can
use the store_creds_temp()
which just makes the credentials available
for the current R session only (note, you will have to do this at every
session):
# you will be prompted for your credentials
store_creds_temp()
Alternatively, you can manually pass your API token and user credentials
into the relevant newsrivr functions (below), however, this isn’t
recommended as credentials can accidentally get leaked in scripts and
.Rhistory
files. See the ?store_credentials
documentation for more
information.
The get_news()
function returns news articles from the Newsriver API
matching a user provided search query. These queries must be valid
Lucene query
strings,
with the option to search the title and text fields of articles. See the
?get_news
documentation for more information:
get_news(query = "Google")
#> # A tibble: 3,100 x 26
#> id publishDate discoverDate title language text structuredText url
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EXlv… 2019-06-02… 2019-06-02T… Disp… en "Hel… "<div> \n <p>… http…
#> 2 dOvQ… 2019-06-03… 2019-06-03T… Goog… en "An … "<div> \n <p>… http…
#> 3 U4d0… 2014-02-25… 2019-06-03T… 8 Go… en Any … "<div> \n <p>… http…
#> 4 ikfM… 2012-12-08… 2019-06-03T… Goog… en Yest… "<div> \n <p>… http…
#> 5 Zuf8… 2017-04-03… 2019-06-03T… Goog… en "Goo… "<div> \n <p>… http…
#> # … with 3,095 more rows, and 18 more variables
get_news("Google", from = "2019-05-01", to = "2019-06-01")
#> # A tibble: 3,200 x 26
#> id publishDate discoverDate title language text structuredText url
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 29nG… 2019-05-02… 2019-05-02T… How … en "Goo… "<div> \n <p>… http…
#> 2 jAgI… 2019-05-02… 2019-05-02T… How … en Goog… "<div> \n <p>… http…
#> 3 uDVD… 2019-05-01… 2019-05-01T… Goog… en "Goo… "<div> \n <p>… http…
#> 4 tkYy… <NA> 2019-05-02T… Goog… en "Bot… "<div> \n <p>… http…
#> 5 MieF… 2019-05-01… 2019-05-01T… Walm… en Lead… "<p>Leading u… http…
#> # … with 3,195 more rows, and 18 more variables
get_news("title:Google AND text:\"Google Cloud\"", language = "it")
#> # A tibble: 2,670 x 24
#> id publishDate discoverDate title language text structuredText url
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Qbfr… 2019-06-02… 2019-06-03T… Avay… it Avay… "<div> \n <p>… http…
#> 2 2_eR… 2019-06-03… 2019-06-03T… La p… it Un p… "<div> \n <p>… http…
#> 3 VWo8… 2019-06-03… 2019-06-03T… La p… it Live… "<div> \n <p>… http…
#> 4 Wsvb… 2019-06-03… 2019-06-03T… Guid… it "Son… "<div> \n <p>… http…
#> 5 OuAo… 2019-06-03… 2019-06-03T… Stad… it "All… "<div> \n <p>… http…
#> # … with 2,665 more rows, and 16 more variables
The clean_news()
function wrangles the messy data fetched by
get_news()
, returning a tidy tibble with sensible defaults.
news <- get_news(query = "Google")
clean_news(news)
#> # A tibble: 1,542 x 4
#> text title discoverDate website.domainN…
#> <chr> <chr> <date> <chr>
#> 1 "hello, i have mu… displaying current g… 2019-06-02 spotify.com
#> 2 "an outage of google… google cloud back to… 2019-06-03 androidcentral.…
#> 3 any google apps admi… 8 google apps admin … 2019-06-03 bettercloud.com
#> 4 yesterday, google di… google ends free goo… 2019-06-03 bettercloud.com
#> 5 "google have been ex… google to launch a r… 2019-06-03 madebymagnitude…
#> # … with 1,537 more rows
If you encounter a clear bug, please file a minimal reproducible example in issues.