Orchestrate Instances and Retrieve Data from a ScrapeBot Database
ScrapeBotR (with "R") allows to easily retrieve (large amounts of) data from a ScrapeBot installation. The package provides easy-to-use functions to read and export instances, recipes, runs, log information, and data. Thereby, the package plugs neatly into the tidyverse as it makes heavy use of tibbles.
The ScrapeBot (without "R") is a tool for so-called "agent-based testing" to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.
Install the most recent development version using
Import the installed version ...
... and start using it by defining your ScrapeBot database. Credentials to access your database need to be stored in an INI file somewhere in your computer's home directory (i.e., under
~/, which usually translates to
/home/my_user under *nix or
C:\Users\my_user\Documents under Windows). You can either create this file by hand or use the ScrapeBotR's helper function to create it:
write_scrapebot_credentials( host = 'path_to_my_database', user = 'database_username', password = 'database_password' )
Alternatively, you can create the INI file manually. Ideally, the file is located directly within your home directory and named
.scrapebot.ini (where the leading
. prevents it from being shown in the file browser most of the time). The INI file is essentially just a raw-text file with a so-called section name and some key-value pairs, each of which cannot contain spaces between a key and its value. Any unnecessary settings can be omitted (e.g., the port number). Here's how the INI file could look like:
[a name for me to remember] host=localhost port=3307 user=my_personal_user password=abcd3.45d:cba! database=scrapebot
Once you got that out of the way, try connecting to your database, using the section name again (this is because you can have multiple sections referring to multiple ScrapeBot installations):
connection <- connect_scrapebot('a name for me to remember')
If this doesn't yield an error, you are good to go. And you could start, for example, by ...
- listing the available recipes through
- listing the available instances through
- get information about specific runs through
- Collect data via
- bulk-download-and-compress screenshots from S3 via
Since version 0.5.0, you can also orchestrate servers on Amazon Web Services (AWS). Therefore, you first need an AWS account, to which also any raised costs will be charged. Next, generate an IAM user within your AWS account and create an API key. Also, you need an SSH key pair (in PEM format). Afterwards, use the respective R functions parallel to the ScrapeBot database (above) to write your credentials into an INI file and connect to your AWS account:
write_aws_credentials( access_key_id = 'aws_access_key', secret_access_key = 'aws_access_secret', ssh_private_pem_file = 'path_to_ssh_private_pem_file', ssh_public_pem_file = 'path_to_ssh_public_pem_file' ) aws_connection <- connect_aws()
Given that this does not yield an error, you could ...
- start an AWS RDS instance as ScrapeBot database through
- launch an AWS S3 instance to store screenshots through
- run an EC2 instance as ScrapeBot instance through
- store the connection object for later using
- restore (load) the connection object some days/weeks/months/studies later through
- terminate all AWS instances through the respective
Detailed documentation is available for every function inside R.
Haim, Mario (2021). ScrapeBotR. An R package to orchestrate ScrapeBot for agent-based testing. Available at https://github.com/MarHai/ScrapeBotR/.