Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

VanderBot

The short link to this page is vanderbi.lt/vanderbot

This page is about general use of the VanderBot Wikidata API-writing script and other scripts associated with it. To view information about the original VanderBot project to upload Vanderbilt researcher and scholar items, see this page. That page contains information about the bespoke scripts used in the original project, ways to explore the data, and release notes through v1.6 .

Summary

VanderBot is a Python script (vanderbot.py) used to create or update items in Wikidata using data from CSV spreadsheet data. The script uses a customizable schema based on the W3C Generating RDF from Tabular Data on the Web Recommendation, making it possible to write data about any kind of item using the Wikidata API. To learn more about this aspect of the project, see our paper that is in press at the Semantic Web Journal.

Since the project started, the generalized code for writing to the API has been used with modifications of other Python scripts from the original project to carry out several Wikidata projects at Vanderbilt. They include creating records for items in the Vanderbilt Fine Arts Gallery, connecting and creating image items with the Vanderbilt Divinity Library's Art in the Christian Tradition (ACT) database, and managing journal data as part of the VandyCite WikiProject. Through these explorations, we are learning how to generalize the process so that it can be used in many areas.

How it works

For a detailed do-it-yourself walkthrough on using VanderBot, see the series of blog posts starting with this one. A video walk-through for getting started building your own metadata schema and using the Wikidata test instance is here. A tutorial that jumps directly to writing to the real Wikidata is here. More general instructions are below.

If you want to use the VanderBot script to upload your own data to Wikidata, you will need to create a spreadsheet with appropriate column headers and a csv-metadata.json file to map those headers to the Wikibase graph model according to the Generating RDF from Tabular Data on the Web Recommendation. To create those two files, use this web tool. Copy and paste the generated CSV header generated by the Create CSV button into a plain text document having the name that you specified (with .csv extension), then open that file with a spreadsheet program like Libre Office and enter your data. Click the Create JSON button, then copy and paste the JSON into a file named csv-metadata.json in the same directory as the CSV and vanderbot.py.

The source code that generates the web tool includes the files wikidata-csv2rdf-metadata.html, wikidata-csv2rdf-metadata.js, and wikidata-csv2rdf-metadata.css in this directory.

Another method for generating a metadata description file is to use a simplified JSON configuration file. The script convert_json_to_metadata_schema.py performs the conversion and generates CSV files with appropriate headers. For more information about that script and the format of the configuration file, visit this information page. Using the script is described with much hand-holding and many screenshots in this blog post.

The script acquire_wikidata_metadata.py, downloads existing data from Wikidata into a CSV file that is compatible with the format required by VanderBot. It requires the same JSON configuration file as the conversion script above -- the two scripts are designed to work together. See this page for details.

Another utility, count_entities.py, can be used to count the use of properties in statements made about a defined set of items, or to determine the most common values for particular properties used in statements about those items. For more information about using this script, see the script usage page for details.

Script details

Script location: https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/vanderbot.py

Current version: v1.9.2

Written by Steve Baskauf 2020-22.

Copyright 2022 Vanderbilt University. This program is released under a GNU General Public License v3.0.

RFC 2119 key words

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Credentials text file format example

The API credentials MUST be stored in a plain text file using the following format:

endpointUrl=https://www.wikidata.org
username=User@bot
password=465jli90dslhgoiuhsaoi9s0sj5ki3lo

A trailing newline is OPTIONAL.

Username and password are created on the Bot passwords page, accessed from Special pages. Wikimedia credentials are shared across all platforms (Wikipedia, Wikidata, Commons, etc.). The endpoint URL is the subdomain of a Wikibase instance -- Wikidata in the example above. The credentials file name and location MAY be set using the options below, otherwise the defaults are used.

Command line options

long form short form values default
--log -L log filename, or path and appended filename. Omit to log to console. none
--json -J JSON metadata description filename or path and appended filename "csv-metadata.json"
--credentials -C name of the credentials file "wikibase_credentials.txt"
--path -P credentials directory: "home", "working", or path with trailing "/" "home"
--update -U "allow" or "suppress" automatic updates to labels and descriptions "suppress"
--apisleep -A number of seconds to delay between edits (see notes on rate limits below) 1.25
--endpoint -E a Wikibase SPARQL endpoint URL "https://query.wikidata.org/sparql"
--terse -T terse output: "true" suppresses most terminal output (log unaffected) "false"
--version -V no values; displays current version information
--help -H no values; displays link to this page

Examples:


python vanderbot.py --json project-metadata.json --log ../log.txt

Metadata description file is called project-metadata.json and is in the current working directory. Progress and error logs saved to the file log.txt in the parent directory.


python vanderbot.py -P working -C wikidata-credentials.txt

Credentials file called wikidata-credentials.txt is in the current working directory. Logging will be to the standard output console.


python vanderbot.py --update allow -L update.log

Progress and error logs saved to the file update.log in the current working directory. Labels and descriptions of existing items in Wikidata are automatically replaced with local values if they differ.

Q identifiers

When stored in the CSV, Q identifiers ("Q IDs") for items MUST be written with the leading Q but without any namespace prefix. Example: Q42.

Value nodes

Generally, CSV column names are flexible and can be whatever is specified in the metadata description JSON file. However, VanderBot REQUIRES several suffixes for complex values that require more than one column to describe (value nodes). The following table lists the supported value nodes and the REQUIRED suffixes.

type component suffix example datatype
time timestamp _val startDate_val ISO 8601-like dateTime timestamp*
precision _prec startDate_prec integer
quantity amount _val height_val decimal
unit _unit height_unit Q ID IRI
globecoordinate latitude _val location_val decimal degrees
longitude _long location_long decimal degrees
precision _prec location_prec decimal degrees

* The value required by the API differs slightly from ISO 8601, particularly in requiring a leading +. However, to allow the schemas to be used to generate valied ISO 8601 dateTimes, values in the CSV MUST omit the leading +, which is added by the script when values are sent to the API. VanderBot will also convert dates in certain formats to what is required by the API. See below for details.

Each value node also includes a column with a _nodeId suffix (e.g. startDate_nodeId) that contains an arbitrary unique identifier assigned by the script when the item line is processed.

See the Wikibase data model for more details. Note that VanderBot supports common attributes of these value nodes but assumes defaults for others (such as the Gregorian calendar model for time and earth globe system for globecooordinate).

Abbreviated time values

Time values MAY be abbreviated when entered in the CSV. VanderBot will convert times that conform to certain patterns into the format required by the Wikibase model. Here are the acceptable abbreviated formats:

character pattern example precision Wikibase precision integer
YYYY 1885 to year 9
YYYY-MM 2020-03 to month 10
YYYY-MM-DD 2001-09-11 to day 11

When these abbreviated values are used in the timestamp (_val) column, the precision (_prec) column MUST be left empty. The precision column will be filled with the appropriate integer when the date is converted to the required timestamp format.

Time values at lower precisions and BCE dates with negative values MUST be in long form. For example:

2020-11-30T00:00:00Z

for 30 November 2020

-0100-01-01T00:00:00Z

for 100 BCE. The dateTime strings MUST end in T00:00:00Z regardless of the precision.

The Wikidata Image property (P18) and image file identification

The Wikidata instance of Wikibase has an idiosyncratic way of handling one particular property: Image (P18). The value of P18 must be an image in Wikimedia Commons. The normal situation in Wikibase is that the value uploaded to the API will be the same as the value that is available via the Query Service (i.e. via SPARQL). However, the P18 value is a special type, called commonsMedia, which is not described in the standard Wikibase data model.

The value for P18 that is uploaded to the API must be the unencoded name of the file as it was uploaded to Commons. For example, this image has the filename Ruïne Casti Munt Sogn Gieri Waltensburg (actm) 17.jpg, which contains spaces, parentheses, and non-Latin characters. However, the value in the linked data graph queried by SPARQL is a URL that includes the URL-encoded version of the file name at the end. In this example, the URL would be http://commons.wikimedia.org/wiki/Special:FilePath/Ru%C3%AFne%20Casti%20Munt%20Sogn%20Gieri%20Waltensburg%20%28actm%29%2017.jpg.

The way VanderBot handles P18 values is as follows:

  • The value stored in the spreadsheet is the encoded URL. This is to be consistent with the goal of being able to generate RDF from the table that will match the triples returned from the Query Service.
  • Users may enter the unencoded file names in the column holding the P18 values. When the VanderBot script encounters values that are not in the encoded URL form, it will convert them and save the changed results in the table.
  • When the script writes to the API, it will convert the encoded URLs into unencoded filenames as required, but will leave the values in the table unaffected.

When setting up the metadata description file csv-metadata.json using the this web tool, you MUST select URL from the type selection dropdown, even if the values you will initially enter into the table are unencoded filenames (they will eventually be converted to URLs). If you use simplified configuration JSON files to create the metadata description file as described here, you MUST use uri as the value for value_type, as in the following example:

        {
          "pid": "P18",
          "variable": "image",
          "value_type": "uri",
          "qual": [],
          "ref": []
        }

somevalue claims (blank nodes)

The Wikibase model allows for claims where it is known that a property has some value, but that value is not known. Although this feature is not as commonly used as claims where a property has a stated value, it has an important use case for unknown artists or authors. Although it is possible to provide a value of Q4233718 (anonymous) as a value for P170 (creator), this is not the preferred practice. If Q4233718 is used, a bot will automatically change it from a value claim to a somevalue claim with a qualifier property of P3831 (object has role) and value Q4233718 (anonymous). So it is better to handle anonymous creators correctly from the start.

The Wikibase RDF model handles somevalue claims by treating them as triples with a blank node as the object. When a query is made to the Query Service that involves a somevalue claim, the object value returned for the corresponding triple is a Skolem IRI blank node identifier in the form: http://www.wikidata.org/.well-known/genid/86c4ed0e862509f61bba3ad98a1d5840 where 86c4ed0e862509f61bba3ad98a1d5840 is a hash that is unique within Wikidata.

Because somevalue claims can be made for properties of any value type (string, IRI, item, date, etc.), their RDF representations can't properly be generated from a CSV using the W3C Generating RDF from Tabular Data on the Web Recommendation. So the solution adopted for CSV input to VanderBot is a hack that allows users to write somevalue claims and provide unique values in the cells of the CSV (representing that each value is a different blank node), but not to generate RDF from the table that will exactly match what is in the graph used by the Query Service.

To specify that a property should have a somevalue claim, the cell in that property's column for the subject item row should contain a string that begins with _: (a RDF Turtle blank node label). For value node properties (e.g. time), the string should be placed in the column whose header ends in _val.) Any characters can follow these initial two characters, but if a cell contains only those two characters, VanderBot will generate a random UUID suffix and append it (e.g. _:fd655ec7-a596-41da-916c-2bd680808165) to ensure uniqueness. (If the acquire_wikidata_metadata.py script is used to download existing data, the hash from the downloaded Skolem IRI will be used to create the blank node label, e.g. _:86c4ed0e862509f61bba3ad98a1d5840 for the example above.)

Here is an example:

before writing to the API:

creator_uuid creator creator_object_has_role inception_uuid inception_nodeId inception_val inception_prec
Q364350 _:
_: Q4233718 1635

after writing to the API:

creator_uuid creator creator_object_has_role inception_uuid inception_nodeId inception_val inception_prec
3829C5F8-C5F3-4ECE-AE56-8C4689A3A057 Q364350 DDD9982B-5E30-4A51-A950-E7339B939D0E _:757ac080-a8a5-4d7f-9d5c-9bb8331b846e
4E8637FD-A0F9-4467-BAB2-E26EFBBD2D04 _:fd655ec7-a596-41da-916c-2bd680808165 Q4233718 64F5DFAF-DBA2-4DCA-8699-2EE5F202E43F 2fd04384-737c-4f2e-b001-09db7b1e9818 1635-01-01T00:00:00Z 9

Technical note: If a CSV containing somevalue data of this form is used to generate RDF using the W3C Recommendation, columns where an item is expected will generate IRIs of the form http://www.wikidata.org/entity/_:86c4ed0e862509f61bba3ad98a1d5840 rather than the expected form http://www.wikidata.org/.well-known/genid/86c4ed0e862509f61bba3ad98a1d5840. For some queries that simply require different values for IRIs in the object position of triples, this probably doesn't matter, but for federated queries comparing the state of the local graph against the Wikidata graph, there could be problems.

Rate limits

Based on information acquired in 2020, bot password users who don't have a "bot flag" are limited to 50 edits per minute. Editing at a faster rate will get you temporarily blocked from writing to the API. VanderBot has a hard-coded limit to prevent it from writing faster than that rate.

If you are a "newbie" (new user), you are subject to a slower rate limit: 8 edits per minute. A newbie is defined as a user whose account is less than four days old and who has done fewer than 50 edits. If you fall into the newbie category, you probably ought to do at least 50 manual edits to become familiar with the Wikidata data model and terminology anyway. However, if you don't want to wait, you SHOULD use an --apisleep or -A option with a value of 8 to set the delay to 8 seconds between writes. Once you are no longer a newbie, you MAY change it back to the higher rate by omitting this option.

For more detail on rate limit settings, see this page and the configuration file used by Wikidata.


Revised 2022-09-18