Skip to content

AndyYTHsiao/wikidata-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikidata CLI

A command-line tool that converts natural-language prompts into a SPARQL script, executes against Wikidata Query Service, and prints the results.

Command line interface
     ↓
Natural language
     ↓
LLM generator
     ↓
SPARQL script
     ↓
Wikidata execution
     ↓
Formatted answer

Quickstart

Initialze the Project

uv sync

Set an API Key

Create a .env file and save your OpenAI API key:

OPENAI_API_KEY=your_key_here

To use LLMs on Nvidia's NIM, specify your NIM API key as follows.

NIM_API_KEY=your_key_here

Run the CLI:

uv run python -m src.cli

After initializing the CLI, the user has three options:

  1. Ask a query
  2. Change model config
  3. Exit the CLI

When a query is given the system, the user will see a SPARQL script if it is successfully generated. The user can choose whether or not the script should be executed on Wikidata. If the script is executed, the user will see the returned results on the CLI.

Example

Query: List 5 cities in Taiwan

Genreated SPARQL:

SELECT ?city ?cityLabel WHERE {
  ?city wdt:P31 wd:Q515;
        wdt:P17 wd:Q865.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 5

Retrieved Results:

Results (5 shown):

1. Taipei
   QID: Q1867
   URI: http://www.wikidata.org/entity/Q1867
2. New Taipei
   QID: Q244898
   URI: http://www.wikidata.org/entity/Q244898
3. Taichung
   QID: Q245023
   URI: http://www.wikidata.org/entity/Q245023
4. Hsinchu City
   QID: Q249994
   URI: http://www.wikidata.org/entity/Q249994
5. Keelung
   QID: Q249996
   URI: http://www.wikidata.org/entity/Q249996

Case Studies

data/case_studies.jsonl contains queries with ambiguous inputs, conflicting constraints, typos, negation, and non-English contents.

In general, LLMs are capable of handling typos, negation, French, and Chinese better than expected. The main failures were not in intent extraction, but in downstream validation and entity resolution.

Category Query Result Root Cause Planned Fix
Baseline Find scientists from France born after 1900 Success
Baseline Child has father Johann Sebastian Bach and mother Maria Barbara Bach Success
Ambiguous entity Find people from Georgia Returned Georgia the U.S. state; missed Georgia the country Resolver selected top-1 entity without ambiguity detection Expose top candidates or require clarification
Ambiguous entity Find works by Bach Returned no results Resolver treated “Bach” as a last-name entity rather than resolving intended person/composer Add ambiguity handling and candidate display
Conflicting constraints Find scientists born after 1900 and before 1800 Generated valid SPARQL but impossible filters No semantic validation for date ranges Add constraint conflict validator
Conflicting constraints Find French scientists born after 1900 and born before 1800 Generated valid SPARQL but impossible filters No semantic validation for date ranges Add constraint conflict validator
Typo Find scinetists from Frnace born after 1900 Success Model/resolver recovered likely intent Keep as robustness win
Typo Find child whose fther is Johann Sebastian Bach Success Model/resolver recovered father=P22 Keep as robustness win
Non-English Trouve les scientifiques français nés après 1900 Success Model translated query into English semantic intent Keep multilingual prompt examples
Non-English 找出1900年後出生的法國科學家 Success Model translated query into English semantic intent Keep multilingual prompt examples
Negation Find French scientists not born after 1900 Success Model handled negation correctly Add regression test

Key Findings

The strongest parts of the system were intent extraction and typo recovery. GPT-5.5 successfully converted misspelled and non-English queries into usable semantic intents.

The weakest parts were ambiguity handling and constraint validation. For example, “Georgia” can refer to either the country or the U.S. state, but the LLM selected one candidate automatically. Similarly, impossible date constraints such as “born after 1900 and before 1800” compiled into SPARQL instead of being rejected earlier.

This suggests the next improvements should focus less on prompt engineering and more on deterministic safeguards:

  1. ambiguity detection in entity/property resolution
  2. conflict detection in date and numeric constraints
  3. clearer user-facing error messages

Hardening and Fixes

The strongest models tested, GPT-5.5 and GPT-5.4, produced similar failures. This showed that the core issues were not only LLM parsing errors, but deterministic pipeline issues in entity resolution and validation.

Fix 1: Ambiguity Detection

The original resolver selected the first Wikidata search result automatically. This failed for ambiguous labels such as "Georgia" and "Bach". guardrails.py is introduced to solve this issue. It compares top candidates and raises an ambiguity error when multiple plausible candidates exist. This prevents the system from returning confidently wrong results.

Example:

Input: Find people from Georgia

Before: The system silently resolved Georgia to the U.S. state.

After: The system reports that "Georgia" is ambiguous entity label and asks for a more specific query or provide a QID. It also list top candidates for users's reference.

Fix 2: Constraint Conflict Validation

The original system allowed contradictory date filters, such as:

born after 1900 and before 1800

This generated valid SPARQL but always returned no results. I added validation before compilation to detect impossible date ranges.

Before: The compiler generated SPARQL with both filters.

After: The validator rejects the query with a clear error explaining that the date constraints conflict.

Remaining Hard Cases

Some failures are fundamentally difficult because natural language queries often omit information that is required for deterministic Wikidata resolution.

  1. Ambiguous Entity Names: Names such as "Georgia", "Bach", "Washington", or "Apple" can refer to many different Wikidata entities. Without additional context, there may be no single correct QID. A search API can return candidates, but it cannot always know which one the user intended. The safest behavior is to expose ambiguity instead of guessing.

  2. Wikidata Modeling Complexity: Wikidata is not always modeled uniformly. A concept that sounds simple in natural language may require different properties for different domains. For example, "French scientist" might use country of citizenship, country of origin, or nationality-like descriptions to matach "France". The system currently handles direct property-value constraints and may select the incorrect property labels that are semantically similar to the correct label.

Cross-model Evaluation

To run the evaluation on the evaluation dataset, use the following command.

uv run python -m src.run_eval

The default models are GPT-5.5, GPT-5.4, and Devstral 2 123B Instruct 2512. These models were selected based on two main criteria:

  1. Strong structured reasoning + coding ability: The LLMs are asked to generate SPARQL scripts, which is closer to code generation than free-form text generation.

  2. Large context windows All three LLMs have large context window (1.05M, 1.05M, and 256K, respectively). Larger context windows reduce truncation risk in prompts and improve consistency in outputs.

All three models were capable of hitting the accuracy threshold because they reliably produced valid SPARQL scripts that match ground truth. However, they also exhibited consistent failure patterns.

  1. Wrong entity/property selections: All models occasionally failed at entity/property selection. For example, when asked about Films directed by Steven Spielberg, Devstral 2 123B Instruct 2512 chose Q534 (Ordinance on Industrial Safety and Health) instead of Q8877 (Steven Spielberg) as the target entity. Similar issue also showed up when smaller models (e.g., NVIDIA Nemotron 3 Nano Omni 30B) were used as the SPARQL generator. The strongest model, GPT-5.5 in this case, also failed to select the correct property label when there are other sematically similar candidates. This failure suggests that LLMs sometimes map to semantically unrelated but token-similar.

  2. Complex query: Models struggled when queries required aggregation or calculations. Additionally, variations in SPARQL syntax formatting increased the probability of failures.

In conclusion, larger, stronger models improve SPARQL generation reliability. However, most real failures stem from ambiguity and validation gaps, rather than raw model capability. Building a robust system therefore requires careful evaluation design, explicit handling of ambiguity, and deterministic validation layers. The current evaluation focuses on relatively simple queries. As tasks become more complex, a more carefully designed dataset and evaluation pipeline will be necessary.

About

A CLI tool to query Wikidata

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages