TREC Car Tools
Development tools for participants of the TREC Complex Answer Retrieval track.
Data release support for v1.5 and v2.0.
Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to
treccar-tools-v2 with version 2.0, and the package path changed to
Current support for
- Python 3.6
- Java 1.8
If you are using Anaconda, install the
library for Python 3.6:
conda install -c laura-dietz cbor=1.0.0
How to use the Python bindings for trec-car-tools?
- Get the data from http://trec-car.cs.unh.edu
- Clone this repository
python setup.py install
Look out for test.py for an example on how to access the data.
How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?
add to your project's pom.xml file (or similarly gradel or sbt):
<repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories>
add the trec-car-tools dependency:
<dependency> <groupId>com.github.TREMA-UNH</groupId> <artifactId>trec-car-tools-java</artifactId> <version>17</version> </dependency>
compile your project with
This package provides support for the following activities.
read_data: Reading the provided paragraph collection, outline collections, and training articles
format_runs: writing submission files
If you use python or java, please use
trec-car-tools, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.
CBOR is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.
Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors RedirectNames -> [$pageName] DisambiguationNames -> [$pageName] DisambiguationIds -> [$pageId] CategoryNames -> [$pageName] CategoryIds -> [$pageId] InlinkIds -> [$pageId] InlinkAnchors -> [$anchorText] PageSkeleton -> Section | Para | Image | ListItem Section -> $sectionHeading [PageSkeleton] Para -> Paragraph Paragraph -> $paragraphId, [ParaBody] ListItem -> $nestingLevel, Paragraph Image -> $imageURL [PageSkeleton] ParaBody -> ParaText | ParaLink ParaText -> $text ParaLink -> $targetPage $targetPageId $linkSection $anchorText
You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)
./read_data/trec_car_read_data.pyPython 3.5 convenience library for reading the input data (in CBOR format). -- If you use anaconda, please install the cbor library with
conda install -c auto cbor=1.0-- Otherwise install it with
pypi install cbor
Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).
Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.
The section is represented by the path of headings in the outline
$pageTitle/$heading1/$heading1.1/.../$section in URL encoding.
The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.
The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.
Example of ranking format
Green\_sea\_turtle\Habitat Pelagic\_zone 12345 0 27409 myTeam $qid $entity $passageId rank sim run_id
Integration with other tools
It is recommended to use the
format_runs package to write run files. Here an example:
with open('runfile', mode='w', encoding='UTF-8') as f: writer = configure_csv_writer(f) for page in pages: for section_path in page.flat_headings_list(): ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking] format_run(writer, ranking, exp_name='test') f.close()
This ensures that the output is correctly formatted to work with
trec_eval and the provided qrels file.
Run trec_eval version 9.0.4 as usual:
trec_eval -q release.qrel runfile > run.eval
The output is compatible with the eval plotting package minir-plots. For example run
python column.py --out column-plot.pdf --metric map run.eval python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval
Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.
python hurtshelps.py --metric map run.eval run2.eval python paired-ttest.py --metric map run.eval run2.eval
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.