Name		Name	Last commit message	Last commit date
parent directory ..
ClueWeb09		ClueWeb09
Web2009		Web2009
Web2010		Web2010
Web2011		Web2011
Web2012		Web2012
README.md		README.md

README.md

Feature datasets

ClueWeb 09 features

The features required to run our experiments on the ClueWeb09 corpus are already included in this Git repository. Simply clone with Git LFS enabled.

Our dataset is derived from judged documents of the TREC 2009–2012 Web Tracks, by computing text-based (body, title, anchors, main content), and web-graph based features:

Feature	Count
Term frequency	4
TF · IDF	4
BM25 score	4
F2 exp score	4
F2 log score	4
QL score	4
QLJM score	4
PL2 score	4
SPL score	4
URL length	1
No. of slashes in URL	1
PageRank	1
SpamRank	1
No. of inlinks	1
No. of outlinks	1

As we only compute features for judged documents, our dataset is suited for supervised learning.

Versions

We have different feature versions, similar to LETOR supervised learning features.

Null version – `NULL.txt`

For some documents we can't provide all features. We use NaN to indicate unknown / minus infinity values. This data cannot be directly be used for learning.

Min version – `min.txt`

Replace the NaN value in Null version with the minimal value of this feature under a same query. This data can be directly used for learning.

Query-level norm version – `Querylevelnorm.txt`

Conduct query level normalization based on data in Min version. This data can be directly used for learning.

Partitions

We further provide 5-fold partitions for cross fold validation (Fold1, …, Fold5), as well as a partition that contains the queries with the highest amount of near-duplicate documents (MostRedundantTraining).

License

Our features dataset is licensed under terms of the CC BY-SA 4.0 license.

LETOR features

If you want to run our experiments on the GOV2 corpus, please download the LETOR 4.0 dataset, and unpack the MQ2007, and MQ2008 folders to data/features/MQ2008, or data/features/MQ2007 respectively:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

features

features

ClueWeb09

ClueWeb09

Web2009

Web2009

Web2010

Web2010

Web2011

Web2011

Web2012

Web2012

README.md

README.md

README.md

Feature datasets

ClueWeb 09 features

Versions

Null version – `NULL.txt`

Min version – `min.txt`

Query-level norm version – `Querylevelnorm.txt`

Partitions

License

LETOR features

Files

features

Directory actions

More options

Directory actions

More options

Latest commit

History

features

Folders and files

parent directory

Feature datasets

ClueWeb 09 features

Versions

Null version – NULL.txt

Min version – min.txt

Query-level norm version – Querylevelnorm.txt

Partitions

License

LETOR features

Null version – `NULL.txt`

Min version – `min.txt`

Query-level norm version – `Querylevelnorm.txt`