Skip to content
faraday edited this page Sep 14, 2010 · 5 revisions

What to do to replicate ESA as in Gabrilovich et al. :

Discarding Articles

- only keep articles of main namespace (meaning, discard categories, Wikipedia:, Help:, File: etc.)
- discard articles in month_year (e.g. January 2002) format
- discard articles in year_in… (e.g. 2002 in literature, 1996 in the Olympics) format
- discard articles in only digit format (e.g. 1996, 819382, 42)
- discard articles in list format (e.g. List of … )
- discard articles belonging to a stop category list (provided with source)
- discard articles with inlinks < 5 or outlinks < 5
- discard articles with fewer than 100 unique non-stop words
- use these characters to tokenize (consider these as whitespace for splitting):
String strTokenSplit = " \t\n\r`~!@#$%^&*()_=+|[;]{},./?<>:’\\\"";

- use TITLE_WEIGHT = 4. To apply this, you can append 4 instances of article title to the article text you will be indexing.

Add Anchor Text

- add anchor text to target articles

Indexing

- run Porter stemmer 3 times, instead of just once
- apply normalization on TF-IDF scores
- prefer more general articles slightly by using: log(log(TFIDF))

Pruning with a Sliding Window

- use WINDOW_SIZE = 100, WINDOW_THRES = 0.005

Clone this wiki locally