GSoC2012 Progress (Dirk)

GSoC 2012 Progress Report

Links

Repository
Proposal
Rest service: http://160.45.137.73:2222/rest/topic?text="Here comes the text" (if it is not running, just contact me)

Summary

Midterm Evaluation

Corpus generation

The simple approach, which was flattening wikis hierarchy utilizing the distances of a wiki category to the topic's main categories, which were defined manually before, did work ok, but it was not satisfying. That goes for wikipedias main_topic_categorization and also the wiki main portals, on which I both tested the approach.
Clustering over wikipedia's categories utilizing extracted TF-IDF vectors for each one of them resulted in a much better flattening of wikipedia categories to their main topics, but there was still too much confusion within the clusters, because clustering doesn't know which topics to discriminate. Hierarchical clustering, i.e. in this case clustering clusters that were still too fuzzy again, was the next step to remove confusion and results actually got less fuzzy, but still some clusters did not match any 'good' topic and of course there were too many clusters to label. Therefore I implemented an automatical cluster labeling mechanism, which did work well. Anyway, as much as clustering seemed appropriate for the task of fattening wikipedia's categories, it still left too much confusion.
The last approach of this kind was sort of a semi-supervised procedure which evolved from the clustering approach. The fact that wikipedia categories are often strongly related by keywords in their titles to a certain topic, e.g. Rocky_Mountains -> Keywords: rocky, mountains -> topic: geography, because 'mountain' is a keyword for geography, could be exploited by labeling those categories with their obvious topics and then training a topical classifier with those obvious examples (categories) for each topic. Note that there were over 200k obvious category assignments (to topics) possible. After training the classifier, it was easy to apply the classifier to yet unassignment wiki categories and estimate the topic to which it should belong. Then it was possible to repeat that procedure with the newly assigned categories as new examples for training.
With categories assigned to topics, it is then easily possible to assign topics to dbpedia resources by evaluating their categories they belong to. Like this it was possible to split the extracted occurences from dbpedia-spotlight to their main topics, which allowed me to generate a topically labeled corpus consisting of dbpedia-spotlight occs.
My last approach was similar to the last one except that I was using resources itself, s.t. there was no need of using categories. In the initial step, resource occurrences were assigned by matching topical keywords to the resources title to the specific topics. This procedure was implicitly generating an initial corpus for training a topical classifier. This classifier could then be used to further assign resource occurrences to topics by calculating the probabilities of the occurrence's context of belonging to a specific topic or not.

Modeling and Training

The proposed model, which was a simple topical classifier on top of lda features, resulted in good results on 20Newsgroup Dataset but performance on the wikipedia was really bad. So I tried naive bayes multinomial from weka, which is always a good choice for topical classification, and it worked pretty well on both 20Newsgroup and my wikipedia corpus. The advantages of naive bayes multinomial are, that it is already incremental and also really fast. Now there are two options on how to use this model. One was simply to train it on the N different topics we chose as classes (concurrent), which means as single label classifier. The other option was to train for each topic a different model with the classes 'topic' and 'not-topic', which yields in fact a model (consisting of actually N models) that was able to do multi label classification.

Progress

Apr 24 – May 20, 2012

branching/forking main repository, setup ide, learning scala
scripts for:
- converting tsv files (category, text) to vowpal input format
- converting text to vectors (stopwords, stemming, no numbers etc)
- converting vowpal prediction files to .arff files
tests on 20 newsgroup dataset with online lda (vowpal wabbit), mlda and slda with reasonable performance
flattening of wikipedia hierarchy
extracting occs from wikipedia dump
getting to know and coordinate with other GSoC students

May 21, 2012

sorting of occs, categories_articles (after articles)
decision: training classifier on paragraphs. Assigned topic should not come from the article of the occurence, but from the occurence's resource, because like this the initial corpus is especially generated by the occurences in the wikipedia -> topical classifier should generate better topics for disambiguation (,but results could be worse for general purpose topical classification, which is not the goal)
split occs.tsv by top categories

May 22, 2012

extracting corpus from splittet occs files

May 24, 2012

transforming corpus to vowpal input with vocabulary cut and wordcount transformations

May 26, 2012

training of lda model and testing features on weka's classifier's => no classification possible
possible explanations:
- corpus is no good for topical classification -- proof through applying classifier to bag-of-words will show -- solution: new corpus, possibly derived from clusters over the wikipedia
- lda doesn't work for wikipedia -- proof would be the same, if other classifiers would do better

May 28, 2012

tests on wikipedia corpus with naive bayes (incremental) and MaxEnt went well => applying plain lda does not give expected results; generated wikipedia corpus can be used for training topical classifiers
in the next few days different incremental classifiers should be evaluated and the best should be chosen
naive bayes (following the paper from [1]) seems like a good candidate (performance similar to MaxEnt)

[1] http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

May 29, 2012

First results:

20Newsgroup, on testset:

Algorithm	Examples in corpus per category	Accuracy
Naive Bayes (incremental)	10k	80%
MaxEnt (OpenNLP)	10k	78%

Wikipedia, on testset:
Note that direct classification is not a good measure here, because occurences of different topics can be in the same paragraph or text.

Algorithm	Examples in corpus per category	Accuracy
Naive Bayes (incremental)	10k	62%
MaxEnt (OpenNLP)	10k	64%
Naive Bayes (incremental)	100k	45%
MaxEnt (OpenNLP)	100k	49%

Conclusions

Progression: use of naive bayes, because it is really fast, robust and incremental
Future work: implement HSLDA, but for now bayes is fine

May 30, 2012

implementation of simple rest service
decision: better try content oriented wikipedia portals as topics for classification (see: wikiportals) except of general_references and people

May 31, 2012

adaptions on indexing part to allow multiple categories under one topic (note: workflow changed slightly)
indexing of new chosen topics (see May 30)

Results on new corpus from portals:

Algorithm	Examples in corpus per category	Accuracy
Naive Bayes (incremental)	100k	57%

June 5-9, 2012

last approach for flattening of the wikipedia hierarchy: clustering over wikipedia categories
FileOccurenceCategorySource class for iterating over extracted occs and their set of categories
new TextToWordVector implementation, using lucene
corpus extraction of wordvectors for each dbpedia category (details see below in workflow part, top category selection)

June 10, 2012

decision: clustering with vowpal wabbits fast online lda
conversion of extracted data to vowpal input format
clustering

June 11-12, 2012

cluster results

June 13-14, 2012

manual cluster labeling
some clusters are still fuzzy and would need reclustering

June 15, 2012

reclustering would result in too many new clusters
decision: label clusters automatically
idea: to do this, we take keywords extracted from dbpedia category names that are members of the same cluster and compare them to keywords of the topics
problem: get typical keywords for topics...
solution: wikipedia contains two categories that provide pages with topical outlines and/or indexes which provide me implicitly with a keyword set for each topic
extraction of topic keywords from wikipedia

June 18, 2012

topic selection, see here
implementation of automatical hierarchical clustering using lda (vowpal wabbit) and automatically cluster labeling
- idea: if label assignment is too fuzzy, cluster the specific cluster again

June 19-20, 2012

clustering with implicit flattening of wikipedias categories to their topics
splitting occs by topics
extracting training corpus for training of classifier
training classifier

June 21, 2012

just an idea: maybe work would have been easier if
- I trained a model on newsfeeds in the first place (because they already provide topical metainformation)
- and split occs by topics using the trained model, i.e. determine the topic(s) of an occ and put the occ to the specific topic splits

June 22, 2012

new idea of flattening: Taking the obvious categories for each topic, training a topical classifier on them, assign yet unassigned categories to topics utilizing the model (repeat those steps until it converges)
- why?: well, using classification rather than clustering makes the whole process supervised, thus giving results that we aim for
implementation of new idea

June 23-25, 2012

Flattening the hierarchy in a semi-supervised fashion
- results are much clearer
- implementation was much easier (and shorter)
- much more intuitive procedure

June 26, 2012

trained final model before midterm evaluations
implemented and executed splitting of not yet to topics assigned occs utilizing the trained model

July 03, 2012

implemented multilabel topical classification (training a model for each topic)
training of the model (can be found on biggy, deployed on rest server)

July 04, 2012

implementation of another occ splitting method, which is direct (i.e. no wiki categories are involved anymore) and semi-supervised
1. assign occurences of resources that obviously belong to a certain topic (using keywords for each topic and match them against resource names)
2. train a topical classifier on these assigned occurences
3. split occurences using the trained model
4. repeat steps 2-3 for specified number of iterations
again much shorter and simpler than splitting procedure before
splitting occurences

July 05, 2012

meeting with pablo
training on new occurences split -> results look really good

July 07, 2012

implementation of rss-feed and incremental training, not yet tested

July 08-11, 2012

integration of wikifeed
tested implemented feed framework within dbpedia spotlight

July 12-13, 2012

indexing splitted occs on biggy (splitted index)

July 14, 2012

discussion with pablo about KBA TREC as evaluation corpus for dbpedia spotlight live
sketch of possible workflow of our "KBA"-system

July 15-19

implementation of training for trec-kba task, which can be found in package live of module index

July 19 - ongoing

refactoring
documentation
- sketch of whole indexing and training part
testing

Workflow

Corpus Generation

old part:

ExtractCandidateMap
- Args= $INDEX_CONFIG_FILE
ExtractOccsFromWikipedia
- Args= $INDEX_CONFIG_FILE | output/occs.tsv
sort -t $'\t' -k2,2 -k1,1 output/occs.tsv > output/occs.uriSorted.tsv

Topic Selection and Topical Category Assignment (just needed for indirect splitting)

Either handpicked or from clustering. This part explains creation of category (word vector) corpus for clustering.

ExtractCategoryCorpus -- write temporary vector file
- Args = "extract" | path to sorted articles_categories.nt | path to sorted occs | output path of temporary wordvectors | offset
- there was a problem between the 29.200.000-th and 29.700.000-th occ -- solution extract until 29.200.000 occs then extract from (offset=) 29.700.000 th occ
sort temporary vectors by first column
ExtractCategoryCorpus -- merge temporary vectors to vowpal's input file (file conversions can be easily done from there)
- Args = "merge" | path to sorted temporary vectors | output path
- Note: vector are TF/IDF vectors
VowpalWCScalingFilter (this is not necessary anymore, because WriteCategoryCorpus already does that)
- Args = input file (output from WriteCategoryCorpus) | output file | scaling factor (eg 10000 worked well)
- scales word counts down by a given factor (new count = old count / factor)
shuf -o shuffled.scaled.category.corpus scaled.category.corpus
download and install vowpal and define path to executable in the indexing.properties

Hierarchy Flattening (just needed for indirect splitting)

Handpicked topics

FlattenWikipediaHierarchy (OLD)
- Args= articles_categories.nt | outputDir | maximal depth [ | categories file path ]
- categories file structure: each line: "topiclabel=category1,category2..." (eg "culture_arts=Culture,Arts")
- maximal depth is the maximal distance allowed from a top category to its subcategory, everything that does not belong to one of the top categories (given by the topics' categories) within this max depth belongs to a new topic others

From Clustering

FlattenHierarchyByClusters (OLD)
- Args= vowpal input file created by WriteCategoryCorpus | vowpal rest-input file created by WriteCategoryCorpus | rest.categories.list from WriteCategoryCorpus | categories.list from WriteCategoryCorpus | path where flattened hierarchy should be written | path to temporary working directory
- this process will cluster the input and automatically assign labels to clusters or recluster if label assignment is too fuzzy

From obvious categories and subsequently trained model

FlattenHierarchyByTopics
- best approach, see June 22-26 above for closer description
- path to indexing properties, path to training corpus, path to training corpus' categories, path to evaluation corpus, path to evaluation corpus' categories, path to temporary dir, confidence threshold for assigning a category to a topic (should be high, prob. at least 0.8)

Topic Corpus Generation

Utilizing flattened hierarchy (indirect splitting through wikipedia categories)

SplitOccsByTopics
- Args= indexing.properties | path to sorted occs file | path to output directory

Without flattened hierarchy (direct splitting)

SplitOccsSemiSupervised
- Args= 1st: indexing.properties 2nd: path to (sorted) occs file, 3rd: temporary path (same partition as output), 4th: minimal confidence of assigning an occ to a topic, 5th: #iterations, 6th: path to output directory

Corpus Generation

[GenerateOccTopicCorpus] (https://github.com/D9891/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/topic/GenerateOccTopicCorpus.scala)
- Args= indexing.properties | path to splitted occs | number of examples per category (<= 0, maximum number) | output file (corpus.tsv) [| 'false', if generation without flattened hierarchy]
shuf corpus.tsv > shuffled_corpus.tsv
[TextCorpusToInputCorpus] (https://github.com/D9891/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/topic/convert/TextCorpusToInputCorpus.scala)
- Args= -i input-file/directory, -o outputdir, -d dictionary file, -c categoryinfo file, -a if tagged (with topic: "topic23\tthis is example blablabla...), -s if testset, -t if data should be transformed, -ct type of corpus ('arff' or default:'vowpal')

Training

SingleLabelClassifier

java -cp weka.jar weka.classifiers.bayes.NaiveBayesMultinomialUpdateable -t training.arff -T test.arff -d model.dat > weka.out

MultiLabelClassifier

mvn scala:run -DmainClass=org.dbpedia.spotlight.topic.WekaMultiLabelClassifier "-DaddArgs=/.../train.corpus.arff|/.../multilabel-model"

Live

RunTrecKBA
- Args: spotlight configuration, trec corpus dir, trecJudgmentsFile, training start date (yyyy-MM-dd-hh), end date, minimal confidence of assigning topic to a list of resources, path to trec target entity classifier model dir, evaluation folder (evaluation, if folder exists, no evaluation otherwise), clear (optional, start training from scratch except of topical classifier)

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Google Summer of Code - GSoC

2013

2012

GSoC2012 Progress (Dirk)

GSoC 2012 Progress Report

Links

Summary

Midterm Evaluation

Corpus generation

Modeling and Training

Progress

Apr 24 – May 20, 2012

May 21, 2012

May 22, 2012

May 24, 2012

May 26, 2012

May 28, 2012

May 29, 2012

First results:

Conclusions

May 30, 2012

May 31, 2012

June 5-9, 2012

June 10, 2012

June 11-12, 2012

June 13-14, 2012

June 15, 2012

June 18, 2012

June 19-20, 2012

June 21, 2012

June 22, 2012

June 23-25, 2012

June 26, 2012

July 03, 2012

July 04, 2012

July 05, 2012

July 07, 2012

July 08-11, 2012

July 12-13, 2012

July 14, 2012

July 15-19

July 19 - ongoing

Workflow

Corpus Generation

Topic Selection and Topical Category Assignment (just needed for indirect splitting)

Hierarchy Flattening (just needed for indirect splitting)

Handpicked topics

From Clustering

From obvious categories and subsequently trained model

Topic Corpus Generation

Utilizing flattened hierarchy (indirect splitting through wikipedia categories)

Without flattened hierarchy (direct splitting)

Corpus Generation

Training

SingleLabelClassifier

MultiLabelClassifier

Live

Clone this wiki locally