Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

GSoC2012 Progress (Dirk)

sandroacoelho edited this page Jul 26, 2013 · 1 revision

GSoC 2012 Progress Report

Links

Repository
Proposal
Rest service: http://160.45.137.73:2222/rest/topic?text="Here comes the text" (if it is not running, just contact me)

Summary

Midterm Evaluation

Corpus generation

The simple approach, which was flattening wikis hierarchy utilizing the distances of a wiki category to the topic's main categories, which were defined manually before, did work ok, but it was not satisfying. That goes for wikipedias main_topic_categorization and also the wiki main portals, on which I both tested the approach.
Clustering over wikipedia's categories utilizing extracted TF-IDF vectors for each one of them resulted in a much better flattening of wikipedia categories to their main topics, but there was still too much confusion within the clusters, because clustering doesn't know which topics to discriminate. Hierarchical clustering, i.e. in this case clustering clusters that were still too fuzzy again, was the next step to remove confusion and results actually got less fuzzy, but still some clusters did not match any 'good' topic and of course there were too many clusters to label. Therefore I implemented an automatical cluster labeling mechanism, which did work well. Anyway, as much as clustering seemed appropriate for the task of fattening wikipedia's categories, it still left too much confusion.
The last approach of this kind was sort of a semi-supervised procedure which evolved from the clustering approach. The fact that wikipedia categories are often strongly related by keywords in their titles to a certain topic, e.g. Rocky_Mountains -> Keywords: rocky, mountains -> topic: geography, because 'mountain' is a keyword for geography, could be exploited by labeling those categories with their obvious topics and then training a topical classifier with those obvious examples (categories) for each topic. Note that there were over 200k obvious category assignments (to topics) possible. After training the classifier, it was easy to apply the classifier to yet unassignment wiki categories and estimate the topic to which it should belong. Then it was possible to repeat that procedure with the newly assigned categories as new examples for training.
With categories assigned to topics, it is then easily possible to assign topics to dbpedia resources by evaluating their categories they belong to. Like this it was possible to split the extracted occurences from dbpedia-spotlight to their main topics, which allowed me to generate a topically labeled corpus consisting of dbpedia-spotlight occs.
My last approach was similar to the last one except that I was using resources itself, s.t. there was no need of using categories. In the initial step, resource occurrences were assigned by matching topical keywords to the resources title to the specific topics. This procedure was implicitly generating an initial corpus for training a topical classifier. This classifier could then be used to further assign resource occurrences to topics by calculating the probabilities of the occurrence's context of belonging to a specific topic or not.

Modeling and Training

The proposed model, which was a simple topical classifier on top of lda features, resulted in good results on 20Newsgroup Dataset but performance on the wikipedia was really bad. So I tried naive bayes multinomial from weka, which is always a good choice for topical classification, and it worked pretty well on both 20Newsgroup and my wikipedia corpus. The advantages of naive bayes multinomial are, that it is already incremental and also really fast. Now there are two options on how to use this model. One was simply to train it on the N different topics we chose as classes (concurrent), which means as single label classifier. The other option was to train for each topic a different model with the classes 'topic' and 'not-topic', which yields in fact a model (consisting of actually N models) that was able to do multi label classification.

Progress

Apr 24 – May 20, 2012

  • branching/forking main repository, setup ide, learning scala
  • scripts for:
    • converting tsv files (category, text) to vowpal input format
    • converting text to vectors (stopwords, stemming, no numbers etc)
    • converting vowpal prediction files to .arff files
  • tests on 20 newsgroup dataset with online lda (vowpal wabbit), mlda and slda with reasonable performance
  • flattening of wikipedia hierarchy
  • extracting occs from wikipedia dump
  • getting to know and coordinate with other GSoC students

May 21, 2012

  • sorting of occs, categories_articles (after articles)
  • decision: training classifier on paragraphs. Assigned topic should not come from the article of the occurence, but from the occurence's resource, because like this the initial corpus is especially generated by the occurences in the wikipedia -> topical classifier should generate better topics for disambiguation (,but results could be worse for general purpose topical classification, which is not the goal)
  • split occs.tsv by top categories

May 22, 2012

  • extracting corpus from splittet occs files

May 24, 2012

  • transforming corpus to vowpal input with vocabulary cut and wordcount transformations

May 26, 2012

  • training of lda model and testing features on weka's classifier's => no classification possible
  • possible explanations:
    • corpus is no good for topical classification -- proof through applying classifier to bag-of-words will show -- solution: new corpus, possibly derived from clusters over the wikipedia
    • lda doesn't work for wikipedia -- proof would be the same, if other classifiers would do better

May 28, 2012

  • tests on wikipedia corpus with naive bayes (incremental) and MaxEnt went well => applying plain lda does not give expected results; generated wikipedia corpus can be used for training topical classifiers
  • in the next few days different incremental classifiers should be evaluated and the best should be chosen
  • naive bayes (following the paper from [1]) seems like a good candidate (performance similar to MaxEnt)

[1] http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

May 29, 2012

First results:

20Newsgroup, on testset:

Algorithm Examples in corpus per category Accuracy
Naive Bayes (incremental) 10k 80%
MaxEnt (OpenNLP) 10k 78%

Wikipedia, on testset:
Note that direct classification is not a good measure here, because occurences of different topics can be in the same paragraph or text.

Algorithm Examples in corpus per category Accuracy
Naive Bayes (incremental) 10k 62%
MaxEnt (OpenNLP) 10k 64%
Naive Bayes (incremental) 100k 45%
MaxEnt (OpenNLP) 100k 49%

Conclusions

  • Progression: use of naive bayes, because it is really fast, robust and incremental
  • Future work: implement HSLDA, but for now bayes is fine

May 30, 2012

  • implementation of simple rest service
  • decision: better try content oriented wikipedia portals as topics for classification (see: wikiportals) except of general_references and people

May 31, 2012

  • adaptions on indexing part to allow multiple categories under one topic (note: workflow changed slightly)
  • indexing of new chosen topics (see May 30)

Results on new corpus from portals:

Algorithm Examples in corpus per category Accuracy
Naive Bayes (incremental) 100k 57%

June 5-9, 2012

  • last approach for flattening of the wikipedia hierarchy: clustering over wikipedia categories
  • FileOccurenceCategorySource class for iterating over extracted occs and their set of categories
  • new TextToWordVector implementation, using lucene
  • corpus extraction of wordvectors for each dbpedia category (details see below in workflow part, top category selection)

June 10, 2012

  • decision: clustering with vowpal wabbits fast online lda
  • conversion of extracted data to vowpal input format
  • clustering

June 11-12, 2012

June 13-14, 2012

  • manual cluster labeling
  • some clusters are still fuzzy and would need reclustering

June 15, 2012

  • reclustering would result in too many new clusters
  • decision: label clusters automatically
  • idea: to do this, we take keywords extracted from dbpedia category names that are members of the same cluster and compare them to keywords of the topics
  • problem: get typical keywords for topics...
  • solution: wikipedia contains two categories that provide pages with topical outlines and/or indexes which provide me implicitly with a keyword set for each topic
  • extraction of topic keywords from wikipedia

June 18, 2012

  • topic selection, see here
  • implementation of automatical hierarchical clustering using lda (vowpal wabbit) and automatically cluster labeling
    • idea: if label assignment is too fuzzy, cluster the specific cluster again

June 19-20, 2012

  • clustering with implicit flattening of wikipedias categories to their topics
  • splitting occs by topics
  • extracting training corpus for training of classifier
  • training classifier

June 21, 2012

  • just an idea: maybe work would have been easier if
    • I trained a model on newsfeeds in the first place (because they already provide topical metainformation)
    • and split occs by topics using the trained model, i.e. determine the topic(s) of an occ and put the occ to the specific topic splits

June 22, 2012

  • new idea of flattening: Taking the obvious categories for each topic, training a topical classifier on them, assign yet unassigned categories to topics utilizing the model (repeat those steps until it converges)
    • why?: well, using classification rather than clustering makes the whole process supervised, thus giving results that we aim for
  • implementation of new idea

June 23-25, 2012

  • Flattening the hierarchy in a semi-supervised fashion
    • results are much clearer
    • implementation was much easier (and shorter)
    • much more intuitive procedure

June 26, 2012

  • trained final model before midterm evaluations
  • implemented and executed splitting of not yet to topics assigned occs utilizing the trained model

July 03, 2012

  • implemented multilabel topical classification (training a model for each topic)
  • training of the model (can be found on biggy, deployed on rest server)

July 04, 2012

  • implementation of another occ splitting method, which is direct (i.e. no wiki categories are involved anymore) and semi-supervised
    1. assign occurences of resources that obviously belong to a certain topic (using keywords for each topic and match them against resource names)
    2. train a topical classifier on these assigned occurences
    3. split occurences using the trained model
    4. repeat steps 2-3 for specified number of iterations
  • again much shorter and simpler than splitting procedure before
  • splitting occurences

July 05, 2012

  • meeting with pablo
  • training on new occurences split -> results look really good

July 07, 2012

  • implementation of rss-feed and incremental training, not yet tested

July 08-11, 2012

  • integration of wikifeed
  • tested implemented feed framework within dbpedia spotlight

July 12-13, 2012

  • indexing splitted occs on biggy (splitted index)

July 14, 2012

  • discussion with pablo about KBA TREC as evaluation corpus for dbpedia spotlight live
  • sketch of possible workflow of our "KBA"-system

July 15-19

  • implementation of training for trec-kba task, which can be found in package live of module index

July 19 - ongoing

  • refactoring
  • documentation
    • sketch of whole indexing and training part
  • testing

Workflow

Corpus Generation

old part:

  • ExtractCandidateMap
    • Args= $INDEX_CONFIG_FILE
  • ExtractOccsFromWikipedia
    • Args= $INDEX_CONFIG_FILE | output/occs.tsv
  • sort -t $'\t' -k2,2 -k1,1 output/occs.tsv > output/occs.uriSorted.tsv

Topic Selection and Topical Category Assignment (just needed for indirect splitting)

Either handpicked or from clustering. This part explains creation of category (word vector) corpus for clustering.

  • ExtractCategoryCorpus -- write temporary vector file
    • Args = "extract" | path to sorted articles_categories.nt | path to sorted occs | output path of temporary wordvectors | offset
    • there was a problem between the 29.200.000-th and 29.700.000-th occ -- solution extract until 29.200.000 occs then extract from (offset=) 29.700.000 th occ
  • sort temporary vectors by first column
  • ExtractCategoryCorpus -- merge temporary vectors to vowpal's input file (file conversions can be easily done from there)
    • Args = "merge" | path to sorted temporary vectors | output path
    • Note: vector are TF/IDF vectors
  • VowpalWCScalingFilter (this is not necessary anymore, because WriteCategoryCorpus already does that)
    • Args = input file (output from WriteCategoryCorpus) | output file | scaling factor (eg 10000 worked well)
    • scales word counts down by a given factor (new count = old count / factor)
  • shuf -o shuffled.scaled.category.corpus scaled.category.corpus
  • download and install vowpal and define path to executable in the indexing.properties

Hierarchy Flattening (just needed for indirect splitting)

Handpicked topics
  • FlattenWikipediaHierarchy (OLD)
    • Args= articles_categories.nt | outputDir | maximal depth [ | categories file path ]
    • categories file structure: each line: "topiclabel=category1,category2..." (eg "culture_arts=Culture,Arts")
    • maximal depth is the maximal distance allowed from a top category to its subcategory, everything that does not belong to one of the top categories (given by the topics' categories) within this max depth belongs to a new topic others
From Clustering
  • FlattenHierarchyByClusters (OLD)
    • Args= vowpal input file created by WriteCategoryCorpus | vowpal rest-input file created by WriteCategoryCorpus | rest.categories.list from WriteCategoryCorpus | categories.list from WriteCategoryCorpus | path where flattened hierarchy should be written | path to temporary working directory
    • this process will cluster the input and automatically assign labels to clusters or recluster if label assignment is too fuzzy
From obvious categories and subsequently trained model
  • FlattenHierarchyByTopics
    • best approach, see June 22-26 above for closer description
    • path to indexing properties, path to training corpus, path to training corpus' categories, path to evaluation corpus, path to evaluation corpus' categories, path to temporary dir, confidence threshold for assigning a category to a topic (should be high, prob. at least 0.8)

Topic Corpus Generation

Utilizing flattened hierarchy (indirect splitting through wikipedia categories)
  • SplitOccsByTopics
    • Args= indexing.properties | path to sorted occs file | path to output directory
Without flattened hierarchy (direct splitting)
  • SplitOccsSemiSupervised
    • Args= 1st: indexing.properties 2nd: path to (sorted) occs file, 3rd: temporary path (same partition as output), 4th: minimal confidence of assigning an occ to a topic, 5th: #iterations, 6th: path to output directory
Corpus Generation

Training

  • java -cp weka.jar weka.classifiers.bayes.NaiveBayesMultinomialUpdateable -t training.arff -T test.arff -d model.dat > weka.out
  • mvn scala:run -DmainClass=org.dbpedia.spotlight.topic.WekaMultiLabelClassifier "-DaddArgs=/.../train.corpus.arff|/.../multilabel-model"

Live

  • RunTrecKBA
    • Args: spotlight configuration, trec corpus dir, trecJudgmentsFile, training start date (yyyy-MM-dd-hh), end date, minimal confidence of assigning topic to a list of resources, path to trec target entity classifier model dir, evaluation folder (evaluation, if folder exists, no evaluation otherwise), clear (optional, start training from scratch except of topical classifier)
Clone this wiki locally