GSoC2013 Progress (Wei Wang)

##About Name: Wei Wang

Mentor: Dirk

Github Link:github

Proposal Link: gsoc

###Project Short Description In this project, I am going to implement the entity-topic model[1] for entity linking. This model combines entity mention's context consistency and document's topic coherence. Specifically, it models the generation of a document by sampling its words and mentions from a probabilistic model. For example, to generate an entity mention of a document, we firstly sample a topic for this entity from the document's topic distribution. Then, we sample an entity from the topic's entity distribution. Finally, we sample a mention from the entity's mention distribution. Both learning and inference of this model are conducted through Gibbs sampling.

[1]Xianpei Han, Le Sun: An Entity-Topic Model for Entity Linking. EMNLP-CoNLL 2012: 105-115

##Progress

###June 2, 2013

Set up IntelliJ.
Build dbpedia-spotlight from source inside IntelliJ.
Create dbpedia-spotlight model (DB backed core
Reading code about DBTwoStepDisambiguation.

###June 7, 2013

Contact with the author of [1]; get details on implementations of the model

Figure out necessary data structures

  let d be a document, d_m be a mention in document d, d_w be a word in document d
  *assignment for document d:     
  ta[d,d_m]//topic assignment for mention d_m
  ea[d,d_m]//entity assignment for mention d_m; in training, 
            initialized according to entity-mention distribution derived from Wikipedia's link anchors
  aa[d,d_w]//entity assignment for word d_w; initialized to nearest entity in training

  *count for document d:    
  Ct[d,t]//document's topic(t) count
  Ce[d,e]//document's entity(e) count
  Ca[d,e]//document's word count for entity e

  *global count:     
  Cte[t,e]//topic*entity count matrix
  Cem[e,m]//entity*name(mention) count matrix
  Cew[e,w]//entity*context word count matrix

###June 14,2013

Prepare training dataset and testing dataset. I used the Dutch Wikipedia. 20% articles are selected as the testing dataset randomly. The rest is the training dataset.

Figure out the work flow of the training algorithm, as is shown in the following pseudo code:

  Step 1:
      use pignlproc to collect surface forms from training dataset to create spotter model; 

  Step 2:
      //initialize assignments
      let D=training dataset; spotter=spotModel;
      foreach article d in D:
          M=spotter.detectMentions(d)
          foreach mention m in d:
              ta[d,d_m]=t=sampleRandom(T)//randomly select a topic id from [0,T)
              Ct[d,t]++                    
              ea[d,d_m]=e=getEntityFor(m)//sample an entity from m's entity distribution, if m is a link, return the target entity
              Ce[d,e]++

              Cte[t,e]++
              Cem[e,m]++
          foreach word w in d:
              aa[d,d_w]=e=getNearestEntity(w)
              Ca[d,e]++
              Cew[e,w]++
        
  Step 3:
      //gibbs sampling
      for step=0;step<STEP;step++
          foreach article d in D:
              M=spotter.detectMentions(d)
              foreach mention m in d:
                  ta[d,d_m]=t=GibbsSampleMentionTopic()//sample the topic for each mention according to [1]
                  update Ct[d,t]
              
                  ea[d,d_m]=e=GibbsSampleMentionEntity()//sample the entity for each mention according to [1]
                  update Ce[d,e]

                  update Cte[t,e] and Cem[e,m]
              foreach word w in d:
                  aa[d,d_w]=e=GibbsSampleWordEntity()//sample the entity for each word according to [1]
                  update Ca[d,e],Cew[e,w]

  Step 4:
      //store global knowledge(distributions)
      calculate and store entity distribution for each topic from Cte[,]
      calculate and store mention distribution for each entity from Cem[,]
      calculate and store context word distribution for each entity from Cew[,]

###June 19,2013

read lda related papers, namely, online-lda, supervised-lda, multi-modal lda, correspondence lda, topic-regression-lda
trying to formulate the learning part as online-lda, i.e., 1)replace gibbs sampling with variational inference; 2)turning variational inference into online algorithm/learning. One potential problem is that the accuracy of variational inference is usually not as goog as gibbs sampling.

###June 29, 2013

collect statistics from wikipedia training data using the pignlp tool
create entity topic model, i.e., the EntityTopicModel class, which is based on the SpotlightModel with MemoryContextStore ignored.

create class Document for each wikipedia page, which includes:

  val mentions:Array[Int] //all mentions represented using surfaceform id 
  val words:Array[Int],  //all words represented using token id
  val entityOfMention:Array[Int], //entity for each mention, represented using dbpedia resource id
  val topicOfMention:Array[Int], //topic for each mention, represented using topic id
  val entityOfWord:Array[Int], //entity for each word, represented using dbpedia resource id
  val topicCount:HashMap[Int,Int], //count for each topic 
  val entityForMentionCount:HashMap[Int,Int], /# of entity e assigned to mention m 
  val entityForWordCount:HashMap[Int,Int] //# of entity e assigned to word w

initializeDocuments parses each wikipedia page and passes the parsed page to Document.initDocument For every two succinct link anchors: each token within them is assigned with the nearest link anchor's target entity(i.e.,resource id); each mention within them is assigned with an entity sampled from the entity distribution; Besides, a topic is sampled randomly to every mention. The token and mention of the link anchor are assigned with the link anchor's target entity. Finally a Document instance is created.

###July 7, 2013 *Code for the gibbs sampling processing. Specifically, 1) add GlobalCounter.scala for Cem[,], Cew[,], Cte[,]; 2) for each document, update all assignments(topic assignments, entity assignments) and local/global counters.

*TODO: implement GlobalCounter.scala

###July 14, 2013 *Learn breeze library from scalanlp. Implement the GlobalCounter.scala by CSCMatrix from breeze.

*TODO: Debug the gibbs sampling process.

###July 21, 2013 Implement the inference algorithm(i.e., entity topic model based disambiguator).

###July 28, 2013 *Debug training/inference code.

*Modify document initialization code to support multi-threading.

*Write evaluation code.

###Midterm Evaluation

Have Done

Finished entity topic model on a single machine, including global knowledge training and disambiguating for a new document.
Currently, the major problem is the efficiency of the global knowledge training. Although I implemented a multi-threading DocumentInitializer.scala, it is restricted by the Tokenizer(opennlp_parallel=1). Consequently, It took about 1-2 min to initialize one Wiki article.

TODO.

Tune the DocumentInitializer.scala, Train and Test on the whole Dutch dataset(One week);
If the efficiency is not acceptable, I would adapt the training algorithm onto hadoop(3 weeks). Since Wiki articles are initialized separately, the DocumentInitializer can be conducted by mappers individually. After that we got the initialized documents and global counters. To update documents' assignments and global counters(i.e., global knowledge), we do multiple map-reduce jobs. For each job, a set of mappers apply the gibbs sampling to update documents' assignments and global counters; One reducer averages the global counters. The global counters of the final job is saved as global knowledge.
If the efficiency is acceptable, I will test the English Wiki dataset.(1-2 weeks)
Clean the code; Documentation. (1 week).

###August 2, 2013

Tune the GlobalCounter.scala. Change the implementation to HashMap from CSCMatrix. CSCMatrix shifts data array for every insertion, which cost expensively.
TODO: currently the efficiency has improved a lot. After fixing some bugs (tokenizer timeout, document serialization), I will test and tune the gibbs sampling parameters (e.g., sampling iterations, burn-in steps, etc.) in the next step.

###August 13, 2013

Replace TextTokenizer with LanguageIndependentTokenizer. The initialization process can be finished within 4 hours for 1.8 million Wikipedia articles now.
Implement the document serialization function (I give up kryo because the bugs are not easy to fix)
TODO: Run the learning program on 1.8 million dataset.

###August 18, 2013

Test the EntityTopicModel on Dutch Wikipeida dataset.

  Corpus: AnnotatedTextSource
  Number of occs: 1605 (original), 1605 (processed)
  Disambiguator: Entity Topic Model
  Correct URI not found = 543 / 1605 = 0.338
  Accuracy = 1062 / 1605 = 0.662
  Global MRR: 0.42998664641809825
  Elapsed time: 16 sec

TODO: tune parameters (e.g., gibbs steps); Learn model for English wikipedia dataset

###September 8, 2013

Reformat the interface of functions, add comments.
Improve performance of the code by exploiting dbpeida extraction framework and spotters to detect all resource occurrences of one wiki page in training process.
Re-generate training and heldout dataset for Ducth Wikipedia.
Training on Ducth Wikipedia dataset (not finished due to spotter bug)

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Google Summer of Code - GSoC

2013

2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC2013 Progress (Wei Wang)

Have Done

TODO.

Clone this wiki locally