Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

GSoC2013 Progress (Wei Wang)

Wei Wang edited this page Sep 9, 2013 · 7 revisions

##About Name: Wei Wang

Mentor: Dirk

Github Link:github

Proposal Link: gsoc

###Project Short Description In this project, I am going to implement the entity-topic model[1] for entity linking. This model combines entity mention's context consistency and document's topic coherence. Specifically, it models the generation of a document by sampling its words and mentions from a probabilistic model. For example, to generate an entity mention of a document, we firstly sample a topic for this entity from the document's topic distribution. Then, we sample an entity from the topic's entity distribution. Finally, we sample a mention from the entity's mention distribution. Both learning and inference of this model are conducted through Gibbs sampling.

[1]Xianpei Han, Le Sun: An Entity-Topic Model for Entity Linking. EMNLP-CoNLL 2012: 105-115

##Progress

###June 2, 2013

  • Set up IntelliJ.
  • Build dbpedia-spotlight from source inside IntelliJ.
  • Create dbpedia-spotlight model (DB backed core
  • Reading code about DBTwoStepDisambiguation.

###June 7, 2013

  • Contact with the author of [1]; get details on implementations of the model

  • Figure out necessary data structures

      let d be a document, d_m be a mention in document d, d_w be a word in document d
      *assignment for document d:     
      ta[d,d_m]//topic assignment for mention d_m
      ea[d,d_m]//entity assignment for mention d_m; in training, 
                initialized according to entity-mention distribution derived from Wikipedia's link anchors
      aa[d,d_w]//entity assignment for word d_w; initialized to nearest entity in training
    
      *count for document d:    
      Ct[d,t]//document's topic(t) count
      Ce[d,e]//document's entity(e) count
      Ca[d,e]//document's word count for entity e
    
      *global count:     
      Cte[t,e]//topic*entity count matrix
      Cem[e,m]//entity*name(mention) count matrix
      Cew[e,w]//entity*context word count matrix
    

###June 14,2013

  • Prepare training dataset and testing dataset. I used the Dutch Wikipedia. 20% articles are selected as the testing dataset randomly. The rest is the training dataset.

  • Figure out the work flow of the training algorithm, as is shown in the following pseudo code:

      Step 1:
          use pignlproc to collect surface forms from training dataset to create spotter model; 
    
      Step 2:
          //initialize assignments
          let D=training dataset; spotter=spotModel;
          foreach article d in D:
              M=spotter.detectMentions(d)
              foreach mention m in d:
                  ta[d,d_m]=t=sampleRandom(T)//randomly select a topic id from [0,T)
                  Ct[d,t]++                    
                  ea[d,d_m]=e=getEntityFor(m)//sample an entity from m's entity distribution, if m is a link, return the target entity
                  Ce[d,e]++
    
                  Cte[t,e]++
                  Cem[e,m]++
              foreach word w in d:
                  aa[d,d_w]=e=getNearestEntity(w)
                  Ca[d,e]++
                  Cew[e,w]++
            
      Step 3:
          //gibbs sampling
          for step=0;step<STEP;step++
              foreach article d in D:
                  M=spotter.detectMentions(d)
                  foreach mention m in d:
                      ta[d,d_m]=t=GibbsSampleMentionTopic()//sample the topic for each mention according to [1]
                      update Ct[d,t]
                  
                      ea[d,d_m]=e=GibbsSampleMentionEntity()//sample the entity for each mention according to [1]
                      update Ce[d,e]
    
                      update Cte[t,e] and Cem[e,m]
                  foreach word w in d:
                      aa[d,d_w]=e=GibbsSampleWordEntity()//sample the entity for each word according to [1]
                      update Ca[d,e],Cew[e,w]
    
      Step 4:
          //store global knowledge(distributions)
          calculate and store entity distribution for each topic from Cte[,]
          calculate and store mention distribution for each entity from Cem[,]
          calculate and store context word distribution for each entity from Cew[,]
    

###June 19,2013

  • read lda related papers, namely, online-lda, supervised-lda, multi-modal lda, correspondence lda, topic-regression-lda
  • trying to formulate the learning part as online-lda, i.e., 1)replace gibbs sampling with variational inference; 2)turning variational inference into online algorithm/learning. One potential problem is that the accuracy of variational inference is usually not as goog as gibbs sampling.

###June 29, 2013

  • collect statistics from wikipedia training data using the pignlp tool

  • create entity topic model, i.e., the EntityTopicModel class, which is based on the SpotlightModel with MemoryContextStore ignored.

  • create class Document for each wikipedia page, which includes:

      val mentions:Array[Int] //all mentions represented using surfaceform id 
      val words:Array[Int],  //all words represented using token id
      val entityOfMention:Array[Int], //entity for each mention, represented using dbpedia resource id
      val topicOfMention:Array[Int], //topic for each mention, represented using topic id
      val entityOfWord:Array[Int], //entity for each word, represented using dbpedia resource id
      val topicCount:HashMap[Int,Int], //count for each topic 
      val entityForMentionCount:HashMap[Int,Int], /# of entity e assigned to mention m 
      val entityForWordCount:HashMap[Int,Int] //# of entity e assigned to word w
    

initializeDocuments parses each wikipedia page and passes the parsed page to Document.initDocument For every two succinct link anchors: each token within them is assigned with the nearest link anchor's target entity(i.e.,resource id); each mention within them is assigned with an entity sampled from the entity distribution; Besides, a topic is sampled randomly to every mention. The token and mention of the link anchor are assigned with the link anchor's target entity. Finally a Document instance is created.

###July 7, 2013 *Code for the gibbs sampling processing. Specifically, 1) add GlobalCounter.scala for Cem[,], Cew[,], Cte[,]; 2) for each document, update all assignments(topic assignments, entity assignments) and local/global counters.

*TODO: implement GlobalCounter.scala

###July 14, 2013 *Learn breeze library from scalanlp. Implement the GlobalCounter.scala by CSCMatrix from breeze.

*TODO: Debug the gibbs sampling process.

###July 21, 2013 Implement the inference algorithm(i.e., entity topic model based disambiguator).

###July 28, 2013 *Debug training/inference code.

*Modify document initialization code to support multi-threading.

*Write evaluation code.

###Midterm Evaluation

Have Done

  • Finished entity topic model on a single machine, including global knowledge training and disambiguating for a new document.
  • Currently, the major problem is the efficiency of the global knowledge training. Although I implemented a multi-threading DocumentInitializer.scala, it is restricted by the Tokenizer(opennlp_parallel=1). Consequently, It took about 1-2 min to initialize one Wiki article.

TODO.

  • Tune the DocumentInitializer.scala, Train and Test on the whole Dutch dataset(One week);
  • If the efficiency is not acceptable, I would adapt the training algorithm onto hadoop(3 weeks). Since Wiki articles are initialized separately, the DocumentInitializer can be conducted by mappers individually. After that we got the initialized documents and global counters. To update documents' assignments and global counters(i.e., global knowledge), we do multiple map-reduce jobs. For each job, a set of mappers apply the gibbs sampling to update documents' assignments and global counters; One reducer averages the global counters. The global counters of the final job is saved as global knowledge.
  • If the efficiency is acceptable, I will test the English Wiki dataset.(1-2 weeks)
  • Clean the code; Documentation. (1 week).

###August 2, 2013

  • Tune the GlobalCounter.scala. Change the implementation to HashMap from CSCMatrix. CSCMatrix shifts data array for every insertion, which cost expensively.
  • TODO: currently the efficiency has improved a lot. After fixing some bugs (tokenizer timeout, document serialization), I will test and tune the gibbs sampling parameters (e.g., sampling iterations, burn-in steps, etc.) in the next step.

###August 13, 2013

  • Replace TextTokenizer with LanguageIndependentTokenizer. The initialization process can be finished within 4 hours for 1.8 million Wikipedia articles now.
  • Implement the document serialization function (I give up kryo because the bugs are not easy to fix)
  • TODO: Run the learning program on 1.8 million dataset.

###August 18, 2013

  • Test the EntityTopicModel on Dutch Wikipeida dataset.

      Corpus: AnnotatedTextSource
      Number of occs: 1605 (original), 1605 (processed)
      Disambiguator: Entity Topic Model
      Correct URI not found = 543 / 1605 = 0.338
      Accuracy = 1062 / 1605 = 0.662
      Global MRR: 0.42998664641809825
      Elapsed time: 16 sec
    
  • TODO: tune parameters (e.g., gibbs steps); Learn model for English wikipedia dataset

###September 8, 2013

  • Reformat the interface of functions, add comments.

  • Improve performance of the code by exploiting dbpeida extraction framework and spotters to detect all resource occurrences of one wiki page in training process.

  • Re-generate training and heldout dataset for Ducth Wikipedia.

  • Training on Ducth Wikipedia dataset (not finished due to spotter bug)

Clone this wiki locally