Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

GSoC2012 Project Ideas

sandroacoelho edited this page Jul 26, 2013 · 1 revision

DBpedia Spotlight Project Ideas - for Google Summer of Code 2012

DBpedia Spotlight took part on GSoC2012 with 4 students, who did a great job in implementing some of the ideas that were discussed in this page. Take a look at their projects or read the descriptions on Google Melange.

DBpedia Spotlight will apply to GSoC 2013 in conjunction with the DBpedia project.

TODO: fix links and delete finished tasks


Introduction

Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the "guts" of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.

DBpedia is a project aiming at exposing knowledge from Wikipedia as Linked Data. One can navigate this Web of facts with standard Web browsers, automated crawlers or select subsets with SQL-like query languages (e.g. SPARQL). DBpedia exists in 97 different languages, and is interlinked with many other databases such as Freebase, New York Times, CIA Factbook, etc.

This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. Or see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. By the way, they use DBpedia.

DBpedia Spotlight is an open source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, "Washington" can be interpreted in more than 50 ways including a [state](http://en.wikipedia.org/wiki/Washington_(state\)), a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million "things" of 320 different "types" with over half a billion "facts" (July 2011).

But we think we're doing quite well. And we could use your help to do even better!

Project Ideas

Here is an initial set of topics:

  • Feedback incorporation: there are software clients for DBpedia Spotlight that allow users to annotate their blogs in Wordpress, or even to suggest new links in Wikipedia. When a user chooses a suggestion, or fixes an incorrect annotation produced by DBpedia Spotlight, this is valuable feedback information that should be incorporated by the system to learn from its mistakes. You could work on clients that let users fix the wrong annotations, and/or in the backend that stores that in the system to relearn the annotator.
  • Spotting: There are many strategies that one could use to identify "names of things" in text. An easy one is using a dictionary of names and well-known string matching algorithms. Other approaches include detecting "important phrases" based on their occurrence statistics (keyphrase extraction), or learning patterns to detect specific types of entities (Named Entity Recognition), etc. We have implemented a few of these: Named Entity Recognition, Keyphrase Extraction, N-Grams, Lexicon-based, etc. You could extend the comparison of approaches, and create a solution to run multiple of them and choose the best assignment in the end, considering overlaps. Take a look at Sequence Labeler
  • Disambiguation: our disambiguation algorithm chooses the most likely "interpretation" for a name based on the words that occur around that name in a paragraph. We weigh these worlds according to how frequently they occurred with each of the possible interpretations, using a scoring function we call Inverse Candidate Frequency (ICF). Our current implementation is based on Lucene, and does too many unnecessary reads in order to achieve its functionality. You could work on a faster implementation of ICF using Lucene, or you could propose other solutions based on key-value stores, relational databases, etc. Consider a HSQL solution, Project Voldemort, among others.
  • Text formats: text may come in a variety of formats, e.g. plain, HTML, PDF, etc. To make DBpedia Spotlight a general tool for text annotation, it would be great to deal with various input/output formats. You would start with HTML. It will be important to find the main text passage in a page. This can be done using existing frameworks or self-developed methods. Apache Tika is a starting point. See also BoilerPipe for HTML and LA-PDFText for PDF.
  • Internationalization: do you speak another language besides English? Would you like to help us to enable DBpedia Spotlight in other languages? We've started in that direction, but there are many ways in which you could still contribute.
  • More knowledge from Freebase: DBpedia Spotlight currently allows users to use the Freebase types to select which kinds of output they would like to display or ignore. However, Freebase has much more knowledge than just their types, and all of that information can be integrated in the disambiguation process of DBpedia Spotlight. This includes categories such as "arts & entertainment", "science and technology", etc. as well as relationships between entities.
  • Better support for short messages: DBpedia Spotlight has been used to annotate tweets by a few applications. Tweets are a tough case because the messages are short and the language is unconventional. We have provided initial integration with Twarql to enable streaming annotation of tweets. Exploring ways to enhance the speed and/or quality of annotation in this setting would make a really interesting project.
  • Integration with other software stacks: many applications are based on larger software stacks such as Apache UIMA. DBpedia Spotlight can also be integrated with these stacks to support those applications.

We have also collected over the months a few open tasks in our feature requests and bug tracker. Please feel free to take a look.

Here is a set of topics that were at least partly implemented since GSoC2012:

  • DBpedia Spotlight Live: An ever-learning text annotation tool. Knowledge is constantly changing. When a new a new president gets elected, a celebrity gets married, or a world cup is won, Wikipedians rush to edit the world encyclopedia to reflect this new knowledge. DBpedia Live runs the DBpedia extraction process from the Wikipedia stream of modifications and keeps DBpedia in sync with its knowledge source. You could work on feeding this stream of modifications in Wikipedia also to the DBpedia Spotlight extraction/indexing process. One candidate framework for implementing this is Storm. See also WikiStream for a way to obtain the stream of updates coming from Wikipedia. As a possible storage, check out the recently released SenseiDB

  • Integration with Apache Stanbol: many applications are based on larger software stacks such as Apache Stanbol. A first step would be to integrate DBpedia Spotlight as Enhancement Engine within Apache Stanbol. This should be relatively easy, especially when implemented on the Web Service (RESTful) layer. This is a good starter task.

  • Writing a NIF Wrapper: the recently published NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The format is able to connect documents in the WWW, data in the GGG with the in and output of NLP tools. In this task, a NIF wrapper has to be created, that is able to (de-)serialize the input and output of Spotlight to and from NIF. In combination with an annotation server, this would allow third parties to annotate web site with DBpedia Spotlight and also provide a way to represent annotations for the Web, following the idea of AnnotateIt.

  • Hadoop-based Indexing: when DBpedia Spotlight processes the English Wikipedia, it generates approximately 60GB of data. When processing more languages, it becomes more interesting to perform much of our statistics gathering jobs in Hadoop. You can build up on pignlproc, Apache Mahout and Katta. This idea is good for a starter task (2-week). Much of it has already been done, it is a matter of putting pieces together.

  • Topical classification: in DBpedia, things are associated to categories that can represent topics. Example topics are Music, Politics, Sports, etc. DBpedia Spotlight could use this information to create topic detectors and use that information in the annotation process. Mahout offers LDA and NaiveBayes implementations and has scripts to process Wikipedia data. This can be a starting point.

Some of these ideas are not enough for a project by themselves, so ponder about the size of your project and feel free to bundle related ideas.

Pre-requisites

Soft skills:

  • We would like to work with people that are energetic programmers, passionate about open source, and really interested in the topics around DBpedia Spotlight. You don't need to worry about convincing us about this. We can tell from how much preparation went into your proposal.
  • Although the mentors are here to help you, we expect you to be able to search and find answers for most questions for yourself. Search engines like easy questions, mentors like the tough ones. When you ask a question, show that you've looked for the answer before asking.

Programming languages we love:

  • Java: we love cross-platform code and object oriented programming.
  • Scala: adds functional programming to the Java world, and in our opinion allows one to write more concise code, and write it in less time.
  • R: very convenient for analyzing your data and looking into anything that involves statistics.
  • Python: we commonly write scripts in python for quick, small tasks.
  • Linux/Bash: a lot of common tasks can be done with cat/sort/uniq/grep/sed/cut. We use them every day.

You don't need to know all of them. Solid knowledge in Java/Scala is enough for most of what we do. Our build process is based on Maven2.

Mentors

We have a super international team of mentors eager to work with you to build a better Web!

  • Pablo N. Mendes is Brazilian, now living in Berlin. He is a Research Associate at the Free University of Berlin, researching Linked Data and Information Extraction. He co-created and maintains the projects DBpedia Spotlight, DBpedia Portuguese, Twarql, and Cuebee.
  • Max Jakob lives in Berlin. He works at Neofonie GmbH and is co-creator of DBpedia Spotlight. He also formerly maintained the DBpedia project.
  • Jimmy O'Regan is from Ireland. He is a member of the DBpedia Spotlight internationalization community and has participated in 3 past editions of GSoC with Apertium.
  • Mihály Héder lives in Budapest. He is a Research Assistant at MTA SZTAKI and is the lead developer of Sztakipedia that suggests enhancements to Wikipedia editors using, among others, the DBpedia Spotlight API. He has been mentoring 5 BSc and MSc students at TU Budapest.
  • Iavor Jelev is Bulgarian, now living in Leipzig. He is partner at BabelMonkeys.com, and led the development of Robo Tagger, a similar tool to DBpedia Spotlight that focuses on German. He is currently collaborating on introducing topic classification into DBpedia Spotlight.
  • Rupert Westenthaler, lives in Salzburg (Austria). He is employed at Salzburg Research and is committer of Apache Stanbol (incubating). He will serve as a point of contact for integration tasks between DBpedia Spotlight and Apache Stanbol.

More Information

Clone this wiki locally