Skip to content

Named Entity Extraction II

Matteo Romanello edited this page Apr 20, 2017 · 9 revisions

Date: Thursday, April 20, 2017, 17h00-18h15 (CEST time)

Session coordinators: Matteo Romanello and Francesco Mambrini (Deutsches Archäologisches Institut, Berlin)

YouTube link: https://www.youtube.com/watch?v=mD5icsPJIG4

Slides: SunoikisisDC, session 13

Jupyter Notebooks:


Summary

After reviewing the main technical concepts introduced in the previous session, in this session we aim to address some more advanced topics of both programming in Python and Named Entity Recognition (NER). We will introduce the Object-Oriented Paradigm in Python, and write some advanced function to derive basic statistics about the extracted named entities. We will then move to the extraction of dates from journal articles by using regular expressions. Finally, we will cover the topic of how to compare and evaluate different solutions (i.e. algorithms) to perform the same NLP task.

Outline

  • recap of main concepts of programming in Python
  • tagging of Caesar's De Bello Gallico: how to calculate some basic statistics about the extracted NEs?
  • how to extract dates from journal articles by using regular expressions?
  • evaluation of NER: error types and accuracy measures

Required readings

  • S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly, 2009. Available at: http://www.nltk.org/book/:

    • Ch. 1 “Language Processing and Python”: sections 1 & 2 (link)
    • Ch. 7 “Extracting Information from Text”, sections 1 & 5 (link)
  • Dan Jurafsky and James H. Martin. Speech and Language Processing, Chapter 21, p. 1-7 (Information Extraction). 3rd edition draft available at https://web.stanford.edu/~jurafsky/slp3/21.pdf.

Further readings

  • Erdmann, Alex, Christopher Brown, Brian D. Joseph, Mark Janse, and Petra Ajaka. 2016. “Challenges and Solutions for Latin Named Entity Recognition.” In Coling. Association for Computational Linguistics. Available at https://www.clarin-d.de/images/lt4dh/pdf/LT4DH12.pdf.

Essay title

Practical exercise

The exercise is described in last section of this notebook.

The students are asked to apply some of the notions discussed in the common classes and reuse some of the code presented in order to extract and annotate in IOB. Students will also evaluate the performances of the NER tagger using the methodology and metrics (precision, recall, F-score) explained in the lecture.

Clone this wiki locally