Named Entity Extraction I

Matteo Romanello edited this page Mar 23, 2017 · 14 revisions

Date: Thursday, March 23, 2017, 17h00-18h15 (CET time)

Session coordinators: Matteo Romanello and Francesco Mambrini (Deutsches Archäologisches Institut, Berlin)

YouTube link: https://www.youtube.com/watch?v=yxolXPWxmQ0

Slides: Sunoikisis DC, session 9

Jupyter notebooks:


Summary

This session aims to 1) give participants a good understanding of basic programming concepts and constructs in Python and 2) show how to use them in order to extract named entities (e.g. names of people, names of places) from text, a task also known as Named Entity Recognition (NER).

Our approach will be as task-oriented and hands-on as possible. In order to step directly into writing and executing some Python code, participants we will provided with an account to access an iPython notebook server (see instructions below). An iPython notebook is essentially a collection of code snippets and other resources (e.g. text, images, etc.) that are useful to understand what the code does (here an example). A notebook server is a web-based, interactive programming environment where such notebooks can be created and executed live.

In the following session we will cover some more advanced concepts including NER by means of regular expressions, the evaluation of NER systems and how to do some basic statistics of extracted named entitities.

Outline

  • Introduction: Information Extraction and Named Entity Recognition (NER)
  • NER: definitions and tasks (extraction, classification, disambiguation)
  • basic programming concepts in Python (loops, if statements, exception catching, variables, functions, basic data structures, input/output)
  • Doing NER from Latin texts with CLTK and NLTK

Required readings

  • S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly, 2009. Available at: http://www.nltk.org/book/:

    • Ch. 1 “Language Processing and Python”: sections 1 & 2 (link)
    • Ch. 7 “Extracting Information from Text”, sections 1 & 5 (link)
  • Dan Jurafsky and James H. Martin. Speech and Language Processing, Chapter 21, p. 1-7 (Information Extraction). 3rd edition draft available at https://web.stanford.edu/~jurafsky/slp3/21.pdf.

Further readings

Using the iPython notebook server

  • go to http://nlp.dainst.org:8888
  • enter the password you should have received by email
  • select a notebook by ticking the box beside the filename (extension .ipynb)
  • press the duplicate button that will appear
  • select the notebook copy you just created (selection works as above)
  • rename it by adding your initials at the end of the file name (e.g. *_MR.ipynb)
  • you're done! -- during the common session you will be running and working on your copy of the notebooks

Practical exercise

The exercise is described in last section of this notebook.

The student is asked to modify the code examples contained in the notebook to extract the named entities from the English translation of Caesar's De Bello Gallico, book 1, available from Perseus.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.