Skip to content

Session 8: Data Structuring and Querying

Gabriel Bodard edited this page Oct 5, 2017 · 19 revisions

Date: Thursday, November 17, 2016, 16h00 (UK time)

Session coordinator: Tom Elliott and Sebastian Heath (Institute for the Study of the Ancient World, NYU)

YouTube link: https://youtu.be/39Ao9_mfCxE

Tom's Slides: a huge googly link

Sebastian's Google Doc: https://docs.google.com/document/d/1jYVwDnf-AvUNuOHLNkvvnQJtaYoXqACCSl_QRhdwqls

Outline

This session will survey some of the techniques available for structuring data and finding information in data that has been structured by others. Formal data structures in computing will be introduced, and we will relate these to some of the data formats and encodings found in use today in the field of ancient studies, including:

  • Comma-Separated Values (CSV)
  • JavaScript Object Notation (JSON) and its specializations GeoJSON and JSON-LD
  • Extensible Markup Language (XML) and its use in the Text Encoding Initiative (TEI) and EpiDoc encodings.

We will also address query and data-manipulation languages suited to these formats, including:

  • Regular Expressions (regexes): for text of various kinds
  • XPATH: for XML
  • SQL: for tabular data, such as that found in CSV
  • Python dictionaries and JavaScript associative arrays

Required reading

Further readings

Practical exercises

1. Querying

Edit the SPARQL queries on the Nomisma.org website to query numismatic data. Discuss the historical questions you can ask with these queries. How can you use the results of these queries in further processing or visualizations?

2. Regular expressions

Select and download a copy of an ancient text in English translation (e.g., via Perseus or another online textual source). Use a regular-expression-capable text editor or command-line tool to identify and extract all of the words in the text that begin with a capital letter. Arrange these words into a list, and then manually determine (with reference back to the text) which are proper nouns. Save these proper nouns in CSV format for use in additional processing.

Clone this wiki locally