# Old Babylonian Lists of Trees and Wooden Objects

## Introduction to research questions and analysis plan

We are interested in understanding relationships among extant versions of lexical texts. Patterns in similarity of these texts may provide important information about text provenance and/or routes of influence from one geographical area onto another. 

We are also interested in understanding the patterns by which lexical texts evolved and changed. 

In comparing versions of a lexical text we may think of four types of features: 

1) presence or absence of entries  
2) order of entries within a section  
3) order of sections in a document  
4) spelling of words  

The following sections will explore these four features independently and in combination to uncover patterns of similarity among documents.

## Introduction to dataset and data structure

This notebook uses data from the Digital Corpus of Cuneiform Lexical Texts ([DCCLT](http://oracc.org/dcclt)) derived from parsed JSON files. For the JSON output from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)) see the [ORACC Open Data documentation](http://oracc.museum.upenn.edu/doc/opendata/index.html).  

The JSON files are parsed with the notebook [grab_json.ipynb](https://github.com/ErinBecker/digital-humanities-phylogenetics/blob/master/scripts/grab_json.ipynb). This notebook takes an input file, identifying the text IDs of the documents to be parsed. The input file is [ob_lists_wood.txt](https://github.com/ErinBecker/digital-humanities-phylogenetics/blob/master/data/text_ids/ob_lists_wood.txt). 

The input file lists all the Text IDs of Old Babylonian lists of trees and wooden objects currently in [DCCLT](http://oracc.org/dcclt), as well as the composite text of the [Nippur version](http://oracc.org/Q000039). Text IDs consist of a P plus a six-digit number (commonly referred to as P-number) that is recognized by [ORACC](http://oracc.org) and by the Cuneiform Digital Library Initiative ([CDLI](http://cdli.ucla.edu)) and that has become the de-facto standard in Assyriology. [CDLI](http://cdli.ucla.edu) provides metadata (provenience, period, publication, museum number, etc) for each text. Composite text IDs consist of a Q plus a six-digit number (for instance Q000039). Texts that have not (yet) been cataloged in [CDLI](http://cdli.ucla.edu) receive a (temporary) six-digit X number.

The data are placed in the directory [data](https://github.com/ErinBecker/digital-humanities-phylogenetics/tree/master/data). The are comma-separated files have the following fields: 

| field         | description                     |
|-----------	|------------------------------------------------------------------------------------------------------------------------------------------------------	|
| id_line   	| consists of a text ID (P, Q, or X number) plus a reference number 	|
| label 	| line number: obverse/reverse, column number, line number (e.g. o ii 16')                                                          	|
| lemma      	| Sumerian words in lemmatized form (e.g. lugal[king]N); for unlemmatized words the raw transliteration is taken                                                                                  	|
| base      	| Sumerian words in original spelling, but without morphological prefixes or suffixes   |
| extent | (for missing data): how many lines or columns (restricted vocabulary) are missing|
| scope | (for missing data): what is missing - line, column, face, or surface (restricted vocabulary) |

There are various types of missing data, represented in different ways. A word that is present, but not lemmatized is represented in its transliterated form, followed by [NA]NA (that is: Guideword and POS are both NA). Words that are partly or entirely illegible on the original document are by definition unlemmatized and are handled the same way.

Lines or multiple lines that are missing are indicated in the fields `extent` and `scope`. `Extent` gives the number of missing lines (or missing columns, etc). The restricted vocabulary includes numbers and the words 'n' (unknown), 'beginning', and 'rest'. `Scope` indicates the scope of the missing text: line, column, obverse, etc.

| type         | how represented                     |
|-----------	|------------------------------------------------------------------------------------------------------------------------------------------------------	|
| words with unknown lemmatization| siki-siki[NA]NA |
| illegible words | x[NA]NA |
| known number of missing lines 	|extent: '5' scope: 'line' |
| unknown number of missing lines	|extent: 'n' scope: 'line |
| two missing columns  | extent: '2' scope: 'column'|
  
