# Old Babylonian Lists of Trees and Wooden Objects
This notebook will use data scraped from [DCCLT](http://oracc.org/dcclt) with the notebooks [Save ORACC HTML files](https://github.com/niekveldhuis/Digital-Assyriology/blob/master/Scrape-Oracc/Save%20Oracc%20HTML%20files.ipynb) and [Scrape Oracc](https://github.com/niekveldhuis/Digital-Assyriology/blob/master/Scrape-Oracc/Scrape%20Oracc.ipynb), using the input file [ob_lists_wood.txt](https://github.com/ErinBecker/digital-humanities-phylogenetics/blob/master/data/text_ids/ob_lists_wood.txt). The input file lists all the Text IDs of Old Babylonian lists of trees and wooden objects currently in DCCLT, as well as the composite text of the Nippur version. Text IDs consist of a P plus a six-digit number (commonly referred to as P-number) that is recognized by [ORACC](http://oracc.org) and by [CDLI](http://cdli.ucla.edu) and that has become the de-facto standard in Assyriology. [CDLI](http://cdli.ucla.edu) provides metadata (provenience, period, publication, museum number, etc) for each text.  Composite text IDs consist of a Q plus a six-digit number.

The raw data are placed in the directory [data/raw](https://github.com/ErinBecker/digital-humanities-phylogenetics/tree/master/data/raw). Each text has a separate file named dcclt_P######.txt (or dcclt_Q######.txt). These are comma-separated files with the fields id_text, text_name, l_no, text. 

| field         | description                     |
|-----------	|------------------------------------------------------------------------------------------------------------------------------------------------------	|
| text_id   	| allows creation of a link to the online edition in [DCCLT](http://oracc.org/dcclt) and/or to the images and metadata in [CDLI](http://cdli.ucla.edu) 	|
| text_name 	| a reference to a text (publication or museum number) that is recognizable by Assyriologists                                                          	|
| l_no      	| line number: obverse/reverse, column number, line number (e.g. o ii 16')                                                                                  	|
| text      	| Sumerian words in lemmatized form (e.g. lugal[king]N)                                                                                                     	|

In [1]:
import pandas as pd
import numpy as np
import re

# Create list of Text Files
The input list used for scraping is used here again to create a list of file names.

In [7]:
textlist = '../data/text_ids/ob_lists_wood.txt'
with open(textlist) as f:
    text_ids = f.readlines()
text_ids = [id.strip() for id in text_ids]
# strip in case there are any accidental spaces
path = "../data/raw/"
filenames = [path + id.replace('/', '_') + '.txt' for id in text_ids]
filenames[:5]

['../data/raw/dcclt_Q000039.txt',
 '../data/raw/dcclt_P117395.txt',
 '../data/raw/dcclt_P117404.txt',
 '../data/raw/dcclt_P128345.txt',
 '../data/raw/dcclt_P224980.txt']

# Open Files
Open each file and create a Dataframe in Pandas and create a list of Dataframes. Concatenate that list into a single Dataframe and re-index. 

In [19]:
list_ = [pd.read_csv(file_,index_col=None, header=0) for file_ in filenames]
df = pd.concat(list_).reset_index(drop=True)

In [20]:
df

Unnamed: 0,id_text,text_name,l_no,text
0,dcclt/Q000039,OB Nippur Ura 01,1,sux:taškarin[boxwood]N
1,dcclt/Q000039,OB Nippur Ura 01,2,sux:esi[tree]N
2,dcclt/Q000039,OB Nippur Ura 01,3,sux:ŋešnu[tree]N
3,dcclt/Q000039,OB Nippur Ura 01,4,sux:halub[tree]N
4,dcclt/Q000039,OB Nippur Ura 01,5,sux:šagkal[tree]N
5,dcclt/Q000039,OB Nippur Ura 01,6,sux:ŋešgana[tree]N
6,dcclt/Q000039,OB Nippur Ura 01,6a,sux:ŋešgana[tree]N sux:babbar[white]V/i
7,dcclt/Q000039,OB Nippur Ura 01,6b,sux:ŋešgana[tree]N sux:giggi[black]V/i
8,dcclt/Q000039,OB Nippur Ura 01,7,sux:ŋeš[tree]N sux:giggi[black]V/i
9,dcclt/Q000039,OB Nippur Ura 01,8,sux:ŋeštin[vine]N


# Create Expressions
A line in a lexical text may contain more than one word. Usually a list is divided into sections by keyword, for instance:

| text                	| translation                      	|
|---------------------	|----------------------------------	|
| {ŋeš}gigir          	| chariot                          	|
| {ŋeš}e₂ gigir       	| chariot cabin                    	|
| {ŋeš}e₂ usan₃ gigir 	| storage box for the chariot whip 	|
| {ŋeš}gaba gigir     	| breastwork of a chariot          	|

The comparison between different versions of the list may be based on

* presence or absence of entries
* order of entries in a section
* order of sections in the document
* spelling of words

In order to look at entries (rather than words), words in an entry are connected by asterisks (*).

In [23]:
df['entries'] = df['text'].replace(' ', '*')
df

Unnamed: 0,id_text,text_name,l_no,text,entries
0,dcclt/Q000039,OB Nippur Ura 01,1,sux:taškarin[boxwood]N,sux:taškarin[boxwood]N
1,dcclt/Q000039,OB Nippur Ura 01,2,sux:esi[tree]N,sux:esi[tree]N
2,dcclt/Q000039,OB Nippur Ura 01,3,sux:ŋešnu[tree]N,sux:ŋešnu[tree]N
3,dcclt/Q000039,OB Nippur Ura 01,4,sux:halub[tree]N,sux:halub[tree]N
4,dcclt/Q000039,OB Nippur Ura 01,5,sux:šagkal[tree]N,sux:šagkal[tree]N
5,dcclt/Q000039,OB Nippur Ura 01,6,sux:ŋešgana[tree]N,sux:ŋešgana[tree]N
6,dcclt/Q000039,OB Nippur Ura 01,6a,sux:ŋešgana[tree]N sux:babbar[white]V/i,sux:ŋešgana[tree]N sux:babbar[white]V/i
7,dcclt/Q000039,OB Nippur Ura 01,6b,sux:ŋešgana[tree]N sux:giggi[black]V/i,sux:ŋešgana[tree]N sux:giggi[black]V/i
8,dcclt/Q000039,OB Nippur Ura 01,7,sux:ŋeš[tree]N sux:giggi[black]V/i,sux:ŋeš[tree]N sux:giggi[black]V/i
9,dcclt/Q000039,OB Nippur Ura 01,8,sux:ŋeštin[vine]N,sux:ŋeštin[vine]N
