# Importing ORACC Data from corpus.json
by Niek Veldhuis
UC Berkeley


# TODO
* check that COFs are treated properly
* check that lines that continue into the next line (as in bilinguals) are captured completely. Such lines are indicated in the json by the the addition of 'l' (lower case L) to the reference (.ref).
* add definition of fields to list in Introduction

# Note
Currently the code will fetch a large zip file from ORACC, download it, extract certain files from the zip file and parse those. The zip file contains all data that belong to an ORACC project or sub-project. Since one may run this notebook several times for collecting data from the same project, this may not be the best process (the download will take place every time). Move the download process to a separate notebook, preceding the current one.


# Introduction

Purpose of the code is to download [ORACC](http://oracc.org) (Open Richly Annotated Cuneiform Corpus) JSON files that contain textual data and produce a `.csv` file in the directory `data/raw` with the relevant data for use in the phylogenetics project. The JSON files contain all the transliteration and lemmatization data of an ORACC project (metadata are made available in a separate `.json` file). For an introduction to the various ORACC JSON files see the [ORACC Open Data](http://oracc.org/doc/opendata) page.

The resulting data file includes various elements of the ORACC data structure. The current code will output a file with the following fields: 

* id_line
* label
* lemma
* base
* extent
* scope

The fields `extent` and `scope` capture the number of missing lines or columns.

The selection of fields may be adjusted with standard `Pandas` functions.

## Notes
The current version of the script works with the `requests` library.  

This notebook is written for **Python 3.5** with **Pandas 0.19** and **requests 2.18.1**.

The notebook was written for the [Digital Humanities Phylogenetics](https://github.com/ErinBecker/digital-humanities-phylogenetics) project with Erin Becker of [Data Carpentry](http://www.datacarpentry.org). The particular data selection and data manipulation performed in this notebook are inspired by the needs of that project (for instance, non-Sumerian words are filtered out). It should be fairly easy to adapt the notebook to the purposes of any other project that wishes to use [ORACC](http://oracc.org) data.

## Licensing
This notebook may be downloaded, used and adapted without any restrictions.

In [1]:
import pandas as pd   
import requests
import zipfile
import io
import tqdm
import json
import os

# Input List of Text IDs
Identify a list of text IDs (P, Q, and X numbers) in the directory `data/text_ids`. The IDs are six-digit P, Q, or X numbers preceded by a project abbreviation in the format 'PROJECT/P######' or 'PROJECT/SUBPROJECT/Q######'. Each number may be followed by a start and/or stop label. For example:

* etcsri/Q001203
* rinap/rinap1/Q003421
* dcclt/P117395 r i 23 - r ii 3
* dcclt/P453267 - r iv 35'
* dcclt/P236734 o ii 12 -

The list should be created with a flat text editor such as Textedit or Emacs (not in a word processor such as MS Word), and the filename should end in `.txt`. The labels should copy exactly the line lables as used in the online [ORACC](http://oracc.org) editions.

In [2]:
filename = input('Filename: ')

Filename: Q39_par.txt


In [3]:
textids = '../data/text_ids/' + filename
with open(textids, 'r') as f:
    pqxnos = f.read().splitlines()
pqxnos = [no.strip() for no in pqxnos] # strip spaces left and right
nos_labels = [no.split(' ', 1) if " " in no else [no, '-'] for no in pqxnos] #separate ID from labels
for label in nos_labels:  # split labels into start label and stop label
    label[1] = label[1].split('-')

# Parse

`Parsejson()` takes as second argument a logical variable. If "True" the parser starts with the first word. If "False" the parser starts when it gets to `startlabel`. The parser stops when it gets to `endlabel`. `Label`, `startlabel` and `stoplabel` are stored in the dictionary `labels` outside of the function.
The list `dollar_keys` (also outside of the function) stores the relevant field names when capturing line breaks etc. 

Words not only include lemmatized words, but also unlemmatized and unlemmatizable words (such as breaks).

The resulting dictionary includes keys such as `lang` (for language), `guideword`, `sense`, etc. - all the elements that define an [ORACC](http://oracc.org) signature. The dictionary also includes the key `id_word` (a sequential number for each word in each line) which has the format `TextID.LineID.WordID` - in other words, line and text ID can be derived from it. This allows the user to reassemble a text in the original word and line order.

In [4]:
def parsejson(text, parameters):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject, parameters)
        if "label" in JSONobject:
            parameters["label"] = JSONobject["label"]
        if parameters["label"] == parameters["startlabel"]:
            parameters["keep"] = True
        if parameters["label"] == parameters["endlabel"]:
            parameters["keep"] = False
        if parameters["keep"] == True or parameters["label"] == parameters["endlabel"]: # the "or" statement ensures that the line
            if "f" in JSONobject:             # corresponding to the endlabel is included.
                lemma = JSONobject["f"]
                lemma["id_word"] = JSONobject["ref"]
                lemma["label"] = parameters["label"]
                lemma["id_text"] = parameters["id_text"]
                lemm_l.append(lemma)
            if "strict" in JSONobject and JSONobject["strict"] == "1":
                lemma = {key: JSONobject[key] for key in dollar_keys}
                lemma["id_word"] = JSONobject["ref"] + ".0"
                lemma["id_text"] = parameters["id_text"]
                lemm_l.append(lemma)
    return

# Download the DCCLT JSON file in a ZIP from ORACC

All the JSON files that belong to [DCCLT](http://oracc.org/dcclt) are available in a single ZIP file that can be downloaded from [ORACC](http://oracc.org). This ZIP file is large (45MB or more) so this step might take some time.

In [5]:
project = 'dcclt'
url = 'http://build-oracc.museum.upenn.edu/' + project + '/json'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

# Call the Parser Function for Each Textid

In [6]:
lemm_l = []
dollar_keys = ["extent", "scope", "state"]
for pqx in tqdm.tqdm(nos_labels):
    project = pqx[0][:-8].lower()
    textid = pqx[0][-7:].upper()
    z.extract(member = project + "/corpusjson/" + textid +".json", path= '../data/json')
    parameters = {"startlabel":pqx[1][0].strip(), "endlabel":pqx[1][1].strip(), "label":None,
                 "keep": False, "id_text": project + '/' + textid}
    if parameters["startlabel"] == "":
        parameters["keep"] = True
    else:
        parameters["keep"] = False
    #url = "http://build-oracc.museum.upenn.edu/" + project + "/corpusjson/" + textid + ".json"  
    #r = requests.get(url).json()
    directory = '../data/json/' + project + '/corpusjson/'
    file = directory + textid + '.json'
    with open(file) as f:
        r = json.load(f)
    try:
        parsejson(r, parameters)
    except:
        print(textid + ' is not available or not complete')

100%|██████████| 138/138 [00:01<00:00, 121.36it/s]


# Transform the Data into a DataFrame

If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. The fields `extent` and `scope` are referenced in the code below. After creating the dataframe the existence of these two fields is checked - if they do not exist, empty columns are created.

In [7]:
words = pd.DataFrame(lemm_l)
if not 'extent' in words.columns:
    words['extent'] = ''
if not 'scope' in words.columns:
    words['scope'] = ''
words

Unnamed: 0,base,cf,cont,delim,epos,extent,form,gdl,gw,id_text,id_word,label,lang,morph,norm,norm0,pos,scope,sense,state
0,{ŋeš}taškarin,taškarin,,,N,,{ŋeš}taškarin,"[{'det': 'semantic', 'seq': [{'id': 'Q000039.1...",boxwood,dcclt/Q000039,Q000039.1.1,1,sux,~,,taškarin,N,,"box tree, boxwood",
1,{ŋeš}esi,esi,,,N,,{ŋeš}esi,"[{'det': 'semantic', 'seq': [{'id': 'Q000039.2...",tree,dcclt/Q000039,Q000039.2.1,2,sux,~,,esi,N,,ebony,
2,ŋeš-nu₁₁,ŋešnu,,,N,,{ŋeš}nu₁₁,"[{'det': 'semantic', 'seq': [{'id': 'Q000039.3...",tree,dcclt/Q000039,Q000039.3.1,3,sux,~,,ŋešnu,N,,tree,
3,{ŋeš}ha-lu-ub₂,halub,,,N,,{ŋeš}ha-lu-ub₂,"[{'det': 'semantic', 'seq': [{'id': 'Q000039.4...",tree,dcclt/Q000039,Q000039.4.1,4,sux,~,,halub,N,,tree,
4,{ŋeš}šag₄-kal,šagkal,,,N,,{ŋeš}šag₄-kal,"[{'det': 'semantic', 'seq': [{'id': 'Q000039.5...",tree,dcclt/Q000039,Q000039.5.1,5,sux,~,,šagkal,N,,tree,
5,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'id': 'Q000039.6.1.0', 'v': 'ŋeš', 'delim': ...",tree,dcclt/Q000039,Q000039.6.1,6,sux,~,,ŋešgana,N,,tree,
6,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'id': 'Q000039.7.1.0', 'v': 'ŋeš', 'delim': ...",tree,dcclt/Q000039,Q000039.7.1,6a,sux,~,,ŋešgana,N,,tree,
7,babbar,babbar,,,V/i,,babbar,"[{'id': 'Q000039.7.2.0', 'v': 'babbar'}]",white,dcclt/Q000039,Q000039.7.2,6a,sux,~,,babbar,V/i,,(to be) white,
8,ŋeš-kin₂,ŋešgana,,,N,,ŋeš-kin₂,"[{'id': 'Q000039.8.1.0', 'v': 'ŋeš', 'delim': ...",tree,dcclt/Q000039,Q000039.8.1,6b,sux,~,,ŋešgana,N,,tree,
9,giggi,giggi,,,V/i,,giggi,"[{'id': 'Q000039.8.2.0', 'v': 'giggi'}]",black,dcclt/Q000039,Q000039.8.2,6b,sux,~,,giggi,V/i,,(to be) black,


# Remove Spaces and Commas from Guide Word and Sense
Spaces in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens or nothing, respectively.

In [8]:
words = words.fillna('') # first replace Missing Values by empty string
words['sense'] = [x.replace(' ', '-') for x in words['sense']]
words['sense'] = [x.replace(',', '') for x in words['sense']]
words['gw'] = [x.replace(' ', '-') for x in words['gw']]
words['gw'] = [x.replace(',', '') for x in words['gw']]

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* signature (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

# Manipulate
The columns may be manipulated with standard Pandas methods to create the desired output. By way of example, the following code will create a column `lemma` with the format **cf[gw]pos** (for instance **lugal[king]N**). For words that have no lemmatization, `lemma` equals `form`. Only Sumerian words are allowed (and thus `lang` can be omitted) and in addition to the column `lemma` the column `base` is preserved; words that have no lemmatization take `form` as their base. Words and bases are concatenated to lines.

## Remove  non-Sumerian words

In [9]:
lang = ['sux', ''] # note that 'lang' is empty in entries that indicate damage
words = words.loc[words['lang'].str[:3].isin(lang)].reset_index()

## Create Lemma Column and Adjust Base

In [10]:
words['lemma'] = words['cf'] # first element of lemma is the citation form
words['lemma'] = [words['lemma'][i] + '[' + words['gw'][i] 
                     + ']' + words['pos'][i] 
                     if not words['lemma'][i] == '' 
                     else words['form'][i] +'[NA]NA' for i in range(len(words))]
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ]
words['base'] = [words['base'][i] if not words['base'][i] == '' 
                 or words['label'][i] == '' else words['form'][i] 
                 for i in range(len(words))]

## Group by Line
Create `id_line` and `id_text` from `id_word`. The field `id_word` has the format `Q000039.76.1` (first word in line 76 of Q000039). The corresponding `id_line` is `76` (integer), `id_text` is `Q000039`. 

`id_line` is an integer that will keep the lines in proper order (`id_word` and `id_text` are strings).

In [11]:
words['id_line'] = [int(wordid[wordid.find('.')+1:wordid.rfind('.')]) for wordid in words['id_word']]
words['id_text'] = [wordid[:7] for wordid in words['id_word']]

In [12]:
df = words.groupby([words['id_text'], words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'base': ' '.join,
        'extent': ''.join, 
        'scope': ''.join
    }).reset_index()
df        

Unnamed: 0,id_text,id_line,label,lemma,base,extent,scope
0,P117395,2,o 1,ŋešed[key]N,{ŋeš}e₃-a,,
1,P117395,3,o 2,pakud[~tree]N,{ŋeš}pa-kud,,
2,P117395,4,o 3,raba[clamp]N,{ŋeš}raba,,
3,P117404,2,o 1,ig[door]N eren[cedar]N,{ŋeš}ig {ŋeš}eren,,
4,P117404,3,o 2,ig[door]N dib[board]N,{ŋeš}ig dib,,
5,P117404,4,o 3,ig[door]N i[oil]N,{ŋeš}ig i₃,,
6,P128345,2,o 1,garig[comb]N siki[hair]N,{ŋeš}ga-rig₂ siki,,
7,P128345,3,o 2,garig[comb]N siki-siki[NA]NA,{ŋeš}ga-rig₂ siki-siki,,
8,P128345,4,o 3,garig[comb]N saŋdu[head]N,{ŋeš}ga-rig₂ saŋ-du,,
9,P224980,4,o i 1,gigir[chariot]N,{ŋeš}gigir,,


## Save in CSV Format

In [13]:
if not os.path.isdir('../data/raw'):
    os.mkdir('../data/raw')
f_out = '../data/raw/' + filename[:-4] + '.csv'
print('saving ' + f_out)
with open(f_out, 'w') as w:
    df.to_csv(w, encoding='utf8', index=False)

saving ../data/raw/Q39_par.csv
