# Retrieve Metadata from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The file can be found at `http://oracc.org/[PROJECT]/catalogue.json`. In the URL replace [PROJECT] with your project or sub-project name (e.g. `dcclt` or `cams/gkab`).

The file `catalogue.json` of each project is also included in the file `PROJECT.zip` in `http://oracc.org/PROJECT/json.zip`. 

The main item in a `catalogue.json` file is called `members`, which contains the information about all the items in the project catalog.

In [1]:
import requests
import pandas as pd

## Select Project
Select the project or subproject of interest. Subprojects are indicated as `[PROJECT]/[SUBPROJECT]` or `[PROJECT]/[SUBPROJECT]/[SUBPROJECT]`.

In [2]:
project = input('Project or subproject abbreviation: ')
project = project.strip().lower()

Project or subproject abbreviation: dcclt


## Access File and Create Dataframe
The `requests` library creates a Python `dictionary` out of a JSON file with the `.json()` function. The JSON catalogue file has all the P, Q, and X numbers of a project under the key `members`.

The resulting dictionary `d` has the text ID numbers (P, Q, and X numbers) as keys, the value is another dictionary where each key is a field (`provenience`, `primary_publication`, etc) and the value is the content of that field. 

In [4]:
url = 'http://oracc.org/' + project + '/catalogue.json'
f = requests.get(url, timeout = 3)
d = f.json()
d = d['members']

## Create Dataframe
After the dictionary is transformed into a Pandas Dataframe, it needs to be transposed, so that the P, Q, and X numbers become indexes or row names (rather than column names), and each column represents a field in the catalog. 

Creating a Dataframe is not necessary, one may also manipulate the dictionary directly, but for demonstration purposes the Dataframe is a handy format. In manipulating the dictionary directly it is important to keep in mind that not all catalog fields have data for all entries, which means that not all dictionary keys are available for each P, Q, or X number.

Example code for slicing the dictionary to select all entries that have `provenience = 'Ur'`:
> `urcat = {key:value for key, value in d.items() if 'provenience' in d[key] and d[key]['provenience'] == 'Ur'}

In [5]:
df = pd.DataFrame(d).T.fillna('')
df

Unnamed: 0,accession_no,acquisition_history,archive,ark,atf_source,atf_up,author,author_remarks,cdli_collation,cdli_comments,...,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,thickness,trans,uri,width,xproject
P000001,,,,21198/zz001q0dtm,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","31x61x18; Lú A 14-16.30-32.48-50; M XVIII, auf...",,,...,Archaic Lu A,,LEX,,,18,,http://cdli.ucla.edu/P000001,61,CDLI
P000002,,,,21198/zz001q0dv4,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",30x48x13; Lú A 13-15.23-25.?; Fundstelle wie W...,,,...,Archaic Lu A,,LEX,,,13,,http://cdli.ucla.edu/P000002,48,CDLI
P000003,,,,21198/zz001q0dwn,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","42x53x19; Vocabulary 9; Qa XVI,2, unter der Ab...",,,...,Archaic Vocabulary,,LEX,,,19,,http://cdli.ucla.edu/P000003,53,CDLI
P000004,,,,21198/zz001q0dx5,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",26x23x23; Lú A 9-10.?.?; Fundstelle wie W 9123...,,,...,Archaic Lu A,,LEX,,,23,,http://cdli.ucla.edu/P000004,23,CDLI
P000005,,,,21198/zz001q0dzp,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","29x36x20; Lú A Vorläufer; Qa XVI,2, unter der ...",,,...,Archaic Lu A,,LEX,,,20,,http://cdli.ucla.edu/P000005,36,CDLI
P000006,,,,21198/zz001q0f0p,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",82x62x19; Lú A Vorläufer; Fundstelle wie W 912...,,,...,Archaic Lu A,,LEX,,,19,,http://cdli.ucla.edu/P000006,62,CDLI
P000007,,,,21198/zz001q0f16,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",56x36x29; Lú A Vorläufer; Fundstelle wie W 912...,,,...,Archaic Lu A,,LEX,,,29,,http://cdli.ucla.edu/P000007,36,CDLI
P000008,,,,21198/zz001q0f2q,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","39x26x9; Unidentified 1; Pb XVII,1, +19.50 m, ...",,,...,Archaic Unidentified,1,LEX,,,9,,http://cdli.ucla.edu/P000008,26,CDLI
P000009,,,,21198/zz001q0f37,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",54x46x?; Lú A 95-98.111-113; Fundstelle wie W ...,,,...,Archaic Lu A,,LEX,,,,,http://cdli.ucla.edu/P000009,46,CDLI
P000010,,,,21198/zz001q0f4r,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",23x25x19; Officials 16-18.66-68.?-?; Fundstell...,,,...,Archaic Officials,,LEX,,,19,,http://cdli.ucla.edu/P000010,25,CDLI


## Select Relevant Fields
First display all available fields, then select the ones that are relevant for the task at hand. The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

In [6]:
list(df.columns)

['accession_no',
 'acquisition_history',
 'archive',
 'ark',
 'atf_source',
 'atf_up',
 'author',
 'author_remarks',
 'cdli_collation',
 'cdli_comments',
 'citation',
 'collection',
 'collection_copyright',
 'condition_description',
 'created_by',
 'created_on',
 'credits',
 'date_entered',
 'date_of_origin',
 'date_remarks',
 'date_updated',
 'dates_referenced',
 'db_source',
 'designation',
 'electronic_publication',
 'elevation',
 'excavation_no',
 'external_id',
 'findspot_remarks',
 'findspot_square',
 'genre',
 'google_earth_collection',
 'google_earth_provenience',
 'height',
 'id_composite',
 'id_text',
 'images',
 'join_information',
 'keywords',
 'langs',
 'language',
 'last_modified_by',
 'last_modified_on',
 'lineart_up',
 'material',
 'museum_no',
 'notes',
 'object_preservation',
 'object_remarks',
 'object_type',
 'other_names',
 'period',
 'period_remarks',
 'photo_up',
 'place',
 'primary_edition',
 'primary_publication',
 'project',
 'provenience',
 'provenience_remar

In [7]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no']]
df1

Unnamed: 0,designation,period,provenience,museum_no
P000001,"W 06435,a",Uruk III,Uruk,VAT 01533
P000002,"W 06435,b",Uruk III,Uruk,VAT 15263
P000003,"W 09123,d",Uruk IV,Uruk,VAT 15253
P000004,"W 09169,d",Uruk IV,Uruk,VAT 15168
P000005,"W 09206,k",Uruk IV,Uruk,VAT 15153
P000006,"W 09656,h1",Uruk IV,Uruk,VAT 15003
P000007,"W 09656,x",Uruk IV,Uruk,VAT 15111
P000008,"W 11985,e",Uruk III,Uruk,VAT 17684
P000009,"W 11985,f",Uruk III,Uruk,VAT 17702
P000010,"W 11985,g",Uruk III,Uruk,VAT 17709


## Manipulate
The Dataframe may now be manipulated with standard Pandas methods. The example code selects the texts from Ur.
> `ur = df1[df1.provenience == "Ur"]`

## Save
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [8]:
filename = project.replace('/', '-') + '_cat.csv'
with open('../data/metadata/' + filename, 'w') as w:
    df1.to_csv(w, encoding='utf-8')