# Retrieve Metadata from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The file can be found at `http://oracc.org/[PROJECT]/catalogue.json`. In the URL replace [PROJECT] with your project or sub-project name (e.g. `dcclt` or `cams/gkab`).

The file `catalogue.json` of each project is also included in the file `PROJECT.zip` in `http://oracc.org/PROJECT/json.zip`. 

The main item in a `catalogue.json` file is called `members`, which contains the information about all the items in the project catalog.

In [None]:
import requests
import pandas as pd

## Select Project
Select the project or subproject of interest. Subprojects are indicated as `[PROJECT]/[SUBPROJECT]` or `[PROJECT]/[SUBPROJECT]/[SUBPROJECT]`.

In [None]:
project = input('Project or subproject abbreviation: ')
project = project.strip().lower()

## Access File and Create Dataframe
The `requests` library creates a Python `dictionary` out of a JSON file with the `.json()` function. The JSON catalogue file has all the P, Q, and X numbers of a project under the key `members`.

The resulting dictionary `d` has the text ID numbers (P, Q, and X numbers) as keys, the value is another dictionary where each key is a field (`provenience`, `primary_publication`, etc) and the value is the content of that field. 

In [None]:
url = 'http://oracc.org/' + project + '/catalogue.json'
f = requests.get(url, timeout = 3)
d = f.json()
d = d['members']

## Create Dataframe
After the dictionary is transformed into a Pandas Dataframe, it needs to be transposed, so that the P, Q, and X numbers become indexes or row names (rather than column names), and each column represents a field in the catalog. 

Creating a Dataframe is not necessary, one may also manipulate the dictionary directly, but for demonstration purposes the Dataframe is a handy format. In manipulating the dictionary directly it is important to keep in mind that not all catalog fields have data for all entries, which means that not all dictionary keys are available for each P, Q, or X number.

Example code for slicing the dictionary to select all entries that have `provenience = 'Ur'`:
> `urcat = {key:value for key, value in d.items() if 'provenience' in d[key] and d[key]['provenience'] == 'Ur'}

In [None]:
df = pd.DataFrame(d).T.fillna('')
df

## Select Relevant Fields
First display all available fields, then select the ones that are relevant for the task at hand. The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

In [None]:
list(df.columns)

In [None]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no']]
df1

## Manipulate
The Dataframe may now be manipulated with standard Pandas methods. The example code selects the texts from Ur.
> `ur = df1[df1.provenience == "Ur"]`

## Save
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [None]:
filename = project.replace('/', '-') + '_cat.csv'
with open('../data/metadata/' + filename, 'w') as w:
    df1.to_csv(w, encoding='utf-8')