# Loading Sydney Speaks from the LDaCA ReST API

The Language Data Commons of Australia (LDaCA) packages all their data collections in an [ro-crate](https://www.researchobject.org/ro-crate/). There is a metadata file called `ro-crate-metadata.json` that comes with every data collection and this is how we can obtain metadata on this collection of research objects.

The metadata file is in the json format, and so we'll be learning how to read a json file in this notebook.

We will use the ldaca-py api wrapper to help us browse and get files from the collection database

<div class="alert alert-block alert-success">
<b>Skills</b> 
    
<ul>
<li> json format (see https://en.wikipedia.org/wiki/JSON)</li>
<li> working with files, API_KEYS, RO-Crates</li>
<li> discovering and exploring metadata</li>
</ul>    
<br>

<b>Skill level:</b> Intermediate
</div>

In [None]:
# Before we begin, let's make sure that we install all the requirements that we need
import sys
!{sys.executable} -m pip install -r requirements.txt

<div class="alert alert-block alert-info">
<b>Python Library: ldaca-py</b> 
<br>    
An ldaca python library tries to help the user navigate the data collections in The Language Data Commons of Australia (LDaCA). 
<br>    
In the block of code below, we are creating an ldaca instance
    <strong>To do this we provide the url of ldaca and a token</strong>
<br>
To get an API token, go to https://data-dev.ldaca.edu.au, login via GitHub and generate an API TOKEN.
Once we have an ldaca instance we can retrieve collections download data and use the ldaca.crate to explore the metadata. You will also get the ro-crate-metadata.json stored in the notebook
</div>

In [20]:
import os
from dotenv import load_dotenv
from ldaca.ldaca import LDaCA
URL = 'https://data-dev.ldaca.edu.au/api'
COLLECTION_ID = 'arcp://name,sydney-speaks/corpus/root'
load_dotenv('vars.env') 
TOKEN = os.getenv('API_KEY') 

In [21]:
# Have you set your own API Key?

while TOKEN == None or TOKEN == "": 
    print('You haven\'t set an API_KEY value for this Collection in notebooks/vars.env.')
    print('To get an API token, go to https://data-dev.ldaca.edu.au, login via GitHub and generate an API TOKEN.')
    print('This token can be reused next time you launch this notebook.')
    print('What is your API token?\n') 
    TOKEN = input("API_KEY=")    

In [22]:
ldaca = LDaCA(url=URL, token=TOKEN)

In [24]:
# Saves the metadata in the data_dir
ldaca.retrieve_collection(collection=COLLECTION_ID, collection_type='Collection', data_dir='data')

## The ldaca.crate

We use (ro-crates)[https://www.researchobject.org/ro-crate/] in LDaCA as the standard metadata

We use the special function called `ldaca.crate.dereference(id)` to find out which linked object this is
More on how to consume an ro-crate using python here: https://github.com/ResearchObject/ro-crate-py#consuming-an-ro-crate

In [25]:
# The metadata is stored in ldaca.crate
ldaca.crate

<rocrate_lang.rocrate_plus.ROCratePlus at 0x111b92a00>

In [26]:
# We can explore the metadata programmatically this way
provenance = ldaca.crate.dereference('#provenance')
provenance.as_jsonld() # Use as_jsonld() to produce a json representation of the found item

{'@id': '#provenance',
 '@type': 'CreateAction',
 'name': 'Created RO-Crate using corpus-tools-sydney-speaks',
 'instrument': {'@id': 'git+https://github.com/Language-Research-Technology/corpus-tools-sydney-speaks.git'},
 'result': {'@id': 'ro-crate-metadata.json'}}

In [27]:
from rocrate_lang.utils import as_list # A handy utility for converting to list
ro_crate_sub_collections = set()
for item in ldaca.crate.contextual_entities:
    for member_of in as_list(item.get('memberOf')):
        ro_crate_sub_collections.add(member_of)
ro_crate_sub_collections

{<arcp://name,sydney-speaks/corpus/root/ ['Dataset', 'RepositoryCollection']>,
 <arcp://name,sydney-speaks/subcorpus/Bcnt ['Dataset', 'RepositoryCollection']>,
 <arcp://name,sydney-speaks/subcorpus/SSDS ['Dataset', 'RepositoryCollection']>,
 <arcp://name,sydney-speaks/subcorpus/SydS ['Dataset', 'RepositoryCollection']>,
 <arcp://name,sydney-speaks/subcorpus/Syds ['Dataset', 'RepositoryCollection']>}

In [28]:
sub = next(iter(ro_crate_sub_collections))
sub['@id']

'arcp://name,sydney-speaks/subcorpus/Syds'

In [None]:
# We could get the members of this sub collection

x_collection = list()

for item in ldaca.crate.contextual_entities:
    for member_of in as_list(item.get('memberOf')):
        if member_of.id == sub['@id']:
            x_collection.append(item)
sample_item = x_collection[0].as_jsonld()
print(json.dumps(sample_item, indent=2))

In [30]:
# We can then get the collections from the API
# In this case we retrieve the members of the collection
members = ldaca.retrieve_members_of_collection()
ldaca.collection_members

[{'conformsTo': 'https://purl.archive.org/textcommons/profile#Collection',
  'crateId': 'arcp://name,sydney-speaks/subcorpus/Syds',
  'memberOf': 'arcp://name,sydney-speaks/corpus/root',
  'record': {'license': 'https://www.dynamicsoflanguage.edu.au/sydney-speaks/licence/E/',
   'name': 'Syds',
   'description': ''}},
 {'conformsTo': 'https://purl.archive.org/textcommons/profile#Collection',
  'crateId': 'arcp://name,sydney-speaks/subcorpus/Bcnt',
  'memberOf': 'arcp://name,sydney-speaks/corpus/root',
  'record': {'license': 'https://www.dynamicsoflanguage.edu.au/sydney-speaks/licence/E/',
   'name': 'Bcnt',
   'description': ''}},
 {'conformsTo': 'https://purl.archive.org/textcommons/profile#Collection',
  'crateId': 'arcp://name,sydney-speaks/subcorpus/SydS',
  'memberOf': 'arcp://name,sydney-speaks/corpus/root',
  'record': {'license': 'https://www.dynamicsoflanguage.edu.au/sydney-speaks/licence/E/',
   'name': 'SydS',
   'description': ''}},
 {'conformsTo': 'https://purl.archive.or

In [31]:
# Lets select the 3rd member
sub_collection = ldaca.collection_members[3]
sub_collection['crateId']

'arcp://name,sydney-speaks/subcorpus/SSDS'

In [32]:
# Finds files with a file picker and store data and metadata
# All files will be stored in data/ldaca_files

my_file_picker = lambda f: f if f.get('encodingFormat') == 'text/csv' else None
all_files = ldaca.store_data(
    sub_collection=sub_collection['crateId'], 
    entity_type='RepositoryObject', 
    ldaca_files='ldaca_files', 
    file_picker=my_file_picker, extension='csv'
)
all_files

['data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Amar_Bakas_-_alias_-_(Male,_16).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Donna_Tortorella_-_alias_-_(Female,_16).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_David_Sorielli_-_alias_-_(Male,_14).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Marta_Caccia_-_alias_-_(Female,_15).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Lorraine_Pastore_-_alias_-_(Female,_16).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Fay_Moore_-_alias_-_(Female,_14).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Calvin_Luchetti_-_alias_-_(Male,_14).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Franco_Manella_-_alias_-_(Male,_15).csv',
 'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Hanna_Lekosis_-_alias_-_(Female,_17).csv',
 'data/ldaca_files/Interview_Trans

In [33]:
# Lets browse one file
all_files[0]

'data/ldaca_files/Interview_Transcript_1977-80_-_interview_with_Amar_Bakas_-_alias_-_(Male,_16).csv'

## Frictionless Libraries
Lets use the [Table Schema](https://specs.frictionlessdata.io/table-schema/) defined in the Frictionless Framework.

We will use [tableschema-py](https://libraries.frictionlessdata.io/docs/table-schema/python) which is a python implementation of this

In [None]:
import sys

!{sys.executable} -m pip install tableschema

In [None]:
from tableschema import Table
import json
table = Table(all_files[0])
table.headers
table_read = table.read(keyed=True)
# Lets browse the tables first 3 rows
for x in range(0, 10):
    print(json.dumps(table_read[x], indent=4))

In [45]:
table.infer()
schema = table.schema.descriptor

In [47]:
# Now, each field in the schema is a python object

for field in schema['fields']:
    print(type(field))
    print(field)

<class 'dict'>
{'name': 'field1', 'type': 'integer', 'format': 'default'}
<class 'dict'>
{'name': 'Transcript', 'type': 'string', 'format': 'default'}
<class 'dict'>
{'name': 'start_time', 'type': 'string', 'format': 'default'}
<class 'dict'>
{'name': 'end_time', 'type': 'string', 'format': 'default'}
<class 'dict'>
{'name': 'speaker', 'type': 'string', 'format': 'default'}
<class 'dict'>
{'name': 'IU', 'type': 'string', 'format': 'default'}
