# Introduction

This shows the work done for the UKCEH croissant spike.

The purpose was to demonstrate extracting data from file(s) on our production catalogue (https://catalogue.ceh.ac.uk/) using a hard coded croissant.json file.

Only example 1 worked, which shows extraction for a single fileObject.  Examples 2 and 3 show the work I did to try to access netcdf files with and without a login, and also trying to use a **fileSet** - which I now believe is only used against an archive fileObject (or other fileSets that eventually point to an archive fileObject). 

Install dependencies.

In [None]:
!pip install ipywidgets
!pip install mlcroissant
!pip install netCDF4

## Example 1 - successfully access a single csv file from archived and unarchived downloads
This demonstrates extracting columns from csv files that are downloaded as archived (zip) and unarchived (raw csv) files.

All files are freely available and need no login.

All data available via OGL license.

1. Archived
   croissantSpikeZip.json is the croissant file for an archived dataset that is downloaded as a zip file.
   Full information for this dataset is [here](https://catalogue.ceh.ac.uk/documents/972599af-0cc3-4e0e-a4dc-2fab7a6dfc85).
   It contains 4 csv files for sand dune data, with up to ~30,000 rows per file.

2. Unarchived
   croissantSpikeCOSMOSEdited.json is a working version of croissantSpikeCOSMOS.json.  It fixes some issues with the latter, but both have been kept for now for comparison. The working version lists 255 fileObjects, and has 1 recordSet that downloads all 44 fields from one of those fileObjects.
   Full information for this dataset is [here](https://catalogue.ceh.ac.uk/documents/399ed9b1-bf59-4d85-9832-ee4d29f49bfb)
   All downloadable files are [here](https://catalogue.ceh.ac.uk/datastore/eidchub/399ed9b1-bf59-4d85-9832-ee4d29f49bfb/).

I have not yet worked out how to extract data from subsets of csv files in either the  archive or non-archive files using a 'fileSet' and glob pattern  - this would be very useful for archive files that contain many files, and long distribution lists.

In [None]:
import mlcroissant as mlc

def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records):
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeZip.json', 'rs-abberfraw')  # This one demostrates accessing a zip file containing 4 csv files and extracting a set of columns from one of them

doML('croissantSpikeCOSMOSEdited.json', 'rs-cosmos') # This one demonstrates acccessing columns of a single csv downloaded directly ie not in an archive file


## Example 2 - successfully use credentials to download (but not extract from) netcdf

This uses credentials to access CHESS netcdf files.  I have proven it works using my credentials.

Run the first cell once to set your login credentials, then run the second cell to download the data.

If required, create a new account [here](https://catalogue.ceh.ac.uk/sso/signup).  You may have to agree a license to use the dataset in this example.

Unfortunately, I haven't yet worked out how to extract data from the netcdf files, so whilst it downloads the netcdf file, it fails to extract data.

The source files are hierachically organised, starting [here](https://catalogue.ceh.ac.uk/datastore/eidchub/835a50df-e74f-4bfb-b593-804fd61d5eab/)


In [None]:
# Set credentials

import ipywidgets as widgets
from IPython.display import display

# Create widgets for username and password
username = widgets.Text(description='Username:')
password = widgets.Password(description='Password:')
login_button = widgets.Button(description='Login')

# Get user credentials and set as envars
def login(button):
    os.environ["CROISSANT_BASIC_AUTH_USERNAME"] = username.value
    os.environ["CROISSANT_BASIC_AUTH_PASSWORD"] = password.value

# Attach the login function to the button
login_button.on_click(login)

# Display the widgets
display("Login to UKCEH's catalogue to download data.  If required, create an account at https://catalogue.ceh.ac.uk/sso/signup", username, password, login_button)


In [None]:
import mlcroissant as mlc
import netCDF4 as nc

def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records): # <-- It fails here because I can't workout what the 'dataType' should be nor the 'extract'  should be in the recordSet of the croissant file - time to use an easier csv example dataset!
      print(record)
      if i > 10:
        break
          
doML('croissantSpikeChess.json', 'rs-abberfraw')