# Introduction

This shows the work done for the UKCEH croissant spike.

The purpose was to demonstrate extracting data from file(s) on our production catalogue (https://catalogue.ceh.ac.uk/) using a hard coded croissant.json file.

Only example 1 worked, which shows extraction for a single fileObject.  Examples 2 and 3 show the work I did to try to access netcdf files with and without a login, and also trying to use a **fileSet** - which I now believe is only used against an archive fileObject (or other fileSets that eventually point to an archive fileObject). 

Install dependencies used across all cells of investigation.

In [None]:
!pip install requests
!pip install ipywidgets
!pip install mlcroissant
!pip install mlcroissant netCDF4

## Example 1 - successfully access a single csv file
This demonstrates the download and extraction of data from a COSMOS csv file that is available on UKCEH's production catalogue.

There are links to 1300 COSMOS csv files [here](https://catalogue.ceh.ac.uk/datastore/eidchub/399ed9b1-bf59-4d85-9832-ee4d29f49bfb/).

These files are freely available and need no login.

I have not worked out how to access them using a glob pattern using a FileSet, and I think all files need to be downloaded as a single archive file (zip), which means you can't be selective about which files you download when doing ML - ie you have to download them all in an archive file first (I hope I'm wrong).

In [None]:
import mlcroissant as mlc

def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    print(records)
    for i, record in enumerate(records):
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeCOSMOSSingle.json', 'rs-one-file')


## Example 2 - fail to handle freely available netcdf files
This example shows the work I did to try (and fail) to access netcdf files that are freely available - no login required.

It tries to access the netcdf files of the Hadukgrid dataset, which contains links to many netcdf files [here](https://catalogue.ceh.ac.uk/datastore/eidchub/beb62085-ba81-480c-9ed0-2d31c27ff196/).

Files have to be downloaded individually as there is no archive file available.

I could not work out how to define a netcdf 'dataType' and get the 'extract' working in the recordSet, also I don't think I've got the fileSet correct.  In hindsight I could change it to access individual netcdf files successfully (as in example 1), but still not handle the netcdf format.


In [None]:
import mlcroissant as mlc
import netCDF4 as nc
   
def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records): # <-- It fails here because I can't workout what the 'dataType' should be nor the 'extract'  should be in the recordSet of the croissant file - time to use an easier csv example dataset!
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeHadukgrid.json', 'rs/file-set-dtr-preOct')


## Example 3 - fail to access netcdf files protected by login
This example shows the work I did to try to access CHESS netcdf files that require a login to access the files.

It shows that credentials or a session token IS NOT supported.

Instead, I show how to traverse the json file and access the files - which is not really in the spirit of using croissant!

The source files are hierachically organised, starting [here](https://catalogue.ceh.ac.uk/datastore/eidchub/835a50df-e74f-4bfb-b593-804fd61d5eab/)
Creating an account for the login is [here](https://catalogue.ceh.ac.uk/sso/signup)

In [None]:
import ipywidgets as widgets
from IPython.display import display
import requests
import json
import mlcroissant as mlc
import fnmatch
import netCDF4 as nc

# Download a file from a url using credentials (specified in 'session' parameter)
def download_file(session, file_url, save_path):
    response = session.get(file_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Save the file locally
        with open(save_path, "wb") as file:
            file.write(response.content)
        print(f"File downloaded successfully: {save_path}")
    else:
        print("Failed to download the file. Status code:", response.status_code)

# Function to get a key's value from a fileSet
def get_fileset_key_value(fileset_id, key, metadata):
    for item in metadata.get("distribution", []):
        if item["@id"] == fileset_id:
            return item.get(key)
    return None

# Get all the file download urls for all the fileObjects in the 'distribution' that match the pattern
def get_file_urls(pattern, metadata):
    objects = metadata.get("distribution", [])
    return [obj.get('contentUrl') for obj in objects if obj['@type'] == 'cr:FileObject' and fnmatch.fnmatch(obj.get('contentUrl'), pattern)]
   
# Do the machine learning stuff - ie read a croissant file and download files from a RecordSet
# IMPORTANT: the implementation has been painful because mlcroissant doesn't support credentials or session tokens
# This meant that most of the time is spent traversing json (sigh!) rather than using the convenient methods of mlcroissant
# See block of comments lower down to see how easy it could otherwise be
def doML(session):
    url = "croissantSpikeChess.json"
    dataset = mlc.Dataset(jsonld=url)
    
    # # Since you can't use mscroissant to automagically get the files (see login issue mentioned below).
    # # then you have to get croissant's metadata and loop through them yourselves (sigh!)
    metadata = dataset.metadata.to_json()
    record_sets = metadata.get("recordSet",[])
    # print(record_sets)
    for record_set in record_sets:
        if(record_set['@id'] == 'rs/file-set-dtr-january1961'):
            field = record_set['field'][0]
            sourceFileset = field['cr:source']['cr:fileSet']['@id'] # <-- I know it is a cr:fileSet and not cr:fileObject, should really test
            # Now get the glob that defines the fileObjects to return
            includes = get_fileset_key_value(sourceFileset, 'cr:includes', metadata)

            # Now get required netcdf files using the 'includes' glob
            file_urls = get_file_urls(f'*{includes}', metadata)

            # Download the files from the urls
            for file_url in file_urls:
                filename = file_url.rsplit('/', 1)[-1]
                download_file(session, file_url, filename)

            # Use the files
            for file_url in file_urls:
                filename = file_url.rsplit('/', 1)[-1]
                ncdataset = nc.Dataset(filename, 'r')
                # Print the dimensions
                for dim_name, dim in ncdataset.dimensions.items():
                    print(f"{dim_name}: {len(dim)}")
                # Print the variables
                print("\nVariables:")
                for var_name, var in ncdataset.variables.items():
                    print(f"{var_name}: {var.dimensions}")
                # Print some data
                # Query some data
                # For example, extracting data from a variable named 'temperature'
                # temperature_data = dataset.variables['temperature'][:]


                
# # Typically this is how you would access data using mlcroissant
# # However, mlcroissant DOES NOT SUPPORT CREDENTIALS OR AUTH TOKENS
# # Also, it automatically tries to access the datasets when using 'dataset.records()' - 2nd line below
# # So, it 'automatically' fails since it is trying to access endpoints protected by authentication it can't handle
# # This makes using mlcroissant problematic if there isn't a solution I've missed
# # I think ML libraries that make use of mlcroissant will not work because of this
# # Instead we need to traverse the json of 'dataset' object - which is the solution implemented above
# dataset = mlc.Dataset(jsonld="phils-croissant-doctored.json")
# records = dataset.records(record_set="file-set-dtr-january1961") # This line tries to call the endpoint without authentication
# for i, record in enumerate(records):
#   print(record)
#   if i > 10:
#     break
      
# Create widgets for username and password
username = widgets.Text(description='Username:')
password = widgets.Password(description='Password:')
login_button = widgets.Button(description='Login')

# Get user credentials and pass to the machine learning function
def login(button):
    requests.Session()
    user = username.value
    pwd = password.value
    session.auth = (username.value, password.value)
    doML(session)

# Attach the login function to the button
login_button.on_click(login)

# Display the widgets
display("Login to UKCEH's catalogue to download data.  If required, create an account at https://catalogue.ceh.ac.uk/sso/signup", username, password, login_button)
