# Introduction

This shows the work done for the UKCEH croissant spike.

The purpose was to demonstrate extracting data from file(s) on our production catalogue (https://catalogue.ceh.ac.uk/) using a hard coded croissant.json file.

Only example 1 worked, which shows extraction for a single fileObject.  Examples 2 and 3 show the work I did to try to access netcdf files with and without a login, and also trying to use a **fileSet** - which I now believe is only used against an archive fileObject (or other fileSets that eventually point to an archive fileObject). 

Install dependencies used across all cells of investigation.

In [None]:
!pip install requests
!pip install ipywidgets
!pip install mlcroissant
!pip install mlcroissant netCDF4

## Example 1 - successfully access a single csv file from archived and unarchived downloads
This demonstrates extracting columns from csv files that are downloaded as archived (zip) and unarchived (raw csv) files.

All files are freely available and need no login.

All data available via OGL license.

1. Archived
   croissantSpikeZip.json is the croissant file for an archived dataset that is downloaded as a zip file.
   Full information for this dataset is [here](https://catalogue.ceh.ac.uk/documents/972599af-0cc3-4e0e-a4dc-2fab7a6dfc85).
   It contains 4 csv files for sand dune data, with up to ~30,000 rows per file.

2. Unarchived
   croissantSpikeCOSMOSSingle.json is a croissant file that points to a couple of raw csv download files.
   Full information for this dataset is [here](https://catalogue.ceh.ac.uk/documents/399ed9b1-bf59-4d85-9832-ee4d29f49bfb)
   Just a couple of csv's are used from the full set of about 1300 csv links available [here](https://catalogue.ceh.ac.uk/datastore/eidchub/399ed9b1-bf59-4d85-9832-ee4d29f49bfb/).

I have not yet worked out how to extract data from subsets of csv files in the archive (zip) file using a 'fileSet' and glob pattern  - this would be very useful for archive files that contain many files.

In [506]:
import mlcroissant as mlc

def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records):
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeZip.json', 'rs-abberfraw')  # This one demostrates accessing a zip file containing 4 csv files and extracting a set of columns from one of them

doML('croissantSpikeCOSMOSSingle.json', 'rs-one-file') # This one demonstrates acccessing columns of a single csv downloaded directly ie not in an archive file


{'id': 91619, 'X': 235341.25, 'Y': 368183.75, 'Aspect': 147.3955898, 'Slope': 5.947185636, 'WindSpeed': 1.552845}
{'id': 91620, 'X': 235341.25, 'Y': 368181.25, 'Aspect': 183.3925629, 'Slope': 7.696038723, 'WindSpeed': 1.6105891}
{'id': 91621, 'X': 235341.25, 'Y': 368178.75, 'Aspect': 174.296432, 'Slope': 5.170789957, 'WindSpeed': 1.5677601}
{'id': 91622, 'X': 235341.25, 'Y': 368176.25, 'Aspect': 264.8109093, 'Slope': 2.708107293, 'WindSpeed': 1.4615709}
{'id': 91623, 'X': 235341.25, 'Y': 368173.75, 'Aspect': 172.1959329, 'Slope': 11.81059909, 'WindSpeed': 1.4316834}
{'id': 91624, 'X': 235341.25, 'Y': 368171.25, 'Aspect': 255.1948204, 'Slope': 7.329361081, 'WindSpeed': 1.814988}
{'id': 91625, 'X': 235341.25, 'Y': 368168.75, 'Aspect': 206.2561874, 'Slope': 6.884741783, 'WindSpeed': 1.6822661}
{'id': 91626, 'X': 235341.25, 'Y': 368166.25, 'Aspect': 237.1464386, 'Slope': 8.261778951, 'WindSpeed': 1.4579354}
{'id': 91627, 'X': 235341.25, 'Y': 368163.75, 'Aspect': 240.042305, 'Slope': 5.5238

## Example 2 - fail to handle freely available netcdf files
This example shows the work I did to try (and fail) to access netcdf files that are freely available - no login required.

It tries to access the netcdf files of the Hadukgrid dataset, which contains links to many netcdf files [here](https://catalogue.ceh.ac.uk/datastore/eidchub/beb62085-ba81-480c-9ed0-2d31c27ff196/).

Files have to be downloaded individually as there is no archive file available.

I could not work out how to define a netcdf 'dataType' and get the 'extract' working in the recordSet, also I don't think I've got the fileSet correct.  In hindsight I could change it to access individual netcdf files successfully (as in example 1), but still not handle the netcdf format.


In [None]:
import mlcroissant as mlc
import netCDF4 as nc
   
def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records): # <-- It fails here because I can't workout what the 'dataType' should be nor the 'extract'  should be in the recordSet of the croissant file - time to use an easier csv example dataset!
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeHadukgrid.json', 'rs/file-set-dtr-preOct')


## Example 3 - using credentials

This uses credentials to access CHESS netcdf files.  I have proven it works using my credentials.

If required, create a new account [here](https://catalogue.ceh.ac.uk/sso/signup).  You may have to agree a license to use the dataset in this example.

Unfortunately, I haven't yet worked out how to work with netcdf files, so whilst it downloads the netcdf file, it fails to extract data.

The source files are hierachically organised, starting [here](https://catalogue.ceh.ac.uk/datastore/eidchub/835a50df-e74f-4bfb-b593-804fd61d5eab/)


In [None]:
import mlcroissant as mlc
import netCDF4 as nc

os.environ["CROISSANT_BASIC_AUTH_USERNAME"] = "myusername"
os.environ["CROISSANT_BASIC_AUTH_PASSWORD"] = "mypassword"
def doML(url, recordsetId):
    dataset = mlc.Dataset(jsonld=url)
    
    dataset = mlc.Dataset(jsonld=url)
    records = dataset.records(record_set=recordsetId)
    for i, record in enumerate(records): # <-- It fails here because I can't workout what the 'dataType' should be nor the 'extract'  should be in the recordSet of the croissant file - time to use an easier csv example dataset!
      print(record)
      if i > 10:
        break
      
doML('croissantSpikeChess.json', 'rs-abberfraw')