# ParData Demo

This notebook showcases early ParData functionality. 

  * GitHub repo: https://github.com/CODAIT/pardata
  * Docs: https://pardata.readthedocs.io/en/latest/

ParData is a Python API designed to simplify downloading and loading datasets using schemata. ParData was designed with the flexibility to allow users to use or specify their own schemata for the package to load metadata from. This means users can use ParData to load virtually any dataset regardless of data format or directory structure and in just a few simple lines of Python code. For example, if we wanted to load the [WikiText-103](https://developer.ibm.com/exchanges/data/all/wikitext-103/) dataset (included in the default ParData dataset schema), we would simply need to run:

```python
>>> import pardata
>>> wikitext103_data = pardata.load_dataset('wikitext103')  # load the dataset (download if not yet downloaded)
>>> print(wikitext103_data['train'][:500])  # preview the training split subdataset
```
```text
 = Valkyria Chronicles III = 
 
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs par
```

Out of the box, ParData comes with data loaders that support common data formats, while the dataset's directory structure is specified by the user in the dataset schema. The advantage of this setup is that it means when a user wants to share their dataset with the world, they have the option of bundling that dataset with a ParData dataset schema. Doing so allows anybody to securely download and load the dataset without needing to run any messy or potentially insecure scripts.

## 1. Install & Load ParData

In [1]:
%pip install pardata

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple/
Collecting pardata
  Using cached pardata-0.1.0-py3-none-any.whl (43 kB)
Installing collected packages: pardata
Successfully installed pardata-0.1.0
You should consider upgrading via the '/opt/rh/rh-python38/root/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import ParData
import pardata
from pardata.dataset import Dataset
from pardata.loaders import FormatLoaderMap, Loader

In [3]:
# Import other packages
import pathlib

In [4]:
# Check the current version
pardata.__version__

'0.1.0'

In [5]:
# Peek at package makeup
[symbol for symbol in dir(pardata) if not symbol.startswith('_')]

['dataset',
 'describe_dataset',
 'exceptions',
 'export_schema_collections',
 'get_config',
 'get_dataset_metadata',
 'init',
 'list_all_datasets',
 'load_dataset',
 'load_schema_collections',
 'loaders',
 'schema',
 'typing']

## 2. Beginner Functionality

Beginner users of ParData will use the package's high-level functions designed to provide core functionality (downloading and loading datasets) with minimal configurability. In this section we'll learn the easiest way to load and download datasets from a dataset schema by exploring the following high-level functions: `get_config`, `list_all_datasets`, `get_dataset_metadata`, and `load_dataset`.


Use the `get_config` function to view the library's global configs. Currently the configs stored are:
  * `DATADIR`: The default data directory used for downloading/loading datasets.
  * `DATASET_SCHEMA_FILE_URL`: The URL for the default dataset schema to use. This schema is used to provide all necessary metadata for downloading/loading datasets (e.g. dataset download URL, dataset data format, dataset directory structure). We'll take a closer look closer at this schema later in the notebook.
  * `FORMAT_SCHEMA_FILE_URL`: The URL for the default format schema to use. This schema is used to provide extra metadata regarding dataset formats.
  * `LICENSE_SCHEMA_FILE_URL`: The URL for the default license schema to use. This schema is used to provide extra metadata regarding dataset licenses.
  
The default schemata are currently stored in: https://github.com/CODAIT/dax-schemata

In [6]:
# View default config settings
print('ParData default datadir: ', pardata.get_config().DATADIR)
print('ParData dataset schema URL: ', pardata.get_config().DATASET_SCHEMA_FILE_URL)
print('ParData format schema URL: ', pardata.get_config().FORMAT_SCHEMA_FILE_URL)
print('ParData license schema URL: ', pardata.get_config().LICENSE_SCHEMA_FILE_URL)

ParData default datadir:  /home/hong/.pardata/data
ParData dataset schema URL:  https://raw.githubusercontent.com/CODAIT/dax-schemata/master/datasets.yaml
ParData format schema URL:  https://raw.githubusercontent.com/CODAIT/dax-schemata/master/formats.yaml
ParData license schema URL:  https://raw.githubusercontent.com/CODAIT/dax-schemata/master/licenses.yaml


Use the `list_all_datasets` function to view all available datasets and their versions as defined in the dataset schema loaded from `pardata.get_config().DATADIR`. The names and versions are allowable arguments to pass to the `name` and `version` parameters used by other ParData high-level functions such as `load_dataset()`. 

In [7]:
# List all default datasets available to download using ParData (by default, these will be datasets from IBM's Data Asset Exchange: https://ibm.biz/data-exchange)
pardata.list_all_datasets()

{'claim_sentences_search': ('1.0.2',),
 'concept_abstractness': ('1.0.2',),
 'expert-in-the-loop-ai-polymer-discovery': ('1.0.0',),
 'gmb': ('1.0.2',),
 'noaa_jfk': ('1.1.4',),
 'taranaki-basin-curated-well-logs': ('1.0.0',),
 'thematic_clustering_of_sentences': ('1.0.2',),
 'wikipedia_category_stance': ('1.0.2',),
 'wikipedia_oriented_relatedness': ('1.0.2',),
 'wikitext103': ('1.0.1',)}

Use the `get_dataset_metadata` function to peek at a dataset's metadata. Set the `human` parameter to `False` if you want to see the raw dictionary representation of the dataset's schema.  

In [8]:
# Print the dataset WikiText-103's metadata in human-readable format
print(pardata.describe_dataset('wikitext103'))

Dataset name: WikiText-103
Homepage: https://developer.ibm.com/exchanges/data/all/wikitext-103/
Description: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia.
Size: 181M
Published date: 2020-03-17
License: Creative Commons Attribution 3.0 Unported
Available subdatasets: train, valid, test


Load a dataset into memory as a dict composed of subdatasets using `load_dataset`. By default, `load_dataset` will download the dataset to your default data directory if it is not already present there. If the `version` parameter is not provided, the dataset's latest version specified in the dataset schema is assumed. In this example, ParData will download the WikiText-103 dataset (version 1.0.1) to `~/.pardata/data/wikitext103/1.0.1/` and load its subdatasets into `wikitext103_data`.

In [9]:
# Load WikiText-103, since it hasn't been downloaded yet, load_dataset will automatically download and unarchive the dataset, before loading it
# WikiText-103 is 181 MB, so this cell may take a little bit of time to complete the first time it is run
wikitext103_data = pardata.load_dataset('wikitext103')

In [10]:
# Show available WikiText-103 subdatasets (subdatasets are ParData's way of modeling various dataset directory structures)
wikitext103_data.keys()

dict_keys(['train', 'valid', 'test'])

In [11]:
# ParData loads plaintext datasets into strings by default
# WikiText-103 is a plaintext dataset composed of Wikipedia articles, lets take a peek at its validation subdataset
wikitext103_valid = wikitext103_data['valid']
print(wikitext103_valid[:730])

 
 = Homarus gammarus = 
 
 Homarus gammarus , known as the European lobster or common lobster , is a species of clawed lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into planktonic larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . 
 


In [12]:
# Show datadir structure
!du -ah ~/.pardata

4.0K	/home/hong/.pardata/data/default/wikitext103/1.0.1/.pardata.dataset/files.list
8.0K	/home/hong/.pardata/data/default/wikitext103/1.0.1/.pardata.dataset
1.1M	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103/wiki.valid.tokens
515M	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103/wiki.train.tokens
24K	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103/LICENSE.txt
1.3M	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103/wiki.test.tokens
4.0K	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103/README.txt
517M	/home/hong/.pardata/data/default/wikitext103/1.0.1/wikitext-103
517M	/home/hong/.pardata/data/default/wikitext103/1.0.1
517M	/home/hong/.pardata/data/default/wikitext103
517M	/home/hong/.pardata/data/default
517M	/home/hong/.pardata/data
517M	/home/hong/.pardata


By default, `load_dataset` downloads to and loads from `~/.pardata/data/<dataset-name>/<dataset-version>/`. To change the default data directory (and any other global configs), use `pardata.init`.

In [13]:
# Change default datadir to new-pardata-datadir
new_pardata_datadir_path = pathlib.Path.home() / 'new-pardata-datadir'
pardata.init(DATADIR=new_pardata_datadir_path)  # pass global configs to change to init() as kwargs
pardata.get_config().DATADIR

PosixPath('/home/hong/new-pardata-datadir')

## 3. Advanced Functionality

In this section, we cover ParData's low-level features prioritizing flexibility. We'll take a look at the low-level functionality of the package by working with the `Dataset`, `SchemaManager`, `Schema`, `Loaders`, and `FormatLoaderMap` classes.

### 3.1 Dataset Class

The main class of ParData is `pardata.dataset.Dataset`, which models a dataset. High-level functions use this class behind the scenes, however users who want access to more advanced features may want to interact with the class directly. 

`Dataset` requires a dataset `schema` and `data_dir` arguments to load/download a dataset. Let's first extract a *dataset* schema from our default *datasets* schema. To do this, we'll use the `export_schema_collections` function which is used to return copies of our datasets, licenses, and formats schemata as `Schema` objects. These copies are stored in a `SchemaManager` object as a dictionary accessible via the `schema_collections` attribute. To extract a certain chunk of a schema, you can call a `Schema` object's `export_schema` method, and supply it with the sequence of keys leading to the portion of the schema to be exported.

In [14]:
# Export the default pardata schemata and extract the NOAA JFK Weather version 1.1.4 dataset schema from the datasets schema
schema_manager = pardata.export_schema_collections()  # export copies of the datasets, licenses, and formats Schema objects into a SchemaManager object
datasets_schema = schema_manager.schema_collections['datasets']  # extract the dataset Schema
jfk_schema = datasets_schema.export_schema('datasets', 'noaa_jfk', '1.1.4')  # extract NOAA JFK Weather dataset schema version 1.1.4
jfk_schema

{'name': 'NOAA Weather Data – JFK Airport',
 'published': datetime.date(2019, 9, 12),
 'homepage': 'https://developer.ibm.com/exchanges/data/all/jfk-weather-data/',
 'download_url': 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz',
 'sha512sum': 'e3f27a8fcc0db5289df356e3f48aef6df56236798d5b3ae3889d358489ec6609d2d797e4c4932b86016d2ce4a379ac0a0749b6fb2c293ebae4e585ea1c8422ac',
 'license': 'CDLA-Sharing-1.0',
 'estimated_size': '3.2M',
 'description': 'The NOAA JFK dataset contains 114,546 hourly observations of various local climatological variables (including visibility, temperature, wind speed and direction, humidity, dew point, and pressure). The data was collected by a NOAA weather station located at the John F. Kennedy International Airport in Queens, New York.',
 'subdatasets': {'jfk_weather_cleaned': {'name': 'Cleaned JFK Weather Data',
   'description': 'Cleaned version of the JFK weather data.',
   'format': {'id'

The `Dataset` class also accepts an optional `mode` for how to instantiate the `Dataset`. Available modes include:
- `LAZY` (default load mode: init's Dataset without download/loading)
- `DOWNLOAD_ONLY`
- `LOAD_ONLY`
- `DOWNLOAD_AND_LOAD`

In [15]:
# Instantiate the NOAA JFK Weather dataset using the Dataset class in LAZY mode
jfk_data_dir = pardata.get_config().DATADIR / 'jfk' / '1.1.4'
jfk_dataset = Dataset(schema=jfk_schema, data_dir=jfk_data_dir)

In [16]:
# Call Dataset.download() to download
jfk_dataset.download()

In [17]:
# Call Dataset.load() to load
jfk_dataset.load()

{'jfk_weather_cleaned':                      DATE  HOURLYVISIBILITY  HOURLYDRYBULBTEMPF  \
 0     2010-01-01 01:00:00               6.0                33.0   
 1     2010-01-01 02:00:00               6.0                33.0   
 2     2010-01-01 03:00:00               5.0                33.0   
 3     2010-01-01 04:00:00               5.0                33.0   
 4     2010-01-01 05:00:00               5.0                33.0   
 ...                   ...               ...                 ...   
 75114 2018-07-27 19:00:00              10.0                76.0   
 75115 2018-07-27 20:00:00               4.0                69.0   
 75116 2018-07-27 21:00:00              10.0                71.0   
 75117 2018-07-27 22:00:00              10.0                72.0   
 75118 2018-07-27 23:00:00              10.0                72.0   
 
        HOURLYWETBULBTEMPF  HOURLYDewPointTempF  HOURLYRelativeHumidity  \
 0                    32.0                 31.0                    92.0   
 1       

In [18]:
# NOAA JFK Weather is a CSV dataset and by default CSV datasets are loaded into Pandas dataframes
jfk_dataset.data['jfk_weather_cleaned'].head()

Unnamed: 0,DATE,HOURLYVISIBILITY,HOURLYDRYBULBTEMPF,HOURLYWETBULBTEMPF,HOURLYDewPointTempF,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYStationPressure,HOURLYSeaLevelPressure,HOURLYPrecip,HOURLYAltimeterSetting,HOURLYWindDirectionSin,HOURLYWindDirectionCos,HOURLYPressureTendencyIncr,HOURLYPressureTendencyDecr,HOURLYPressureTendencyCons
0,2010-01-01 01:00:00,6.0,33.0,32.0,31.0,92.0,0.0,29.97,29.99,0.01,29.99,0.0,1.0,0,1,0
1,2010-01-01 02:00:00,6.0,33.0,33.0,32.0,96.0,0.0,29.97,29.99,0.02,29.99,0.0,1.0,0,1,0
2,2010-01-01 03:00:00,5.0,33.0,33.0,32.0,96.0,0.0,29.97,29.99,0.0,29.99,0.0,1.0,0,1,0
3,2010-01-01 04:00:00,5.0,33.0,33.0,32.0,96.0,0.0,29.95,29.97,0.0,29.97,0.0,1.0,0,1,0
4,2010-01-01 05:00:00,5.0,33.0,32.0,31.0,92.0,0.0,29.93,29.96,0.0,29.95,0.0,1.0,0,1,0


In [19]:
# Show datadir structure
!du -ah ~/new-pardata-datadir

4.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/.pardata.dataset/files.list
8.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/.pardata.dataset
8.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/clean_data.py
29M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/jfk_weather.csv
12K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/LICENSE.txt
4.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/README.txt
5.8M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
35M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport
35M	/home/hong/new-pardata-datadir/jfk/1.1.4
35M	/home/hong/new-pardata-datadir/jfk
35M	/home/hong/new-pardata-datadir


### 3.2 Custom User Schema

ParData supports users defining their own custom schemata and using those to download and load their datasets. Lets as an example use a custom schema for the IBM Debater Concept Abstractness dataset which will be downloaded from Box.

In [20]:
# Specify custom datasets schema which contains the IBM Debater Concept Abstractness dataset
custom_datasets_schema_path = 'https://ibm.box.com/shared/static/uzw72y44ghxujgcyit6kmxhg3m8va8pj.yaml'
pardata.init(DATASET_SCHEMA_FILE_URL=custom_datasets_schema_path, update_only=False)
pardata.get_config().DATASET_SCHEMA_FILE_URL

'https://ibm.box.com/shared/static/uzw72y44ghxujgcyit6kmxhg3m8va8pj.yaml'

In [21]:
# Load custom dataset schema for IBM Debater Concept Abstractness dataset
custom_schema_manager = pardata.export_schema_collections()
concept_abstractness_schema = custom_schema_manager.schema_collections['datasets'].export_schema('datasets', 'concept_abstractness', '1.0.2')
concept_abstractness_schema

{'name': 'IBM Debater Concept Abstractness',
 'published': datetime.date(2019, 6, 29),
 'homepage': 'https://developer.ibm.com/exchanges/data/all/concept-abstractness/',
 'download_url': 'https://dax-cdn.cdn.appdomain.cloud/dax-concept-abstractness/1.0.2/concept-abstractness.tar.gz',
 'sha512sum': '25cb76c0a8fdfc9cae7e050d4c2492bf055f97a20fa85690b9aaf7dcf965a705fc89e32aed7fbb6d418432e5368cb06fc3bb0f1ab85807fec8aef9df3965cc06',
 'license': 'cdla_sharing',
 'estimated_size': '3.6M',
 'description': 'Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses.',
 'subdatasets': {'prediction_unigrams': {'name': 'Prediction Unigrams',
   'description': 'Concepts and abstractness scores for unigrams (single worded concepts)',
   'format': {'id': 'table/csv', 'options': {'encoding': 'UTF-8'}},
   'path': 'prediction_unigrams.csv'},
  'prediction_bigrams': {'name': 'Prediction Bigrams',
   'description': 'Concepts and abstractness 

In [22]:
# Download and load Concept Abstractness dataset using the Dataset class
concept_abstractness_data_dir = pardata.get_config().DATADIR / 'concept-abstractness' / '1.0.2'
concept_abstractness_dataset = Dataset(schema=concept_abstractness_schema,
                                       data_dir=concept_abstractness_data_dir,
                                       mode=Dataset.InitializationMode.DOWNLOAD_AND_LOAD)

In [23]:
# Peek at subdatasets that were loaded, Concept Abstractness has 3 separate subdatasets
concept_abstractness_dataset.data.keys()

dict_keys(['prediction_unigrams', 'prediction_bigrams', 'prediction_trigrams'])

In [24]:
# Since Concept Abstractness is also a CSV dataset, it also gets loaded into a Pandas dataframe
# Note: The user can however define their own custom data loader if they prefer to load the dataset in a different way
concept_abstractness_dataset.data['prediction_bigrams'].head()

Unnamed: 0,Concept,Score
0,a best,0.088837
1,a bola,0.254723
2,a famosa,0.135748
3,a fazenda,0.187551
4,a gallery,0.48624


In [25]:
# Show datadir structure
!du -ah ~/new-pardata-datadir

4.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/.pardata.dataset/files.list
8.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/.pardata.dataset
8.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/clean_data.py
29M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/jfk_weather.csv
12K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/LICENSE.txt
4.0K	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/README.txt
5.8M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
35M	/home/hong/new-pardata-datadir/jfk/1.1.4/noaa-weather-data-jfk-airport
35M	/home/hong/new-pardata-datadir/jfk/1.1.4
35M	/home/hong/new-pardata-datadir/jfk
35M	/home/hong/new-pardata-datadir


### 3.3 Customer User Loader

ParData uses loaders to load a certain dataset filetype into a certain Python object. For instance we've been using the `CSVPandasLoader`, the default loader used for CSV datasets, to load the NOAA JFK Weather and Concept Abstractness datasets into Pandas dataframes. If ParData had not yet implemented a loader a user desires, they may define that loader manually and use it when loading their own datasets. All ParData loaders inherit from a base class called `Loader` which expects you to overwrite the `Loader.load` method. This method is run each time a subdataset is loaded during a `Dataset.load` call.

Lets for demonstration purposes define a simple custom loader that loads CSV files into strings instead of Pandas dataframes. To do this, we first define a class called `CSVStringLoader` which inherits from `Loader`. 

In [26]:
class CSVStringLoader(Loader):
    def load(self, path, options):
        """Custom loader to load CSV files into strings (for demo purposes).

        :param path: The path to the subdataset CSV file.
        :param options:
               - ``encoding`` key specifies the encoding of the CSV file.
        """

        encoding = options.get('encoding', 'utf-8')
        return pathlib.Path(path).read_text(encoding=encoding)

Now that we have our customer loader defined, we must create a new `FormatLoaderMap` instance that registers this loader. If a custom `FormatLoaderMap` instance is not created, ParData uses a default `FormatLoaderMap` instance that has access to all of ParData's default loaders.

In [27]:
# Register our custom CSVStringLoader into our custom FormatLoaderMap instance custom_format_loader_map
custom_format_loader_map = FormatLoaderMap({'table/csv': CSVStringLoader()})

Now we are ready to load a dataset using our custom loader. Lets reload the Concept Abstractness dataset but this time load its subdatasets into strings instead of Pandas dataframes. To do this, we simply provide our `custom_format_loader_map` instance as an argument to `Dataset.load`. 

In [28]:
# Specify a custom format_loader_map argument
concept_abstractness_dataset.load(format_loader_map=custom_format_loader_map)



In [29]:
# Peek at the trigram subdataset which has been loaded in as a string
concept_abstractness_bigrams = concept_abstractness_dataset.data['prediction_trigrams']
print(concept_abstractness_bigrams[:150])

Concept,Score
a baby story,0.330729888
a bad dream,0.420424387
a bailar tour,0.178898939
a beautiful lie,0.237211137
a benihana christmas,0.330718464

