# Notebook to test access to relbench datasets and tasks as well as their namings

When trying to use rebench as benchmark for my master thesis project, I had difficulties to access some datasets and tasks. This is a notebook that gives an overview about the occurring naming mismatches of tasks and inaccessibility of datasets and task as well as the error messages. 

## 1 Overview of what datasets and tasks should exist

In [1]:
from relbench.datasets import get_dataset_names, get_dataset
from relbench.tasks import get_task_names, get_task

from importlib.metadata import version

print(version("relbench") == "2.0.2")

True


First we take a look at the datasets that relbech 2.0.2 grants access to and compare the list with the one on relbench.stanford.edu

In [4]:
dataset_names = get_dataset_names()
print(dataset_names)

['rel-amazon', 'rel-avito', 'rel-event', 'rel-f1', 'rel-hm', 'rel-stack', 'rel-mimic', 'rel-trial', 'rel-arxiv', 'rel-salt', 'rel-ratebeer', 'dbinfer-avs', 'dbinfer-mag', 'dbinfer-diginetica', 'dbinfer-retailrocket', 'dbinfer-seznam', 'dbinfer-amazon', 'dbinfer-stackexchange', 'dbinfer-outbrain-small']


### My output:
'rel-amazon', 'rel-avito', 'rel-event', 'rel-f1', 'rel-hm', 'rel-stack', 'rel-mimic', 'rel-trial', 'rel-arxiv', 'rel-salt', 'rel-ratebeer', 'dbinfer-avs', 'dbinfer-mag', 'dbinfer-diginetica', 'dbinfer-retailrocket', 'dbinfer-seznam', 'dbinfer-amazon', 'dbinfer-stackexchange', 'dbinfer-outbrain-small'

### Available according to relbench.stanford.edu:
'rel-amazon', 'rel-avito', 'rel-event', 'rel-f1', 'rel-hm', 'rel-stack', 'rel-mimic', 'rel-trial', 'rel-arxiv', 'rel-salt', 'rel-ratebeer'

### Conclusion:
There seam to be a mismatch, that is counterintuitive to me as a user

I continue with the tasks

In [5]:
for dataset_name in dataset_names:
    print(dataset_name, get_task_names(dataset_name))

rel-amazon ['user-churn', 'user-ltv', 'item-churn', 'item-ltv', 'user-item-purchase', 'user-item-rate', 'user-item-review', 'review-rating', 'review-verified']
rel-avito ['ad-ctr', 'user-visits', 'user-clicks', 'user-ad-visit', 'searchstream-click', 'searchinfo-isuserloggedon']
rel-event ['user-attendance', 'user-repeat', 'user-ignore', 'event_interest-interested', 'event_interest-not_interested', 'users-birthyear']
rel-f1 ['driver-position', 'driver-dnf', 'driver-top3', 'driver-race-compete', 'results-position', 'qualifying-position', 'constructor_results-points', 'constructor_standings-position']
rel-hm ['user-item-purchase', 'user-churn', 'item-sales', 'transactions-price']
rel-stack ['user-engagement', 'post-votes', 'user-badge', 'user-post-comment', 'post-post-related', 'badges-class', 'postlinks-linktypeid']
rel-mimic ['icu-length-of-stay']
rel-trial ['study-outcome', 'study-adverse', 'site-success', 'condition-sponsor-run', 'site-sponsor-run', 'studies-enrollment', 'studies-has_

### My output:
- rel-amazon ['user-churn', 'user-ltv', 'item-churn', 'item-ltv', 'user-item-purchase', 'user-item-rate', 'user-item-review', 'review-rating', 'review-verified']
- rel-avito ['ad-ctr', 'user-visits', 'user-clicks', 'user-ad-visit', 'searchstream-click', 'searchinfo-isuserloggedon']
- rel-event ['user-attendance', 'user-repeat', 'user-ignore', 'event_interest-interested', 'event_interest-not_interested', 'users-birthyear']
- rel-f1 ['driver-position', 'driver-dnf', 'driver-top3', 'driver-race-compete', 'results-position', 'qualifying-position', 'constructor_results-points', 'constructor_standings-position']
- rel-hm ['user-item-purchase', 'user-churn', 'item-sales', 'transactions-price']
- rel-stack ['user-engagement', 'post-votes', 'user-badge', 'user-post-comment', 'post-post-related', 'badges-class', 'postlinks-linktypeid']
- rel-mimic ['icu-length-of-stay']
- rel-trial ['study-outcome', 'study-adverse', 'site-success', 'condition-sponsor-run', 'site-sponsor-run', 'studies-enrollment', 'studies-has_dmc', 'eligibilities-adult', 'eligibilities-child']
-rel-arxiv ['paper-citation', 'author-category', 'author-publication', 'co-citation']
- rel-salt ['item-plant', 'item-shippoint', 'item-incoterms', 'sales-office', 'sales-group', 'sales-payterms', 'sales-shipcond', 'sales-incoterms']
- rel-ratebeer ['beer-rating-churn', 'user-rating-churn', 'brewer-dormant', 'user-rating-count', 'user-liked-beer', 'user-liked-place', 'user-favorite-beer', 'user-beer-rating']
- dbinfer-avs ['repeater']
- dbinfer-mag ['cite', 'venue']
- dbinfer-diginetica ['ctr', 'purchase']
- dbinfer-retailrocket ['cvr']
- dbinfer-seznam ['charge', 'prepay']
- dbinfer-amazon ['rating', 'purchase', 'churn']
- dbinfer-stackexchange ['churn', 'upvote']
- dbinfer-outbrain-small ['ctr']

### Available according to relbench.stanford.edu:
- rel-amazon ['user-churn', 'user-ltv', 'item-churn', 'item-ltv', 'user-item-purchase', 'user-item-rate', 'user-item-review', 'review-rating', 'review-verified']
- rel-avito ['ad-ctr', 'user-visits', 'user-clicks', 'user-ad-visit', 'searchstream-click', 'searchinfo-isuserloggedon']
- rel-event ['user-attendance', 'user-repeat', 'user-ignore', 'event_interest-interested', 'event_interest-not_interested', 'users-birthyear']
- rel-f1 ['driver-position', 'driver-dnf', 'driver-top3', 'driver-circuit-complete', 'results-position', 'qualifying-position', 'constructor_results-points', 'constructor_standings-position']
- rel-hm ['user-item-purchase', 'user-churn', 'item-sales', 'transactions-price']
- rel-stack ['user-engagement', 'post-votes', 'user-badge', 'user-post-comment', 'post-post-related', 'badges-class', 'postlinks-linktypeid']
- rel-mimic ['patient-iculengthofstay']
- rel-trial ['study-outcome', 'study-adverse', 'site-success', 'condition-sponsor-run', 'site-sponsor-run', 'studies-enrollment', 'studies-has_dmc', 'eligibilities-adult', 'eligibilities-child']
- rel-arxiv ['paper-citation', 'author-category', 'author-publication', 'paper-paper-cocitation']
- rel-salt ['item-plant', 'item-shippoint', 'item-incoterms', 'sales-office', 'sales-group', 'sales-payterms', 'sales-shipcond', 'sales-incoterms']
- rel-ratebeer ['beer-churn', 'user-churn', 'brewer-dormant', 'user-count', 'user-beer-liked', 'user-place-liked', 'user-beer-favorite', 'beer_ratings-total_score']

### Conclusion:
There are mismatches, but they are inconsistent leading to confusion:
- rel-f1: 'driver-circuit-complete' instead of 'driver-race-compete'
- rel-arxiv: 'paper-paper-cocitation' instead of 'co-citation'
- rel-ratebeer: 'beer-churn' instead of 'beer-rating-churn', 'user-churn' instead of 'user-rating-churn', 'user-count' instead of 'user-rating-count', 'beer_ratings-total_score' instead of 'user-beer-rating', 'user-beer-favorite' instead of 'user-favorite-beer', 'user-beer-liked' instead of 'user-liked-beer', 'user-place-liked' instead of 'user-liked-place'
- rel-mimic 'patient-iculengthofstay' instead of 'icu-length-of-stay'

## 2. Test which are actually accessable

In [6]:
download=True
for dataset_name in dataset_names:
    if dataset_name not in {"rel-stack", "rel-mimic"} and not dataset_name.startswith("dbinfer-"):
        dataset = get_dataset(dataset_name, download=download)

#dataset = get_dataset("rel-stack")
#dataset = get_dataset("rel-mimic")

### Error Message when trying to download rel-stack

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 6
      5 try:
----> 6     dataset = get_dataset(dataset_name)
      7 except:

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\__init__.py:115, in get_dataset(name, download)
    114 if download:
--> 115     download_dataset(name)
    117 if name.startswith("ctu-"):

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\__init__.py:86, in download_dataset(name)
     84     verify_mimic_access()
---> 86 DOWNLOAD_REGISTRY.fetch(
     87     f"{name}/db.zip",
     88     processor=pooch.Unzip(extract_dir="."),
     89     progressbar=True,
     90 )

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pooch\core.py:589, in Pooch.fetch(self, fname, processor, downloader, progressbar)
    587         downloader = choose_downloader(url, progressbar=progressbar)
--> 589     stream_download(
    590         url,
    591         full_path,
    592         known_hash,
    593         downloader,
    594         pooch=self,
    595         retry_if_failed=self.retry_if_failed,
    596     )
    598 if processor is not None:

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pooch\core.py:808, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    807 downloader(url, tmp, pooch)
--> 808 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    809 shutil.move(tmp, str(fname))

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pooch\hashes.py:176, in hash_matches(fname, known_hash, strict, source)
    175         source = str(fname)
--> 176     raise ValueError(
    177         f"{algorithm.upper()} hash of downloaded file ({source}) does not match"
    178         f" the known hash: expected {known_hash} but got {new_hash}. Deleted"
    179         " download for safety. The downloaded file may have been corrupted or"
    180         " the known hash may be outdated."
    181     )
    182 return matches

ValueError: SHA256 hash of downloaded file (db.zip) does not match the known hash: expected 9e5acfcaef041059dba346b1a876ff108fbb496ede0955dc89be6349e777a380 but got f1374fda2828f62ac98e360f969dbf76a8a6404f6691b0a27a521d0730e22ec3. Deleted download for safety. The downloaded file may have been corrupted or the known hash may be outdated.

### Error Message when trying to download rel-mimic
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\mimic.py:18, in verify_mimic_access()
     17 try:
---> 18     from google.cloud import bigquery
     20     table_id = "physionet-data.mimiciv_3_1_hosp.patients"

ModuleNotFoundError: No module named 'google.cloud'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[5], line 6
      3         dataset = get_dataset(dataset_name)
      5 #dataset = get_dataset("rel-stack")
----> 6 dataset = get_dataset("rel-mimic")

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\__init__.py:115, in get_dataset(name, download)
     95 r"""Return a dataset object by name.
     96
     97 Args:
   (...)    111 cached database matches the RelBench version even in this case.
    112 """
    114 if download:
--> 115     download_dataset(name)
    117 if name.startswith("ctu-"):
    118     try:

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\__init__.py:84, in download_dataset(name)
     81     print("Downloading Mimic dataset...")
     82     from relbench.datasets.mimic import verify_mimic_access
---> 84     verify_mimic_access()
     86 DOWNLOAD_REGISTRY.fetch(
     87     f"{name}/db.zip",
     88     processor=pooch.Unzip(extract_dir="."),
     89     progressbar=True,
     90 )

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\relbench\datasets\mimic.py:27, in verify_mimic_access()
     25     print("MIMIC-IV access verified.")
     26 except Exception as e:
---> 27     raise RuntimeError(
     28         f"\nACCESS FAILED - BigQuery credential check encountered an error: {e}"
     29     )

RuntimeError:
ACCESS FAILED - BigQuery credential check encountered an error: No module named 'google.cloud'

### Error Message when trying to download dbinfer-...

- Dataset 'dbinfer-mag' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-diginetica' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-retailrocket' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-seznam' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-amazon' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-stackexchange' is derived from 4DBInfer and must be generated locally; skipping download.
- Dataset 'dbinfer-outbrain-small' is derived from 4DBInfer and must be generated locally; skipping download.

Setting download=False just returns no message, but doesn't seam to generate the dataset either (see inability to get access to the tasks)

In [None]:
download=True
for dataset_name in dataset_names:
    if dataset_name not in {"rel-stack", "rel-mimic"} and not dataset_name.startswith("dbinfer-"):
        for task_name in get_task_names(dataset_name):
            if (dataset_name, task_name) not in {("rel-event", 'user-ignore'), ("rel-event", 'event_interest-interested'), ("rel-arxiv", 'co-citation'), ("rel-ratebeer", 'beer-rating-churn'), ("rel-ratebeer", 'user-rating-churn'), ("rel-ratebeer", 'user-rating-count'), ("rel-ratebeer", 'user-liked-beer'), ("rel-ratebeer", 'user-liked-place'), ("rel-ratebeer", 'user-favorite-beer')}:
                task = get_task(dataset_name, task_name, download=download)
                train_table = task.get_table("train")
                val_table = task.get_table("val")
                test_table = task.get_table("test")
                print(f"{dataset_name}::{task_name}::{task.task_type.name}\n---------------------------------")
                print(train_table.df.keys().values)
                print(val_table.df.keys().values)
                print(test_table.df.keys().values)
                print("\n---------------------------------\n")

rel-amazon::user-churn::BINARY_CLASSIFICATION
---------------------------------
['timestamp' 'customer_id' 'churn']
['timestamp' 'customer_id' 'churn']
['timestamp' 'customer_id']

---------------------------------

rel-amazon::user-ltv::REGRESSION
---------------------------------
['timestamp' 'customer_id' 'ltv']
['timestamp' 'customer_id' 'ltv']
['timestamp' 'customer_id']

---------------------------------

rel-amazon::item-churn::BINARY_CLASSIFICATION
---------------------------------
['timestamp' 'product_id' 'churn']
['timestamp' 'product_id' 'churn']
['timestamp' 'product_id']

---------------------------------

rel-amazon::item-ltv::REGRESSION
---------------------------------
['timestamp' 'product_id' 'ltv']
['timestamp' 'product_id' 'ltv']
['timestamp' 'product_id']

---------------------------------

rel-amazon::user-item-purchase::LINK_PREDICTION
---------------------------------
['timestamp' 'customer_id' 'product_id']
['timestamp' 'customer_id' 'product_id']
['timestamp'

### Error Message when trying to download rel-event tasks 'user-ignore' and 'event_interest-interested'

ValueError: SHA256 hash of downloaded file (user-ignore.zip) does not match the known hash: expected 354e06f4f2f878d53832f67552b117d773c83c12637cea417f89352f21a9a2d2 but got 5c7b40da7102336cf6ea2628c403a651005c9257b5f8d67fcebdb0a16f1d4fb1. Deleted download for safety. The downloaded file may have been corrupted or the known hash may be outdated.

### Error Message when trying to download rel-arxiv task 'co-citation' and rel-ratebeers task 'beer-rating-churn', 'user-rating-churn', 'user-rating-count', 'user-liked-beer', 'user-liked-place' and 'user-favorite-beer'

- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-arxiv/tasks/co-citation.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/beer-rating-churn.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/user-rating-churn.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/user-rating-count.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/user-liked-beer.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/user-liked-place.zip
- HTTPError: 404 Client Error: Not Found for url: https://relbench.stanford.edu/download/rel-ratebeer/tasks/user-favorite-beer.zip

### Error Message when trying to download any task of the db-infer-... datasets
- ModuleNotFoundError: No module named 'dbinfer_relbench_adapter' after attempting to use download=False for them, as asked by get_dataset()


## 3. Conclusion

### Works:
- rel-amazon ['user-churn', 'user-ltv', 'item-churn', 'item-ltv', 'user-item-purchase', 'user-item-rate', 'user-item-review', 'review-rating', 'review-verified']
- rel-avito ['ad-ctr', 'user-visits', 'user-clicks', 'user-ad-visit', 'searchstream-click', 'searchinfo-isuserloggedon']
- rel-event ['user-attendance', 'user-repeat', 'event_interest-not_interested', 'users-birthyear']
- rel-f1 ['driver-position', 'driver-dnf', 'driver-top3', 'driver-race-compete', 'results-position', 'qualifying-position', 'constructor_results-points', 'constructor_standings-position']
- rel-hm ['user-item-purchase', 'user-churn', 'item-sales', 'transactions-price']
- rel-trial ['study-outcome', 'study-adverse', 'site-success', 'condition-sponsor-run', 'site-sponsor-run', 'studies-enrollment', 'studies-has_dmc', 'eligibilities-adult', 'eligibilities-child']
- rel-arxiv ['paper-citation', 'author-category', 'author-publication']
- rel-salt ['item-plant', 'item-shippoint', 'item-incoterms', 'sales-office', 'sales-group', 'sales-payterms', 'sales-shipcond', 'sales-incoterms']
- rel-ratebeer ['brewer-dormant', 'user-beer-rating']

### Doesn't work
- rel-stack
- rel-mimic
- rel-event ['user-ignore', 'event_interest-interested']
- rel-arxiv ['co-citation']
- rel-ratebeer ['beer-rating-churn', 'user-rating-churn', 'user-rating-count', 'user-liked-beer', 'user-liked-place', 'user-favorite-beer']
- dbinfer-avs
- dbinfer-mag
- dbinfer-diginetica
- dbinfer-retailrocket
- dbinfer-seznam
- dbinfer-amazon
- dbinfer-stackexchange
- dbinfer-outbrain-small