# Exploring the Pangaea Python API

In [None]:
from pprint import pprint

import pangaeapy

In [None]:
[item for item in dir(pangaeapy) if not item.startswith("__")]

## Checking out the `PanDataSet` class
Exploring the following dataset: https://doi.pangaea.de/10.1594/PANGAEA.875148

**Issue:** The webpage for the dataset specifies the size to be 1719 data points. But the dataset only has 573 rows.

In [None]:
# Fetch dataset using doi
ds = pangaeapy.PanDataSet("10.1594/PANGAEA.875148")
# Examine attributes
df = ds.data
print(ds.title)
print("Authors:", [auth.fullname for auth in ds.authors])
print(f"{df.shape[0]} rows x {df.shape[1]} columns")
df.head()

In [None]:
print(ds.citation)

Checking another one using the dataset id only (dont need full doi)

In [None]:
# Fetch dataset using dataset ID
ds = pangaeapy.PanDataSet(371064)
# Examine attributes
print(ds.title)
print("Authors:", [auth.fullname for auth in ds.authors])
print(f"{ds.data.shape[0]} rows x {ds.data.shape[1]} columns")

In [None]:
ds.isParent

In [None]:
ds.data

In [None]:
print(ds.doi)

The dataframe for this dataset is empty and if we check the dataset on the Pangaea website, we can see that the dataset size is 'unknown'. After clicking 'Download dataset' we can see that the images are hosted on the website on several pages, rather than providing the image urls in a table format.

In [None]:
print(ds.citation)

## Checking out the `PanQuery` class

In [None]:
# Search Pangaea using query string
query = pangaeapy.PanQuery(query="seafloor images", limit=999)
print("Total search results:", query.totalcount)
print("Results returned:", len(query.result))

As we can see desipte 596 results being available, `PanQuery` returns a max of 500 results

In [None]:
query.result[0].keys()

### Examining result items
Each of the items in the results list is a dictionary containing:
- dataset **URI** (doi)
- TF-IDF **score** (match with query string)
- **type** which is either 'parent' or 'child'
- **position** the result index
- **html** with other useful info such as the dataset size, citation etc.

In [None]:
result = query.result[0]
result

In [None]:
for result in query.result:
    print(
        f"Index: [{result['position']}], DOI: {result['URI']}, Type: {result['type']}, Score: {result['score']}"
    )
    pprint(result["html"])
    print("-" * 125)