# Detecting the type of dataset before fetching it as a `PanDataSet`

There are different types of datasets on Pangaea
- Table format datasets (row x col) where we can easily check if it has the desired column
- Seafloor videos which are usually too large to be fetched as a `PanDataSet`
- Images hosted on the website

In this Notebook we will attempt to detect the type of dataset before fetching it as a `PanDataSet`

In [None]:
import os
import sys

import pandas as pd
import pangaeapy
from bs4 import BeautifulSoup

sys.path.append("..")
from downloader.utilz import fetch_child_datasets, has_url_col

## 1. Make search query

In [None]:
query = "seafloor video"
n_results = 200

In [None]:
pq = pangaeapy.PanQuery(query=query, limit=n_results)
print("Requested URL:", pq.PANGAEA_QUERY_URL + "+".join(pq.query.split(" ")))

print("Number of results returned:", len(pq.result))
print("Total search results", pq.totalcount)

## 2. Processing a result item
The dictionary returned for each result item has some useful information.

- The `URI` can be used to fetch the `PanDataSet`

- The `type` tells us if it has child datasets

- Within the `html` we find a number of useful info
    - The citation for the dataset
    - The URL of the dataset webpage
    - The dataset size (eg: 
        - 14 datasets (has child datasets)
        - 500 data points (normal tabular format)
        - 50 MBytes (video)
        - unknown (images hosted on website)

In [None]:
result = pq.result[0]
print("Result dict keys:", result.keys())
result

In [None]:
def process_result(result, verbose=False):
    soup = BeautifulSoup(result["html"])
    citation = soup.find("div", attrs={"class": "citation"}).text
    url = soup.find("a", attrs={"class": "dataset-link"})["href"]
    size = soup.find_all("td", class_="content")[-1].text
    is_parent = True if result["type"] == "parent" else False

    if verbose:
        print(citation, url)
        print(
            f"Dataset size: {size}, Has child datasets: {is_parent}, TF-IDF Score: {result['score']}"
        )
    return url, size

In [None]:
# Testing function
from numpy.random import randint

idx = randint(0, len(pq.result))
url, size = process_result(pq.result[idx], verbose=True)

In [None]:
url, size

## 3. Viewing results

In [None]:
for i, result in enumerate(pq.result):
    url, size = process_result(result)
    if not "data" in size:  # Excluding datasets/data points
        print(f"[{i}]", size, url)

In [None]:
for result in pq.result:
    process_result(result, verbose=True)
    print("-" * 120)