# Dataproc Metastore - Public Datasets Quickstart

## Overview

#### **Dataproc Metastore service**

Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.

Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata. This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.

[Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)

### About Dataproc Metastore Public Datasets

We are making available a public dataset in a public Dataproc Metastore, which all users can read-only several pre-loaded datasets, to facilitate reading pre-selected public datasets from the internet and leveraging Dataproc capabilities.

|                path|    modificationTime|
|--------------------|--------------------|
|GCP_PROJECT|dataproc-workspaces-notebooks  |
|NAME|metastore-dev |
|LOCATION|us-central1|
|VERSION|3.1.2|
|URI|https://metastore-dev-53ae9631-fniezrdzdq-uc.a.run.app:443|
|WAREHOUSE_DIR|gs://gcs-bucket-metastore-dev-53ae9631-5703-491f-829f-164917e79441/hive-warehouse  |
|BINARIES_BUCKET|gs://dataproc-metastore-public-binaries  |

## Using Dataproc Metastore Public Datasets

**Step 1**: Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command:

```console
export GCP_PROJECT=<your_gcp_project>
export REGION=<your_region>
export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10
export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation
export SERVICE_ACCOUNT=<your_service_account>

gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$REGION \
    --project=$GCP_PROJECT \
    --service-account=$SERVICE_ACCOUNT \
    --image-version=$CLUSTER_IMAGE_VERSION \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --enable-component-gateway \
    --optional-components JUPYTER \
    --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev
```
* For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed

**Step 2**: Use PySpark to list tables in the **public_datasets**:

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("Dataproc Metastore Example") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
spark.sql("""DESCRIBE DATABASE public_datasets""").show(500,False)

|    info_name|                                                                                          info_value|
|-------------|----------------------------------------------------------------------------------------------------|
|Database Name|                                                                                     public_datasets|
|      Comment|                                                                                                    |
|     Location|gs://gcs-bucket-metastore-dev-53ae9631-5703-491f-829f-164917e79441/hive-warehouse/public_datasets.db|
|        Owner|                                                                                                root|

## Available tables

In [None]:
spark.sql("SHOW TABLES IN public_datasets").show()

|       database|tableName|isTemporary|
|---------------|---------|-----------|
|public_datasets|  cuad_v1|      false|

### Example 1) Contract Understanding Atticus Dataset (CUAD) - PDF files

#### How this dataset was created in the metastore

Binary data metadata tables are created with Spark by using:  

```
binaries_tables_metadata = spark.read.format("binaryFile").option("recursiveFileLookup", "true").load(PUBLIC_DATASETS_BINARIES_BUCKET + "/dataset/path/")
binaries_tables_metadata.write.mode('overwrite').saveAsTable("public_datasets.table_name")
```


In [None]:
spark.sql("DESCRIBE TABLE EXTENDED public_datasets.cuad_v1").show(50,50)

|                    col_name|                                         data_type|comment|
|----------------------------|--------------------------------------------------|-------|
|                        path|                                            string|   null|
|            modificationTime|                                         timestamp|   null|
|                      length|                                            bigint|   null|
|                     content|                                            binary|   null|
|                            |                                                  |       |
|# Detailed Table Information|                                                  |       |
|                    Database|                                   public_datasets|       |
|                       Table|                                           cuad_v1|       |
|                       Owner|                                              root|       |
|                Created Time|                      Fri May 26 21:24:19 UTC 2023|       |
|                 Last Access|                                           UNKNOWN|       |
|                  Created By|                                       Spark 3.3.0|       |
|                        Type|                                           MANAGED|       |
|                    Provider|                                           parquet|       |
|                  Statistics|                                    94736181 bytes|       |
|                    Location|gs://gcs-bucket-metastore-dev-53ae9631-5703-491...|       |
|               Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.Parq...|       |
|                 InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParq...|       |
|                OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParq...|       |

In [None]:
cuad_v1 = spark.sql("SELECT * FROM public_datasets.cuad_v1 LIMIT 100")
cuad_v1.show()

|                path|    modificationTime| length|             content|
|--------------------|--------------------|-------|--------------------|
|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|2881262|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:54:...|1778356|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1557129|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1452180|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1403548|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1064706|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1063465|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1008546|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...| 995799|[25 50 44 46 2D 3...|