In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataproc Metastore - Public Datasets Quickstart

## Overview

#### **Dataproc Metastore service**

Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.

Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata. This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.

[Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)

### Available datasets

[**public_datasets.cuad_v1**](https://www.atticusprojectai.org/cuad)  
510 PDF files of legal contracts  
 
|                path|    modificationTime| length|             content|
|--------------------|--------------------|-------|--------------------|
|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|

[**public_datasets.winequality_red**](https://archive.ics.uci.edu/dataset/186/wine+quality)  
4898 Wine quality data of white vinho verde samples  

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|

[**public_datasets.winequality_white**](https://archive.ics.uci.edu/dataset/186/wine+quality)  
1599 Wine quality data of red vinho verde samples  

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|


[**public_datasets.real_estate_sales**](https://catalog.data.gov/dataset/real-estate-sales-2001-2018)  
997213 samples of real estate sales  

| serial_number | list_year | date_recorded | town    | address      | assessed_value | sale_amount | sales_ratio | property_type | residential_type | non_use_code | assessor_remarks | opm_remarks          | location            |
|---------------|-----------|---------------|---------|--------------|----------------|-------------|-------------|---------------|------------------|--------------|------------------|----------------------|---------------------|
| 200594        | 2020      | 2021-02-16    | Danbury | 8 HICKORY ST | 121600.0       | 146216.0    | 0.8316463   | Residential   | Single Family    | 25 - Other   | I11192           | HOUSE HAS SETTLED... | {-73.44696, 41.41.. |

[**public_datasets.sms_spam_collection**](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)  
5574 SMS messages (4827 ham and 747 spam)  

|label|                text|
|-----|--------------------|
|  ham|Go until jurong p...|

### About Dataproc Metastore Public Datasets

We are making available a public dataset in a public Dataproc Metastore, which all users can read-only several pre-loaded datasets, to facilitate reading pre-selected public datasets from the internet and leveraging Dataproc capabilities.

|                path|    modificationTime|
|--------------------|--------------------|
|GCP_PROJECT|dataproc-workspaces-notebooks  |
|NAME|metastore-dev |
|LOCATION|us-central1|
|VERSION|3.1.2|
|URI|https://metastore-dev-53ae9631-fniezrdzdq-uc.a.run.app:443|
|WAREHOUSE_DIR|gs://gcs-bucket-metastore-dev-53ae9631-5703-491f-829f-164917e79441/hive-warehouse  |
|BINARIES_BUCKET|gs://dataproc-metastore-public-binaries  |

## Using Dataproc Metastore Public Datasets

**Using Dataproc Workspaces**
    
- Via the UI console:
    1) Edit Workspace Runtime
    2) In the Metastore section, select the **dataproc-workspaces-notebooks** GCP project
    3) Select **projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev**


**Using Dataproc Cluster**

Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command  

```console
export GCP_PROJECT=<your_gcp_project>
export REGION=<your_region>
export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10
export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation
export SERVICE_ACCOUNT=<your_service_account>

gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$REGION \
    --project=$GCP_PROJECT \
    --service-account=$SERVICE_ACCOUNT \
    --image-version=$CLUSTER_IMAGE_VERSION \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --enable-component-gateway \
    --optional-components JUPYTER \
    --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev
```
* For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed

#### Use PySpark to list tables in the **public_datasets**:

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("Dataproc Metastore Example") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
spark.sql("""DESCRIBE DATABASE public_datasets""").show(500,False)

|    info_name|                                                                                          info_value|
|-------------|----------------------------------------------------------------------------------------------------|
|Database Name|                                                                                     public_datasets|
|      Comment|                                                                                                    |
|     Location|gs://gcs-bucket-metastore-dev-53ae9631-5703-491f-829f-164917e79441/hive-warehouse/public_datasets.db|
|        Owner|                                                                                                root|

## Available tables

In [None]:
spark.sql("SHOW TABLES IN public_datasets").show()

|       database|        tableName|isTemporary|
|---------------|-----------------|-----------|
|public_datasets|          cuad_v1|      false|
|public_datasets|  winequality_red|      false|
|public_datasets|winequality_white|      false|

### Example 1) Contract Understanding Atticus Dataset (CUAD) - PDF files

#### How this dataset was created in the metastore

Binary data metadata tables are created with Spark by using:  

```
binaries_tables_metadata = spark.read.format("binaryFile").option("recursiveFileLookup", "true").load(PUBLIC_DATASETS_BINARIES_BUCKET + "/dataset/path/")
binaries_tables_metadata.write.mode('overwrite').saveAsTable("public_datasets.table_name")
```


In [None]:
cuad_v1 = spark.read.table("public_datasets.cuad_v1")
cuad_v1.show()

|                path|    modificationTime| length|             content|
|--------------------|--------------------|-------|--------------------|
|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|2881262|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:54:...|1778356|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1557129|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1452180|[25 50 44 46 2D 3...|

### Example 2) Wine Quality (Red and White)

In [None]:
winequality_white = spark.read.table("public_datasets.winequality_white")
winequality_white.show()

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|
|          6.3|             0.3|       0.34|           1.6|    0.049|               14.0|               132.0|  0.994| 3.3|     0.49|    9.5|      6|
|          8.1|            0.28|        0.4|           6.9|     0.05|               30.0|                97.0| 0.9951|3.26|     0.44|   10.1|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|

### Example 3) Real Estate Sales

In [None]:
real_estate_sales = spark.read.table("public_datasets.real_estate_sales")
real_estate_sales.show()

| serial_number | list_year | date_recorded | town    | address              | assessed_value | sale_amount | sales_ratio | property_type | residential_type | non_use_code         | assessor_remarks     | opm_remarks          | location             |
|---------------|-----------|---------------|---------|----------------------|----------------|-------------|-------------|---------------|------------------|----------------------|----------------------|----------------------|----------------------|
| 200594        | 2020      | 2021-02-16    | Danbury | 8 HICKORY ST         | 121600.0       | 146216.0    | 0.8316463   | Residential   | Single Family    | 25 - Other           | I11192               | HOUSE HAS SETTLED... | {-73.44696, 41.41... |
| 200562        | 2020      | 2021-02-03    | Danbury | 19  MILL RD          | 263600.0       | 415000.0    | 0.6351807   | Residential   | Single Family    | 25 - Other           | AFFORDABLE HOUSIN... | INCORRECT DATA PE... | {-73.53692, 41.38... |
| 200968        | 2020      | 2021-05-24    | Danbury | 4A FLIRTATION DR     | 205700.0       | 515000.0    | 0.3994175   | Residential   | Single Family    | 07 - Change in Pr... | B17008               | UPDATED KITCHEN P... | {null, null}         |
| 200260        | 2020      | 2020-11-23    | Danbury | 32 COALPIT HILL R... | 84900.0        | 181778.0    | 0.4670532   | Residential   | Condo            | 25 - Other           | J16087-4             | MULTIPLE UNIT SALE   | {-73.43796, 41.38... |
| 200262        | 2020      | 2020-11-23    | Danbury | 32 COALPIT HILL R... | 84900.0        | 181778.0    | 0.4670532   | Residential   | Condo            | 25 - Other           | J16087-6             | MULTIPLE UNIT SALE   | {-73.43796, 41.38... |

### Example 4) SMS Spam collection

In [None]:
sms_spam_collection = spark.read.table("public_datasets.sms_spam_collection")
sms_spam_collection.show()

|label|                text|
|-----|--------------------|
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|