In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Public Datasets Quickstart

## Overview

#### Datasets in the public read-only bucket

The following GCS bucket contains the files for each available dataset detailed below.   

|                GCP_PROJECT  |                                                                                                                    GCS bucket|
|-----------------------------|------------------------------------------------------------------------------------------------------------------------------|
|dataproc-workspaces-notebooks|[gs://dataproc-metastore-public-binaries](https://console.cloud.google.com/storage/browser/dataproc-metastore-public-binaries)|

#### Datasets in the public Dataproc Metastore instance

Alternatively, you can configure you Dataproc cluster or Serverless Runtime to connect to this public read-only Dataproc Metastore and read the dataset tables using *spark.read.table("public_datasets.\<table_name\>")*  

|                path|    modificationTime|
|--------------------|--------------------|
|GCP_PROJECT|dataproc-workspaces-notebooks  |
|NAME|public-metastore-v1 |
|LOCATION|us-central1|
|VERSION|3.1.2|

Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.  
Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata.  
This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.  
[Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)  

### Available datasets

[**cuad_v1**](https://www.atticusprojectai.org/cuad)   
GCS bucket path: gs://dataproc-metastore-public-binaries/cuad_v1/full_contract_pdf/    
- Format: .pdf  
- Metastore referece: public_datasets.cuad_v1  
- Description: 510 PDF files of legal contracts   
 
|                path|    modificationTime| length|             content|
|--------------------|--------------------|-------|--------------------|
|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|

[**winequality_red**](https://archive.ics.uci.edu/dataset/186/wine+quality)   
GCS bucket path: gs://dataproc-metastore-public-binaries/winequality_red/  
- Format: .csv    
- Metastore referece: public_datasets.winequality_red  
- Description: 4898 Wine quality data of white vinho verde samples   

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|

[**winequality_white**](https://archive.ics.uci.edu/dataset/186/wine+quality)   
GCS bucket path: gs://dataproc-metastore-public-binaries/winequality_white/    
- Format: .csv    
- Metastore referece: public_datasets.winequality_white  
- Description: 1599 Wine quality data of red vinho verde samples  

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|

[**real_estate_sales**](https://catalog.data.gov/dataset/real-estate-sales-2001-2018)  
GCS bucket path: gs://dataproc-metastore-public-binaries/real_estate_sales/    
- Format: .parquet      
- Metastore referece: public_datasets.real_estate_sales  
- Description: 997213 samples of real estate sales    

| serial_number | list_year | date_recorded | town    | address      | assessed_value | sale_amount | sales_ratio | property_type | residential_type | non_use_code | assessor_remarks | opm_remarks          | longitude           | latitude           |
|---------------|-----------|---------------|---------|--------------|----------------|-------------|-------------|---------------|------------------|--------------|------------------|----------------------|---------------------|--------------------|
| 200594        | 2020      | 2021-02-16    | Danbury | 8 HICKORY ST | 121600.0       | 146216.0    | 0.8316463   | Residential   | Single Family    | 25 - Other   | I11192           | HOUSE HAS SETTLED... |            -73.44696|           41.41422 |

[**sms_spam_collection**](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)  
GCS bucket path: gs://dataproc-metastore-public-binaries/sms_spam_collection/    
- Format: .csv    
- Metastore referece: public_datasets.sms_spam_collection   
- Description: 5574 SMS messages (4827 ham and 747 spam)  

|label|                text|
|-----|--------------------|
|  ham|Go until jurong p...|

[**us_customer_price_index_yearly**](https://data.bls.gov/timeseries/CUUR0000SA0L1E?output_view=pct_12mths)  
GCS bucket path: gs://dataproc-metastore-public-binaries/us_customer_price_index_yearly/    
- Format: .csv    
- Metastore referece: public_datasets.us_customer_price_index_yearly   
- Description: Customer Price Indexes from 1913 to 2022  

| Year | CPI   |
|------|-------|
| 2022 | 297.0 |

[**ai4i_2020_predictive_maintenance**](https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset)  
GCS bucket path: gs://dataproc-metastore-public-binaries/ai4i_2020_predictive_maintenance/    
- Format: .csv    
- Metastore referece: public_datasets.ai4i_2020_predictive_maintenance   
- Description: 10000 data points stored as rows with 14 features in columns   

|udi|product_id|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|twf|hdf|pwf|osf|rnf|
|---|----------|----|-----------------|---------------------|--------------------|---------|-------------|---------------|---|---|---|---|---|
|  1|    M14860|   M|            298.1|                308.6|                1551|     42.8|            0|              0|  0|  0|  0|  0|  0| 97.0 |

[**stanford_online_products**](https://cvgl.stanford.edu/projects/lifted_struct/)   
GCS bucket path: gs://dataproc-metastore-public-binaries/stanford_online_products/    
- Format: .jpg    
- Metastore referece: public_datasets.stanford_online_products   
- Description: 120k images of 23k classes of online products   

|                path|    modificationTime| length|
|--------------------|--------------------|-------|
|gs://dataproc-met...|2023-12-12 20:45:...|3051905|

[**youtube_ucg**](https://media.withyoutube.com/)  
GCS bucket path: gs://dataproc-metastore-public-binaries/youtube_ucg/
- Format: .mp4
- Metastore referece: public_datasets.youtube_ucg
- Description: 20 youtube videos

|                path|    modificationTime| length|
|--------------------|--------------------|-------|
|gs://dataproc-met...|2024-01-04 20:08:...|5051568|


## Using Dataproc Metastore Public Datasets

**Using Dataproc Jupyter Lab plugin**

- Serverless Runtime:
    1) Create New Runtime Template
    2) In the Metastore section, select the **dataproc-workspaces-notebooks** GCP project
    3) Select **projects/dataproc-workspaces-notebooks/locations/us-central1/services/public-metastore-v1**
    4) Select this runtime as Jupyter Kernel

<center><img src="../docs/create-runtime.png"/></center>
<center><img src="../docs/metastore-select.png"/></center>

**Using Dataproc Cluster**

Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command  

1) Export variables
```console
export GCP_PROJECT=<your_gcp_project>
export REGION=<your_region>
export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10
export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation
export SERVICE_ACCOUNT=<your_service_account>
```

2) Create Dataproc cluster with a Dataproc Metastore service attached
```console
gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$REGION \
    --project=$GCP_PROJECT \
    --service-account=$SERVICE_ACCOUNT \
    --image-version=$CLUSTER_IMAGE_VERSION \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --enable-component-gateway \
    --optional-components JUPYTER \
    --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev

# * For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed
```

#### Use PySpark to list tables in the **public_datasets**:

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("Dataproc Metastore Example") \
    .enableHiveSupport() \
    .getOrCreate()

## Available tables

In [None]:
spark.sql("SHOW TABLES IN public_datasets").show()

|       database|        tableName|isTemporary|
|---------------|-----------------|-----------|
|public_datasets|          cuad_v1|      false|
|public_datasets|  winequality_red|      false|
|public_datasets|winequality_white|      false|
|public_datasets|real_estate_sales|      false|
|public_datasets|sms_spam_collection|      false|
|public_datasets|us_customer_price_index_yearly|      false|
|public_datasets|ai4i_2020_predictive_maintenance|      false|
|public_datasets|stanford_online_products|      false|
|public_datasets|youtube_ucg|      false|

### Example 1) Contract Understanding Atticus Dataset (CUAD) - PDF files

#### How this dataset was created in the metastore

Binary data metadata tables are created with Spark by using:  

```
binaries_tables_metadata = spark.read.format("binaryFile").option("recursiveFileLookup", "true").load(PUBLIC_DATASETS_BINARIES_BUCKET + "/dataset/path/")
binaries_tables_metadata.write.mode('overwrite').saveAsTable("public_datasets.table_name")
```


In [None]:
cuad_v1 = spark.read.table("public_datasets.cuad_v1")
cuad_v1.show()

|                path|    modificationTime| length|             content|
|--------------------|--------------------|-------|--------------------|
|gs://dataproc-met...|2023-05-15 20:53:...|3683550|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|2881262|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:54:...|1778356|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1557129|[25 50 44 46 2D 3...|
|gs://dataproc-met...|2023-05-15 20:53:...|1452180|[25 50 44 46 2D 3...|

### Example 2) Wine Quality (Red and White)

In [None]:
winequality_white = spark.read.table("public_datasets.winequality_white")
winequality_white.show()

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|
|          6.3|             0.3|       0.34|           1.6|    0.049|               14.0|               132.0|  0.994| 3.3|     0.49|    9.5|      6|
|          8.1|            0.28|        0.4|           6.9|     0.05|               30.0|                97.0| 0.9951|3.26|     0.44|   10.1|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|

### Example 3) Real Estate Sales

In [None]:
real_estate_sales = spark.read.table("public_datasets.real_estate_sales")
real_estate_sales.show()

| serial_number | list_year | date_recorded | town    | address              | assessed_value | sale_amount | sales_ratio | property_type | residential_type | non_use_code         | assessor_remarks     | opm_remarks          | longitude            | latitude           |
|---------------|-----------|---------------|---------|----------------------|----------------|-------------|-------------|---------------|------------------|----------------------|----------------------|----------------------|----------------------|--------------------|
| 200594        | 2020      | 2021-02-16    | Danbury | 8 HICKORY ST         | 121600.0       | 146216.0    | 0.8316463   | Residential   | Single Family    | 25 - Other           | I11192               | HOUSE HAS SETTLED... |            -73.44696 |           40.41404 |
| 200562        | 2020      | 2021-02-03    | Danbury | 19  MILL RD          | 263600.0       | 415000.0    | 0.6351807   | Residential   | Single Family    | 25 - Other           | AFFORDABLE HOUSIN... | INCORRECT DATA PE... |            -23.42596 |           21.41424 |
| 200968        | 2020      | 2021-05-24    | Danbury | 4A FLIRTATION DR     | 205700.0       | 515000.0    | 0.3994175   | Residential   | Single Family    | 07 - Change in Pr... | B17008               | UPDATED KITCHEN P... |            -73.55596 |           48.43564 |
| 200260        | 2020      | 2020-11-23    | Danbury | 32 COALPIT HILL R... | 84900.0        | 181778.0    | 0.4670532   | Residential   | Condo            | 25 - Other           | J16087-4             | MULTIPLE UNIT SALE   |            -75.46666 |           75.56424 |
| 200262        | 2020      | 2020-11-23    | Danbury | 32 COALPIT HILL R... | 84900.0        | 181778.0    | 0.4670532   | Residential   | Condo            | 25 - Other           | J16087-6             | MULTIPLE UNIT SALE   |            -13.44696 |           22.48524 |

### Example 4) SMS Spam collection

In [None]:
sms_spam_collection = spark.read.table("public_datasets.sms_spam_collection")
sms_spam_collection.show()

|label|                text|
|-----|--------------------|
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|

### Example 5) US Customer Price Index Yearly

In [None]:
us_customer_price_index_yearly = spark.read.table("public_datasets.us_customer_price_index_yearly")
us_customer_price_index_yearly.show()

|Year| CPI|
|----|----|
|1913| 9.9|
|1914|10.0|
|1915|10.1|
|1916|10.9|
|1917|12.8|

### Example 6) AI4I 2020 Predictive Maintenance Dataset

In [None]:
predictive_maintenance = spark.read.table("public_datasets.ai4i_2020_predictive_maintenance")
predictive_maintenance.show()

|udi|product_id|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|twf|hdf|pwf|osf|rnf|
|---|----------|----|-----------------|---------------------|--------------------|---------|-------------|---------------|---|---|---|---|---|
|  1|    M14860|   M|            298.1|                308.6|                1551|     42.8|            0|              0|  0|  0|  0|  0|  0|
|  2|    L47181|   L|            298.2|                308.7|                1408|     46.3|            3|              0|  0|  0|  0|  0|  0|
|  3|    L47182|   L|            298.1|                308.5|                1498|     49.4|            5|              0|  0|  0|  0|  0|  0|
|  4|    L47183|   L|            298.2|                308.6|                1433|     39.5|            7|              0|  0|  0|  0|  0|  0|
|  5|    L47184|   L|            298.2|                308.7|                1408|     40.0|            9|              0|  0|  0|  0|  0|  0|

### Example 7) Stanford Online Products

In [None]:
stanford_online_products = spark.read.table("public_datasets.stanford_online_products")
stanford_online_products.show()

|                path|    modificationTime| length|
|--------------------|--------------------|-------|
|gs://dataproc-met...|2023-12-12 20:45:...|3051905|
|gs://dataproc-met...|2023-12-12 20:45:...|2998889|
|gs://dataproc-met...|2023-12-12 20:44:...|2646343|
|gs://dataproc-met...|2023-12-12 20:44:...|2605636|
|gs://dataproc-met...|2023-12-12 20:45:...|2457016|

### Example 8) YouTube dataset - User Generated Content

In [None]:
stanford_online_products = spark.read.table("public_datasets.youtube_ucg")
stanford_online_products.show()

|                path|    modificationTime| length|
|--------------------|--------------------|-------|
|gs://dataproc-met...|2024-01-04 20:08:...|5051568|
|gs://dataproc-met...|2024-01-04 20:08:...|4450939|
|gs://dataproc-met...|2024-01-04 20:08:...|2766749|
|gs://dataproc-met...|2024-01-04 20:08:...|2525019|
|gs://dataproc-met...|2024-01-04 20:08:...|2311945|
|gs://dataproc-met...|2024-01-04 20:08:...|1682359|