# Sample AWS Glue Studio Notebook

这个 Notebook 是一个示例, 用来演示如何在 Glue Studio Notebook 中 Interactive 的写 Hudi 代码.

In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0, 3.0 and 4.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
    %%tags        Dictionary          Specify a json-formatted dictionary consisting of tags to use in the session.
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X, G.2X, G.4X and G.8X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



# Prepare

## Setup Magic


In [6]:
%idle_timeout 2880
%glue_version 3.0 # 到 2023-07-27 为止, notebook 对 4.0 的支持不好, 建议使用 3.0
%worker_type G.1X
%number_of_workers 2
%%configure # 必须使用以下配置才能让 spark session 使用 hudi
{
    "--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false",
    "--datalake-formats": "hudi"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 2
The following configurations have been updated: {'--conf': 'spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false', '--datalake-formats': 'hudi'}


In [1]:
import sys

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.context import SparkContext

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job


conf = (
    SparkConf()
    .setAppName("myapp")
    .setAll(
        [
            # 下面两个是最重要的, 要想用 Hudi 则必须要启用这两个选项
            ("spark.serializer", "org.apache.spark.serializer.KryoSerializer"),
            ("spark.sql.hive.convertMetastoreParquet", "false"),
            # 下面的不是那么重要
            ("spark.sql.caseSensitive", "true"),
            ("spark.sql.session.timeZone", "UTC"),
            ("spark.sql.files.ignoreMissingFiles", "false"),
            ("spark.sql.parquet.enableVectorizedReader", "false"),
            ("spark.hadoop.hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"),
            ("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "1"),
        ]
    )
)
spark_ses = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
spark_ctx = spark_ses.sparkContext
glue_ctx = GlueContext(spark_ctx)

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::807388292768:role/all-services-admin-role
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 2
Session ID: a1ffc593-2e0d-46c6-8025-e37c3ae33bb8
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
--datalake-formats hudi
Waiting for session a1ffc593-2e0d-46c6-8025-e37c3ae33bb8 to get into ready status...
Session a1ffc593-2e0d-46c6-8025-e37c3ae33bb8 has been created.



In [2]:
config = spark_ses.sparkContext.getConf()
print(config)
print(config.get("spark.sql.hive.convertMetastoreParquet"))
print(config.get("spark.serializer"))

<pyspark.conf.SparkConf object at 0x7fdeebffcc90>
false
org.apache.spark.serializer.KryoSerializer


In [3]:
import boto3

boto_ses = boto3.session.Session()
sts_client = boto_ses.client("sts")
aws_account_id = sts_client.get_caller_identity()["Account"]
aws_region = boto_ses.region_name

print(f"aws_account_id = {aws_account_id}")
print(f"aws_region = {aws_region}")

aws_account_id = 807388292768
aws_region = us-east-1


In [4]:
pdf = spark_ses.createDataFrame(
    [
        ("id-1", "2000", "01", "01", "2000-01-01 00:00:00", 1),
        ("id-2", "2000", "01", "02", "2000-01-02 00:00:00", 2),
        ("id-3", "2000", "01", "03", "2000-01-03 00:00:00", 3),
    ],
    ("id", "year", "month", "day", "ts", "value"),
)
pdf.show()

+----+----+-----+---+-------------------+-----+
|  id|year|month|day|                 ts|value|
+----+----+-----+---+-------------------+-----+
|id-1|2000|   01| 01|2000-01-01 00:00:00|    1|
|id-2|2000|   01| 02|2000-01-02 00:00:00|    2|
|id-3|2000|   01| 03|2000-01-03 00:00:00|    3|
+----+----+-----+---+-------------------+-----+


In [5]:
database = "mydatabase"
table = "mytable"
additional_options={
    "hoodie.table.name": table,
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.partitionpath.field": "year,month,day",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.database": database,
    "hoodie.datasource.hive_sync.table": table,
    "hoodie.datasource.hive_sync.partition_fields": "year,month,day",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasource.hive_sync.use_jdbc": "false",
    "hoodie.datasource.hive_sync.mode": "hms",
    "path": f"s3://{aws_account_id}-{aws_region}-data/projects/hudi-poc/databases/{database}/{table}"
}
(
    pdf.write.format("hudi")
    .options(**additional_options)
    .mode("overwrite")
    .save()
)


