This repository accompanies the AWS Big Data Blog post Design patterns for implementing Hive Metastore for Amazon EMR on EKS. It provides architectural guidance and implementation details for multiple Hive Metastore Service (HMS) design patterns on Amazon EMR on EKS clusters. Please use this repository in conjunction with the AWS Big Data blog post.
In traditional non-containerized environments, HMS typically operated as a service within a Hadoop cluster, offering limited deployment options. However, with the emergence of containerization technologies like Docker and Kubernetes in data lakes, multiple HMS implementation patterns have become available. These patterns provide organizations with greater flexibility to customize HMS deployments according to their specific requirements and infrastructure. This repository demonstrates various HMS design patterns and their implementation details.
In this pattern, HMS runs as a sidecar container within the same pod as the data processing framework, such as Apache Spark. This approach leverages Kubernetes’ multi-container pod functionality, allowing both HMS and the data processing framework to operate together in the same pod.
In this pattern, HMS runs in multiple pods managed through a Kubernetes deployment, typically within a dedicated namespace in the same data processing EKS cluster.
In this pattern, HMS is deployed in its own dedicated EKS cluster deployed using Kubernetes deployment and exposed as a Kubernetes Service using AWS Load Balancer Controller, separate from the data processing clusters.
Before deploying this solution, ensure the following prerequisites are in place:
- Access to a valid AWS account
- The AWS Command Line Interface (AWS CLI) installed on your local machine
git
,docker
,eksctl
,kubectl
,helm
,envsubst
,jq
andyq
utilities installed on the local machine- Permission to create AWS resources
- Familiarity with Hive Metastore, Kubernetes, Amazon EKS, and Amazon EMR on EKS
git clone https://github.com/aws-samples/sample-emr-eks-hive-metastore-patterns.git
cd sample-emr-eks-hive-metastore-patterns
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>
The step creates two EMR on EKS clusters: analytics-cluster
and datascience-cluster
, and an EKS cluster: hivemetastore-cluster
. Both analytics-cluster
and datascience-cluster
serve as compute clusters that runs Spark workloads, while the hivemetastore-cluster
hosts the HMS.
analytics-cluster
is used to illustrate HMS as Sidecar and Cluster Dedicated pattern. All three clusters to demonstrate External HMS pattern.
cd ${REPO_DIR}/setup
./setup.sh
cd ${REPO_DIR}/hms-sidecar
./configure-hms-sidecar.sh analytics-cluster
# Run the spark job to test the successful setup
kubectl apply -f spark-hms-sidecar-job.yaml -n emr
# View the Driver logs to validate the output
kubectl logs spark-hms-sidecar-job-driver --namespace emr
cd ${REPO_DIR}/hms-cluster-dedicated
./configure-hms-cluster-dedicated.sh analytics-cluster
# Run the spark job to test the successful setup
kubectl apply -f spark-hms-cluster-dedicated-job.yaml -n emr
# View the Driver logs to validate the output
kubectl logs spark-hms-cluster-dedicated-job-driver --namespace emr
cd ${REPO_DIR}/hms-external
./configure-hms-external.sh
Submit Spark jobs in analytics-cluster
and datascience-cluster
. The Spark jobs will connect to the HMS service in the hivemetastore-cluster
.Repeat the following steps for analytics-cluster
and datascience-cluster
to verify that both clusters can connect to the HMS on hivemetastore-cluster
.
# Run the spark job to test the successful setup
kubectl config use-context <CONTEXT_NAME>
kubectl apply -f spark-hms-external-job.yaml -n emr
# View the Driver logs to validate the output
kubectl logs spark-hms-external-job-driver --namespace emr
You can use the helper script to cleanup the resources provisioned in this blog post
cd ${REPO_DIR}/setup
./cleanup.sh
This solution deploys the Open Source Apache Hive Metastore in the AWS cloud. Apache Hive Metastore is part of Apache Hive, an external open-source project, and AWS makes no claims regarding its security properties. Please evaluate Apache Hive and Hive Metastore according to your organization's security best practices prior to implementing this solution.