GitHub - aws-samples/sample-emr-eks-hive-metastore-patterns: Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Introduction

This repository accompanies the AWS Big Data Blog post Design patterns for implementing Hive Metastore for Amazon EMR on EKS. It provides architectural guidance and implementation details for multiple Hive Metastore Service (HMS) design patterns on Amazon EMR on EKS clusters. Please use this repository in conjunction with the AWS Big Data blog post.

In traditional non-containerized environments, HMS typically operated as a service within a Hadoop cluster, offering limited deployment options. However, with the emergence of containerization technologies like Docker and Kubernetes in data lakes, multiple HMS implementation patterns have become available. These patterns provide organizations with greater flexibility to customize HMS deployments according to their specific requirements and infrastructure. This repository demonstrates various HMS design patterns and their implementation details.

Design Patterns

a. HMS as a Sidecar Container

In this pattern, HMS runs as a sidecar container within the same pod as the data processing framework, such as Apache Spark. This approach leverages Kubernetes’ multi-container pod functionality, allowing both HMS and the data processing framework to operate together in the same pod.

b. Cluster Dedicated HMS

In this pattern, HMS runs in multiple pods managed through a Kubernetes deployment, typically within a dedicated namespace in the same data processing EKS cluster.

c. External HMS

In this pattern, HMS is deployed in its own dedicated EKS cluster deployed using Kubernetes deployment and exposed as a Kubernetes Service using AWS Load Balancer Controller, separate from the data processing clusters.

Prerequisites

Before deploying this solution, ensure the following prerequisites are in place:

Access to a valid AWS account
The AWS Command Line Interface (AWS CLI) installed on your local machine
git , docker, eksctl,kubectl, helm, envsubst, jq and yq utilities installed on the local machine
Permission to create AWS resources
Familiarity with Hive Metastore, Kubernetes, Amazon EKS, and Amazon EMR on EKS

Clone the Repository and Set Up Environment Variables

git clone https://github.com/aws-samples/sample-emr-eks-hive-metastore-patterns.git

cd sample-emr-eks-hive-metastore-patterns
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

Execute the below script to create the common infrastructure.

The step creates two EMR on EKS clusters: analytics-cluster and datascience-cluster, and an EKS cluster: hivemetastore-cluster. Both analytics-cluster and datascience-cluster serve as compute clusters that runs Spark workloads, while the hivemetastore-cluster hosts the HMS. analytics-cluster is used to illustrate HMS as Sidecar and Cluster Dedicated pattern. All three clusters to demonstrate External HMS pattern.

cd ${REPO_DIR}/setup 
./setup.sh

Implement HMS as a Sidecar Container

a. Execute the below script to create infrastructure

cd ${REPO_DIR}/hms-sidecar
./configure-hms-sidecar.sh analytics-cluster

b. Submit Spark job and verify the setup

# Run the spark job to test the successful setup
kubectl apply -f spark-hms-sidecar-job.yaml -n emr

# View the Driver logs to validate the output
kubectl logs spark-hms-sidecar-job-driver --namespace emr

Implement Cluster Dedicated HMS

a. Execute the below script to create infrastructure

cd ${REPO_DIR}/hms-cluster-dedicated
./configure-hms-cluster-dedicated.sh analytics-cluster

b. Submit Spark job and verify the setup

# Run the spark job to test the successful setup
kubectl apply -f spark-hms-cluster-dedicated-job.yaml -n emr

# View the Driver logs to validate the output
kubectl logs spark-hms-cluster-dedicated-job-driver --namespace emr

Implement External HMS

a. Execute the below script to create infrastructure

cd ${REPO_DIR}/hms-external
./configure-hms-external.sh

b. Submit Spark job and verify the setup

Submit Spark jobs in analytics-cluster and datascience-cluster. The Spark jobs will connect to the HMS service in the hivemetastore-cluster.Repeat the following steps for analytics-cluster and datascience-cluster to verify that both clusters can connect to the HMS on hivemetastore-cluster.

# Run the spark job to test the successful setup
kubectl config use-context <CONTEXT_NAME>
kubectl apply -f spark-hms-external-job.yaml -n emr

# View the Driver logs to validate the output
kubectl logs spark-hms-external-job-driver --namespace emr

Cleanup

You can use the helper script to cleanup the resources provisioned in this blog post

cd ${REPO_DIR}/setup
./cleanup.sh

Disclaimer

This solution deploys the Open Source Apache Hive Metastore in the AWS cloud. Apache Hive Metastore is part of Apache Hive, an external open-source project, and AWS makes no claims regarding its security properties. Please evaluate Apache Hive and Hive Metastore according to your organization's security best practices prior to implementing this solution.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
hive-metastore		hive-metastore
hms-cluster-dedicated		hms-cluster-dedicated
hms-external		hms-external
hms-sidecar		hms-sidecar
images		images
setup		setup
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Introduction

Design Patterns

a. HMS as a Sidecar Container

b. Cluster Dedicated HMS

c. External HMS

Prerequisites

Clone the Repository and Set Up Environment Variables

Execute the below script to create the common infrastructure.

Implement HMS as a Sidecar Container

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Implement Cluster Dedicated HMS

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Implement External HMS

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Cleanup

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

aws-samples/sample-emr-eks-hive-metastore-patterns

Folders and files

Latest commit

History

Repository files navigation

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Introduction

Design Patterns

a. HMS as a Sidecar Container

b. Cluster Dedicated HMS

c. External HMS

Prerequisites

Clone the Repository and Set Up Environment Variables

Execute the below script to create the common infrastructure.

Implement HMS as a Sidecar Container

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Implement Cluster Dedicated HMS

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Implement External HMS

a. Execute the below script to create infrastructure

b. Submit Spark job and verify the setup

Cleanup

Disclaimer

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages