An example Terraform project that will configure a Secure and Customizable Spark Cluster on Amazon EMR (EMR). Zeppelin is also installed as an interface to Spark, and Ganglia is also installed for monitoring.
This project gives an example of extending the base functionality of Amazon EMR to provide a more secure (and potentially compliant) working environment for running Spark workloads on Amazon EMR.
There's two places in this project where data is stored: in Amazon S3 and in Hadoop HDFS, running on the EMR nodes.
Amazon S3 buckets are configured with AES-256 encryption, using your Amazon account's default encryption key. A custom KMS key could be used if desired.
Hadoop HDFS is configured to use EMR's Security Configuration which will configure the EMR nodes to enable LUKS to encrypt the data on the EBS volumes using an custom KMS Encryption Key.
This project will create self-signed certificates to demonstrate in-flight encryption. The certificates are used in three places:
- Inter-Node Communication: configured using EMR's Security Configuration
- Load-Balancer to Zeppelin Communication: configured by placing a self-signed certificate in the Java Keystore, and instructing Zeppelin to use that certificate.
- Internet to Load Balancer Communcation: configured by placing a self-signed certificate into Amazon IAM and configuring the Load Balancer to use that certificate.
Ideally, you or your organization would swap out a self-signed certificate for a certificate generated by a trusted certificate authority.
It is important to track who accesses which data and when. This project demonstrates this in a couple ways:
- EMR Logging: EMR has an internal tool called "Logpusher" which will send
files written to
/var/log/
to the S3 Bucket created and configured in the EMR Module. - Zeppelin Logging: A log4j configuration is placed on the system to log actions against Zeppelin.
- S3 Bucket Logging: A second bucket is created to track object access (gets/puts/deletes) to the main S3 bucket.
- Zeppelin Notebooks: Zeppelin notebooks are configured to save to the S3 Bucket.
This project also provides an example of how to instrument identity with Zeppelin.
Zeppelin employs Shiro. By providing a shiro.ini file, Zeppelin can have a user database. This project employs a basic example by hard-coding usernames and passwords into the file, but it could be extended to connect to an LDAP server or potentially another Identity Service.
Having your infrastructure represented "as code" means that you can use standard code review practices and continuous integration methodologies to make changes to your infrastructure. Under the Apache 2.0 License, you are welcome to fork this repository and customize it to you or your organization's needs.
By tracking what changes are made and by who and when, you can easily audit and control changes to your infrastructure.
This project leverages Terraform Modules, and relies heavily on the EMR Cluster resource.
The layout of this project is as follows:
main.tf --> The main terraform files, this includes the modules listed below
config.tf --> General Terraform configuration, versions, etc.
variables.tf --> Variables needed for Terraform to execute, which also includes
defaults
outputs.tf --> The output of the ELB's address for the Master Node
modules/
bootstrap/
--> This module copies files to S3 so EMR can run a script right after
the EC2 instances are provisioned
emr/
--> This module creates the EMR cluster and configuration files
lb/
--> This module creates a Load Balancer so Zepplein is accessible from
your system
s3/
--> This module creates S3 buckets needed by EMR
sec/
--> This module creates some Security Fundamentals needed by EMR.
NOTE: Please DO NOT USE THIS in production, its in place purely
for demonstrative purposes. See "Security Module" below.
sgs/
--> This module creates the Security Groups needed for EMR and the Load
Balancers.
Before building, ensure you're comfortable with how terraform works.
Terraform will use the AWS credentials provided in your shell environment. You will need an AWS user account available that has the following permissions:
- View/Create/Update/Delete IAM, KMS, and Certificates
- View/Create/Update/Delete S3 Buckets and Objects
- View/Create/Update/Delete Security Groups
- View/Create/Update/Delete Load Balancers
- View VPC Subnets
Before executing your first plan, you need to initialize Terraform:
$> terraform init
Terraform allows you to "Plan", which allows you to see what it would change without actually making any changes.
$> terraform plan -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'
Finally, affer initialization, planning, you can apply your chages, which will actaully create or update your cluster, based on the plan:
$> terraform apply -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'
If you want Terraform to clean up anything it made, you can destroy the cluster:
$> terraform destroy -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'
This project provides a Security Module that is intendended to be replaced with the security requirements of your organization. PLEASE DO NOT use this Security Module in production. It is there for demonstrative purposes only, but can be a guide on what needs to be replaced to use this project in production.
It provides bootstrapping for the following:
- IAM Roles
- Three IAM roles are created, one for EMR to create infrastructure, a role (instance profile) to attach to each one of the EC2 instances, and a role for EMR to use for autoscaling.
- Network
- Whitelisting: Instead of opening your EMR cluster to the world, this
module will look up your public IP using
ifconfig.co
and adding that IP to a few security groups (SSH for the Nodes and HTTPS for the LB) - Subnets: This module will fetch all subnets in the VPC provided as a variable, it will spin up EMR in the first, and it will attach the LB to the rest. You will likely want to specify which subnets to use for both.
- Whitelisting: Instead of opening your EMR cluster to the world, this
module will look up your public IP using
- SSH
- It will create and save a SSH key to connect to the cluster. The private
SSH key is saved in
generated/ssh/
.
- It will create and save a SSH key to connect to the cluster. The private
SSH key is saved in
- Zeppelin
- Zeppelin is equipped with a Load Balancer for (easier) access. It also is equipped with a Self-Signed Certificate for encrypted communication between the ELB and the Zeppelin process, and another Self-Signed Certifcate for the Internet. You should not use Self-Signed Certificates in Production, and switch these out with valid certificates.
Terraform isn't the best at tracking changes to resources in the bootstrap
module. Sometimes you have to let Terraform know you need to destroy and
recreate the EMR cluster by executing the following command:
$> terraform taint -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1' -module=emr aws_emr_cluster.cluster
$> terraform apply -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'
Currently this project is hard-coded to run in AWS US-West-2 (Oregon), and there are two things you have to change to make it run in another region:
- Default Region
- Change this to match the name of the region you'd like to use
- SNS Source IPs
- Change this list to make it match the Source IPs of Amazon SNS in the region you'd like to use.
After the cluster builds, it will output the DNS Name of the Load Balancer that was created:
$> terraform apply...
...
Outputs:
dns_name = my_emr_cluster_1-default-1234567890.us-west-2.elb.amazonaws.com
You can navigate to https://my_emr_cluster_1-default-1234567890.us-west-2.elb.amazonaws.com
in your browser. You will have to ignore the certificate warning, since this
example project creates self-signed SSL certs for demonstrative purposes.
Finally you can find the Username and Password for Zepplein, hard-coded in the Shiro Configuration File.
NOTE: Please DO NOT hard code your Usernames and Passwords for Zeppelin in production, or check them into Git. This is in place purely for demonstrative purposes. Zeppelin offers a few Authentication Options.
For demonstrative purposes, SSH is allowed to the public IP of the system that runs Terraform.
Also, this project will create an SSH key to connecto to the cluster. After
Terraform is applied, the SSH key generated is placed in the generated/
folder, so you can SSH into the cluster with the following command:
$> ssh -i generated/ssh/my_emr_cluster_1-default ip_address_of_a_node
AWS also provides additional SSH connection help in the EMR Console
It is recommended to use Terraform Remote State
- First, navigate to EMR Console
- Locate your cluster, it should be named after your
cluster_name
andregion
, e.g.my_emr_cluster1-us-west-2
- Second, there are logs in the S3 bucket created, under
logs/
- There's a shortcut in the UI to find the log directory, click on the the
folder icon near
Log URI:
- There's a shortcut in the UI to find the log directory, click on the the
folder icon near
- You can visit the Zeppelin UI (See the "How do I login to Zeppelin?" FAQ above)
- Second, follow the Apache Zeppelin Tutorial
Christian Nuss
Collective Health SRE
Copyright 2018 Collective Health, Inc
This project is available under the ApacheV2 License. Please see LICENSE file.