Setup VPC, EC2, Spark and Jupyter notebook running on AWS

0. Get ready for Terraform

Follow the instructions here

1. Setup VPC with Terraform

in terramform/main.tf, set count of masternode and workernode to 0.

CIDR can be changed to 10.0.0.0/25 and 10.0.0.0/27 to allow more available addresses.

2. Setup nodes

Follow the instructions here

cd ~/pegasus/examples/spark

In this directory are two files you can modify for your uses:

`master.yml`

`workers.yml`

In each .yml file, you’ll want to change the following fields:

subnet_id: 
security_group_ids:
key_name: set this to the prefix of your pem file (e.g. set to hoa-nguyen and drop .pem)
purchase_type: make sure this field is always set to on_demand

If peg fetch doesn't work, try:

eval `ssh-agent -s`

before peg fetch

If gzip error when peg install spark-cluster spark: uninstall and install again till the error is gone (downloading issue)

3. Access master node

peg ssh <cluster-name> 1

or

ssh -i ~/.ssh/<pem> ubuntu@<master-node>

4. Setting up environment

PostgreSQL

apt-get postgresql
sudo su postgres
psql --host=<db address> --port='5432' --username=<username> --dbname=<dbname>

Python

wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
export PATH=~/anaconda3/bin:$PATH   (or add to ~/.profile)
conda create --name <env_name> python=3.6
source activate <env_name>
conda install <package>

5. Start Jupyter Notebook

Reference

jupyter notebook --generate-config

mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

cd ~/.jupyter/
vi jupyter_notebook_config.py

Add to the very begining:

c = get_config()

# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem' 
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False  
# Fix port to 8888
c.NotebookApp.port = 8888

Finally, start the notebook:

jupyter notebook

Go the the browser with the login token appeared but replaced the address to the public dns of the master node:

(For eg. https://localhost:8889/?token=56485689ec938840155b92a3d57920c8c7ee52f3f7080e8f)

6. Activate jupyter notebook with different env

Reference

conda create -n ipykernel_py3 python=3 ipykernel
source activate ipykernel_py3    # On Windows, remove the word 'source'
python -m ipykernel install --user

source activate <myenv>
python -m ipykernel install --user --name <myenv> --display-name "Python (myenv)"

7. Setup python conda env on worker nodes

Note:Python version need to be unified among all clusters

For each master/worker node:

a. Set up conda

wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh

And answer 'yes' at:

Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /home/ubuntu/.bashrc ? [yes|no]

Then source ~/.bashrc

Finally, check which python, which will give /home/ubuntu/anaconda3/bin/python

b. Set up environment

conda install python=3.6.5
conda install pyspark=2.2.0
conda install boto3=1.7.4
conda install psycopg2=2.7.4

c. Set up path

- For pyspark shell:

vi $SPARK_HOME/conf/spark-env.sh

export PYSPARK_PYTHON=/home/ubuntu/anaconda3/bin/python 
export PYSPARK_DRIVER_PYTHON=/home/ubuntu/anaconda3/bin/jupyter

- For jupyter notebook:

from os import environ
environ['PYSPARK_PYTHON']='/home/ubuntu/anaconda3/bin/python'
environ['PYSPARK_DRIVER_PYTHON']='/home/ubuntu/anaconda3/bin/jupyter'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly