Setup VPC, EC2, Spark and Jupyter notebook running on AWS
Follow the instructions here
in terramform/main.tf
, set count of masternode and workernode to 0.
CIDR can be changed to 10.0.0.0/25 and 10.0.0.0/27 to allow more available addresses.
Follow the instructions here
cd ~/pegasus/examples/spark
In this directory are two files you can modify for your uses:
`master.yml`
`workers.yml`
In each .yml file, you’ll want to change the following fields:
subnet_id:
security_group_ids:
key_name: set this to the prefix of your pem file (e.g. set to hoa-nguyen and drop .pem)
purchase_type: make sure this field is always set to on_demand
If peg fetch doesn't work, try:
eval `ssh-agent -s`
before peg fetch
If gzip
error when peg install spark-cluster spark
: uninstall and install again till the error is gone (downloading issue)
peg ssh <cluster-name> 1
or
ssh -i ~/.ssh/<pem> ubuntu@<master-node>
apt-get postgresql
sudo su postgres
psql --host=<db address> --port='5432' --username=<username> --dbname=<dbname>
wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
export PATH=~/anaconda3/bin:$PATH (or add to ~/.profile)
conda create --name <env_name> python=3.6
source activate <env_name>
conda install <package>
jupyter notebook --generate-config
mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
cd ~/.jupyter/
vi jupyter_notebook_config.py
Add to the very begining:
c = get_config()
# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False
# Fix port to 8888
c.NotebookApp.port = 8888
Finally, start the notebook:
jupyter notebook
Go the the browser with the login token appeared but replaced the address to the public dns of the master node:
(For eg. https://localhost:8889/?token=56485689ec938840155b92a3d57920c8c7ee52f3f7080e8f)
conda create -n ipykernel_py3 python=3 ipykernel
source activate ipykernel_py3 # On Windows, remove the word 'source'
python -m ipykernel install --user
source activate <myenv>
python -m ipykernel install --user --name <myenv> --display-name "Python (myenv)"
Note:Python version need to be unified among all clusters
For each master/worker node:
wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
And answer 'yes' at:
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /home/ubuntu/.bashrc ? [yes|no]
Then source ~/.bashrc
Finally, check which python
, which will give /home/ubuntu/anaconda3/bin/python
conda install python=3.6.5
conda install pyspark=2.2.0
conda install boto3=1.7.4
conda install psycopg2=2.7.4
- For pyspark shell:
vi $SPARK_HOME/conf/spark-env.sh
export PYSPARK_PYTHON=/home/ubuntu/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ubuntu/anaconda3/bin/jupyter
- For jupyter notebook:
from os import environ
environ['PYSPARK_PYTHON']='/home/ubuntu/anaconda3/bin/python'
environ['PYSPARK_DRIVER_PYTHON']='/home/ubuntu/anaconda3/bin/jupyter'