<div style="font-size:18pt; padding-top:20px; text-align:center"><b><span style="font-weight:bold; color:green">EMR Spark</span> cluster with <span style="font-weight:bold; color:green">Jupyter</span></b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">Deploying EMR Spark Cluster with Jupyter</a></li>
        <li><a href="#2">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Deploying EMR Spark Cluster with Jupyter </div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p>Display available subnets and pick an id of a subnet with 10.0.1.0/28</p>

In [None]:
!aws ec2 describe-subnets

<p>To install Jupyter, upload to S3 the following script</p>

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>The source of the bash code and guideline for deploying jupyter notebook in EMR cluster is <a href="https://bytes.babbel.com/en/articles/2017-07-04-spark-with-jupyter-inside-vpc.html">here</a>. The below script has minor changes in comparison with the original.
     </div>
</div>

In [None]:
# %load /YOUR_PATH/config/install-jupyter.sh
#!/usr/bin/env bash
set -x -e

JUPYTER_PASSWORD=${1:-"myJupyterPassword"}
NOTEBOOK_DIR=${2:-"s3://myS3Bucket/notebooks/"}

# home backup
if [ ! -d /mnt/home_backup ]; then
  sudo mkdir /mnt/home_backup
  sudo cp -a /home/* /mnt/home_backup
fi

# mount home to /mnt
if [ ! -d /mnt/home ]; then
  sudo mv /home/ /mnt/
  sudo ln -s /mnt/home /home
fi

# Install conda
wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
    && /bin/bash ~/miniconda.sh -b -p $HOME/conda

echo '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc

conda config --set always_yes yes --set changeps1 no

conda install conda=4.2.13

conda config -f --add channels conda-forge
conda config -f --add channels defaults

conda install hdfs3 findspark ujson jsonschema toolz boto3 py4j numpy pandas==0.19.2

# cleanup
rm ~/miniconda.sh

echo bootstrap_conda.sh completed. PATH now: $PATH
export PYSPARK_PYTHON="/home/hadoop/conda/bin/python3.5"

############### -------------- master node -------------- ###############

IS_MASTER=false
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
  IS_MASTER=true

  ### install dependencies for s3fs-fuse to access and store notebooks

  sudo yum install -y git
  sudo yum install -y libcurl libcurl-devel graphviz cyrus-sasl cyrus-sasl-devel readline readline-devel gnuplot
  #sudo yum install -y automake fuse fuse-devel libxml2-devel

  sudo yum install -y automake fuse fuse-devel gcc-c++ git libcurl-devel libxml2-devel make openssl-devel
  wget ftp://mirror.switch.ch/pool/4/mirror/epel/6/x86_64/Packages/j/jsoncpp-devel-0.10.5-2.el6.x86_64.rpm
  wget ftp://mirror.switch.ch/pool/4/mirror/epel/6/x86_64/Packages/j/jsoncpp-0.10.5-2.el6.x86_64.rpm
  sudo rpm -ivh *.rpm

  # extract BUCKET and FOLDER to mount from NOTEBOOK_DIR
  NOTEBOOK_DIR="${NOTEBOOK_DIR%/}/"
  BUCKET=$(python -c "print('$NOTEBOOK_DIR'.split('//')[1].split('/')[0])")
  FOLDER=$(python -c "print('/'.join('$NOTEBOOK_DIR'.split('//')[1].split('/')[1:-1]))")

  echo "bucket '$BUCKET' folder '$FOLDER'"

  cd /mnt
  git clone https://github.com/s3fs-fuse/s3fs-fuse.git
  cd s3fs-fuse/
  ls -alrt
  ./autogen.sh
  PKG_CONFIG=/usr/bin/pkg-config ./configure
  #./configure
  make
  sudo make install
  sudo su -c 'echo user_allow_other >> /etc/fuse.conf'
  mkdir -p /mnt/s3fs-cache
  mkdir -p /mnt/$BUCKET
  /usr/local/bin/s3fs -o allow_other -o iam_role=auto -o umask=0 -o url=https://s3.amazonaws.com  -o no_check_certificate -o enable_noobj_cache -o use_cache=/mnt/s3fs-cache $BUCKET /mnt/$BUCKET

  ### Install Jupyter Notebook with conda and configure it.
  echo "installing python libs in master"
  # install
  conda install jupyter

  # install visualization libs
  conda install matplotlib plotly bokeh

  # install scikit-learn stable version
  #conda install --channel scikit-learn-contrib scikit-learn==0.18

  # jupyter configs
  mkdir -p ~/.jupyter
  touch ls ~/.jupyter/jupyter_notebook_config.py
  HASHED_PASSWORD=$(python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))")
  echo "c.NotebookApp.password = u'$HASHED_PASSWORD'" >> ~/.jupyter/jupyter_notebook_config.py
  echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py
  echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py
  echo "c.NotebookApp.notebook_dir = '/mnt/$BUCKET/$FOLDER'" >> ~/.jupyter/jupyter_notebook_config.py
  echo "c.ContentsManager.checkpoints_kwargs = {'root_dir': '.checkpoints'}" >> ~/.jupyter/jupyter_notebook_config.py

  ### Setup Jupyter deamon and launch it
  cd ~
  echo "Creating Jupyter Daemon"

  sudo cat <<EOF > /home/hadoop/jupyter.conf
description "Jupyter"

start on runlevel [2345]
stop on runlevel [016]

respawn
respawn limit 0 10

chdir /mnt/$BUCKET/$FOLDER

script
  sudo su - hadoop > /var/log/jupyter.log 2>&1 <<BASH_SCRIPT
        export PYSPARK_DRIVER_PYTHON="/home/hadoop/conda/bin/jupyter"
        export PYSPARK_DRIVER_PYTHON_OPTS="notebook --log-level=INFO"
        export PYSPARK_PYTHON=/home/hadoop/conda/bin/python3.5
        export JAVA_HOME="/etc/alternatives/jre"
        pyspark
  BASH_SCRIPT

end script
EOF

  sudo mv /home/hadoop/jupyter.conf /etc/init/
  sudo chown root:root /etc/init/jupyter.conf

  sudo initctl reload-configuration

  # start jupyter daemon
  echo "Starting Jupyter Daemon"
  sudo initctl start jupyter

fi

<p>Launch an EMR cluster</p>

In [None]:
%%bash
aws emr create-cluster \
    --name "Spark_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Spark Name=Zeppelin \
    --log-uri s3://YOUR_BUCKET/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=YOUR_KEY,SubnetId="YOUR_SUBNET" \
    --bootstrap-action Name="Install Jupyter notebook",Path="s3://YOUR_BUCKET/scripts/install-jupyter.sh",Args=["jupyter","s3://YOUR_BUCKET/jupyter/"] \
    --configurations file:///YOUR_PATH/config/hdfs-config.json

<p>Check a state of the launching cluster</p>

In [None]:
!aws emr describe-cluster --cluster-id YOUR_CLUSTER_ID --query "Cluster.Status"

<p>Set up an SSH tunnel using dynamic port forwarding (run in your terminal)</p>

In [None]:
sudo aws emr socks --cluster-id YOUR_CLUSTER_ID --key-pair-file /YOUR_PATH/your_private_key.pem

<p>Print out an internal host name of the master node</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PrivateDnsName" \
        --output text

<p>Switch to your browser with the foxyproxy exstension and enter the internal host name and a port to open the Jupyter dashboard</p>

<p>For example,</p>
<p class="code-block code-font">ip-10-0-1-11.eu-west-1.compute.internal:<span class="code-key">8888</span></p>

<p>Upload the files in the <span class="code-font">data/</span> directory to S3. Open the <span class="code-font">Spark_Dataframe_Basics.ipynb</span> notebook on the EMR Cluster through the Jupyter dashboard</p>

<p>Complete tasks in the notebook</p>

<p>Terminate the cluster after completing the class tasks</p>

In [None]:
!aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID

<p>Make sure all clusters are terminated</p>

In [None]:
!aws emr list-clusters --active 

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<a href="https://bytes.babbel.com/en/articles/2017-07-04-spark-with-jupyter-inside-vpc.html">Launch an AWS EMR cluster with Pyspark and Jupyter Notebook inside a VPC</a><br>
<a href="http://www.exegetic.biz/blog/2017/08/using-aws-cli/">Driving AWS from the Command Line</a><br>
<a href="https://jupyterhub.readthedocs.io/en/latest/">JupyterHub</a><br>
<a href="https://github.com/jupyterhub/jupyterhub-tutorial">Getting Started with JupyterHub tutorial</a><br>
<a href="https://medium.com/@muppal/aws-emr-jupyter-spark-2-x-7da54dc4bfc8">AWS EMR+ Jupyter + spark 2.x</a><br>