<a href="https://colab.research.google.com/github/CoolandHot/colab_tricks/blob/main/spark_cluster_setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Refer to : [Tutorial on setting up Spark Cluster](https://medium.com/@jootorres_11979/how-to-install-and-set-up-an-apache-spark-cluster-on-hadoop-18-04-b4d70650ed42)

# Run the tunnel to setup SSH remote hosting & install Spark, Hadoop, Scala, Java

In [1]:
#@title Colab-ssh tunnel
#@markdown Execute this cell to open the ssh tunnel. Check [colab-ssh documentation](https://github.com/WassimBenzarti/colab-ssh) for more details.

# Install colab_ssh on google colab
!pip install colab_ssh --upgrade

from colab_ssh import launch_ssh_cloudflared, init_git_cloudflared
ssh_tunnel_password = "123456" #@param {type: "string"}
launch_ssh_cloudflared(password=ssh_tunnel_password)

# Optional: if you want to clone a Github or Gitlab repository
# repository_url="<PUT_YOUR_REPOSITORY_URL_HERE>" #@param {type: "string"}
# init_git_cloudflared(repository_url)

Collecting colab_ssh
  Downloading colab_ssh-0.3.27-py3-none-any.whl (26 kB)
Installing collected packages: colab-ssh
Successfully installed colab-ssh-0.3.27


In [2]:
#@title Install Spark, Hadoop, Scala, Java
%%capture

!pip install pyspark
# findspark will locate Spark on the system and import it as a regular library.
!pip install -q findspark

# ping tool
!apt-get install iputils-ping
# SSH
!apt-get install openssh-server openssh-client

# Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
Path_of_JAVA_installation = '/usr/lib/jvm/java-8-openjdk-amd64'
# Scala
!apt-get install scala
# Apache Spark with Hadoop
!wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!mv spark-3.2.0-bin-hadoop3.2 /usr/local/spark

# ngrok
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip


# set ssh config ProxyCommand
!mkdir -p ~/.ssh/
!touch ~/.ssh/config
!chmod 600 ~/.ssh/config
with open('/root/.ssh/config', 'w') as file_object:
    file_object.write('''Host *.trycloudflare.com
	HostName %h
	User root
	Port 22
	ProxyCommand /content/cloudflared access ssh --hostname %h
    StrictHostKeyChecking no
    ''')

# Below are only required in the master host

In [None]:
MASTER_host = "masterHostForSpark"
MASTER_URL = 'narrow-apps-fewer-decor.trycloudflare.com' #@param {type:"string"}
MASTER_IP = !hostname -I
MASTER_IP = MASTER_IP[0].replace(' ', '')

slave_num = 3 #@param {type:"integer"}
Slave0_host = "james-assistance-fuji-module.trycloudflare.com" #@param {type:"string"}
Slave1_host = "catalogs-grass-covers-conduct.trycloudflare.com" #@param {type:"string"}
Slave2_host = "davis-circular-share-knife.trycloudflare.com" #@param {type:"string"}
Slave3_host = "high-patrol-vertical-labour.trycloudflare.com" #@param {type:"string"}
Slave4_host = "high-patrol-vertical-labour.trycloudflare.com" #@param {type:"string"}
Slave_hosts = [Slave0_host, Slave1_host, Slave2_host, Slave3_host, Slave4_host]


# !cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
# !mkdir -p /usr/local/spark/applicationHistory
# with open('/usr/local/spark/conf/spark-defaults.conf', 'w') as file_object:
#     file_object.write(f'''
# spark.master                     spark://{MASTER_host}:7077
# spark.eventLog.enabled           true
# # spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.eventLog.dir               file:///usr/local/spark/logs
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# spark.history.fs.logDirectory    file:///usr/local/spark/applicationHistory
# spark.history.fs.update.interval 30s
# spark.history.provider           org.apache.spark.deploy.history.FsHistoryProvider 
# ''')

# !cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
with open('/usr/local/spark/conf/spark-env.sh', 'w') as file_object:
    file_object.write(f'SPARK_MASTER_HOST="{MASTER_host}"')

with open('/usr/local/spark/conf/slaves', 'w') as file_object:
    file_object.write(f'\n{MASTER_host}\n')
    for i in range(slave_num):
        file_object.write(f'{Slave_hosts[i]}\n')


# check hostname to MASTER_host
# !hostname {MASTER_host}
with open('/etc/hosts', 'w') as file_object:
    file_object.write(f'''127.0.0.1	localhost
 ::1	localhost ip6-localhost ip6-loopback
 fe00::0	ip6-localnet
 ff00::0	ip6-mcastprefix
 ff02::1	ip6-allnodes
 ff02::2	ip6-allrouters
 {MASTER_IP}	{MASTER_host}
 ''')


with open('/root/.bashrc', 'a') as file_object:
    file_object.write(f'''
export JAVA_HOME={Path_of_JAVA_installation}
export PATH=$PATH:$JAVE_HOME/bin
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
''')
# sourcing the ~/.bashrc file
!source ~/.bashrc



# create rsa key pairs
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
!cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

from google.colab import files
rsa_files = 'rsa_files.zip'
!zip -r {rsa_files} ~/.ssh/id_rsa ~/.ssh/id_rsa.pub
files.download(rsa_files)

In [23]:
# !scp /usr/local/spark/conf/spark-defaults.conf root@{Slave0_host}:/usr/local/spark/conf
# !scp /usr/local/spark/conf/spark-defaults.conf root@{Slave1_host}:/usr/local/spark/conf
# !scp /usr/local/spark/conf/slaves root@{Slave0_host}:/usr/local/spark/conf
# !scp /usr/local/spark/conf/slaves root@{Slave1_host}:/usr/local/spark/conf

Copy public key to the **slave** host
1. [from local machine on powershell]
```
$slave = "<slave_host>"
type $env:USERPROFILE\.ssh\id_rsa.pub | ssh $slave "cat >> .ssh/authorized_keys"
```
After entering the host's password, it's set properly. Now you can login with `ssh $slave` directly. 

BTW: if you push your local `id_rsa` to remote using the same method, its permission (Permissions 0644) should be changed to (Permissions 0400): `chmod 400 ~/.ssh/id_rsa`

2. [inside master_host]
```
!ssh-copy-id <slave_host>
```


you can test the connection with (type `exit` to disconnect):

In [None]:
# !ssh -o ProxyCommand="/content/cloudflared access ssh --hostname %h" {Slave0_host}
!ssh {Slave0_host}

Now start the master and worker hosts

In [5]:
!/usr/local/spark/sbin/start-all.sh
# !/usr/local/spark/sbin/stop-all.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark--org.apache.spark.deploy.master.Master-1-174f4a950415.out
masterHostForSpark: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-174f4a950415.out
davis-circular-share-knife.trycloudflare.com: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-505e4e27e3a5.out
james-assistance-fuji-module.trycloudflare.com: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-355709b4fafc.out
catalogs-grass-covers-conduct.trycloudflare.com: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-b173ea3e5702.out


In [7]:
# check to if the services started
!jps

2361 Jps
2153 Master
2252 Worker


In [6]:
# check the log outputs
logs = !ls /usr/local/spark/logs/
logs = logs.s.split()
for log_file in logs:
    !cat /usr/local/spark/logs/{log_file}

Spark Command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host masterHostForSpark --port 7077 --webui-port 8080
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/01/22 01:02:31 INFO Master: Started daemon with process name: 2153@174f4a950415
22/01/22 01:02:31 INFO SignalUtils: Registering signal handler for TERM
22/01/22 01:02:31 INFO SignalUtils: Registering signal handler for HUP
22/01/22 01:02:31 INFO SignalUtils: Registering signal handler for INT
22/01/22 01:02:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/22 01:02:32 INFO SecurityManager: Changing view acls to: root
22/01/22 01:02:32 INFO SecurityManager: Changing modify acls to: root
22/01/22 01:02:32 INFO SecurityManager: Changing view acls groups to: 
22/01/22 01:02:32 INFO SecurityManager: Changing modify acls

In [8]:
from google.colab import output
output.serve_kernel_port_as_window("8081")

<IPython.core.display.Javascript object>