#**Lab Distributed Data Analytics**

##Tutorial 5

Hadoop is an open-source framework which is mainly used for storage purpose and maintaining and analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is actually a data management tool.

### Setting up a Hadoop infrastructure

Installing and configuring Java

Hadoop is a java programming-based data processing framework and relies on the Java Virtual Machine (JVM) to run its various components. Hadoop requires a JDK installation with specific versions. For example, Hadoop 3.x is generally compatible with Java 8 or later versions.

In [None]:
#Installing java 8 for compatibility purposes
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
#Switching java version to use as default (option 2)
!update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode


In [None]:
#Switching javac version to use as default (option 2)
!update-alternatives --config javac

There are 2 choices for the alternative javac (providing /usr/bin/javac).

  Selection    Path                                          Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/javac    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/javac to provide /usr/bin/javac (javac) in manual mode


In [None]:
#Switching jps version to use as default (option 2)
!update-alternatives --config jps

There are 2 choices for the alternative jps (providing /usr/bin/jps).

  Selection    Path                                        Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/jps    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jps to provide /usr/bin/jps (jps) in manual mode


In [None]:
#Checking java default version
!java -version

openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)


In [None]:
#creating java home variable
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["JRE_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/jre"
os.environ["PATH"] += ":$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin"

Installing and configuring Secure Shell (SSH)

We need to define a means for the master node to remotely access every node in our cluster. Hadoop uses passphrases SSH for the communication between the nodes.

In [None]:
#It is good practice to purge ssh before installation
!apt-get purge openssh-server

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package 'openssh-server' is not installed, so not removed
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [None]:
#installing openssh-server
!apt-get install openssh-server -qq > /dev/null

In [None]:
#starting the server
!service ssh start

 * Starting OpenBSD Secure Shell server sshd
   ...done.


We need to make sure that to ssh access to localhost and login we do not need to enter a password. Therefore, SSH needs to be set up to allow passwordless login for the hadoop user. The simplest way to achive this is to generate a public-private key pair.

In [None]:
#creating a new rsa key pair with empty password
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:OpVr+4wgJFUOV0i4kr948BnY9Xlx0uDFijl+DgSGhEs root@7ae669598d5a
The key's randomart image is:
+---[RSA 3072]----+
|   oo++o.  .     |
|  E o=+   . o    |
| . o.o.. + =     |
|  +.. . =.= o    |
|  .=.. +So +     |
|  oo+  o=.o      |
|   +.++ o=       |
|  . =. + +.      |
|   .    o.o      |
+----[SHA256]-----+


In [None]:
#Showing the public key
!more /root/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDc7AOtSCpu5s9c/pemtsbtH3FdQTLJ2nWy7yJf0aMg
bas6JLtbHCxVuhl3Bnk5jS91ZqiYPb4yaXgmME25WPkHnLglo4LbfIUu/AxwC3U7IbzEJxkJz8QDm+Ok
nEn+UgvOSRa0MOJyh950sDy4yja7fKbpKxVdDY0w8PqZrEY8l5QQpkceIC/k+dt55OYLJxl7vIO89VCZ
kXYCvoeFyKT8Yi36uk91ULVK4z/GrVhUWkYjGXpd/nRAcK1Bs3J9K3GMjBd/YpChKHQ1PHxlbSioP+NY
iYpyiOnoz7ac9iLvrB9/okFu1AAfdNkheGGk3HIeo14jl6/CJGV7TvvjmUd/9IskAxwaOLiUCHHD8+X+
Azuj98nfTh/3SxfzTvj6fsiAo8NC/2tgConrE9H1Xd5hLxVCH8/ViGCrQ0/HUUr35owvQMJHK9Cm9jKC
Rw6lza7JVQ1pQ6x30o3zwJVKQoPYTlXu9mnmxRDz0zH8b2sG0aexJs2y0/VB/nP14Vk2bOE= root@7a
e669598d5a


In [None]:
#copying the key to autorized keys
!cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
#changing the permissions on the key
!chmod 0600 ~/.ssh/authorized_keys

In [None]:
#conneting with the local machine
!ssh -o StrictHostKeyChecking=no localhost uptime

 08:46:08 up 4 min,  0 users,  load average: 1.10, 0.60, 0.25


Installing Hadoop 3.2.3

In [None]:
#Downloading Hadoop 3.2.3
#From Google drive
!gdown 'https://drive.google.com/uc?id=12P5hpS2DjMG4P3YukBP0D4s6uUUEJG-A' -O hadoop-3.2.3.tar.gz
# !wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz #From official website

Downloading...
From: https://drive.google.com/uc?id=12P5hpS2DjMG4P3YukBP0D4s6uUUEJG-A
To: /content/hadoop-3.2.3.tar.gz
100% 492M/492M [00:07<00:00, 65.2MB/s]
CPU times: user 91.1 ms, sys: 12 ms, total: 103 ms
Wall time: 8.87 s


In [None]:
#untarring the file
!sudo tar -xzf hadoop-3.2.3.tar.gz
!rm hadoop-3.2.3.tar.gz #to remove the tar file

In [None]:
#copying the hadoop file to user/local
!cp -r hadoop-3.2.3/ /usr/local/

In [None]:
#Exploring hadoop-3.2.3 directory
!ls -l /usr/local/hadoop-3.2.3/

total 204
drwxr-xr-x 2 root root   4096 May 27 08:46 bin
drwxr-xr-x 3 root root   4096 May 27 08:46 etc
drwxr-xr-x 2 root root   4096 May 27 08:46 include
drwxr-xr-x 3 root root   4096 May 27 08:46 lib
drwxr-xr-x 4 root root   4096 May 27 08:46 libexec
-rw-r--r-- 1 root root 150571 May 27 08:46 LICENSE.txt
-rw-r--r-- 1 root root  21943 May 27 08:46 NOTICE.txt
-rw-r--r-- 1 root root   1361 May 27 08:46 README.txt
drwxr-xr-x 3 root root   4096 May 27 08:46 sbin
drwxr-xr-x 4 root root   4096 May 27 08:46 share


In [None]:
#Specifing the JAVA_HOME variable in hadoop-env.sh
!sed -i '/export JAVA_HOME=/a export JAVA_HOME=\/usr\/lib\/jvm\/java-8-openjdk-amd64' /usr/local/hadoop-3.2.3/etc/hadoop/hadoop-env.sh

In [None]:
#creating hadoop home variable
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.2.3"

Hadoop modes

Hadoop Mainly works on 3 different Modes:

*   Standalone Mode
*   Pseudo-distributed Mode
*   Fully-Distributed Mode

Standalone Mode

By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode useful for debugging and there isn't any need to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is usually the fastest mode in Hadoop.

Pseudo-distributed Mode

Hadoop can also run on a single node in a Pseudo Distributed mode. In this mode, each daemon runs on separate java process. In this mode custom configuration is required( core-site.xml, hdfs-site.xml, mapred-site.xml). Here HDFS is utilized for input and output. This mode of deployment is useful for testing and debugging purposes.

Fully Distributed Mode

This is the production mode of Hadoop. In this mode typically one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to specified for Hadoop Daemons.

Running Hadoop in Pseudo-distributed mode

All the distributed components of Hadoop comes into play. That is, all the Hadoop deamons that are responsible for distributed storage and distributed processing will run on the same machine.

Master deamons:

*   NameNode
*   Resource Manager
*   Standby NameNode

Slave deamons:

*   DataNode
*   Node Manager

By setting the properties in these xml configuration files, we tell hadoop which machines are in the cluster and where and how we want to run the hadoop deamons

In [None]:
#Configuring core-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>fs.defaultFS</name>\n\
    <value>hdfs://localhost:9000</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/core-site.xml

In [None]:
#Configuring hdfs-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>dfs.replication</name>\n\
    <value>1</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

In [None]:
#Configuring mapred-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>mapreduce.framework.name</name>\n\
    <value>yarn</value>\n\
  </property>\n\
  <property>\n\
    <name>mapreduce.application.classpath</name>\n\
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/mapred-site.xml

In [None]:
#Configuring yarn-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <description>The hostname of the RM.</description>\n\
    <name>yarn.resourcemanager.hostname</name>\n\
    <value>localhost</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.aux-services</name>\n\
    <value>mapreduce_shuffle</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.env-whitelist</name>\n\
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/yarn-site.xml

Before HDFS can be used for the first time the file system must be formatted.

In [None]:
#Formatting to delete namenode mata data
!$HADOOP_HOME/bin/hdfs namenode -format

2023-05-27 08:47:00,866 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 7ae669598d5a/172.28.0.12
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.2.3
STARTUP_MSG:   classpath = /usr/local/hadoop-3.2.3/etc/hadoop:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/avro-1.7.7.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/curator-client-2.13.0.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/checker-qual-2.5.2.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/failureaccess-1.0.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/jetty-servlet-9.4.40.v20210413.jar:/usr/local/hadoop-3.2.3/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/local/hadoop-3.2.3/share/hadoop

In [None]:
#Hadoop scripts
!ls $HADOOP_HOME/sbin

distribute-exclude.sh	 start-all.sh	      stop-balancer.sh
FederationStateStore	 start-balancer.sh    stop-dfs.cmd
hadoop-daemon.sh	 start-dfs.cmd	      stop-dfs.sh
hadoop-daemons.sh	 start-dfs.sh	      stop-secure-dns.sh
httpfs.sh		 start-secure-dns.sh  stop-yarn.cmd
kms.sh			 start-yarn.cmd       stop-yarn.sh
mr-jobhistory-daemon.sh  start-yarn.sh	      workers.sh
refresh-namenodes.sh	 stop-all.cmd	      yarn-daemon.sh
start-all.cmd		 stop-all.sh	      yarn-daemons.sh


In [None]:
#creating other necessary enviromenal variables
os.environ["HDFS_NAMENODE_USER"] = "root"
os.environ["HDFS_DATANODE_USER"] = "root"
os.environ["HDFS_SECONDARYNAMENODE_USER"] = "root"
os.environ["YARN_RESOURCEMANAGER_USER"] = "root"
os.environ["YARN_NODEMANAGER_USER"] = "root"

In [None]:
#starting dfs nodes
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [7ae669598d5a]


In [None]:
#starting yarn nodes
!$HADOOP_HOME/sbin/start-yarn.sh

Starting resourcemanager
Starting nodemanagers


In [None]:
#listing the deamons that are running
!jps

2753 NodeManager
2642 ResourceManager
2389 SecondaryNameNode
2886 Jps
2201 DataNode
2093 NameNode


Expose localhost on a server

In [None]:
from google.colab import output

In [None]:
#Access hadoop web browser interface
output.serve_kernel_port_as_window(9870)

<IPython.core.display.Javascript object>

Downloading and placing the file in HFDS

In [None]:
#Getting file from Google drive
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nYWcznOsh1dcpMJxgP0nZzKGuB83Awp6' -O adjacency.txt

--2023-05-27 08:47:41--  https://docs.google.com/uc?export=download&id=1nYWcznOsh1dcpMJxgP0nZzKGuB83Awp6
Resolving docs.google.com (docs.google.com)... 142.251.2.139, 142.251.2.113, 142.251.2.101, ...
Connecting to docs.google.com (docs.google.com)|142.251.2.139|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-04-ao-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/sqjbhbmiksit8v39nto6cd834b2hhart/1685177250000/14760575472933726065/*/1nYWcznOsh1dcpMJxgP0nZzKGuB83Awp6?e=download&uuid=64225266-aaa7-47f3-b2c3-6c0b9b9a3af7 [following]
--2023-05-27 08:47:41--  https://doc-04-ao-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/sqjbhbmiksit8v39nto6cd834b2hhart/1685177250000/14760575472933726065/*/1nYWcznOsh1dcpMJxgP0nZzKGuB83Awp6?e=download&uuid=64225266-aaa7-47f3-b2c3-6c0b9b9a3af7
Resolving doc-04-ao-docs.googleusercontent.com (doc-04-ao-docs.googleusercontent.com)... 142.250.141.132, 2607:f8b0:402

In [None]:
#Creating a directory within dfs
!$HADOOP_HOME/bin/hdfs dfs -mkdir /node_count_in_python

In [None]:
#putting the file from local file system to hadoop distributed file system
!$HADOOP_HOME/bin/hdfs dfs -put /content/adjacency.txt /node_count_in_python

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /node_count_in_python

Found 1 items
-rw-r--r--   1 root supergroup         84 2023-05-27 08:47 /node_count_in_python/adjacency.txt


### Mapper and reducer for counting neighbors in a graph

In [None]:
%%writefile ex2_mapper.py

#!/usr/bin/env python
import sys
i = 0
# reading entire line from STDIN (standard input)
for line in sys.stdin:
  # to remove leading and trailing whitespace
  line = line.strip()
  # split the row into values
  node = int(line[0])
  adj_vector = line[2:].split(',')

  for _ in adj_vector:
    # write the results to STDOUT (standard output);
    print('%s\t%s' % (node, _)) #Hadoop uses tab(\t) as a default separator

Overwriting ex2_mapper.py


In [None]:
%%writefile ex2_reducer.py

#!/usr/bin/env python

import sys

current_node = None
current_count = 0
node = None

# read the entire line from STDIN
for line in sys.stdin:
  # remove leading and trailing whitespace
  line = line.strip()
  # splitting the data on the basis of tab we have provided in mapper.py
  node, count = line.split('\t', 1) #maxsplit = 1

  # convert count (currently a string) to int
  try:
    count = int(count)
  except ValueError:
    # count was not a number, so silently
    # ignore/discard this line
    continue

  # this IF-switch only works because Hadoop sorts map output
  # by key (here: word) before it is passed to the reducer
  if current_node == node:
    current_count += count
  else:
    if current_node: #to not print current_word=None
      # write result to STDOUT
      print('%s\t%s' % (current_node, current_count))
    current_count = count
    current_node = node

#output of last node
if current_node == node:
  print('%s\t%s' % (current_node, current_count))

Overwriting ex2_reducer.py


In [None]:
#Testing the python files work properly (Hadoop is not involved)
!cat adjacency.txt | python ex2_mapper.py | sort -k1,1 | python ex2_reducer.py

1	3
2	2
3	3
4	2
5	4
6	2


In [None]:
#Changing the permissions of the python files
!chmod 777 /content/ex2_mapper.py /content/ex2_reducer.py

Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc.

In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -input /node_count_in_python/adjacency.txt \
  -output /node_count_in_python/output \
  -mapper "python /content/ex2_mapper.py" \
  -reducer "python /content/ex2_reducer.py"

packageJobJar: [/tmp/hadoop-unjar3347244383225620917/] [] /tmp/streamjob7091167848872007985.jar tmpDir=null
2023-05-27 08:48:08,488 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-27 08:48:08,800 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-27 08:48:09,145 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685177260195_0001
2023-05-27 08:48:09,515 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-27 08:48:10,462 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-27 08:48:10,715 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1685177260195_0001
2023-05-27 08:48:10,719 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-27 08:48:11,016 INFO conf.Configuration: resource-types.xml not found
2023-05-27 08:48:11,017 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-05-27 08:48:11,4

In [None]:
#Exploring hadoop folder
# !$HADOOP_HOME/bin/hdfs dfs -ls /node_count_in_python/output
!$HADOOP_HOME/bin/hdfs dfs -rm -r /node_count_in_python/output

Deleted /node_count_in_python/output


In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /node_count_in_python/output/part-00000

1	3
2	2
3	3
4	2
5	4
6	2
