<a href="https://colab.research.google.com/github/Praxis-QR/BDSN/blob/main/KK_B1_Hadoop_WordCount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://4.bp.blogspot.com/-gbL5nZDkpFQ/XScFYwoTEII/AAAAAAAAAGY/CcVb_HDLwvs2Brv5T4vSsUcz7O4r2Q79ACK4BGAYYCw/s1600/kk3-header00-beta.png)<br>


<hr>

[Prithwis Mukerjee](http://www.linkedin.com/in/prithwis)<br>

#Hadoop
This Notebook has all the codes required to install Hadoop in the Colab VM and execute the a WordCount program using the streaming API <br>
The mapper.py and reducer.py programs are available in the authors G-Drive / Github and are downloaded as required<br>
<hr>


##Acknowledgements
Hadoop Installation from [Anjaly Sam's Github Repository](https://github.com/anjalysam/Hadoop) <br>

To get the concept behind map-reduce see [this notebook](https://github.com/Praxis-QR/BDSN/blob/main/Basic_WordCount_Concept.ipynb) <br>


# 1 Download, Install Hadoop

In [1]:
# The default JVM available at /usr/lib/jvm/java-11-openjdk-amd64/  works for Hadoop
# But gives errors with Hive https://stackoverflow.com/questions/54037773/hive-exception-class-jdk-internal-loader-classloadersappclassloader-cannot
# Hence this JVM needs to be installed
!apt-get update > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
# If there is an error in this cell, it is very likely that the version of hadoop has changed
# Download the latest version of Hadoop and change the version numbers accordingly
!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
# Unzip it
# the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file
!tar -xzf hadoop-3.3.0.tar.gz
#copy  hadoop file to user/local
!mv  hadoop-3.3.0/ /usr/local/

In [3]:
!ls /usr/local

bin	   cuda-11    games		  lib	       sbin	  xgboost
cuda	   cuda-11.0  _gcs_config_ops.so  LICENSE.txt  setup.cfg
cuda-10.0  cuda-11.1  hadoop-3.3.0	  licensing    share
cuda-10.1  etc	      include		  man	       src


# 2 Set Environment Variables


In [4]:
#To find the default Java path
#!readlink -f /usr/bin/java | sed "s:bin/java::"
#!ls /usr/lib/jvm/

In [5]:
#To set java path, go to /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh then
#. . . export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ . . .
#we have used a simpler alternative route using os.environ - it works

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"   # default is changed
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
# make sure that the version number is as downloaded 
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.0/"

In [6]:
!echo $PATH

/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin


In [7]:
# Add Hadoop BIN to PATH
# Get the current_path from output of previous command
current_path = '/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin'
new_path = current_path+':/usr/local/hadoop-3.3.0/bin/'
os.environ["PATH"] = new_path

# 3 Test Hadoop Installation

In [8]:
#Running Hadoop - Test RUN, not doing anything at all
#!/usr/local/hadoop-3.3.0/bin/hadoop
# UNCOMMENT the following line if you want to make sure that Hadoop is alive!
#!hadoop

In [9]:
# Testing Hadoop with PI generating sample program, should calculate value of pi = 3.14157500000000000000
# pi example
#Uncomment the following line if  you want to test Hadoop with pi example
# Final output should be : Estimated value of Pi is 3.14157500000000000000
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar pi 16 100000

Number of Maps  = 16
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
2021-11-10 00:27:59,261 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-11-10 00:27:59,371 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-11-10 00:27:59,371 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-11-10 00:27:59,541 INFO input.FileInputFormat: Total input files to process : 16
2021-11-10 00:27:59,555 INFO mapreduce.JobSubmitter: number of splits:16
2021-11-10 00:27:59,777 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1063306562_0001
2021-11-10 00:

# 4 Run WordCount with Hadoop
Instead of using Java for Map and Reduce methods, we use the streaming API of Hadoop and two simple python programs as mapper.py and reducer.py

In [10]:
# get mapper.py reducer.py from G_drive
#!gdown https://drive.google.com/uc?id=1VTzQ18cWAj6L29ncW6sABy-ITmDCcv5r
#!gdown https://drive.google.com/uc?id=1Or8Cbf9AsFMHStjMzDw3pXCd6TZ0dqxJ

#get mapper.py reducer.py from this git repository
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/mapper.py
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/reducer.py

In [11]:
# to see the codes, uncomment the following lines
#!cat mapper.py
#print("\n----------------------    see above for mapper, see below for reducer")
#!cat reducer.py

In [12]:
# python codes are made executable
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

In [13]:
# get a simple txt file as data for word count
# or you can upload your own
#!gdown https://drive.google.com/uc?id=1R5W0UVH2S3JjPxerqyX4ue5y6tMt0Wkk
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/Chronotantra.txt

In [14]:
# locate the streaming jar file
!find / -name 'hadoop-streaming*.jar'

/usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-test-sources.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-sources.jar


In [25]:
# remove output directories
!rm -r wc_out
!rm -r wc2_out

rm: cannot remove 'wc2_out': No such file or directory


In [26]:
# execute the streaming jar with proper parameters
# four parameters are input file, output directory, the mapper progra, the reducer program
#
#!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/hobbit.txt -output /content/wc_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'
#!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/Chronotantra.txt -output /content/wc_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/Chronotantra.txt -output /content/wc_out  -mapper 'python mapper.py'  -reducer 'python reducer.py'

2021-11-10 01:22:13,204 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-11-10 01:22:13,356 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-11-10 01:22:13,357 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-11-10 01:22:13,375 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-11-10 01:22:13,587 INFO mapred.FileInputFormat: Total input files to process : 1
2021-11-10 01:22:13,611 INFO mapreduce.JobSubmitter: number of splits:1
2021-11-10 01:22:13,842 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local106228352_0001
2021-11-10 01:22:13,842 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-11-10 01:22:14,022 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-11-10 01:22:14,024 INFO mapreduce.Job: Running job: job_local106228352_0001
2021-11-10 01:22:14,031 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2021-11-1

In [27]:
# check output directory
!ls wc_out

part-00000  _SUCCESS


In [28]:
# see actual output
#!tail wc_out/part-00000
!head wc_out/part-00000

1	8
10	2
100	2
1000	1
105	1
108	2
109	1
11	1
110	2
113	1


### Sorting the output

In [29]:
#https://www.geeksforgeeks.org/sort-command-linuxunix-examples/
!sort -nr -k 2 -t$'\t' wc_out/part-00000 > sorted.txt

In [22]:
!head -30 sorted.txt


would	346
could	247
one	198
time	156
like	145
know	144
us	134
mars	119
back	106
even	105
world	97
something	95
see	95
well	93
hermit	93
two	87
people	86
course	84
around	84
way	82
first	80
really	79
new	76
little	74
long	73
still	71
information	70
ai	67
good	63
earth	60


In [31]:
!tail -30 sorted.txt

2150	1
214	1
206	1
205	1
2019	1
2018	1
2007	1
20062007	1
2000	1
1999	1
1970s	1
1956	1
187	1
186	1
17866	1
156	1
155	1
15	1
150	1
1493	1
133	1
132	1
12th	1
12700	1
115	1
113	1
11	1
109	1
105	1
1000	1


#Chronobooks <br>
![alt text](https://1.bp.blogspot.com/-lTiYBkU2qbU/X1er__fvnkI/AAAAAAAAjtE/GhDR3OEGJr4NG43fZPodrQD5kbxtnKebgCLcBGAsYHQ/s600/Footer2020-600x200.png)<hr>
Chronotantra and Chronoyantra are two science fiction novels that explore the collapse of human civilisation on Earth and then its rebirth and reincarnation both on Earth as well as on the distant worlds of Mars, Titan and Enceladus. But is it the human civilisation that is being reborn? Or is it some other sentience that is revealing itself. 
If you have an interest in AI and found this material useful, you may consider buying these novels, in paperback or kindle, from [http://bit.ly/chronobooks](http://bit.ly/chronobooks)