<a href="https://colab.research.google.com/github/Praxis-QR/BDSN/blob/main/KK_B1_Hadoop_WordCount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://4.bp.blogspot.com/-gbL5nZDkpFQ/XScFYwoTEII/AAAAAAAAAGY/CcVb_HDLwvs2Brv5T4vSsUcz7O4r2Q79ACK4BGAYYCw/s1600/kk3-header00-beta.png)<br>


<hr>

[Prithwis Mukerjee](http://www.linkedin.com/in/prithwis)<br>

#Hadoop
This Notebook has all the codes required to install Hadoop in the Colab VM and execute the a WordCount program using the streaming API <br>
The mapper.py and reducer.py programs are available in the authors G-Drive and are downloaded as required<br>
<hr>


##Acknowledgements
Hadoop Installation from [Anjaly Sam's Github Repository](https://github.com/anjalysam/Hadoop) <br>


# 1.1 Download, Install Hadoop

In [1]:
# The default JVM available at /usr/lib/jvm/java-11-openjdk-amd64/  works for Hadoop
# But gives errors with Hive https://stackoverflow.com/questions/54037773/hive-exception-class-jdk-internal-loader-classloadersappclassloader-cannot
# Hence this JVM needs to be installed
!apt-get update > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
# If there is an error in this cell, it is very likely that the version of hadoop has changed
# Download the latest version of Hadoop and change the version numbers accordingly
!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
# Unzip it
# the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file
!tar -xzf hadoop-3.3.0.tar.gz
#copy  hadoop file to user/local
!mv  hadoop-3.3.0/ /usr/local/

## 1.2 Set Environment Variables


In [5]:
#To find the default Java path
#!readlink -f /usr/bin/java | sed "s:bin/java::"
#!ls /usr/lib/jvm/

/usr/lib/jvm/java-11-openjdk-amd64/


In [4]:
#To set java path, go to /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh then
#. . . export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ . . .
#we have used a simpler alternative route using os.environ - it works

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"   # default is changed
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
# make sure that the version number is as downloaded 
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.0/"

In [6]:
!echo $PATH

/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin


In [7]:
# Add Hadoop BIN to PATH
# Get the current_path from output of previous command
current_path = '/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin'
new_path = current_path+':/usr/local/hadoop-3.3.0/bin/'
os.environ["PATH"] = new_path

## 1.3 Test Hadoop Installation

In [None]:
#Running Hadoop - Test RUN, not doing anything at all
#!/usr/local/hadoop-3.3.0/bin/hadoop
# UNCOMMENT the following line if you want to make sure that Hadoop is alive!
#!hadoop

In [None]:
# Testing Hadoop with PI generating sample program, should calculate value of pi = 3.14157500000000000000
# pi example
#Uncomment the following line if  you want to test Hadoop with pi example
# Final output should be : Estimated value of Pi is 3.14157500000000000000
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar pi 16 100000

Number of Maps  = 16
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
2021-08-25 08:49:54,151 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-25 08:49:54,244 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-25 08:49:54,244 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-25 08:49:54,383 INFO input.FileInputFormat: Total input files to process : 16
2021-08-25 08:49:54,394 INFO mapreduce.JobSubmitter: number of splits:16
2021-08-25 08:49:54,587 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1134888830_0001
2021-08-25 08:

## 1. 4 Run WordCount with Hadoop
Instead of using Java for Map and Reduce methods, we use the streaming API of Hadoop and two simple python programs as mapper.py and reducer.py

In [8]:
# get mapper.py reducer.py from G_drive
#!gdown https://drive.google.com/uc?id=1VTzQ18cWAj6L29ncW6sABy-ITmDCcv5r
#!gdown https://drive.google.com/uc?id=1Or8Cbf9AsFMHStjMzDw3pXCd6TZ0dqxJ

#get mapper.py reducer.py from this git repository
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/mapper.py
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/reducer.py

In [9]:
# to see the codes, uncomment the following lines
#!cat mapper.py
print("\n----------------------    see above for mapper, see below for reducer")
#!cat reducer.py


----------------------    see above for mapper, see below for reducer


In [10]:
# python codes are made executable
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

In [11]:
# get a simple txt file as data for word count
# or you can upload your own
#!gdown https://drive.google.com/uc?id=1R5W0UVH2S3JjPxerqyX4ue5y6tMt0Wkk
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/Chronotantra.txt

In [None]:
# locate the streaming jar file
!find / -name 'hadoop-streaming*.jar'

/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-test-sources.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-sources.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar


In [12]:
# remove output directories
!rm -r wc_out
!rm -r wc2_out

rm: cannot remove 'wc_out': No such file or directory
rm: cannot remove 'wc2_out': No such file or directory


In [13]:
# execute the streaming jar with proper parameters
# four parameters are input file, output directory, the mapper progra, the reducer program
#
#!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/hobbit.txt -output /content/wc_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/Chronotantra.txt -output /content/wc_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

2021-08-25 09:27:58,772 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/mapper.py, /content/reducer.py] [] /tmp/streamjob8630648500709022402.jar tmpDir=null
2021-08-25 09:28:00,099 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-25 09:28:00,234 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-25 09:28:00,234 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-25 09:28:00,255 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-25 09:28:00,437 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-25 09:28:00,466 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-25 09:28:00,742 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local122411687_0001
2021-08-25 09:28:00,742 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-25 09:28:01,186 INFO mapred.Loca

In [14]:
# check output directory
!ls wc_out

part-00000  _SUCCESS


In [17]:
# see actual output
!tail wc_out/part-00000

yuri	1
zahra	1
zenith	1
zeros	1
zhejiang	1
zipped	1
zipping	1
zled	5
zone	4
zula	1


### Sorting the output

In [24]:
!sort -k 2 -t$'\t' wc_out/part-00000 > sorted.txt

In [29]:
!tail -100 sorted.txt

people	86
two	87
20	9
among	9
apartment	9
asked	9
beautiful	9
begin	9
behaviour	9
business	9
c	9
carried	9
cases	9
child	9
commander	9
continued	9
conversation	9
covered	9
cut	9
dalma	9
damodar	9
darkness	9
dead	9
decided	9
decision	9
decode	9
disappeared	9
dropped	9
dust	9
event	9
evolved	9
exciting	9
experienced	9
eye	9
feeling	9
five	9
flat	9
forgotten	9
forward	9
four	9
frankly	9
free	9
gave	9
girl	9
global	9
habitats	9
heavy	9
hebrus	9
images	9
immediately	9
impossible	9
interested	9
jin	9
lead	9
map	9
matters	9
met	9
months	9
movement	9
music	9
natural	9
needs	9
niloofer	9
option	9
piece	9
play	9
possibly	9
project	9
quality	9
question	9
quickly	9
reach	9
safety	9
seat	9
seem	9
services	9
showed	9
single	9
somewhere	9
sound	9
sun	9
takes	9
tall	9
truck	9
turned	9
universal	9
unknown	9
vast	9
wall	9
walls	9
went	9
whatever	9
worked	9
write	9
young	9
hermit	93
well	93
see	95
something	95
world	97


In [26]:
!head sorted.txt


1000	1
105	1
109	1
11	1
113	1
115	1
12700	1
12th	1
132	1
133	1


#Chronobooks <br>
![alt text](https://1.bp.blogspot.com/-lTiYBkU2qbU/X1er__fvnkI/AAAAAAAAjtE/GhDR3OEGJr4NG43fZPodrQD5kbxtnKebgCLcBGAsYHQ/s600/Footer2020-600x200.png)<hr>
Chronotantra and Chronoyantra are two science fiction novels that explore the collapse of human civilisation on Earth and then its rebirth and reincarnation both on Earth as well as on the distant worlds of Mars, Titan and Enceladus. But is it the human civilisation that is being reborn? Or is it some other sentience that is revealing itself. 
If you have an interest in AI and found this material useful, you may consider buying these novels, in paperback or kindle, from [http://bit.ly/chronobooks](http://bit.ly/chronobooks)