<a href="https://colab.research.google.com/github/Praxis-QR/BDSN/blob/main/KK_B1_Hadoop_WordCount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://4.bp.blogspot.com/-gbL5nZDkpFQ/XScFYwoTEII/AAAAAAAAAGY/CcVb_HDLwvs2Brv5T4vSsUcz7O4r2Q79ACK4BGAYYCw/s1600/kk3-header00-beta.png)<br>


<hr>

[Prithwis Mukerjee](http://www.linkedin.com/in/prithwis)<br>

#Hadoop
This Notebook has all the codes required to install Hadoop in the Colab VM and execute the a WordCount program using the streaming API <br>
The mapper.py and reducer.py programs are available in the authors G-Drive and are downloaded as required<br>
<hr>


##Acknowledgements
Hadoop Installation from [Anjaly Sam's Github Repository](https://github.com/anjalysam/Hadoop) <br>


# 1.1 Download, Install Hadoop

In [1]:
# The default JVM available at /usr/lib/jvm/java-11-openjdk-amd64/  works for Hadoop
# But gives errors with Hive https://stackoverflow.com/questions/54037773/hive-exception-class-jdk-internal-loader-classloadersappclassloader-cannot
# Hence this JVM needs to be installed
!apt-get update > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
# Download the latest version of Hadoop and change the version numbers accordingly
!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
# Unzip it
# the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file
!tar -xzf hadoop-3.3.0.tar.gz
#copy  hadoop file to user/local
!mv  hadoop-3.3.0/ /usr/local/

## 1.2 Set Environment Variables


In [3]:
#To find the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"
#!ls /usr/lib/jvm/

/usr/lib/jvm/java-11-openjdk-amd64/


In [4]:
#To set java path, go to /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh then
#. . . export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ . . .
#we have used a simpler alternative route using os.environ - it works

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"   # default is changed
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
# make sure that the version number is as downloaded 
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.0/"

In [5]:
!echo $PATH

/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin


In [6]:
# Add Hadoop BIN to PATH
# Get the current_path from output of previous command
current_path = '/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin'
new_path = current_path+':/usr/local/hadoop-3.3.0/bin/'
os.environ["PATH"] = new_path

## 1.3 Test Hadoop Installation

In [None]:
#Running Hadoop - Test RUN, not doing anything at all
#!/usr/local/hadoop-3.3.0/bin/hadoop
# UNCOMMENT the following line if you want to make sure that Hadoop is alive!
#!hadoop

In [7]:
# Testing Hadoop with PI generating sample program, should calculate value of pi = 3.14157500000000000000
# pi example
#Uncomment the following line if  you want to test Hadoop with pi example
# Final output should be : Estimated value of Pi is 3.14157500000000000000
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar pi 16 100000

Number of Maps  = 16
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
2021-08-25 08:49:54,151 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-25 08:49:54,244 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-25 08:49:54,244 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-25 08:49:54,383 INFO input.FileInputFormat: Total input files to process : 16
2021-08-25 08:49:54,394 INFO mapreduce.JobSubmitter: number of splits:16
2021-08-25 08:49:54,587 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1134888830_0001
2021-08-25 08:

## 1. 4 Run WordCount with Hadoop
Instead of using Java for Map and Reduce methods, we use the streaming API of Hadoop and two simple python programs as mapper.py and reducer.py

In [8]:
# get mapper.py reducer.py from G_drive
#!gdown https://drive.google.com/uc?id=1VTzQ18cWAj6L29ncW6sABy-ITmDCcv5r
#!gdown https://drive.google.com/uc?id=1Or8Cbf9AsFMHStjMzDw3pXCd6TZ0dqxJ

#get mapper.py reducer.py from this git repository
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/mapper.py
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/reducer.py

In [9]:
# to see the codes, uncomment the following lines
!cat mapper.py
print("\n----------------------    see above for mapper, see below for reducer")
!cat reducer.py

# -*- coding: utf-8 -*-
"""mapper.ipynb

Automatically generated by Colaboratory.

Original file is located at    https://colab.research.google.com/drive/1yCwGyMXJT2qt3_58aLOOiJXO0GIaPcJd
"""



import sys
import io
import re
import nltk
nltk.download('stopwords',quiet=True)
from nltk.corpus import stopwords
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

stop_words = set(stopwords.words('english'))
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')
for line in input_stream:
  line = line.strip()
  line = re.sub(r'[^\w\s]', '',line)
  line = line.lower()
  for x in line:
    if x in punctuations:
      line=line.replace(x, " ") 

  words=line.split()
  for word in words: 
    if word not in stop_words:
      print('%s\t%s' % (word, 1))
----------------------    see above for mapper, see below for reducer
# -*- coding: utf-8 -*-
"""reducer.ipynb

Automatically generated by Colaboratory.

Original file is located at     https://colab.research.google.com/drive/1YzJ-vU

In [10]:
# python codes are made executable
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

In [19]:
# get a simple txt file as data for word count
# or you can upload your own
#!gdown https://drive.google.com/uc?id=1R5W0UVH2S3JjPxerqyX4ue5y6tMt0Wkk
!wget -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/Chronotantra.txt

In [12]:
# locate the streaming jar file
!find / -name 'hadoop-streaming*.jar'

/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-test-sources.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-sources.jar
/usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar


In [21]:
# remove output directories
!rm -r wc_out
!rm -r wc2_out

In [22]:
# execute the streaming jar with proper parameters
# four parameters are input file, output directory, the mapper progra, the reducer program
#
#!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/hobbit.txt -output /content/wc_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/Chronotantra.txt -output /content/wc2_out -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

2021-08-25 09:12:31,333 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/mapper.py, /content/reducer.py] [] /tmp/streamjob4534058275163442726.jar tmpDir=null
2021-08-25 09:12:31,941 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-25 09:12:32,042 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-25 09:12:32,042 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-25 09:12:32,059 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-25 09:12:32,220 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-25 09:12:32,244 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-25 09:12:32,420 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1148488299_0001
2021-08-25 09:12:32,420 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-25 09:12:32,787 INFO mapred.Loc

In [23]:
# check output directory
!ls wc2_out

part-00000  _SUCCESS


In [24]:
# see actual output
!cat wc2_out/part-00000

1	8
10	2
100	2
1000	1
105	1
108	2
109	1
11	1
110	2
113	1
115	1
12700	1
12th	1
132	1
133	1
134	2
1493	1
15	1
150	1
155	1
156	1
157	2
17866	1
186	1
187	1
188	2
1956	1
1960s	2
1970s	1
1999	1
2	6
20	9
2000	1
20062007	1
2007	1
2018	1
2019	1
205	1
206	1
20th	2
21	2
213	2
214	1
2150	1
2152	1
2155	1
2157	1
2167	1
2170	1
2172	1
2175	1
2176	1
219	1
21st	1
22	2
221	2
23	1
25	2
28	1
2d	3
3	6
30	3
300	3
33	1
36wheeler	1
380	1
39	1
39th	2
3d	12
3dio	23
3dios	2
3dprint	1
3i	10
4	3
40	2
4000	1
43	1
45	2
46	2
48	1
49	1
5	4
50	1
500	2
5000	1
6	4
60	1
65	1
66	1
668	1
67	2
6c	4
7	4
70	1
715	1
730	1
8	2
800	1
80s	1
830	1
85	1
86	1
87	2
9	3
900	4
96	1
9789353518271	1
aakashbhora	1
aamar	1
aami	1
aback	4
abandon	1
abandoned	3
abdar	1
abdomen	1
abducted	1
abhi	4
abhinavagupta	5
abilities	3
ability	13
ablaze	1
able	31
abomination	1
absence	2
absolute	1
absolutely	4
abstract	1
abundant	1
academia	1
academic	3
academicians	1
academics	1
accelerate	2
acceleration	1
accept	4
acceptable	3
accepted	3
accepting	2
acc

#Chronobooks <br>
![alt text](https://1.bp.blogspot.com/-lTiYBkU2qbU/X1er__fvnkI/AAAAAAAAjtE/GhDR3OEGJr4NG43fZPodrQD5kbxtnKebgCLcBGAsYHQ/s600/Footer2020-600x200.png)<hr>
Chronotantra and Chronoyantra are two science fiction novels that explore the collapse of human civilisation on Earth and then its rebirth and reincarnation both on Earth as well as on the distant worlds of Mars, Titan and Enceladus. But is it the human civilisation that is being reborn? Or is it some other sentience that is revealing itself. 
If you have an interest in AI and found this material useful, you may consider buying these novels, in paperback or kindle, from [http://bit.ly/chronobooks](http://bit.ly/chronobooks)