# CS4225 Assignment 0: Word Count

Assignment 0 provides a basic working example of a Hadoop program. We are using Hadoop Streaming, a variant of Hadoop which is easier to work with for our purposes as it allows the mapper and reducer to be any executable (and thus can be written in languages other than Java). Assignment 0 is a simple word count program, which just counts the number of occurrences of each word in a text document. You do not have to code anything for this assignment, but it is still important to go over this to familiarize with the Hadoop Streaming framework.

In Hadoop Streaming, the mappers and reducers communicate via **lines of text**, rather than key-value pairs. We will see this in the following example.

# Installing Hadoop

First, we need to download and install Hadoop to the `usr/local` directory, and then set the `JAVA_HOME` environment variable so Hadoop knows where to find Java on our system. The installation process should take about 45 seconds or so. Note that if you refresh the colab notebook, you need to run this cell again before you can use Hadoop.

In [None]:
!wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
!tar -xzf hadoop-3.3.1.tar.gz
# install hadoop to /usr/local
!cp -r hadoop-3.3.1/ /usr/local/
import os
# hadoop needs to know where java is located
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# the following line reduces the amount of Hadoop's printed output, to make it easier to see your own debug output
os.environ["HADOOP_ROOT_LOGGER"] = "WARN,console"

--2023-09-09 17:27:26--  https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 605187279 (577M) [application/x-gzip]
Saving to: ‘hadoop-3.3.1.tar.gz’


2023-09-09 17:27:49 (195 MB/s) - ‘hadoop-3.3.1.tar.gz’ saved [605187279/605187279]



# Downloading the Data

In [None]:
!wget https://bhooi.github.io/teaching/cs4225/assign0/input.txt

--2023-09-09 17:28:12--  https://bhooi.github.io/teaching/cs4225/assign0/input.txt
Resolving bhooi.github.io (bhooi.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to bhooi.github.io (bhooi.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38507 (38K) [text/plain]
Saving to: ‘input.txt’


2023-09-09 17:28:12 (9.52 MB/s) - ‘input.txt’ saved [38507/38507]



# Hadoop Streaming

The `%%file` command in colab means that when we execute this block, colab will write the Python program in the block to a file
located at `/content/mapper.py` (note the slash in front). Generally
in colab, `/content` is the default directory where your files are stored.

It is not recommended to code your mapper and reducer directly into
the files `/content/mapper.py` and `/content/reducer.py`; this is quite unsafe, as colab
does not in general save these files, and when you re-open the notebook
later they will most likely be gone. It is better to follow the below approach and code them into
notebook cells, and then use the `%%file` command.

**Mapper:** Like in regular Hadoop, the framework divides the input file into splits (e.g., 128MB) and passes them to different mappers. But unlike regular Hadoop where the mappers receive their input as key-value pairs, here the mapper just takes in lines of text (which are lines from the input file). Instead of an explicit `map()` function, here we just iterate over the lines of text and process them one by one. Both the mapper and reducer read their inputs from `sys.stdin`, and write their outputs to `sys.stdout`.

In [None]:
%%file mapper.py

import io
import sys
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')

for line in sys.stdin:

  # Here we split each line of the incoming input, and emit each word, with
  # its count of 1, separated by a tab. This corresponds to the map() function
  # in Hadoop.
  for word in line.split():
    print(f"{word}\t1")

Overwriting mapper.py


**Reducer:** Like in regular
Hadoop, the data emitted by the mappers is grouped by key and sorted, and
then passed to the reducer responsible for that key. But unlike regular Hadoop
where the `reduce()` function's input is in the form `<key, List[values]>`,
here the reducer receives its input line by line, in the same format emitted
by the mappers.

In [None]:
%%file reducer.py

import sys

# Initialization: here we create the data structures we need; this corresponds
# to the setup() function in Hadoop.
counts = {}

for line in sys.stdin:

  # Here we process each line of the incoming input; this corresponds to the
  # reduce() function in Hadoop.
  word, count = line.strip().split('\t')
  if word not in counts:
    counts[word] = 0
  counts[word] += int(count)

# Postprocess: after processing all lines, we do any necessary post-processing,
# corresponding to the cleanup() function in Hadoop.
for word, count in counts.items():
  print(f"{word}\t{count}")

Overwriting reducer.py


In [None]:
# Set permissions to ensure that Hadoop can use the files
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

**Run Hadoop Streaming:** the first line deletes the `/content/output` folder to prevent errors when running the program multiple times.

In [None]:
!rm -rf /content/output
!/usr/local/hadoop-3.3.1/bin/hadoop jar /usr/local/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar \
-input /content/input.txt \
-output /content/output \
-file /content/mapper.py \
-file /content/reducer.py \
-mapper 'python mapper.py' \
-reducer 'python reducer.py'

2023-09-09 19:05:47,413 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/mapper.py, /content/reducer.py] [] /tmp/streamjob16835668167985813470.jar tmpDir=null
2023-09-09 19:05:48,692 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2023-09-09 19:05:50,535 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


The output is stored in the `/content/output/part-00000` file. (If we had multiple reducers, there would be multiple files, but for our assignments it is sufficient to use just 1 reducer (and Hadoop's default behavior is to use just 1 reducer).

In [None]:
!cat /content/output/part-00000

"And	1
"Better	1
"Bless	1
"Don't	1
"For	1
"I	3
"I've	1
"John	1
"My	1
"Open	1
"Project	5
"The	2
"Then	1
"What	2
"Why	1
"Why,	1
"You	1
"Your	1
"and	2
"debased	1
"in	1
"nor	1
"our	1
"she	1
"the	1
"there	1
"work"	1
#1952]	1
("the	1
(and	1
(any	1
(available	1
(does	1
(or	1
(trademark/copyright)	1
***	6
*****	2
1.	1
1.A.	1
1.B.	1
1.C	1
1.C.	1
1.D.	1
1.E	1
1.E.	1
1.E.1	1
1.E.1.	1
1.E.2.	1
1.E.3.	1
1.E.7	1
1.E.8	1
1.E.8.	1
1.E.9.	1
1952.txt	1
1952.zip	1
1999	1
2008	1
2011	1
25,	1
31,	1
A	5
ALIVE!	1
ANYTHING	1
ASCII	1
AT	1
All	1
An	2
And	17
Anonymous	2
Archive	1
As	2
At	2
Author:	1
BECAUSE	2
BEFORE	1
Behind	1
Besides	1
Besides,	2
But	27
But,	2
By	4
CANNOT	1
COLOR	1
Can	1
Character	1
Charlotte	4
Copyright	1
Corrected:	1
Cousin	2
Creating	1
DELICIOUS	1
DID	1
DISTRIBUTE	1
DOES	2
DRAUGHT,	1
Date:	2
Dear	1
Did	1
EBOOK	2
END	1
Else,	1
End	1
English	2
Even	1
FULL	2
For	1
Foundation	3
Foundation"	1
Fourth	1
Full	1
GUTENBERG	3
GUTENBERG-tm	1
General	2
Gilman	4
God's	1
Gutenberg	5
Gutenberg"	4
Gutenberg"