# Hadoop Streaming assignment 1: Words Rating

The purpose of this task is to create your own WordCount program for Wikipedia dump processing and learn basic concepts of the MapReduce.

In this task you have to find the 7th word by popularity and its quantity in the reverse order (most popular first) in Wikipedia data (`/data/wiki/en_articles_part`).

There are several points for this task:

1) As an output, you have to get the 7th word and its quantity separated by a tab character.

2) You must use the second job to obtain a totally ordered result.

3) Do not forget to redirect all trash and output to /dev/null.

Here you can find the draft of the task main steps. You can use other methods for solution obtaining.

## Step 1. Create mapper and reducer.

<b>Hint:</b>  Demo task contains almost all the necessary pieces to complete this assignment. You may use the demo to implement the first MapReduce Job.

In [1]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % 1
        print "%s\t%d" % (word.lower(), 1)


Writing mapper1.py


In [2]:
%%writefile reducer1.py

import sys

current_key = None
word_sum = 0
for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print "%s\t%d" % (current_key, word_sum)
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print "%s\t%d" % (current_key, word_sum)

Writing reducer1.py


In [3]:
# You can use this cell for other experiments: for example, for comb
! python mapper1.py

  File "mapper1.py", line 16
    print "%s\t%d" % (word.lower(), 1)
                 ^
SyntaxError: invalid syntax


## Step 2. Create sort job.

<b>Hint:</b> You may use MapReduce comparator to solve this step. Make sure that the keys are sorted in ascending order.

In [11]:
%%writefile sortjob.py
# Your code for sort job here. Don't forget to use magic writefile

import sys

for line in sys.stdin:
    print line,


Overwriting sortjob.py


In [37]:
%%writefile test-num
1	8
2	7
3	8
4	6
5	5
1	4
2	4
3	2
4	6
5	8
6	1
7	9
8	0
9	5
5	3

Writing test-num


In [43]:
! hdfs dfs -ls test

Found 1 items
-rw-r--r--   1 jovyan supergroup         59 2019-01-04 17:42 test/test-num


For sorting rules *Look at doc:* http://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html

In [55]:
%%bash

OUT_DIR="test_result"
NUM_REDUCERS=1

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.job.name="Sort Test numbers" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keypartitioner.options=-k2,2 \
    -D mapreduce.partition.keycomparator.options=-k2,2nr \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input test/test-num \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/* 

7	9	
5	8	
3	8	
1	8	
2	7	
4	6	
4	6	
9	5	
5	5	
2	4	
1	4	
5	3	
3	2	
6	1	
8	0	


19/01/05 16:14:48 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/05 16:14:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/05 16:14:50 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/05 16:14:50 INFO mapreduce.JobSubmitter: number of splits:2
19/01/05 16:14:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546432843083_0037
19/01/05 16:14:50 INFO impl.YarnClientImpl: Submitted application application_1546432843083_0037
19/01/05 16:14:51 INFO mapreduce.Job: The url to track the job: http://1ba397aeaf92:8088/proxy/application_1546432843083_0037/
19/01/05 16:14:51 INFO mapreduce.Job: Running job: job_1546432843083_0037
19/01/05 16:14:58 INFO mapreduce.Job: Job job_1546432843083_0037 running in uber mode : false
19/01/05 16:14:58 INFO mapreduce.Job:  map 0% reduce 0%
19/01/05 16:15:04 INFO mapreduce.Job:  map 100% reduce 0%
19/01/05 16:15:10 INFO mapreduce.Job:  map 100% reduce 100%
19/01/05 16:15:10 I

In [56]:
%%bash

OUT_DIR="wordcount_result_"$(date +"%s%6N")
NUM_REDUCERS=1

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.job.name="Sort wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keypartitioner.options=-k2,2 \
    -D mapreduce.partition.keycomparator.options=-k2,2nr \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input wordcount_result_1546433333935161 \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/* | head

the	822164	
of	447464	
and	342715	
in	292354	
to	241467	
a	236225	
is	126420	
as	103301	
for	91245	
was	90336	


rm: `wordcount_result_1546705121107718': No such file or directory
19/01/05 16:18:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/05 16:18:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/05 16:18:45 INFO mapred.FileInputFormat: Total input files to process : 2
19/01/05 16:18:46 INFO mapreduce.JobSubmitter: number of splits:2
19/01/05 16:18:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546432843083_0038
19/01/05 16:18:46 INFO impl.YarnClientImpl: Submitted application application_1546432843083_0038
19/01/05 16:18:46 INFO mapreduce.Job: The url to track the job: http://1ba397aeaf92:8088/proxy/application_1546432843083_0038/
19/01/05 16:18:46 INFO mapreduce.Job: Running job: job_1546432843083_0038
19/01/05 16:18:53 INFO mapreduce.Job: Job job_1546432843083_0038 running in uber mode : false
19/01/05 16:18:53 INFO mapreduce.Job:  map 0% reduce 0%
19/01/05 16:19:00 INFO mapreduce.Job:  map 100% reduce 0%
19/01/05 16:

In [9]:
! hdfs dfs -ls wordcount_result_1546433333935161

Found 3 items
-rw-r--r--   1 jovyan supergroup          0 2019-01-02 12:50 wordcount_result_1546433333935161/_SUCCESS
-rw-r--r--   1 jovyan supergroup    2667984 2019-01-02 12:50 wordcount_result_1546433333935161/part-00000
-rw-r--r--   1 jovyan supergroup    2702529 2019-01-02 12:50 wordcount_result_1546433333935161/part-00001


In [29]:
! hdfs dfs -cat wordcount_result_1546433333935161/* | head

0".32	1
0%however	1
0&\mathrm{if	1
0(8)320-1234	1
0)(0,0,0	1
0)).(1	2
0).euclid's	1
0).one	1
0+\|b\|^2	1
0,0,snr	1
cat: Unable to write to output stream.
cat: Unable to write to output stream.


## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

In [0]:
%%bash

OUT_DIR="assignment1_"$(date +"%s%6N")


# Code for your first job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...

%%bash

OUT_DIR="wordcount_result_"$(date +"%s%6N")
NUM_REDUCERS=2

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper1.py,reducer1.py \
    -mapper "python mapper1.py" \
    -combiner "python reducer1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/* | head

# Code for your second job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...



# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR}/part-00000 | sed -n '7p;8q'
hdfs dfs -rm -r -skipTrash ${OUT_DIR}* > /dev/null