# Hadoop Streaming assignment 1: Words Rating

The purpose of this task is to create your own WordCount program for Wikipedia dump processing and learn basic concepts of the MapReduce.

In this task you have to find the 7th word by popularity and its quantity in the reverse order (most popular first) in Wikipedia data (`/data/wiki/en_articles_part`).

There are several points for this task:

1) As an output, you have to get the 7th word and its quantity separated by a tab character.

2) You must use the second job to obtain a totally ordered result.

3) Do not forget to redirect all trash and output to /dev/null.

Here you can find the draft of the task main steps. You can use other methods for solution obtaining.

## Step 1. Create mapper and reducer.

<b>Hint:</b>  Demo task contains almost all the necessary pieces to complete this assignment. You may use the demo to implement the first MapReduce Job.

In [1]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        print >> sys.stderr, "reporter:counter:word_status,total_word,%d" % 1
        print "%s\t%d" % (word.lower(), 1)

Overwriting mapper1.py


In [2]:
%%writefile reducer1.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print "%s\t%d" % (current_key, word_sum)
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print "%s\t%d" % (current_key, word_sum)

Overwriting reducer1.py


## Step 2. Create sort job.

<b>Hint:</b> You may use MapReduce comparator to solve this step. Make sure that the keys are sorted in ascending order.

In [3]:
%%writefile mapper2.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t',1)
        count = int(count)
    except ValueError as e:
        continue
    print "%d\t%s" %(count, key)

Overwriting mapper2.py


In [4]:
%%writefile reducer2.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

for line in sys.stdin:
    try:
        count, key = line.strip().split('\t',1)
        count = int(count)
    except ValueError as e:
        continue
    print "%s\t%d" %(key, count)

Overwriting reducer2.py


## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

**CODE COMMENT**

```shell
OUT_DIR_2="assignment_2_"$(date +"%s%6N")
NUM_REDUCERS_2=1

hdfs dfs -rm -r -skipTrash ${OUT_DIR_2} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount_2" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_2} \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.map.output.field.separator=\t \ #在map output的階段，針對每個欄位，都是以\t來分隔
    -D stream.num.map.output.key.fields=1 \ #在map output的階段將前面的幾個欄位，設定為key
    -D mapreduce.map.output.key.field.separator=\t \ # 在map output階段，將key裡面的值，以\t進行分隔
    -D mapreduce.partition.keycomparator.options=-k1,1nr \ #在map output階段，以第一個key，用numerical reverse的方式排序
    -files mapper2.py,reducer2.py \
    -mapper "python mapper2.py" \
    -reducer "python reducer2.py" \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null
```

In [7]:
%%bash


OUT_DIR_1="assignment_1_"$(date +"%s%6N")
NUM_REDUCERS_1=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR_1} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount_1" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_1} \
    -files mapper1.py,reducer1.py \
    -mapper "python mapper1.py" \
    -combiner "python reducer1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_1} > /dev/null




OUT_DIR_2="assignment_2_"$(date +"%s%6N")
NUM_REDUCERS_2=1

hdfs dfs -rm -r -skipTrash ${OUT_DIR_2} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount_2" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_2} \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.map.output.key.field.separator=\t \
    -D mapreduce.partition.keycomparator.options=-k1,1nr \
    -files mapper2.py,reducer2.py \
    -mapper "python mapper2.py" \
    -reducer "python reducer2.py" \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null



hdfs dfs -cat ${OUT_DIR_2}/part-00000 | head -7 | tail -1

0%however	1
0&\mathrm{if	1
0(8)320-1234	1
0)).(1	2
0,03	1
0,1,...,n	1
0,1,0	1
	he	822164
of	447464
and	342715
in	292354
	o	241467
a	236225
is	126420


rm: `assignment_1_1533201152511695': No such file or directory
18/08/02 09:12:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/08/02 09:12:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/08/02 09:12:36 INFO mapred.FileInputFormat: Total input files to process : 1
18/08/02 09:12:37 INFO mapreduce.JobSubmitter: number of splits:2
18/08/02 09:12:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533143750107_0020
18/08/02 09:12:37 INFO impl.YarnClientImpl: Submitted application application_1533143750107_0020
18/08/02 09:12:37 INFO mapreduce.Job: The url to track the job: http://b0dbae425182:8088/proxy/application_1533143750107_0020/
18/08/02 09:12:37 INFO mapreduce.Job: Running job: job_1533143750107_0020
18/08/02 09:12:44 INFO mapreduce.Job: Job job_1533143750107_0020 running in uber mode : false
18/08/02 09:12:44 INFO mapreduce.Job:  map 0% reduce 0%
18/08/02 09:13:00 INFO mapreduce.Job:  map 33% reduce 0%
18/08/02 09:13:06