# Hadoop Streaming assignment 1: Words Rating

Create a WordCount program and process Wikipedia dump. 
Use two MapReduce jobs to sort words by quantity in the reverse order (most popular first). Output format:

    word <tab> count

The result is the 7th word by popularity and its quantity.

The result on the sample dataset:
    
    is  126420
    
**Hint**: it is possible to use exactly one reducer in the second job to obtain a totally ordered result.
    
Docker container: https://hub.docker.com/r/bigdatateam/yarn-notebook/
    

# First map reduce job: Create a word count

## Mapper

In [None]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % 1
        print "%s\t%d" % (word.lower(), 1)

## Reducer

In [1]:
%%writefile reducer.py
##%%writefile -a reducer.py
import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print(current_key, word_sum, sep="\t")
        word_sum = count
        current_key = key
    else:
        word_sum += count

if current_key:
    print(current_key, word_sum, sep="\t")

Writing reducer.py


# Second map reduce job: Sorting

## Mapper

In [3]:
%%writefile mapper2.py

import sys

for line in sys.stdin:
    try:
        word, count = line.strip().split('\t', 1)
        print("{0}\t{1}".format(count, word))
    except ValueError as e:
        continue

Writing mapper2.py


## Reducer

In [4]:
%%writefile reducer2.py

import sys
for line in sys.stdin:
    try:
        word, count = line.strip().split('\t', 1)
        print("{0}\t{1}".format(count, word))
    except ValueError as e:
        continue

Writing reducer2.py


## Bash script

In [6]:
%%bash
HADOOP_STREAMING_JAR="/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar"
OUT_DIR="1_wordcount"$(date +"%s%6N")
OUT_DIR_2="2_wordcountSorting"$(date +"%s%6N")


hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null
hdfs dfs -rm -r -skipTrash ${OUT_DIR_2} > /dev/null

# Map reduce 1
yarn jar $HADOOP_STREAMING_JAR \
    -D mapred.jab.name="WordCount" \
    -files mapper.py,reducer.py \
    -mapper "python2 mapper.py" \
    -combiner "python3 reducer.py" \
    -reducer "python3 reducer.py" \
    -numReduceTasks 8 \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null


# Map reduce 2
yarn jar $HADOOP_STREAMING_JAR \
    -D mapred.jab.name="WordCountSort" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k1,2nr" \
    -files mapper2.py,reducer2.py \
    -mapper "python3 mapper2.py" \
    -reducer "python3 reducer2.py" \
    -numReduceTasks 1 \
    -input ${OUT_DIR} \
    -output ${OUT_DIR_2} > /dev/null
    
    
# Results
hdfs dfs -cat ${OUT_DIR_2}/part-00000 | sed -n '7p;8q'

Couldn't find program: 'bash'
