# Hadoop Streaming assignment 1: Words Rating

The purpose of this task is to create your own WordCount program for Wikipedia dump processing and learn basic concepts of the MapReduce.

In this task you have to find the 7th word by popularity and its quantity in the reverse order (most popular first) in Wikipedia data (`/data/wiki/en_articles_part`).

There are several points for this task:

1) As an output, you have to get the 7th word and its quantity separated by a tab character.

2) You must use the second job to obtain a totally ordered result.

3) Do not forget to redirect all trash and output to /dev/null.

Here you can find the draft of the task main steps. You can use other methods for solution obtaining.

## Step 1. Create mapper and reducer.

<b>Hint:</b>  Demo task contains almost all the necessary pieces to complete this assignment. You may use the demo to implement the first MapReduce Job.

In [12]:
%%writefile mapper2.py

# ---------------------------
# Merge all the lines
# sort into key-value pair (use collections Counter item)
# return key-value pairs

from __future__ import print_function
import sys, uuid

for line in sys.stdin:
    lid = line.strip()
    if len(lid) == 0:
        continue
    rid = uuid.uuid4()
    
    print(rid.hex, lid, sep = '\t')
            
# Your code for mapper here.

Overwriting mapper2.py


In [13]:
%%writefile reducer2.py

# ----------------
# Choose random number -rn- between 1-5 (rand.randint())
# return randomly selected rn words from each line 

from __future__ import print_function
import sys
import random

current_range = 0
output = []

for line in sys.stdin:
    rid, lid = line.split()
    if len(output) <= current_range:
        if(len(output)) > 0:
            print(','.join(output))
        current_range = random.randint(1,5)
        output = []
    output.append(lid)

if len(output) > 0:
    print(','. join(output))
# Your code for reducer here.

Overwriting reducer2.py


In [4]:
%%writefile combiner.py

from __future__ import print_function
from collections import Counter
import sys

for line in sys.stdin:
    article_id, content = line.split("\t", 1)
    words = content.split()
    counts = Counter(words)
    for word, word_count in counts.items():
        print(word, word_count, sep="\t")
        
# Your code for combiner here.

Writing combiner.py


## Step 2. Create sort job.

<b>Hint:</b> You may use MapReduce comparator to solve this step. Make sure that the keys are sorted in ascending order.

In [0]:
# Your code for sort job here. Don't forget to use magic writefile

## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

In [0]:
%%bash

OUT_DIR="assignment1_"$(date +"%s%6N")

# Code for your first job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...

# Code for your second job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR}/part-00000 | sed -n '7p;8q'
hdfs dfs -rm -r -skipTrash ${OUT_DIR}* > /dev/null