# Hadoop Streaming assignment 2: Stop Words

Improve the previous program to calculate how many stop words are in the input dataset. Stop words list is in ‘/datasets/stop_words_en.txt’ file. Use Hadoop counter to count the number of stop words and total words in the dataset. The result is the percentage of stop words in the entire dataset (without percent symbol).

The result on the sample dataset: 41.603
    
Hint. As you can see in the Hadoop Streaming userguide "you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission.". In general you can attach to the job not only executable files and then access them within your mappers and reducers as if were located in the same directory.

Hint 2. The solution can contain either one or two Hadoop MapReduce jobs. In each case the last MapReduce job will have either 0 or > 1 reducers.

You should extract counters’ values from Hadoop logs after MapReduce jobs completion. You will use them to calculate the result. For doing this it is convenient to write a script. The script should do the following:

* read the Hadoop logs from stderr of the last notebook’s cell;
* extract the values of the Hadoop counters for: “stop words” and “total words”;
* calculate the percentage of stop words;
* print this percentage in the correct format to stdout;
* print Hadoop logs into stderr. It will be used as the input of your script.
    
    

# Mapper


Hint: Create the mapper, which calculates Total word and Stop word amounts. You may redirect this information to sys.stderr. This will make it possible to parse these data on the next steps.

Distributed cache: If we add option -files mapper.py,reducer.py,/datasets/stop_words_en.txt, then mapper.py, reducer.py and stop_words_en.txt file will be in the same directory on the datanodes. Hence, it is necessary to use a relative path stop_words_en.txt from the mapper to access this txt file.

In [None]:
%%writefile mapper.py

import sys
import re

totalWordsCount = 0
stopWordsCount = 0

with open('stop_words_en.txt') as f:
    stopWords = f.read().split()
    
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        for word in words:
            if word in stopWords:
                stopWordsCount += 1
            totalWordsCount += 1
    except ValueError as e:
        continue
        
print("stop", stopWordsCount, sep="\t")
print("total", totalWordsCount, sep="\t")

# Reducer

In [2]:
%%writefile reducer.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
        if current_key != key:
            if current_key:
                print("reporter:counter:wiki,{0},{1}".format(current_key, word_sum), file = sys.stderr)
            word_sum = count
            current_key = key
        else:
            word_sum += count
    except ValueError as e:
        continue

if current_key:
    print("reporter:counter:wiki,{0},{1}".format(current_key, word_sum), file = sys.stderr)

Overwriting reducer.py


# Counter

In [3]:
%%writefile counter.py

import sys

totalWordsCount = 0
stopWordsCount = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('=', 1)
        count = int(count)
        if key == 'stop':
            stopWordsCount += count
        elif key == 'total':
            totalWordsCount += count
    except ValueError as e:
        continue

print(stopWordsCount / float(totalWordsCount) * 100)

Overwriting counter_process.py


# Bash
## Run map-reduce and counter

In [2]:
%%bash

OUT_DIR="stopWordsPercentage"$(date +"%s%6N")
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="StopWordsPercentage" \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -numReduceTasks 1 \
    -input /data/wiki/en_articles_part \
    #redirect standard error to LOGS
    -output ${OUT_DIR} > /dev/null 2> $LOGS
    
cat $LOGS | python3 ./counter.py
cat $LOGS >&2


Couldn't find program: 'bash'
