# Hadoop Streaming assignment 2: Stop Words

The purpose of this task is to improve the previous "Word rating" program. You have to calculate how many stop words are there in the input dataset. Stop words list is in `/datasets/stop_words_en.txt` file. 

Use Hadoop counters to compute the number of stop words and total words in the dataset. The result is the percentage of stop words in the entire dataset (without percent symbol).

There are several points for this task:

1) As an output, you have to get the percentage of stop words in the entire dataset without percent symbol (correct answer on sample dataset is `41.603`).

2) As you can see in the Hadoop Streaming userguide "you will need to use `-files` option to tell the framework to pack your executable files as a part of a job submission."

3) Do not forget to redirect junk output to `/dev/null`.

4) You may modify mappers/reducers from "Word rating" task and parse its output to get the answer on "Stop Words" task.

5) You may use mapper/reducer to get `"Stop Words"` and `"Total Words"` amounts and redirect them to sys.stderr. After that you may redirect the output of MapReduce to the parsed function. In this function you may find rows correspond to these amounts and compute the percentage.

Here you can find the draft for the main steps of the task. You can use other methods to get the solution.

## Step 1. Create the mapper.

<b>Hint:</b> Create the mapper, which calculates Total word and Stop word amounts. You may redirect this information to sys.stderr. This will make it possible to parse these data on the next steps.

Example of the redirections:

`print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % count`

Remember about the Distributed cache. If we add option `-files mapper.py,reducer.py,/datasets/stop_words_en.txt`, then `mapper.py, reducer.py` and `stop_words_en.txt` file will be in the same directory on the datanodes. Hence, it is necessary to use a relative path `stop_words_en.txt` from the mapper to access this txt file.

In [1]:
!hdfs dfs -ls /datasets/stop_words_en.txt .

-rw-r--r--   1 jovyan supergroup       1914 2018-08-01 17:15 /datasets/stop_words_en.txt


In [22]:
%%writefile mapper.py
#!/usr/bin/env python

from __future__ import print_function
import sys
import re


reload(sys)
sys.setdefaultencoding('utf-8')  # required to convert to unicode

path = 'stop_words_en.txt'

# Your code for reading stop words here
stopWords =[]

with open(path) as stopWordsFile:
    for line in stopWordsFile:
        try:
            stopWords.append(unicode(line.strip()))
        except ValueError as e:
            continue

wordSum, stopWordSum = 0, 0

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    words = re.split("\W*\s+\W*", text.lower(), flags=re.UNICODE)
    #words = re.findall("[a-z]+", text.lower())
    
    # Your code for mapper here.
    for word in words:
        if word in stopWords:
            stopWordSum += 1
            #print >> sys.stderr, "reporter:counter:Stop words,%d" % 1
            #print("reporter:counter:Personal Stats,Total words,1", file= sys.stderr)
            print("reporter:counter:Stop words,words found,%d" % 1, file=sys.stderr) 
        wordSum += 1
        #print >> sys.stderr, "reporter:counter:Total words,%d" % 1
        print("reporter:counter:Total words,words found,%d" % 1,file=sys.stderr )
    #wordSum += len(words)
    
#print >> sys.stderr, "reporter:counter:Total words,%d" % wordSum
#print >> sys.stderr, "reporter:counter:Stop words,%d" % stopWordSum

print ("wordSum-stopWordSum\t%d\t%d" % (wordSum, stopWordSum))

Overwriting mapper.py


In [23]:
!date

Tue Jan 15 13:25:59 UTC 2019


## Step 2. Create the reducer.

Create the reducer, which will accumulate the information after the mapper step. You may implement the combiner if you want. It can be useful from optimizing and speed up your computations (see the lectures from the Week 2 for more details).

In [24]:
%%writefile reducer.py

# Your code for reducer here.
import sys

totalWords, totalStopWords = 0, 0

for line in sys.stdin:
    try:
        _, wordSum, stopWordSum = line.strip().split('\t')
    except:
        continue
    totalWords += int(wordSum)
    totalStopWords += int(stopWordSum)
    
print "wordSum-stopWordSum\t%d\t%d" % (totalWords, totalStopWords)
    

Overwriting reducer.py


In [25]:
ls -l /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar

-rw-rw-r-- 1 500 500 133866 Jun  2  2017 /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar


In [26]:
%%bash

# Code for your first job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...

OUT_DIR="stopword_result_"$(date +"%s%6N")
NUM_REDUCERS=0
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Stop wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> $LOGS

cat $LOGS
    
hdfs dfs -cat ${OUT_DIR}/*


19/01/15 13:26:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 13:26:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 13:26:07 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/15 13:26:07 INFO mapreduce.JobSubmitter: number of splits:2
19/01/15 13:26:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547542376033_0017
19/01/15 13:26:08 INFO impl.YarnClientImpl: Submitted application application_1547542376033_0017
19/01/15 13:26:08 INFO mapreduce.Job: The url to track the job: http://6ebf3e0bce0a:8088/proxy/application_1547542376033_0017/
19/01/15 13:26:08 INFO mapreduce.Job: Running job: job_1547542376033_0017
19/01/15 13:26:15 INFO mapreduce.Job: Job job_1547542376033_0017 running in uber mode : false
19/01/15 13:26:15 INFO mapreduce.Job:  map 0% reduce 0%
19/01/15 13:26:32 INFO mapreduce.Job:  map 17% reduce 0%
19/01/15 13:26:38 INFO mapreduce.Job:  map 26% reduce 0%
19/01/15 13:26:44 INFO 

rm: `stopword_result_1547558762418663': No such file or directory


In [6]:
!ls -l


total 76
-rw-r--r-- 1 jovyan users  1163 Jan 15 08:56 mapper.py
-rw-r--r-- 1 jovyan root    868 May  2  2018 README.md
-rw-r--r-- 1 jovyan users   344 Jan 15 08:56 reducer.py
-rw-r--r-- 1 jovyan users  3520 Jan 15 08:59 stderr_logs.txt
-rw-r--r-- 1 jovyan users 18052 Jan 15 08:59 StopWordsTask22.ipynb
-rw-r--r-- 1 jovyan root  10072 Aug  1 17:10 StopWordsTask2.ipynb
-rw-r--r-- 1 root   users   387 Jan 15 08:52 supervisord.log
-rw-r--r-- 1 root   users     2 Jan 15 08:52 supervisord.pid
-rw-r--r-- 1 jovyan root   9197 Aug  1 17:11 WordCountTask0.ipynb
-rw-r--r-- 1 jovyan root   5771 Aug  1 17:10 WordsRatingTask1.ipynb


In [7]:
!cat stderr_logs.txt

19/01/15 08:57:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 08:57:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 08:57:48 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/15 08:57:48 INFO mapreduce.JobSubmitter: number of splits:2
19/01/15 08:57:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547542376033_0001
19/01/15 08:57:49 INFO impl.YarnClientImpl: Submitted application application_1547542376033_0001
19/01/15 08:57:49 INFO mapreduce.Job: The url to track the job: http://6ebf3e0bce0a:8088/proxy/application_1547542376033_0001/
19/01/15 08:57:49 INFO mapreduce.Job: Running job: job_1547542376033_0001
19/01/15 08:57:56 INFO mapreduce.Job: Job job_1547542376033_0001 running in uber mode : false
19/01/15 08:57:56 INFO mapreduce.Job:  map 0% reduce 0%
19/01/15 08:58:13 INFO mapreduce.Job:  map 13% reduce 0%
19/01/15 08:58:19 INFO mapreduce.Job:  map 20% reduce 0%
19/01/15 08

## Step 3. Create the parsed function.

<b>Hint:</b> Create the function, which will parse MapReduce sys.stderr for Total word and Stop word amounts.

The `./counter_process.py` script should do the following:

- parse hadoop logs from Stderr,

- retrieve values of 2 user-defined counters,

- compute percentage and output it into the stdout.

In [0]:
%%writefile counter_process.py

#! /usr/bin/env python

import sys

# Your functions may be here.



if __name__ == '__main__':
    # Your code here.

## Step 4. Bash commands

<b> Hints: </b> 

1) If you want to redirect standard output to txt file you may use the following argument in yarn jar:

```
yarn ... \
  ... \
  -output ${OUT_DIR} > /dev/null 2> $LOGS
```

2) For printing the percentage of stop words in the entire dataset you may parse the MapReduce output. Parsed script may be written in Python code. 

To get the result you may use the UNIX pipe operator `|`. The output of the first command acts as an input to the second command (see lecture file-content-exploration-2 for more details).

With this operator you may use command `cat` to redirect the output of MapReduce to ./counter_process.py with arguments, which correspond to the `"Stop words"` and `"Total words"` counters. Example is the following:

`cat $LOGS | python ./counter_process.py "Stop words" "Total words"`

Now something about Hadoop counters naming. 
 - Built-in Hadoop counters usually have UPPER_CASE names. To make the grading system possible to distinguish your custom counters and system ones please use the following pattern for their naming: `[Aa]aaa...` (all except the first letters should be in lowercase);
 - Another points is how Hadoop sorts the counters. It sorts them lexicographically. Grading system reads your first counter as Stop words counter and the second as Total words. Please name you counters in such way that Hadoop set the Stop words counter before the Total words. 
 
E.g. "Stop words" and "Total words" names are Ok because they correspond both requirements.

3) In Python code sys.argv is a list, which contains the command-line arguments passed to the script. The name of the script is in `sys.argv[0]`. Other arguments begin from `sys.argv[1]`.

Hence, if you have two arguments, which you send from the Bash to your python script, you may use arguments in your script with the following command:

`function(sys.argv[1], sys.argv[2])`

4) Do not forget about printing your MapReduce output in the last cell. You may use the next command:

`cat $LOGS >&2`

In [0]:
%%bash

OUT_DIR="coursera_mr_task2"$(date +"%s%6N")
NUM_REDUCERS=8
LOGS="stderr_logs.txt"

# Stub code for your job

# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...
# ... \
#    -output ${OUT_DIR} > /dev/null 2> $LOGS
    
cat $LOGS | python ./counter_process.py "Stop words" "Total words"
cat $LOGS >&2

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null


In [12]:
%%bash

# Code for your first job
# yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar ...

OUT_DIR="stopword_result_"$(date +"%s%6N")
NUM_REDUCERS=0
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Stop wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> $LOGS

cat $LOGS
hdfs dfs -cat ${OUT_DIR}/*

19/01/15 12:53:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 12:53:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 12:53:55 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/15 12:53:55 INFO mapreduce.JobSubmitter: number of splits:2
19/01/15 12:53:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547542376033_0012
19/01/15 12:53:55 INFO impl.YarnClientImpl: Submitted application application_1547542376033_0012
19/01/15 12:53:55 INFO mapreduce.Job: The url to track the job: http://6ebf3e0bce0a:8088/proxy/application_1547542376033_0012/
19/01/15 12:53:55 INFO mapreduce.Job: Running job: job_1547542376033_0012
19/01/15 12:54:01 INFO mapreduce.Job: Job job_1547542376033_0012 running in uber mode : false
19/01/15 12:54:01 INFO mapreduce.Job:  map 0% reduce 0%
19/01/15 12:54:05 INFO mapreduce.Job: Task Id : attempt_1547542376033_0012_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException:

rm: `stopword_result_1547556830529285': No such file or directory
cat: `stopword_result_1547556830529285/*': No such file or directory


In [8]:
!cat stderr_logs.txt

19/01/15 12:05:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 12:05:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/15 12:05:03 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/15 12:05:03 INFO mapreduce.JobSubmitter: number of splits:2
19/01/15 12:05:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547542376033_0011
19/01/15 12:05:04 INFO impl.YarnClientImpl: Submitted application application_1547542376033_0011
19/01/15 12:05:04 INFO mapreduce.Job: The url to track the job: http://6ebf3e0bce0a:8088/proxy/application_1547542376033_0011/
19/01/15 12:05:04 INFO mapreduce.Job: Running job: job_1547542376033_0011
19/01/15 12:05:09 INFO mapreduce.Job: Job job_1547542376033_0011 running in uber mode : false
19/01/15 12:05:09 INFO mapreduce.Job:  map 0% reduce 0%
19/01/15 12:05:26 INFO mapreduce.Job:  map 28% reduce 0%
19/01/15 12:05:33 INFO mapreduce.Job:  map 43% reduce 0%
19/01/15 12

In [3]:
%%bash

ls -1 | python mapper.py

Traceback (most recent call last):
  File "mapper.py", line 7, in <module>
    reload(sys)
NameError: name 'reload' is not defined


In [4]:
! ls -1 | python mapper.py

Traceback (most recent call last):
  File "mapper.py", line 7, in <module>
    reload(sys)
NameError: name 'reload' is not defined


In [5]:
! cat test.txt | python mapper.py

Traceback (most recent call last):
  File "mapper.py", line 7, in <module>
    reload(sys)
NameError: name 'reload' is not defined


In [20]:
! cat mapper.py

#!/usr/bin/env python

from __future__ import print_function
import sys
import re


reload(sys)
sys.setdefaultencoding('utf-8')  # required to convert to unicode

path = 'stop_words_en.txt'

# Your code for reading stop words here
stopWords =[]

with open(path) as stopWordsFile:
    for line in stopWordsFile:
        try:
            stopWords.append(unicode(line.strip()))
        except ValueError as e:
            continue

wordSum, stopWordSum = 0, 0

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    words = re.split("\W*\s+\W*", text.lower(), flags=re.UNICODE)
    #words = re.findall("[a-z]+", text.lower())
    
    # Your code for mapper here.
    for word in words:
        if word in stopWords:
            stopWordSum += 1
            #print >> sys.stderr, "reporter:counter:Stop words,%d" % 1
            #print("reporter:counter:Personal Stats,Tota

In [10]:
! /usr/bin/env python mapper.py

Traceback (most recent call last):
  File "mapper.py", line 9, in <module>
    sys.setdefaultencoding('utf-8')  # required to convert to unicode
AttributeError: module 'sys' has no attribute 'setdefaultencoding'
