# Word Count with mrjob

`mrjob` is a Python package that helps you write and run Hadoop Streaming jobs. It supports Amazon's Elastic MapReduce (EMR) and it also works with your own Hadoop cluster.  It has been released as an open-source framework by Yelp and we will use it to interface with Hadoop due to its legibility and ease of use with MapReduce tasks.  

Read through some of the [mrjob docs](http://mrjob.readthedocs.org/en/latest/index.html) or this [mrjob tutorial](https://pythonhosted.org/mrjob/guides/quickstart.html) 

Some important features of ```mrjob```:

* It can run jobs on EMR, your own Hadoop cluster, or locally (for testing).
* It can be used to write multi-step jobs, where one map-reduce step feeds into the next. 

## Exercise 1: Simple Count

Do a simple word count on the [Reuters 20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) click [here](http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz) to download.  If that is unavailable look [here](http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html).


__STEP 1:__ Create a file `wordcounts.py` with the following code:

```Python
from mrjob.job import MRJob
from string import punctuation


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in line.split():
            yield (word.strip(punctuation).lower(), 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()

```
  

__STEP 2:__ Create a mini version of the dataset with these terminal commands: 

```bash
    mkdir mini_20_newsgroups
    mkdir mini_20_newsgroups/comp.windows.x
    mkdir mini_20_newsgroups/rec.motorcycles
    mkdir mini_20_newsgroups/sci.med
    cp 20_newsgroups/comp.windows.x/663* mini_20_newsgroups/comp.windows.x
    cp 20_newsgroups/rec.motorcycles/10311* mini_20_newsgroups/rec.motorcycles
    cp 20_newsgroups/sci.med/5889* mini_20_newsgroups/sci.med
    ```
__STEP 3:__ From the terminal, execute the file with the mini folder `mini_20_newsgroups` and notice how it goes through each folder in the directory and performs the word count.


The output look pretty messy. But it does what we expected. 

## Exercise 2: Word Counts by Topic

Create a file `wordcounts_bytopic.py` with the following code:

```Python
from mrjob.job import MRJob
from string import punctuation
import os


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        file_path = os.environ['map_input_file']
        topic = file_path.split('.')[-1].split('/')[0]
        for word in line.split():
            word = word.strip(punctuation).lower()
            yield (topic + '_'+ word, 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()
```


# Exercise 3: json and tokenization

I have a json file containing NYT articles `articles.json`. We want to clean up the thext before counting the words. 


#### Tokenize and Stem

Lets Count the words in each section after we tokenize and stem. 

In [5]:
from string import punctuation
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
import os
import json

sno = SnowballStemmer('english')

def my_test_tokenizer(text):
    for s in text:
            s = s.strip().encode("ascii", "ignore").decode('utf-8').lower()
            translator = s.maketrans('', '', punctuation)
            no_punct = s.translate(translator)
            for word in word_tokenize(no_punct):
                w = sno.stem(word)
                print(w)
                
with open('data/articles.json') as f:
        for line in f:
            d = json.loads(line)
            headline = d['headline']['main']
            section = d['section_name']
            my_test_tokenizer(d['content'])
            break

hey
the
man
on
the
phone
said
are
you
still
come
tonight
it
took
a
moment
for
me
to
realiz
that
he
was
call
from
distil
to
confirm
my
dinner
reserv
yes
i
repli
cool
he
said
and
sound
as
if
he
meant
it
distil
open
in
june
on
the
corner
of
franklin
street
and
west
broadway
in
tribeca
the
former
home
of
drew
niepor
layla
and
centrico
the
belli
dancer
and
the
frozenmargarita
machin
are
gone
but
a
certain
effervesc
remain
so
doe
mr
niepor
hover
in
the
background
as
guru
to
distil
owner
the
firsttim
restaurateur
nick
iovacchini
and
shane
lyon
the
25yearold
chef
the
space
is
bland
handsom
with
dark
wood
and
charcoal
banquett
breathless
high
ceil
and
quasimediev
wheel
chandeli
like
crown
of
fire
one
side
is
devot
to
the
bar
where
the
drink
by
benjamin
wood
are
ladykil
eleg
with
a
knife
twist
occasion
1980s
mope
rock
shimmer
from
the
speaker
servic
is
confound
friend
almost
coddl
when
i
stood
outsid
read
the
post
menu
someon
came
hurri
down
the
step
to
hand
me
my
own
copi
so
i
wouldnt
crane
my


# Exercise 4: Most Common Topic by Word

Now I want to count each word by topic, but then figure out what topic is most common for each word. 