# MRjob, Python's interface to Hadoop
This notebook shows two well-known, simple examples of the way Python connects to Hadoop through `mrjob`. Of course mrjob is not Mister Job but stands for MapReduce job. MapReduce is a framework to handle very large datasets (order of magnitude in petabytes or more) in parallel and on nodes of computers. Originally introduced by Google, the open source Java implementation Hadoop is from Apache. MRjob is the Python interface to Hadoop.
<br>
## MRjob in a nutshell
MapReduce is about the simplest way you can break down a big job into little pieces. Basically, mappers read lines of input and spit out tuples of (key, value). Each key and all of its corresponding values are sent to a reducer. Per key the reducer performs some operation on the values. The mapper and reducer logic is all that must be written.
![](data/MapReduce-Basics2.PNG)
There is much more on this subject. See [CS109-2015 lecture 14](https://github.com/cs109/2015/blob/master/Lectures/14-Recommendations_MapReduce.pdf) and many other sources on the internet.

---
The examples below are from CS109-2015 lecture 14. Note: back in 2015 `mrjob` coding was not possible in Jupyter Notebook. Fortunately it is now.<br>Before running the examples install `mrjob` from your Anaconda Prompt terminal as follows:
```python
pip install mrjob
```
Input files are in the `data/mrjob` directory where the output will also be redirected to.

### Word count

In [1]:
%%file wordcount.py
from mrjob.job import MRJob

class MrWordCount(MRJob):
    def mapper(self, key, line):
        for word in line.split(' '):
            yield word.lower(),1
    
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)
        
if __name__ == '__main__':
    MrWordCount.run()

Overwriting wordcount.py


In [None]:
!python wordcount.py < data/mrjob/wordcount.txt > data/mrjob/wordcountresult.txt


### Anagram
Two strings are anagrams if one string can be constructed by rearranging the characters in the other string using all the characters in the original string exactly once. 

In [3]:
%%file anagram.py
from mrjob.job import MRJob

class MRAnagram(MRJob):
    def mapper(self, _, line):
        # convert word into list of chars, sort them and convert back to string
        letters = list(line)
        letters.sort()
        # key is sorted word; value regular word
        yield letters, line
        
    def reducer(self, _, words):
        # get list of words containing these letters
        anagrams = [w for w in words]
        # only yield results if there are >= 2 words which are anagrams of each other
        if len(anagrams) > 1:
            yield len(anagrams), anagrams
            
if __name__ == '__main__':
    MRAnagram.run()

Overwriting anagram.py


In [None]:
!python anagram.py < data/mrjob/anagram.txt > data/mrjob/anagramresult.txt