# MapReduce

The MapReduce programming technique was designed to analyze massive data sets across a cluster. In this Jupyter notebook, you'll get a sense for how Hadoop MapReduce works; however, this notebook will run locally rather than on a cluster.

The biggest difference between Hadoop and Spark is that Spark tries to do as many calculations as possible in memory, which avoids moving data back and forth across a cluster. Hadoop writes intermediate calculations out to disk, which can be less efficient. Hadoop is an older technology than Spark and one of the cornerstone big data technologies.

If you click on the Jupyter notebook logo at the top of the workspace, you'll be taken to the workspace directory. There you will see a file called "songplays.txt". This is a text file where each line represents a song that was played in the Sparkify app. The MapReduce code will count how many times each song was played. In other words, the code counts how many times the song title appears in the list.


# MapReduce versus Hadoop MapReduce

Don't get confused by the terminology! MapReduce is a programming technique. Hadoop MapReduce is a specific implementation of the programming technique.

Some of the syntax will look a bit funny, so be sure to read the explanation and comments for each section. You'll learn more about the syntax in later lessons. 

Run each of the code cells below to see the output.

In [12]:
# Install mrjob library. This package is for running MapReduce jobs with Python
# In Jupyter notebooks, "!" runs terminal commands from inside notebooks 

! pip install mrjob



In [None]:
%%file wordcount.py
from collections import defaultdict
import sys

class MRSongCount:
    def mapper(self, _, song):
        yield (song.strip(), 1)

    def reducer(self, key, values):
        yield (key, sum(values))

    def run(self, lines):
        # MAP
        mapped = []
        for i, line in enumerate(lines):
            for pair in self.mapper(i, line):
                mapped.append(pair)

        # SHUFFLE & SORT
        grouped = defaultdict(list)
        for key, value in mapped:
            grouped[key].append(value)

        # REDUCE
        for key, values in grouped.items():
            for result in self.reducer(key, values):
                print(f"{result[0]}\t{result[1]}")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python wordcount.py songs.txt")
        sys.exit(1)

    with open(sys.argv[1], "r") as f:
        lines = f.readlines()

    job = MRSongCount()
    job.run(lines)

FileNotFoundError: [Errno 2] No such file or directory: '--f=/Users/at/Library/Jupyter/runtime/kernel-v3d1bad0492030ad807432f1d05bcbf3345b827d36.json'

In [14]:
# run the code as a terminal command
! python wordcount.py songplays.txt

Traceback (most recent call last):
  File [35m"/Users/at/Documents/udacity/learning/spark-datalake/wordcount.py"[0m, line [35m3[0m, in [35m<module>[0m
    [1;31mfrom mrjob.job import MRJob[0m # import the mrjob library
    [1;31m^^^^^^^^^^^^^^^^^^^^^^^^^^^[0m
  File [35m"/Users/at/.local/share/virtualenvs/udacity-7oez_sss/lib/python3.13/site-packages/mrjob/job.py"[0m, line [35m36[0m, in [35m<module>[0m
    from mrjob.conf import combine_dicts
  File [35m"/Users/at/.local/share/virtualenvs/udacity-7oez_sss/lib/python3.13/site-packages/mrjob/conf.py"[0m, line [35m34[0m, in [35m<module>[0m
    from mrjob.util import expand_path
  File [35m"/Users/at/.local/share/virtualenvs/udacity-7oez_sss/lib/python3.13/site-packages/mrjob/util.py"[0m, line [35m23[0m, in [35m<module>[0m
    import pipes
[1;35mModuleNotFoundError[0m: [35mNo module named 'pipes'[0m


# Summary of what happens in the code.

There is a list of songs in songplays.txt that looks like the following:

Deep Dreams
Data House Rock
Deep Dreams
Data House Rock
Broken Networks
Data House Rock
etc.....

During the map step, the code reads in the txt file one line at a time. The map steps outputs a set of tuples that look like this:

(Deep Dreams, 1)  
(Data House Rock, 1)  
(Deep Dreams, 1)  
(Data House Rock, 1)  
(Broken Networks, 1)  
(Data House Rock, 1)  
etc.....

Finally, the reduce step combines all of the values by keys and sums the values:  

(Deep Dreams, \[1, 1, 1, 1, 1, 1, ... \])  
(Data House Rock, \[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\])  
(Broken Networks, \[1, 1, 1, ...\]  

With the output 

(Deep Dreams, 1131)  
(Data House Rock, 510)  
(Broken Networks, 828)  