# MapReduce

The MapReduce programming technique was designed to analyze massive data sets across a cluster. In this Jupyter notebook, you'll get a sense for how Hadoop MapReduce works; however, this notebook will run locally (or on colab) rather than on a cluster.

The biggest difference between Hadoop and Spark is that Spark tries to do as many calculations as possible in memory, which avoids moving data back and forth across a cluster. MapReduce writes intermediate calculations out to disk, which can be less efficient. MapReduce is an older technology than Spark and one of the cornerstone big data technologies.

You must use the file called "songplays.txt" and put it in your workspace. This is a text file where each line represents a song that was played in the Sparkify app. The MapReduce code will count how many times each song was played. In other words, the code counts how many times the song title appears in the list.


# MapReduce versus Hadoop MapReduce

Don't get confused by the terminology! MapReduce is a programming technique. Hadoop MapReduce is a specific implementation of the programming technique.

Some of the syntax will look a bit funny, so be sure to read the explanation and comments for each section. You'll learn more about the syntax in later lessons. 

Run each of the code cells below to see the output.

After running the code cells, implement the exercices of the lecture in order to understand well the way MapReduce works and how you can use it to solve your problems

In [2]:
# Install mrjob library. This package is for running MapReduce jobs with Python
# In Jupyter notebooks, "!" runs terminal commands from inside notebooks 

! pip install mrjob



In [3]:
%%file wordcount.py
# %%file is an Ipython magic function that saves the code cell as a file

from mrjob.job import MRJob # import the mrjob library

class MRSongCount(MRJob):
    
    # the map step: each line in the txt file is read as a key, value pair
    # in this case, each line in the txt file only contains a value but no key
    # _ means that in this case, there is no key for each line
    def mapper(self, _, song):
        # output each line as a tuple of (song_names, 1) 
        yield (song, 1)

    # the reduce step: combine all tuples with the same key
    # in this case, the key is the song name
    # then sum all the values of the tuple, which will give the total song plays
    def reducer(self, key, values):
        yield (key, sum(values))
        
if __name__ == "__main__":
    MRSongCount.run()

Writing wordcount.py


In [4]:
# run the code as a terminal command
! python wordcount.py songplays.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount.jovyan.20231222.122633.294757
Running step 1 of 1...
job output is in /tmp/wordcount.jovyan.20231222.122633.294757/output
Streaming final output from /tmp/wordcount.jovyan.20231222.122633.294757/output...
"Data House Rock"	828
"Deep Dreams"	1131
"Broken Networks"	510
Removing temp directory /tmp/wordcount.jovyan.20231222.122633.294757...


# Summary of what happens in the code.

There is a list of songs in songplays.txt that looks like the following:

Deep Dreams
Data House Rock
Deep Dreams
Data House Rock
Broken Networks
Data House Rock
etc.....

During the map step, the code reads in the txt file one line at a time. The map steps outputs a set of tuples that look like this:

(Deep Dreams, 1)  
(Data House Rock, 1)  
(Deep Dreams, 1)  
(Data House Rock, 1)  
(Broken Networks, 1)  
(Data House Rock, 1)  
etc.....

Finally, the reduce step combines all of the values by keys and sums the values:  

(Deep Dreams, \[1, 1, 1, 1, 1, 1, ... \])  
(Data House Rock, \[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\])  
(Broken Networks, \[1, 1, 1, ...\]  

With the output 

(Deep Dreams, 1131)  
(Data House Rock, 510)  
(Broken Networks, 828)  

# Exercice 1

Choose a file of your choice and apply the MapReduce functions to it

In [167]:
! python wordcount.py fruitcolors.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount.jovyan.20231222.160352.457963
Running step 1 of 1...
job output is in /tmp/wordcount.jovyan.20231222.160352.457963/output
Streaming final output from /tmp/wordcount.jovyan.20231222.160352.457963/output...
"Honeydew Brown"	1
"Banana Yellow"	2
"Fig Purple"	1
"Grape Purple"	1
"Cherry Red"	2
"Date"	1
"Elderberry Blue"	2
"Apple Green"	2
Removing temp directory /tmp/wordcount.jovyan.20231222.160352.457963...


# Exercice 2

Change the functions in order to get the number of each word instead of each song name


In [168]:
! python exo2.py songplays.txt


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo2.jovyan.20231222.160357.335132
Running step 1 of 1...
job output is in /tmp/exo2.jovyan.20231222.160357.335132/output
Streaming final output from /tmp/exo2.jovyan.20231222.160357.335132/output...
"deep"	1131
"networks"	510
"rock"	828
"dreams"	1131
"house"	828
"broken"	510
"data"	828
Removing temp directory /tmp/exo2.jovyan.20231222.160357.335132...


# Exercice 3
Based on the example of the previous document, code a program that implements the simple inverted index. You can create the file with D1 ... D4 (lines in one document or different documents) to test your program.

In [169]:
! python exo3.py doc1.txt doc2.txt doc3.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo3.jovyan.20231222.160400.484729
Running step 1 of 1...
job output is in /tmp/exo3.jovyan.20231222.160400.484729/output
Streaming final output from /tmp/exo3.jovyan.20231222.160400.484729/output...
"Yellow"	["doc1", "doc3"]
"white"	["doc1"]
"Purple"	["doc1", "doc1", "doc2", "doc3"]
"Red"	["doc1", "doc3"]
"Brown"	["doc1", "doc2", "doc3"]
"Cherry"	["doc1", "doc3"]
"Honeydew"	["doc1", "doc2"]
"PassionFruits"	["doc1", "doc2"]
"Coconuts"	["doc1"]
"Date"	["doc2", "doc3"]
"Elderberry"	["doc1", "doc3"]
"Fig"	["doc1"]
"Grape"	["doc1", "doc2", "doc3"]
"Green"	["doc1", "doc2"]
"Apple"	["doc1", "doc2"]
"Banana"	["doc1", "doc3"]
"Blue"	["doc1", "doc3"]
Removing temp directory /tmp/exo3.jovyan.20231222.160400.484729...


# Exercice 4
Based on the example of the previous document, code a program that implements the listing documents occuring 2+ times. You can create the file with D1 ... D4 (lines in one document or different documents) to test your program.

In [170]:
! python exo4.py doc1.txt doc2.txt doc3.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo4.jovyan.20231222.160407.152968
Running step 1 of 1...
job output is in /tmp/exo4.jovyan.20231222.160407.152968/output
Streaming final output from /tmp/exo4.jovyan.20231222.160407.152968/output...
"doc1"	"Yellow"
"doc3"	"Yellow"
"doc1"	"Purple"
"doc1"	"Purple"
"doc2"	"Purple"
"doc3"	"Purple"
"doc1"	"Red"
"doc3"	"Red"
"doc1"	"Brown"
"doc2"	"Brown"
"doc3"	"Brown"
"doc1"	"Cherry"
"doc3"	"Cherry"
"doc1"	"Honeydew"
"doc2"	"Honeydew"
"doc1"	"PassionFruits"
"doc2"	"PassionFruits"
"doc2"	"Date"
"doc3"	"Date"
"doc1"	"Elderberry"
"doc3"	"Elderberry"
"doc1"	"Grape"
"doc2"	"Grape"
"doc3"	"Grape"
"doc1"	"Green"
"doc2"	"Green"
"doc1"	"Apple"
"doc2"	"Apple"
"doc1"	"Banana"
"doc3"	"Banana"
"doc1"	"Blue"
"doc3"	"Blue"
Removing temp directory /tmp/exo4.jovyan.20231222.160407.152968...


# Exercice 5
Implement the lab 2 using python map reduce functions and apply it to sales data.

In [175]:
! python exo5.py sales.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo5.jovyan.20231222.160614.251302
Running step 1 of 1...
job output is in /tmp/exo5.jovyan.20231222.160614.251302/output
Streaming final output from /tmp/exo5.jovyan.20231222.160614.251302/output...
Removing temp directory /tmp/exo5.jovyan.20231222.160614.251302...


In [179]:
! python exo5.py --task region sales.csv
! python exo5.py --task region sales.csv
! python exo5.py --task region sales.csv
! python exo5.py --task region --onoff online sales.csv
! python exo5.py --task country --onoff online sales.csv
! python exo5.py --task item_type --onoff online sales.csv
! python exo5.py --task region --onoff offline sales.csv
! python exo5.py --task country --onoff offline sales.csv
! python exo5.py --task item_type --onoff offline sales.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo5.jovyan.20231222.160742.006169
Running step 1 of 1...
job output is in /tmp/exo5.jovyan.20231222.160742.006169/output
Streaming final output from /tmp/exo5.jovyan.20231222.160742.006169/output...
"Central America and the Caribbean"	"403357849.72"
"North America"	"99495515.12"
"Sub-Saharan Africa"	"999642091.50"
"Europe"	"1026999612.80"
"Middle East and North Africa"	"509923894.53"
"Asia"	"587403296.85"
"Australia and Oceania"	"324071211.41"
Removing temp directory /tmp/exo5.jovyan.20231222.160742.006169...
No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/exo5.jovyan.20231222.160742.532130
Running step 1 of 1...
job output is in /tmp/exo5.jovyan.20231222.160742.532130/output
Streaming final output from /tmp/exo5.jovyan.20231222.160742.532130/output...
"Central America and the Caribbean"	"40335