 ![alt text](http://www.jkspeaks.com/wordpress/wp-content/uploads/2011/05/mapreduce-logo.jpg "")

####  This pratical session will approach the concept of Map Reduce programming model through simple examples. We will focus on writting a simple WordCount program in Map Reduce using Python that we will run locally without relying on the Hadoop back-end so that it becomes clear "Map Reduce" simply is a programming model that is merely implemented in Hadoop.  It should be kept in mind however, as can be seen in figure hereunder, that our local approach does not make use of any parallelisation.

![alt text](http://www.glennklockwood.com/data-intensive/hadoop/mapreduce-workflow.png "Word Count Execution by Matei Zaharia ")

## What is a Word Count in Map Reduce ?

The Word Count is kind of the canonical example used to illustrate the Map Reduce programming model. The idea is to simply count the number each word's appearance through a given set of input texts.

----

#### What does the Mapper do ?
One mapper takes a line (i.e: a string of text) as input and must break it into words. Then, it outputs the key/value pairs it computed for the line received as input.

#### What does the Reducer do ?
One reducer receives key/value pairs as input and counts, for each word the total and output the final result for a single record.

----

![alt text](http://slideplayer.com/5003555/16/images/17/Word+Count+Execution+Input+Map+Shuffle+%26+Sort+Reduce+Output+Map+Reduce.jpg "Word Count Execution by Matei Zaharia ")

##### To keep the illustration simple, the input and output will we standard SDTIN and SDTOUT and we will run the example locally.

## Material preparation

We first need to import the necessary files required for this practical sessions.
Simply execute the next cell and wait.

### Downloading Books from the Gutenberg.org website

In [1]:
!mkdir ./INFOH515
!mkdir ./INFOH515/books

!wget --quiet http://www.gutenberg.org/cache/epub/20417/pg20417.txt -O ./INFOH515/books/pg20417.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20418/pg20418.txt -O ./INFOH515/books/pg20418.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20419/pg20419.txt -O ./INFOH515/books/pg20419.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20420/pg20420.txt -O ./INFOH515/books/pg20420.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20421/pg20421.txt -O ./INFOH515/books/pg20421.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20422/pg20422.txt -O ./INFOH515/books/pg20422.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20423/pg20423.txt -O ./INFOH515/books/pg20423.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20424/pg20424.txt -O ./INFOH515/books/pg20424.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20425/pg20425.txt -O ./INFOH515/books/pg20425.txt
!wget --quiet http://www.gutenberg.org/cache/epub/20426/pg20426.txt -O ./INFOH515/books/pg20426.txt
!echo "Books downloaded in ./books" 

Books downloaded in ./books


### Creating a Lorem Ipsum excerpt

In [2]:
!echo "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum" > "./INFOH515/loremipsum.txt"
!echo "File created ./INFOH515/loremipsum.txt" 

File created ./INFOH515/loremipsum.txt


## Mapper

Note: the first line `%%file ./INFOH515/mapper.py` in the following cell copies the cell's contents (except first line) to the file `./INFOH515/mapper.py`

In [3]:
%%file ./INFOH515/mapper.py
#!/usr/local/anaconda3/bin/python
import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    words = line.split()                            # Creation of a list containing all words by splitting the line in words
    # increase counters
    for word in words:                              # For each word in the list (i.e: words), do...
        print(word+"\t"+"1")

Writing ./INFOH515/mapper.py


## Reducer

In [4]:
%%file ./INFOH515/reducer.py
#!/usr/local/anaconda3/bin/python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    word, count = line.split('\t', 1)               # Parsing of the awaited key/value pair

    try:
        count = int(count)
    except ValueError:                              # In the case the value is not a number, we silently discard the line
        continue

    if current_word == word:                        # This IF only works because Hadoop sorts map output by key
        current_count += count                      # before it is passed to the reducer
    else:
        if current_word:
            print(current_word+"\t"+str(current_count))           # Output of the result to STDOUT
        current_count = count
        current_word = word

if current_word == word:                            # Output of the last word
    print(current_word+"\t"+str(current_count))

Writing ./INFOH515/reducer.py


## Local execution of the Mapper

In [5]:
!echo "fox wolf dog wolf cat moose mouse dog cat." |  python ./INFOH515/mapper.py

fox	1
wolf	1
dog	1
wolf	1
cat	1
moose	1
mouse	1
dog	1
cat.	1


## Local execution of the Mapper followed by the Reducer; a Word Count application

In [6]:
# Let us simulate the running of a M/R program locally.
# The following shell command:
# - passes some input words to the mapper
# - sorts the mapper output by key (the word)
# - passes the sorted map output to the reducer
!echo "fox wolf dog wolf cat moose mouse dog cat." | python ./INFOH515/mapper.py | sort -k1,1 | python ./INFOH515/reducer.py

cat	1
cat.	1
dog	2
fox	1
moose	1
mouse	1
wolf	2


## Local execution the WordCount application using files

In [7]:
!cat ./INFOH515/books/pg20417.txt | python ./INFOH515/mapper.py | sort -k1,1 | python ./INFOH515/reducer.py

=	2
|	136
|_________|_______________|____________|____________|____________|	2
|______________________________________|	3
|________________________________________________________________|	1
______________________________________	1
________________________________________________________________	1
_______________________________________________________________________	7
-	7
--	2
------	4
-,	1
{	3
§	77
***	6
*****	2
&	12
+	1
0	2
.001	1
0.24	1
0.62	1
1	24
($1	1
(1)	11
[1]	1
1,	2
1.	10
1.]	1
1),	1
1}	1
10	7
10,	1
10.	1
100	2
1.00	1
1,000	2
10,000	3
100,000	8
(100,000,000,000,000,000)	1
_100-inch	1
100-INCH	2
101	1
104	1
105	1
10.5	1
108,600[A]	1
10.--SOLAR	1
11	3
1/100th	1
(1/125)	1
1/125,000,000	1
113	1
116	2
117	2
118	2
1/1800	1
1/1845	1
11.86	1
119	3
11.--MARS,	1
12	1
12)	1
120	4
12,000	2
121	2
123	1
12:30	1
124	2
125	2
1/25	1
128	1
12.--JUPITER	1
13	1
13)	1
130	1
1,300	1
130,000	1
1,340,000	1
135	1
1/35,00

authority	2
authority,	1
automatic	4
automatic.	1
automatically	2
autotomy.	1
autumn	2
autumn,	1
autumn;	1
autumn.]	1
availability	1
available	3
(available	1
available.	2
avalanche.	1
Avebury's	1
average	11
average,	1
averaged	1
AVOCET'S	2
avoid	4
avoided.	1
Avoided	1
avoiding	2
await	1
awakened	1
aware	8
awareness	4
awareness.	1
away	37
away,	8
away.	10
away.]	1
away--you	1
awful	1
awkward	3
awns	1
axes	1
axes,	1
axis	14
axis,	3
axis.	2
Azores	1
azure	1
_b_	1
(_b_)	6
(b)	1
B	1
B,	3
B.	6
B.,	1
Babies	1
baboons.	1
baby	3
BABY	6
Babylonia,	2
bacillus	1
back	59
back,	4
back,"	1
back;	1
back.	1
back."	1
BACK	1
BACK]	1
backbone	5
backbone,	2
backbone;	1
backbone.	1
backboned	11
backboneless	4
Backboneless	1
BACKBONELESS	2
backed	3
background	4
background,	1
background.	1
backgrounds,	1
back-teeth	1
back--till	1
backwards	5
backwards,	2
Backwards	1
bacteria	7
bacteria,	2
Bacteria.	1
bad	2
badly	3
bag	2
"

difficulties,	3
difficulties.	1
difficulties.]	1
Difficulties	2
difficulty	6
diffraction	1
_diffraction	1
diffuse	2
diffusion	1
digested,	1
digesting	2
digestive	3
digged	1
digit	2
digits	2
digs	1
dilatable	1
dilemma	1
dilute	1
dilutes,	1
dim	3
dimensions	3
dimensions?	1
dimensions.	2
DIMENSIONS	1
diminish	1
diminishing	1
dimly	3
dimmed	1
DINGO	2
dinosaur	1
Dinosaur	1
Dinosaur,	1
Dinosaurs	3
Dinosaurs,	1
dints	2
dioxide	1
Diplodocus	1
(Dipnoan),	1
(Dipnoi),	1
dipper	1
dipper,	1
direct	12
direct,	2
DIRECT,	1
directed	1
direction	15
direction,	4
direction;	1
direction.	2
directions	4
directions,	3
directions;	1
directions.	5
directions--an	1
directly	12
directly.	1
Directly	1
director	1
Director	1
DIRECT-READING	2
disadvantage--a	1
disadvantageous	1
disadvantages,	1
disappear	2
disappear.]	1
disappearance	2
Disappearance	1
disappeared	4
disappeared,	1
disappeared.	2
disappearing,	1
disappearing."	1
disappears	4
dis

floor.	2
floor."	1
flora	1
flora,	2
flora;	1
flora.	1
flotation,	1
flounder	2
flourish,	2
flourished	1
flourished,	1
flow	14
flow.	1
flowed	1
flower	6
flower,	1
flower.	1
Flower	1
FLOWER	2
flowering	9
flower-perfumed,	1
flowers	6
flowers,	1
flowers--attractive	1
flower-vase	1
flowing	4
flows	2
fluctuate,	1
fluid	4
fluid,	1
fluid.	1
fluids	1
fluids.	1
fluorescence	1
fluttering.	1
flux	3
fly	15
fly,	1
fly.	2
fly.]	1
flying	16
"flying	5
"flying"	1
flying,	1
Flying	16
FLYING	4
fly-trap	1
FLY-TRAP	2
flywheel	2
flywheel,	3
flywheel.	2
foals	2
foam	1
foam,	1
foam-bells	1
focus	2
focus,	1
focussed	1
foetal	2
Fog,	1
fogs	1
fold	3
folded	1
folds	1
folk	1
folk,	1
Folk	1
Folk,	1
FOLK,	2
folk-ways,	1
follow	10
follow,	1
follow.	1
followed	15
followed?	1
followed.	1
following	15
following:	1
Following	1
follows	9
follows,	1
fond	3
FONT-DE-GAUME	4
food	33
"food,"	2
food,	14
food.	2
food.]	1
FOOD	4
FOOD,	2
food-c

intensely	1
intensification	1
intensities	1
intensity	4
intent	1
interbreeding.	1
Interbreeding	1
intercept	1
intercepts	2
interchange	2
intercourse	1
intercourse.	1
intercrossing	2
interest	21
interest,	2
interest:	1
interest.	5
interested	1
interesting	60
interesting,	1
Interesting	1
interests,	1
interfered,	1
interference	1
"interference"	1
interferes,	1
interglacial	1
interglacial.	1
Interglacial	4
interior	7
interior,	1
interlinked	2
interlock,	1
intermediary,	1
intermediate	3
internal	24
_internal	1
Internal	1
internally	1
(Internat.	1
International	1
interpose	1
interposed	1
interpret	2
interpretation	3
Interpretation	4
interpreted	2
interpreter.	1
interpreting	2
interprets	1
inter-relation	1
inter-relations	3
inter-relations,	1
inter-relations.	2
inter-relations--the	1
interrupted	1
interruptions	1
interstices	1
interval	1
intervals;	1
intervened	1
interwoven	2
intestine	1
intimate	6
intimately	2
into	211
intrepi

minimum	1
minimum,	1
mining,	1
minnow	1
minnow,	1
Minnow	1
minnows	4
minnow--The	1
minor	6
Minor,	1
minority	1
minority_,	1
minute	30
minute,	3
minute.	2
MINUTE	2
minuteness	2
minutes	4
minutes,	3
minutes;	1
minutes.	5
minutes'	1
Miocene	1
_Miocene_	1
Miocene,	2
Miocene)	1
{MIOCENE	1
mirror	8
mirror!	1
mirror.	1
miserable	1
misjudge	1
miss	2
Miss	2
missed;	1
misses	1
missing	1
mission	3
Mission	1
Mississippi	1
mist	1
mistake	4
mistake,	1
mistake."	1
mistaken.	1
mistakes	1
mistakes,	2
Mitchell,	1
mites,	1
mixed	2
mixed;	1
mixes	1
mixing	2
MIXING	4
mixture	2
mixtures	1
Mme.	1
mobile	7
mobile.	1
mobility	1
mode	1
model	6
MODEL	2
modelled	6
modern	78
Modern	4
MODERN	4
modernised	1
modernising	1
modernity	1
modes	6
modification,	1
modifications	1
modified	1
MODIFIED	2
moist	5
moist,	1
moistened	1
moisture	3
moisture,	1
moisture-laden	1
mole,	1
molecular	7
Molecular	1
molecule	9
"molecule."	1
molecule,	2
m

sea	32
"sea	1
"sea"	1
sea,	25
sea.	17
Sea	4
Sea_	1
Sea_.	1
Sea,	1
Sea.	3
SEA	9
SEA,	2
sea-anemone	7
sea-anemone,	1
SEA-ANEMONE	2
SEA-ANEMONE-LIKE	2
sea-anemones	4
sea-anemone's	1
sea-anemones,	3
SEA-ANEMONES	2
sea-butterflies	1
SEA-CUCUMBER	2
sea-cucumbers,	1
sea-dust	3
sea-dust.	1
sea-floor,	2
"sea-gooseberries,"	1
sea-grass	2
sea-grass,	1
sea-horse	3
SEA-HORSE	2
sea-horses	1
seal	1
sea-lettuce	1
sea-level.	1
sea-lilies,	1
Sea-lilies,	1
seals	1
seals.	2
"sea-meadows"	1
sea-meadows	1
sea-meadows,"	1
Sea-meadows	1
sea-perches,	1
search	4
search.	2
seas	6
seas,	3
seas:	1
seas.	5
Seas_	1
Sea-scorpions	1
sea-serpents,	1
seashore	9
seashore,	1
seashore.	1
seashore.]	1
sea-skimmers	1
sea-snail,	1
sea-snail.	1
sea-snakes,	1
season,	1
SEASON	1
SEASON]	1
Seasonal	1
SEASONAL	2
"sea-spider"	1
sea-spider	1
sea-squirt	1
sea-squirts	1
sea-squirts,	1
sea-squirts;	1
seat	4
sea--the	1
sea--The	3
sea-urchin,	1
sea-urchins	2
sea

true	38
true,	5
true.	5
True	1
truly	3
trumpeting	1
trundles	1
trunk	2
trust.	1
truth	5
truth,	4
try	5
trying	9
TRYING	2
trying--the	1
TRYPANOSOMA	2
Trypanosome,	2
Trypanosome)	1
tse-tse	2
Tse-tse	1
tube	12
tube_,	1
tube,	6
tube;	1
tube.	5
tube."	1
tube"	1
TUBE	2
tube-feet	1
tubeful	1
tubercles	1
tuberculosis.	1
tubes	2
tubes,	1
tubes.	1
tube--the	1
tubular	2
tucking	1
tuft	1
tufts	3
tugs	1
tumbled	1
tumultuous	1
tune	1
tunic	1
(_Tupaia_),	1
Turkestan,	1
turn	21
turn,	2
turned	13
"turned	1
turning	5
turnips,	1
turns	7
turtle	2
turtle,	3
turtle-backs	1
turtles	4
turtles,	1
Turtles	1
Twelfth	1
twelve	2
twentieth	1
twenty	14
twenty-five	4
twenty-four	2
twenty-nine	1
twenty-one	1
twenty-seven	1
twenty-thousandth	1
twice	5
twice.	1
Twice	2
twig	2
twig,	1
twig-insects	1
twig-like	1
twigs	1
twilight	1
twilight,	1
twined	1
twinkling	1
two	158
_two_	1
two,	3
two.	3
two.]	2
Two	7
TWO	2
twofold	2
two-spined	

As you can see, there are many other things than plain words. You may improve on this. 
The issue is that the file are raw with no preprocessing, any idea how to clean this up ? Go ahead !

## Wrap Up

We have seen how to implement a WordCount program implemented in MapReduce. Also, implementing it in Python, we shown how easy it was possible to make it scale once executed on the Hadoop Cluster and not locally anymore.

Fine, **BUT** what about **Shuffle and Sort** ?

* **Shuffle phase :** Transfer of the map output from **a** Mapper to **a** Reducer in MapReduce
* **Sort phase :** Merging and sorting of map outputs. Data from the mapper are grouped by the key and automatically split among the reducers, whatever their number, and sorted by key.

Interestingly, Shuffling can start before the end of the Mapping phase. (What about "slow starts" ?)
Also, Sorting is done by Key not by Value !

This sorting step helps Hadoop to find out when a new Reducer is required and start it accordingly and transparently for the user. 

You may skip the Shuffling and Sorting if you specify no reducers. You then only get a Mapping done; that increases the speed of the Mapping phase.

What about **Partitions** ?

Well, partitioning is another issue. It determines to which reducer the output of a map phase will be send. The Default Partitioner uses a hashing on the keys to make the distribution to the reduce tasks. So, yes, you may override this for specific tasks.

![alt text](https://www.oreilly.com/library/view/distributed-computing-in/9781787126992/assets/fadf32ab-b857-4d22-a334-c989b5bafdea.png "Distributed Computing in Java 9 by Raja Malleswara Rao Pattamsetti")


# Exercises

Create MapReduce programs in Python that compute the following queries. Use the same methodology as above to run your programs (i.e., first test the mapper locally on sample of the data, then test mapper+reducer locally, then test run it on MapReduce)

### Sensor data exercises

In the file "data/sensors/sensor-sample.txt" you will find on each line, multiple fields of information, let's call them : Date(Date), Time(Time), RoomId(Integer)-SensorId(Integer), Value1(float), Value2(float)
Using the file "data.conv.txt", create MapReduce programs in Python that :


1. Counts the number of entries for each day.
2. Counts the number of measures for each pair of RoomId-SensorId.
3. Compute the average of Value1.  
4. **Extra** Compute the average of Value1, but use a combiner to minimize the amount of information sent to the reducer.

### Movielens movie data exercises

Movielens (https://movielens.org/) is a website that provides non-commercial, personalised movie recommendations. GroupLens Research has collected and made available rating data sets from the MovieLens web site for the purpose of research into making recommendation services. In this exercise, we will use one of these datasets (the movielens 10M dataset, http://files.grouplens.org/datasets/movielens/ml-10m-README.html) and compute some basic queries on it.
The dataset has already been downloaded and is available at data/movielens/movies.csv, data/movielens/ratings.csv, data/movielens/tags.csv 

1. Inspect the dataset's [README file](http://files.grouplens.org/datasets/movielens/ml-10m-README.html), in particular the section titled "Content and Use of Files" to learn the structure of these three files.

2. Compute all pairs (`title`, `rat`) where `title` is a full movie title (as found in the movies.dat file), and `rat` is the average rating of that movie (computed over all possible ratings for that movie, as found in the ratings.dat file)

>_Hint_: To answer this query, you will need to combine information from two different files. In python, it is not possible to run two different map functions on two different files inside the same Map/Reduce job. To circumvent this, you will need to answer this query by running **three** Map/Reduce jobs consecutively. The first two jobs convert movies.dat and rating.dat into a common structure. This common structure is then processed by the third job, which computes the actual result. Defining exactly what this common structure is, is part of the exercise.

3. **Extra** Compute all pairs (`title`, `tag`) where `title` is a full movie title that has an average rating of at least 3.5, and `tag` is a tag for that movie (as found in the tags.dat file)

## Sensor Dataset Solutions

### Prepare entry count mapper

In [1]:
%%file ./INFOH515/entry_count_mapper.py
#!/home/bigdata/anaconda3/bin/python

import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    values = line.split()                           # A line consists of multiple fields; split the line into fileds
    # increase counters
    print('%s\t%s' % (values[0], 1))                  # ... print to STDOUT the key/value pair value[0]/1  -- values[0] is the first field

Writing ./INFOH515/entry_count_mapper.py


### Entry count - execute locally

In [2]:
data_file = 'data/sensors/sensor-sample.txt'
!echo "======== Running mapper only on 10 lines ======= "
!head -n 10 {data_file} | python ./INFOH515/entry_count_mapper.py
!echo "======== Running mapper + reducer on 10 lines ====== "
!head -n 10 {data_file} | python ./INFOH515/entry_count_mapper.py | sort -k1,1 | python ./INFOH515/reducer.py

2017-03-31	1
2017-03-31	1
2017-03-31	1
2017-02-28	1
2017-02-28	1
2017-02-28	1
2017-02-28	1
2017-02-28	1
2017-02-28	1
2017-02-28	1
2017-02-28	7
2017-03-31	3


### Prepare measure count mapper

In [3]:
%%file ./INFOH515/measure_count_mapper.py
#!/home/bigdata/anaconda3/bin/python
import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    values = line.split()                           # A line consists of multiple fields; split the line into fields
    # increase counters
    print('%s\t%s' % (values[2], 1))                  # ... print to STDOUT the key/value pair value[2]/1  -- values[2] is the field containing roomid-sensorid

Writing ./INFOH515/measure_count_mapper.py


### Test measure count locally

In [4]:
data_file = 'data/sensors/sensor-sample.txt'
!echo "======== Running mapper only on 10 lines ======= "
!head -n 10 {data_file} | python ./INFOH515/measure_count_mapper.py
!echo "======== Running mapper + reducer on 10 lines ====== "
!head -n 10 {data_file} | python ./INFOH515/measure_count_mapper.py | sort -k1,1 | python ./INFOH515/reducer.py

1-0	1
1-1	1
1-2	1
1-0	1
1-1	1
1-2	1
1-0	1
1-1	1
1-2	1
1-0	1
1-0	4
1-1	3
1-2	3


### Average of Value1 - prepare mapper and reducer

In [5]:
%%file ./INFOH515/avg_value_mapper.py
#!/home/bigdata/anaconda3/bin/python

import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    values = line.split()                           # A line consists of multiple fields; split the line into fileds
    # increase counters
    print('%s\t%s' % (1, values[3]))                  # ... print to STDOUT the key/value pair. Since we want 1 average, all values have the same key. values[3] is the field containing the value

Writing ./INFOH515/avg_value_mapper.py


In [6]:
%%file ./INFOH515/avg_value_reducer.py
#!/home/bigdata/anaconda3/bin/python
from operator import itemgetter
import sys

current_val = 0
sum_val = 0.0
num_val = 0


for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    key, value = line.split('\t', 1)               # Parsing of the awaited key/value pair. The key is everything before the first tab, the value everything after

    try:
        current_val = float(value)
        sum_val = sum_val + current_val
        num_val = num_val + 1
    except ValueError:                              # In the case the value is not a number, we silently discard the line
        continue

# Output of the result
print('%s\t%s\t%s' % (sum_val, num_val, sum_val / max(1,num_val)))

Writing ./INFOH515/avg_value_reducer.py


### Average value - run locally

In [7]:
data_file = 'data/sensors/sensor-sample.txt'
!echo "======== Running mapper only on 10 lines ======= "
!head -n 10 {data_file} | python ./INFOH515/avg_value_mapper.py
!echo "======== Running mapper + reducer on 10 lines ====== "
!head -n 10 {data_file} | python ./INFOH515/avg_value_mapper.py | sort -k1,1 | python ./INFOH515/avg_value_reducer.py

1	122.153
1	-3.91901
1	11.04
1	19.9884
1	37.0933
1	45.08
1	19.3024
1	38.4629
1	45.08
1	19.1652
353.44618999999994	10	35.344618999999994


## Movielens solutions

In [12]:
%%file ./INFOH515/movie_mapper.py
#!/home/bigdata/anaconda3/bin/python
import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    tup  = line.split(',')                         # A line consists of multiple fields; split the line into fields, for movie.dat the format is MovieID::Title::Genres
    print('%s\t%s\t%d' % (tup[0], tup[1], -1))        # Output format is MovieID\tTitle\tRating key = tup[0] = movieid; value consists of tup[1] = movie title and -1 for the rating (invalid rating)

Overwriting ./INFOH515/movie_mapper.py


In [25]:
movie_file = 'data/movielens/movies.csv'
# create folder for temporary results
!mkdir temp_res
!cat {movie_file} | python ./INFOH515/movie_mapper.py > temp_res/movies.txt
!cat temp_res/movies.txt

1	Toy Story (1995)	-1
2	Jumanji (1995)	-1
3	Grumpier Old Men (1995)	-1
4	Waiting to Exhale (1995)	-1
5	Father of the Bride Part II (1995)	-1
6	Heat (1995)	-1
7	Sabrina (1995)	-1
8	Tom and Huck (1995)	-1
9	Sudden Death (1995)	-1
10	GoldenEye (1995)	-1
11	"American President	-1
12	Dracula: Dead and Loving It (1995)	-1
13	Balto (1995)	-1
14	Nixon (1995)	-1
15	Cutthroat Island (1995)	-1
16	Casino (1995)	-1
17	Sense and Sensibility (1995)	-1
18	Four Rooms (1995)	-1
19	Ace Ventura: When Nature Calls (1995)	-1
20	Money Train (1995)	-1
21	Get Shorty (1995)	-1
22	Copycat (1995)	-1
23	Assassins (1995)	-1
24	Powder (1995)	-1
25	Leaving Las Vegas (1995)	-1
26	Othello (1995)	-1
27	Now and Then (1995)	-1
28	Persuasion (1995)	-1
29	"City of Lost Children	-1
30	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	-1
31	Dangerous Minds (1995)	-1
32	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	-1
34	Babe (1995)	-1
36	Dead Man Walking (1995)	-1
38	It Takes Two (1995)	-1
39

627	"Last Supper	-1
628	Primal Fear (1996)	-1
631	All Dogs Go to Heaven 2 (1996)	-1
632	Land and Freedom (Tierra y libertad) (1995)	-1
633	Denise Calls Up (1995)	-1
634	Theodore Rex (1995)	-1
635	"Family Thing	-1
636	Frisk (1995)	-1
637	Sgt. Bilko (1996)	-1
638	Jack and Sarah (1995)	-1
639	Girl 6 (1996)	-1
640	Diabolique (1996)	-1
645	Nelly & Monsieur Arnaud (1995)	-1
647	Courage Under Fire (1996)	-1
648	Mission: Impossible (1996)	-1
649	Cold Fever (Á köldum klaka) (1995)	-1
650	Moll Flanders (1996)	-1
653	Dragonheart (1996)	-1
656	Eddie (1996)	-1
661	James and the Giant Peach (1996)	-1
662	Fear (1996)	-1
663	Kids in the Hall: Brain Candy (1996)	-1
665	Underground (1995)	-1
667	Bloodsport 2 (a.k.a. Bloodsport II: The Next Kumite) (1996)	-1
668	Song of the Little Road (Pather Panchali) (1955)	-1
670	"World of Apu	-1
671	Mystery Science Theater 3000: The Movie (1996)	-1
673	Space Jam (1996)	-1
674	Barbarella (1968)	-1
678	Some Folks Call It a Sling Blade (199

5941	Drumline (2002)	-1
5942	"Hot Chick	-1
5943	Maid in Manhattan (2002)	-1
5944	Star Trek: Nemesis (2002)	-1
5945	About Schmidt (2002)	-1
5947	Evelyn (2002)	-1
5949	Intact (Intacto) (2001)	-1
5951	Morvern Callar (2002)	-1
5952	"Lord of the Rings: The Two Towers	-1
5953	Devils on the Doorstep (Guizi lai le) (2000)	-1
5954	25th Hour (2002)	-1
5955	Antwone Fisher (2002)	-1
5956	Gangs of New York (2002)	-1
5957	Two Weeks Notice (2002)	-1
5959	Narc (2002)	-1
5961	Blue Steel (1990)	-1
5962	Body of Evidence (1993)	-1
5963	"Children's Hour	-1
5965	"Duellists	-1
5968	Miami Blues (1990)	-1
5969	My Girl 2 (1994)	-1
5970	My Girl (1991)	-1
5971	My Neighbor Totoro (Tonari no Totoro) (1988)	-1
5974	"Thief of Bagdad	-1
5975	War and Peace (1956)	-1
5979	Attack of the Crab Monsters (1957)	-1
5980	Black Christmas (1974)	-1
5984	"Story of O	-1
5986	Fat City (1972)	-1
5988	Quicksilver (1986)	-1
5989	Catch Me If You Can (2002)	-1
5990	Pinocchio (2002)	-1
5991	Chicago (2002)	

90469	Paranormal Activity 3 (2011)	-1
90471	Puncture (2011)	-1
90522	Johnny English Reborn (2011)	-1
90524	Abduction (2011)	-1
90528	This Must Be the Place (2011)	-1
90531	Shame (2011)	-1
90576	What's Your Number? (2011)	-1
90600	Headhunters (Hodejegerne) (2011)	-1
90603	Batman: Year One (2011)	-1
90630	Miss Representation (2011)	-1
90647	Puss in Boots (2011)	-1
90717	Tower Heist (2011)	-1
90719	J. Edgar (2011)	-1
90738	"Double	-1
90746	"Adventures of Tintin	-1
90769	Starsuckers (2009)	-1
90809	Tomboy (2011)	-1
90863	George Harrison: Living in the Material World (2011)	-1
90866	Hugo (2011)	-1
90888	Immortals (2011)	-1
90890	Jack and Jill (2011)	-1
90943	Into the Abyss (2011)	-1
90945	"Sign of Four	-1
91077	"Descendants	-1
91079	Like Crazy (2011)	-1
91094	"Muppets	-1
91104	"Twilight Saga: Breaking Dawn - Part 1	-1
91126	War Horse (2011)	-1
91128	"Rum Diary	-1
91233	Lifted (2006)	-1
91261	Hipsters (Stilyagi) (2008)	-1
91266	Another Cinderella Story (2008)	-

In [14]:
%%file ./INFOH515/rating_mapper.py
#!/home/bigdata/anaconda3/bin/python
import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    tup  = line.split(',')                         # A line consists of multiple fields; split the line into fields, for ratings.dat the format is UserID::MovieID::Rating::Timestamp
    print('%s\t%s\t%s' % (tup[1], "", tup[2]))       # Output format is MovieID\tTitle\tRating key = tup[1] = movieid; value consists of blank title and rating (tup[2])

Writing ./INFOH515/rating_mapper.py


In [22]:
rating_file = 'data/movielens/ratings.csv'
!cat {rating_file} | python ./INFOH515/rating_mapper.py > temp_res/ratings.txt
!cat temp_res/ratings.txt

1		4.0
3		4.0
6		4.0
47		5.0
50		5.0
70		3.0
101		5.0
110		4.0
151		5.0
157		5.0
163		5.0
216		5.0
223		3.0
231		5.0
235		4.0
260		5.0
296		3.0
316		3.0
333		5.0
349		4.0
356		4.0
362		5.0
367		4.0
423		3.0
441		4.0
457		5.0
480		4.0
500		3.0
527		5.0
543		4.0
552		4.0
553		5.0
590		4.0
592		4.0
593		4.0
596		5.0
608		5.0
648		3.0
661		5.0
673		3.0
733		4.0
736		3.0
780		3.0
804		4.0
919		5.0
923		5.0
940		5.0
943		4.0
954		5.0
1009		3.0
1023		5.0
1024		5.0
1025		5.0
1029		5.0
1030		3.0
1031		5.0
1032		5.0
1042		4.0
1049		5.0
1060		4.0
1073		5.0
1080		5.0
1089		5.0
1090		4.0
1092		5.0
1097		5.0
1127		4.0
1136		5.0
1196		5.0
1197		5.0
1198		5.0
1206		5.0
1208		4.0
1210		5.0
1213		5.0
1214		4.0
1219		2.0
1220		5.0
1222		5.0
1224		5.0
1226		5.0
1240		5.0
1256		5.0
1258		3.0
1265		4.0
1270		5.0
1275		5.0
1278		5.0
1282		5.0
1291		5.0
1298		5.0
1348		4.0
1377		3.0
1396		3.0
1408		3.0
1445		3.0
14

145		3.0
165		3.5
170		2.0
260		4.5
293		3.0
296		3.5
355		4.0
356		4.5
364		4.0
367		3.0
370		5.0
435		1.0
480		4.0
500		2.5
541		3.5
551		0.5
586		2.5
588		4.0
589		3.5
592		3.5
597		4.0
648		4.5
719		3.5
733		3.5
743		4.0
745		3.0
780		3.5
858		2.5
902		3.0
1036		3.5
1148		3.0
1196		4.5
1198		4.0
1200		2.0
1210		4.5
1214		1.5
1221		2.5
1230		2.5
1240		3.5
1265		3.5
1270		5.0
1291		4.0
1370		3.5
1391		0.5
1527		2.5
1544		4.5
1573		2.5
1580		4.5
1704		3.5
1721		3.5
1722		5.0
1909		2.0
1918		3.5
2000		3.5
2001		2.5
2002		3.5
2006		2.5
2011		5.0
2012		5.0
2023		2.5
2052		3.0
2087		2.0
2115		4.0
2167		3.5
2174		1.0
2273		3.5
2288		0.5
2355		3.0
2376		4.5
2378		2.5
2402		3.0
2403		2.5
2424		4.0
2529		4.0
2541		4.0
2571		4.0
2617		3.5
2628		4.0
2671		4.0
2683		2.5
2700		2.5
2706		5.0
2716		3.5
2717		4.0
2724		3.0
2762		0.5
2763		4.0
2858		2.0
2916		4.0
2947		4.5
2948		4.5
2949		4.5
2953		3.0
2959		

5064		5.0
5418		5.0
5669		4.5
6874		4.5
7153		5.0
7361		3.5
7438		4.5
7445		5.0
8464		4.0
8665		5.0
8874		5.0
8961		4.0
30749		5.0
31685		3.0
31696		4.5
33646		3.0
34405		4.5
34437		3.0
35836		4.5
36529		5.0
37733		5.0
39183		4.0
39444		5.0
40583		5.0
44191		5.0
44665		4.0
45447		3.5
46976		5.0
47099		4.0
47200		5.0
47610		5.0
47997		5.0
48304		5.0
48516		4.0
48738		4.5
48774		3.5
49272		5.0
49530		5.0
49651		4.5
50794		4.5
50872		4.0
51077		3.0
51255		5.0
51662		5.0
52245		4.0
52281		4.0
52328		5.0
52973		4.5
54286		5.0
54503		4.0
54736		5.0
54995		5.0
54997		5.0
54999		5.0
55118		5.0
55276		5.0
55363		3.0
55765		4.0
56801		3.0
57368		5.0
57528		3.0
57669		5.0
58559		5.0
58998		5.0
59369		4.0
59784		4.0
59900		3.0
60069		4.0
60684		4.0
61132		5.0
62374		4.0
63082		5.0
63113		4.0
64620		4.0
65514		3.5
68358		5.0
68954		4.0
69122		5.0
69481		4.0
70286		5.0
71535		4.5
72998		4.5
73017		5.0
74458		4.5
7609

103249		3.5
104841		3.0
106487		4.0
106489		3.0
106782		3.0
107406		1.5
109374		3.5
111362		3.0
111759		4.5
112175		4.5
112183		1.5
112852		4.5
114762		3.5
116797		3.5
116823		4.0
117529		4.0
118696		3.0
119145		4.0
122882		5.0
122886		4.5
122904		4.0
122918		3.5
122922		3.0
134130		5.0
135143		4.0
139385		3.0
142488		4.5
152081		3.0
164179		3.5
166528		4.5
166635		3.0
168252		4.0
176371		3.5
50		5.0
260		5.0
296		5.0
318		5.0
333		5.0
356		4.0
541		5.0
553		5.0
589		4.0
593		5.0
924		5.0
1101		5.0
1193		5.0
1196		5.0
1210		5.0
1258		5.0
1266		5.0
1580		4.0
1610		4.0
1645		4.0
1721		4.0
1917		4.0
2476		4.0
2571		5.0
2628		4.0
2762		5.0
2951		5.0
3175		5.0
3552		5.0
3671		5.0
3793		4.0
3827		4.0
3863		3.0
3994		4.0
4020		4.0
4022		4.0
4027		5.0
4370		2.0
4448		4.0
4643		4.0
4744		2.0
4848		4.0
4874		5.0
4876		3.0
4880		5.0
4963		3.0
36		5.0
150		4.0
246		4.5
318		4.5
356		4.0
376		1.5
508		4.5
527		5.0

26082		4.0
26150		4.5
26183		3.5
26184		4.5
26236		4.0
26237		4.0
26340		4.0
26347		4.0
26471		3.5
26578		4.5
26587		5.0
26662		4.0
26849		5.0
26903		4.0
27134		4.0
27706		3.0
27741		4.5
27773		5.0
27803		4.0
27815		4.5
27834		4.5
30707		4.0
30749		4.0
30793		4.0
30803		4.0
30812		4.0
30825		3.5
31658		4.0
31685		4.0
31696		3.5
32460		5.0
32587		4.5
32657		4.5
32892		5.0
32898		4.5
33124		3.5
33162		3.5
33493		4.0
33615		4.0
33660		4.5
33794		4.0
34048		3.5
34072		4.0
34150		3.0
34319		3.5
35836		3.0
36363		4.0
36401		3.0
36529		4.0
37729		4.0
37731		5.0
38061		3.0
38159		4.5
39183		3.5
40629		3.5
40815		3.0
41566		3.5
41571		3.5
42191		4.5
44022		4.0
44191		4.0
44555		3.5
44665		5.0
45447		4.0
45499		4.0
45517		4.0
45720		3.5
45722		4.0
46578		3.0
47099		4.5
47610		3.5
48043		3.0
48304		4.0
48385		4.5
48394		3.5
48516		5.0
48780		5.0
48872		4.5
49272		4.0
49278		4.0
49530		3.5
50794		4.5
50872		4.0
5125

6796		3.0
6808		3.0
6857		1.5
6863		4.0
6870		3.5
6873		3.0
6874		4.0
6879		3.0
6881		4.0
6890		4.5
6893		3.0
6902		4.0
6932		3.5
6934		0.5
6947		3.5
6953		3.5
6969		4.5
6971		4.0
6978		3.5
6979		3.5
7022		4.5
7038		4.0
7048		4.0
7076		4.0
7090		4.5
7123		4.5
7147		4.0
7153		1.0
7254		4.0
7264		4.0
7265		4.5
7299		4.0
7323		3.0
7352		4.0
7354		3.5
7361		4.5
7371		4.5
7438		4.5
7449		2.5
7698		3.5
7802		4.0
7934		4.0
8042		4.0
8360		3.0
8370		4.0
8581		4.5
8784		4.5
8949		4.5
8973		4.5
8984		3.0
30749		4.0
31410		3.5
32031		2.5
32587		4.5
33493		2.0
33794		3.5
34048		2.0
43396		3.5
44191		4.0
44199		3.0
44204		3.0
110		5.0
377		5.0
380		4.0
457		5.0
474		4.0
480		4.0
589		5.0
592		3.0
733		4.0
858		3.0
1036		4.0
1196		4.0
1200		5.0
1201		4.0
1210		4.0
1214		4.0
1220		4.0
1222		4.0
1233		4.0
1242		4.0
1287		5.0
1304		4.0
1370		5.0
1387		5.0
1408		4.0
1527		2.0
1580		4.0
1582		3.0
1610		5.0
1970		3.0

68358		4.0
68554		3.5
68659		3.0
68791		3.0
68793		2.5
68954		3.5
69069		3.5
69122		4.0
69278		3.5
69481		4.5
69526		3.0
69606		2.5
69757		4.0
69844		4.0
69951		4.0
70183		3.5
70286		4.0
70336		3.0
70361		3.0
71106		4.0
71254		3.5
71264		4.0
71520		3.5
71530		3.0
71535		4.5
72011		4.0
72167		3.5
72226		3.0
72378		3.0
72605		3.5
72641		4.0
72998		4.0
73017		4.5
73106		3.0
73431		2.5
73929		3.0
74458		5.0
74698		2.0
74789		2.5
74795		3.5
74946		4.0
75813		3.0
75985		3.5
76077		3.5
76093		3.5
76175		3.0
76251		4.5
77561		3.5
77667		3.0
78039		3.5
78088		3.5
78105		3.0
78209		3.5
78349		4.5
78469		3.0
78499		4.0
78637		3.5
79008		4.5
79057		3.5
79091		4.5
79132		5.0
79134		3.0
79139		3.5
79185		3.5
79224		3.5
79293		3.5
79428		3.5
79592		3.5
79695		4.0
79702		4.5
80166		3.0
80219		3.5
80363		3.5
80463		4.0
80489		4.5
80549		4.0
80693		4.0
80862		3.0
81229		3.5
81537		3.5
81562		4.5
81591		4.0
81834		4.0
8193

2020		5.0
2021		4.0
2023		3.0
2028		5.0
2034		4.0
2043		2.0
2046		3.0
2053		1.0
2054		1.0
2076		4.0
2080		1.0
2087		2.0
2097		2.0
2100		1.0
2105		3.0
2109		3.0
2115		2.0
2116		5.0
2117		5.0
2138		5.0
2140		5.0
2161		2.0
2167		5.0
2174		2.0
2193		3.0
2275		4.0
2287		5.0
2291		3.0
2302		1.0
2311		3.0
2363		3.0
2366		2.0
2376		4.0
2377		4.0
2406		2.0
2407		2.0
2409		1.0
2427		2.0
2455		3.0
2459		4.0
2467		5.0
2478		1.0
2513		2.0
2527		3.0
2528		4.0
2529		5.0
2530		5.0
2531		3.0
2532		3.0
2533		4.0
2542		5.0
2553		4.0
2571		5.0
2617		4.0
2628		2.0
2640		3.0
2641		3.0
2643		2.0
2657		3.0
2664		3.0
2700		3.0
2710		5.0
2716		2.0
2728		3.0
2746		2.0
2747		2.0
2761		3.0
2762		4.0
2791		3.0
2797		1.0
2798		3.0
2819		1.0
2858		4.0
2867		5.0
2871		4.0
2872		4.0
2901		4.0
2916		3.0
2918		3.0
2947		4.0
2949		4.0
2959		5.0
2968		4.0
2985		4.0
2986		3.0
2997		4.0
3032		5.0
3033		2.0
3081		4.0
3114		3.0
3153		5.0

30812		4.0
32139		4.0
33681		2.5
35836		4.5
42556		5.0
51662		3.5
56176		0.5
56837		3.0
62383		3.5
71302		4.5
71999		4.5
72998		4.5
80748		4.0
84799		4.5
86290		3.0
87287		4.0
89745		5.0
92760		3.5
94262		3.0
1		4.5
36		4.0
318		4.5
364		4.5
527		4.5
593		4.5
858		3.0
1193		4.5
1213		4.0
1682		4.0
2329		4.5
2959		3.5
3114		4.5
3949		3.5
4011		3.5
4226		3.5
4993		4.5
5952		4.5
6539		4.0
7143		3.5
7153		4.5
8798		3.5
8950		4.0
8961		4.0
33794		4.5
44191		4.0
48394		3.5
48516		4.5
48780		5.0
54286		3.5
58559		4.5
63082		3.5
64716		4.5
68358		4.0
68954		4.0
69481		4.0
71535		2.0
74458		4.5
78499		4.0
79132		5.0
90600		3.5
91500		2.5
91529		4.5
97304		3.5
99114		3.0
106782		4.0
109487		4.5
116797		4.0
19		2.0
21		2.0
34		5.0
47		3.0
110		5.0
150		5.0
153		3.0
161		4.0
165		2.0
185		4.0
208		3.0
253		3.0
292		4.0
296		2.0
316		3.0
318		3.0
339		3.0
344		2.0
349		4.0
356		5.0
364		3.0
367		2.0
377		2.0
3

47		0.5
150		3.5
296		3.5
318		3.5
337		4.5
339		5.0
356		5.0
364		5.0
471		5.0
480		4.0
500		5.0
539		4.5
587		2.0
588		4.5
595		4.0
597		4.5
1035		5.0
1197		5.0
1441		5.0
1614		0.5
2375		3.0
2424		4.0
2797		4.0
3247		4.5
3255		4.0
3408		5.0
3751		4.0
3793		4.0
3911		2.0
3969		2.5
3996		4.0
4022		3.5
4025		3.5
4027		2.0
4226		2.0
4306		4.5
4344		0.5
4816		2.0
4886		5.0
4896		5.0
4993		1.5
4995		4.5
5014		3.5
5266		1.0
5444		5.0
5816		5.0
5952		2.0
5989		4.0
6373		3.5
6377		5.0
6378		2.5
6593		4.0
7149		3.0
7153		2.0
7361		3.5
8360		3.5
8361		3.0
8368		5.0
8529		3.0
8533		3.5
8614		4.0
8644		2.5
8961		5.0
30793		3.5
34405		5.0
37729		2.5
40815		5.0
45517		4.0
50872		3.0
54001		4.5
54272		3.0
55820		2.5
56367		4.5
59784		2.5
60069		3.5
63082		2.5
66934		4.0
68954		2.5
69844		5.0
72641		5.0
79091		5.0
81834		5.0
88125		4.5
88810		4.0
91325		3.5
94959		2.0
16		4.0
58		2.0
110		4.0
151		3.0
296		4.5


3147		4.0
3148		3.0
3152		4.0
3153		3.5
3160		5.0
3163		3.5
3168		3.5
3174		4.0
3175		4.0
3176		5.0
3178		2.0
3181		4.5
3182		4.0
3186		3.5
3194		3.5
3196		3.5
3198		3.5
3201		3.5
3203		3.0
3206		2.0
3210		2.0
3211		4.0
3217		3.5
3219		2.5
3230		3.5
3244		3.5
3246		4.0
3247		2.0
3249		3.0
3251		3.5
3252		3.0
3253		3.5
3254		3.0
3255		3.5
3256		3.0
3258		2.5
3259		2.5
3260		4.5
3261		3.5
3263		2.0
3264		3.5
3269		2.0
3270		2.5
3271		3.0
3272		3.5
3273		4.0
3281		2.0
3296		3.5
3306		4.5
3307		4.5
3310		4.0
3317		4.0
3330		4.0
3334		3.0
3341		4.0
3350		4.0
3359		3.5
3360		3.5
3361		3.0
3362		3.5
3363		3.0
3364		3.5
3365		3.5
3368		3.5
3384		4.0
3385		2.0
3386		4.0
3393		2.0
3394		1.5
3396		3.0
3408		4.0
3412		3.5
3418		2.5
3420		2.5
3421		2.0
3424		4.0
3429		3.5
3435		4.0
3441		3.0
3447		4.0
3448		3.0
3451		4.5
3461		4.0
3462		4.5
3466		3.0
3467		3.5
3468		3.5
3469		2.5
3470		3.5
3471		3.5
3475		4.0

93831		1.5
93840		4.0
93980		1.0
94015		0.5
94130		3.5
1		4.0
22		3.5
186		3.0
296		4.0
342		3.5
345		4.0
377		4.0
708		4.0
852		4.0
1288		4.0
1289		2.5
1302		3.0
1307		4.5
1556		2.0
1682		4.0
1797		4.0
1968		4.0
2001		3.0
2100		3.5
2174		3.0
2289		4.5
2321		4.0
2353		4.0
2671		4.0
2797		3.5
2859		4.5
3039		4.0
3101		4.0
3102		4.5
3408		3.5
3481		4.5
3556		4.0
3783		4.0
3897		4.0
3911		4.0
4014		3.5
4022		4.0
4025		3.5
4027		4.0
4033		4.0
4034		4.5
4246		4.0
4306		4.5
4361		4.0
4447		3.5
4639		3.5
4700		3.0
4776		3.5
4823		2.5
4979		4.0
4993		4.0
4995		4.0
5013		4.0
5060		4.5
5299		3.5
5377		4.0
5445		3.5
5505		3.5
5669		3.5
5673		2.0
5812		4.5
5902		4.5
5945		3.0
5952		4.0
5955		4.0
5957		2.5
5989		4.0
5991		3.5
5992		2.5
6377		4.0
6552		4.5
6565		4.5
6596		3.5
6620		3.0
6708		4.5
6711		4.5
6870		5.0
6879		4.0
6927		4.0
6932		4.5
6942		4.0
6950		4.0
6953		4.0
7139		4.5
7149		4.5
7151		4.0
7153		

101525		3.0
101864		3.0
102125		3.0
102445		3.0
102590		3.0
102735		0.5
102749		0.5
102852		3.0
103027		3.0
103042		2.5
103228		3.0
103245		2.5
103366		2.0
103609		1.5
103772		3.0
103980		3.0
104017		0.5
104419		2.5
104837		2.0
104841		3.0
104925		2.5
105020		1.5
105720		3.0
105844		3.5
106072		2.5
106766		3.0
106782		3.0
106873		2.0
106916		3.5
106920		3.0
107406		3.5
107999		4.0
108090		3.5
108192		3.5
108945		2.5
109487		3.5
110102		3.0
110501		3.0
110541		2.5
110882		3.0
111360		3.0
111362		3.5
111743		1.0
111759		3.5
111785		0.5
112171		3.0
112183		3.5
112326		1.0
112552		3.0
112556		3.5
112623		3.0
112852		3.5
113849		2.5
114126		2.5
114180		2.5
114627		3.0
115149		3.0
115617		2.0
115713		3.5
115819		3.0
116044		1.5
116169		2.5
116799		3.0
116941		2.0
116963		1.5
117109		1.5
117630		1.0
117895		2.5
118290		2.0
119218		2.5
119828		3.0
120827		3.0
121035		2.0
121097		1.0
121231		3.5
122260		2.5
122433		1.0

In [17]:
%%file ./INFOH515/identity_field_mapper.py
#!/home/bigdata/anaconda3/bin/python
import sys

for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    print(line)

Writing ./INFOH515/identity_field_mapper.py


In [19]:
%%file ./INFOH515/movie_avg_value_reducer.py
#!/home/bigdata/anaconda3/bin/python
from operator import itemgetter
import sys

prev_movieId = None
prev_movieTitle = None
rating_sum = 0.0
rating_count = 0

def printRecord():
    # print the previous average value, assuming that we have a valid previous movieID
    if prev_movieId is not None and prev_movieTitle is not None and rating_count > 0:
        print('%s\t%.2f' % (prev_movieTitle, rating_sum / rating_count))


for line in sys.stdin:                              # The input data comes from STDIN (i.e: The standard input)
    line = line.strip()                             # Removal of leading and trailing whitespaces
    
    try: 
        movieId, movieTitle, movieRating = line.split('\t', 3)          # Parsing of the awaited key/value pair. 
    except ValueError:
        # In case line is not a valid key/value pair we silently discard it
        continue

    try:
        # are we moving to the next group? 
        # (Recall that input is sorted on MovieId, so a change in movieId indicates a new group)
        if prev_movieId != movieId:
            printRecord()
            #store movieID
            prev_movieId = movieId
            # clear fields
            prev_movieTitle = None
            rating_sum = 0.0
            rating_count = 0
        
        if movieTitle != "":
            prev_movieTitle = movieTitle
        currentRating = float(movieRating)
        # if it is a valid rating, add it to the running average
        if currentRating >= 0:
            rating_sum = rating_sum + currentRating
            rating_count = rating_count + 1
    except ValueError:                              # In the case the value is not a number, we silently discard the line
        continue

# Output of the last record
printRecord()

Overwriting ./INFOH515/movie_avg_value_reducer.py


In [26]:
!cat temp_res/* | sort -k1,1 | python ./INFOH515/movie_avg_value_reducer.py

Toy Story (1995)	3.92
GoldenEye (1995)	3.50
City Hall (1996)	2.79
Human Planet (2011)	4.00
Comme un chef (2012)	3.50
Movie 43 (2013)	3.50
"Pervert's Guide to Ideology	3.50
Sightseers (2012)	4.50
Hansel & Gretel: Witch Hunters (2013)	2.90
Jim Jefferies: Fully Functional (EPIX) (2012)	4.50
Why Stop Now (2012)	1.50
Tabu (2012)	4.00
Extreme Measures (1996)	2.50
Upside Down (2012)	3.00
"Liability	3.00
Angst  (1983)	3.50
Stand Up Guys (2012)	2.50
Side Effects (2013)	3.92
Identity Thief (2013)	2.88
"ABCs of Death	3.50
"Glimmer Man	3.00
Beautiful Creatures (2013)	2.00
"Good Day to Die Hard	2.08
D3: The Mighty Ducks (1996)	2.19
21 and Over (2013)	3.62
Safe Haven (2013)	4.00
Frozen Planet (2011)	4.50
"Act of Killing	5.00
Universal Soldier: Day of Reckoning (2012)	4.00
"Chamber	3.50
Escape from Planet Earth (2013)	4.00
"Apple Dumpling Gang	2.60
Before Midnight (2013)	3.50
Snitch (2013)	3.50
"Davy Crockett	3.00
Dark Skies (2013)	3.00
Oh Boy (A Coffee in Berlin) (2012)	3.50
Journey to the West: Con

"Beautician and the Beast	1.83
SubUrbia (1997)	3.75
Trumbo (2015)	4.00
Our Lips Are Sealed (2000)	3.50
"Pest	2.00
Fools Rush In (1997)	3.08
Idaho Transfer (1973)	0.50
Witch Hunt (1999)	2.50
Touch (1997)	4.00
Concussion (2015)	2.50
Absolute Power (1997)	2.67
"Peanuts Movie	1.50
Bloodsport: The Dark Kumite (1999)	0.50
Formula of Love (1984)	5.00
"Amazing Panda Adventure	3.33
That Darn Cat (1997)	3.25
A Man from Boulevard des Capucines (1987)	4.00
The Adventures of Sherlock Holmes and Dr. Watson: The Hound of the Baskervilles (1981)	4.00
Vegas Vacation (National Lampoon's Las Vegas Vacation) (1997)	2.31
Blue Mountain State: The Rise of Thadland (2015)	3.00
Dil To Pagal Hai (1997)	4.00
The Boy and the Beast (2015)	4.00
Lost Highway (1997)	3.26
Rosewood (1997)	3.33
Donnie Brasco (1997)	3.74
Creed (2015)	3.75
Dragons: Gift of the Night Fury (2011)	5.00
Twinsters (2015)	4.00
Cosmic Scrat-tastrophe (2015)	5.00
Solace (2015)	2.00
Lost in the Sun (2015)	2.00
Booty 

"Karate Kid	2.78
"Karate Kid	1.75
Christmas Vacation (National Lampoon's Christmas Vacation) (1989)	3.57
You've Got Mail (1998)	3.12
"General	4.00
"Thin Red Line	3.30
"Faculty	2.48
Mighty Joe Young (1998)	2.80
Gordy (1995)	3.00
Mighty Joe Young (1949)	2.83
Patch Adams (1998)	3.37
Stepmom (1998)	3.08
"Civil Action	3.05
Hurlyburly (1998)	4.00
Tea with Mussolini (1999)	4.25
Affliction (1997)	3.17
Hilary and Jackie (1998)	3.83
Playing by Heart (1998)	3.38
At First Sight (1999)	2.69
In Dreams (1999)	2.00
Varsity Blues (1999)	3.27
Virus (1999)	2.67
Howard the Duck (1986)	2.16
"Gate	3.00
"Boy Who Could Fly	2.17
"Fly	3.38
"Fly	3.38
"Fly II	2.28
Running Scared (1986)	3.56
Armed and Dangerous (1986)	2.75
"Texas Chainsaw Massacre	3.27
Hoop Dreams (1994)	4.29
"Texas Chainsaw Massacre 2	2.00
Texas Chainsaw Massacre: The Next Generation (a.k.a. The Return of the Texas Chainsaw Massacre) (1994)	3.00
Ruthless People (1986)	3.32
Deadly Friend (1986)	1.50
"Name of the

Creepshow 2 (1987)	3.00
Re-Animator (1985)	3.45
Drugstore Cowboy (1989)	4.15
"Queen Margot (Reine Margot	3.20
Falling Down (1993)	3.53
"Funhouse	2.00
"General	4.00
Piranha (1978)	2.64
"Taming of the Shrew	2.50
Nighthawks (1981)	3.25
"Quick and the Dead	2.95
Yojimbo (1961)	4.23
Repossessed (1990)	2.38
"Omega Man	3.67
Spaceballs (1987)	3.48
Robin Hood (1973)	3.51
Mister Roberts (1955)	3.82
"Quest for Fire (Guerre du feu	3.79
Little Big Man (1970)	4.15
"Face in the Crowd	4.20
Trading Places (1983)	3.73
Roommates (1995)	3.67
Meatballs (1979)	3.22
Meatballs Part II (1984)	2.08
Meatballs III (1987)	2.50
Meatballs 4 (1992)	1.00
Dead Again (1991)	3.77
Peter's Friends (1992)	3.60
"Incredibly True Adventure of Two Girls in Love	4.50
Under the Rainbow (1981)	2.00
Ready to Wear (Pret-A-Porter) (1994)	2.83
Anywhere But Here (1999)	2.50
Dogma (1999)	3.65
"Messenger: The Story of Joan of Arc	3.91
Pokémon: The First Movie (1998)	2.31
Felicia's Journey (1999)	3.33
Ox

"Fog	2.10
Sorority House Massacre (1986)	5.00
Shopgirl (2005)	3.00
Sorority House Massacre II (1990)	5.00
Stay (2005)	3.00
Bamboozled (2000)	2.94
"Legend of Zorro	2.90
"Weather Man	3.39
Saw II (2005)	2.81
Prime (2005)	2.60
Digimon: The Movie (2000)	3.00
Get Carter (2000)	2.50
Get Carter (1971)	3.33
Meet the Parents (2000)	3.42
Requiem for a Dream (2000)	3.92
Tigerland (2000)	3.56
Two Family House (2000)	5.00
Don't Move (Non ti muovere) (2004)	4.50
"Contender	3.07
Dr. T and the Women (2000)	2.12
"Ladies Man	1.58
Billy Jack (1971)	2.67
Billy Jack Goes to Washington (1977)	2.00
"Time Machine	3.62
Ghoulies II (1987)	1.50
"Unsinkable Molly Brown	3.75
"Adventures of Ichabod and Mr. Toad	3.00
"Strange Love of Martha Ivers	3.50
Detour (1945)	3.50
Billy Elliot (2000)	3.85
Bedazzled (2000)	3.07
Pay It Forward (2000)	3.44
"Private Eyes	2.75
American Pie Presents: Band Camp (American Pie 4: Band Camp) (2005)	2.14
"Legend of Drunken Master	4.12
Book of Shadows: Bl

Rocket Science (2007)	3.25
Z (1969)	3.83
Halloween: Resurrection (Halloween 8) (2002)	2.00
Daddy Day Camp (2007)	0.50
Sex and Lucia (Lucía y el sexo) (2001)	3.88
"Invasion	3.00
Eight Legged Freaks (2002)	2.75
"Nanny Diaries	3.17
Halloween (2007)	3.25
Death Sentence (2007)	3.50
K-19: The Widowmaker (2002)	3.50
2 Days in Paris (2007)	3.50
Terminal Velocity (1994)	2.75
Stuart Little 2 (2002)	2.60
Austin Powers in Goldmember (2002)	2.85
"Kid Stays in the Picture	3.83
Tadpole (2002)	3.33
Who Is Cletis Tout? (2001)	2.50
"King of Kong	3.92
Nosferatu the Vampyre (Nosferatu: Phantom der Nacht) (1979)	4.00
Thirty-Two Short Films About Glenn Gould (1993)	4.50
The Big Bus (1976)	5.00
Taxi 4 (2007)	2.00
Behind the Mask: The Rise of Leslie Vernon (2006)	4.00
In Like Flint (1967)	4.00
"Brothers Solomon	0.50
"Nines	2.50
Our Man Flint (1965)	3.50
Red Beard (Akahige) (1965)	4.50
Robin and Marian (1976)	3.25
Planet Terror (2007)	3.80
3:10 to Yuma (2007)	4.06
Shoot 'Em Up (

Once Upon a Time in China (Wong Fei Hung) (1991)	3.36
Once Upon a Time in China II (Wong Fei-hung Ji Yi: Naam yi dong ji keung) (1992)	2.50
Once Upon a Time in China III (Wong Fei-hung tsi sam: Siwong tsangba) (1993)	3.00
Paper Moon (1973)	3.60
"Girl with the Dragon Tattoo	3.94
Sunshine Cleaning (2008)	3.38
Kung Fu Panda: Secrets of the Furious Five (2008)	2.88
Space Jam (1996)	2.71
Day of the Dead (1985)	3.62
"Hello	3.75
Memoirs of an Invisible Man (1992)	2.00
Echelon Conspiracy (2009)	2.00
Barbarella (1968)	3.17
Monsters vs. Aliens (2009)	3.25
Once Bitten (1985)	2.25
Squirm (1976)	2.00
"Brood	3.00
Anything Else (2003)	2.83
"Baader Meinhof Komplex	4.00
Cold Creek Manor (2003)	2.50
"Fighting Temptations	3.50
Secondhand Lions (2003)	3.50
Big Stan (2007)	2.75
Underworld (2003)	3.50
Bubba Ho-tep (2002)	3.32
In This World (2002)	3.00
Strictly Sexual (2008)	5.00
Duplex (2003)	2.43
"Rundown	3.15
Under the Tuscan Sun (2003)	3.28
Anvil! The Story of Anvil (2008)	3

"Cabin in the Woods	4.02
God Bless America (2011)	4.50
"Three Stooges	1.25
"Raven	2.00
North & South (2004)	4.50
Beautiful Girls (1996)	3.95
"Adventures of Robin Hood	4.00
"Big Bang	1.50
Mirror Mirror (2012)	3.00
Battleship (2012)	2.33
"Best Exotic Marigold Hotel	3.88
"Mark of Zorro	3.75
Comic-Con Episode IV: A Fan's Hope (2011)	2.00
Bully (2011)	3.50
Hysteria (2011)	4.00
Dante's Inferno: An Animated Epic (2010)	3.00
Laura (1944)	4.33
"Atomic Submarine	3.00
"Five-Year Engagement	2.50
"Ghost and Mrs. Muir	3.93
Think Like a Man (2012)	2.50
"Lucky One	4.00
Lost Horizon (1937)	4.00
Safe (2012)	3.50
Dark Shadows (2012)	2.50
96 Minutes (2011) 	2.50
Top Hat (1935)	4.07
"Decoy Bride	3.67
To Be or Not to Be (1942)	2.50
Rocket Singh: Salesman of the Year (2009)	4.00
"Dictator	3.56
My Man Godfrey (1936)	3.75
Walking with Monsters (2005)	4.00
Men in Black III (M.III.B.) (M.I.B.³) (2012)	3.28
Snow White and the Huntsman (2012)	2.83
Sound of My Voice (2011)	3.50
G