<a href="https://colab.research.google.com/github/Migaalee/hadoop/blob/main/Copy_of_mapreduce1_word_count.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python MapReduce Example

Word count implemented in pure Python.

This notebook exemplifies the execution of a map-reduce program in Python, using Hadoop.
In this example, hadoop runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.


### Download the dataset 

In [None]:
!wget -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

--2020-10-12 14:13:43--  https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.68.1, 2620:100:6024:1::a27d:4401
Connecting to www.dropbox.com (www.dropbox.com)|162.125.68.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/n24v0z7y79np319/os_maias.txt [following]
--2020-10-12 14:13:43--  https://www.dropbox.com/s/raw/n24v0z7y79np319/os_maias.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7f7f2e20bee4ec16561cdf66b9.dl.dropboxusercontent.com/cd/0/inline/BBJubJfDyLPKJvmbdSLG1o4_oNSEIBD23YY9ppDsBnviiEqSxrOXgEFxeWr2ohKnvaJHmCYpFCQV8LVKdtfIJyurRGp4zaPs2cibR8EQwZrQog/file# [following]
--2020-10-12 14:13:44--  https://uc7f7f2e20bee4ec16561cdf66b9.dl.dropboxusercontent.com/cd/0/inline/BBJubJfDyLPKJvmbdSLG1o4_oNSEIBD23YY9ppDsBnviiEqSxrOXgEFxeWr2ohKnvaJHmCYpFCQV8LVKdtfIJyurRGp4zaPs2cibR8EQwZrQog/file
Resolving u

## WordCount Example
Read the words from input and count them.

The processing is split into two steps:

+ The mapper emits for each line the number of words
+ The reduces sums all the tuples produced by the mapper stage...

### Mapper

By starting an element with "%%file", you are specifying that when run, the contents are written to the local disk.

In [None]:
%%file mapper.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string  

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    print('words\t%s' % len(words))

Writing mapper.py


### Reducer

In [None]:
%%file reducer.py
#!/usr/bin/env python

import sys

total_count = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    key, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)

    total_count += count

print('words\t%s' % (total_count))

Writing reducer.py


### Local execution

The scripts can be tested using just the unix shell, as follows...

#### Make the scripts executable

In [None]:
!chmod a+x mapper.py && chmod a+x reducer.py

#### Execute

The execution workflow is as follows:

+ The input file is piped into the input of the mapper;
+ The output the mapper is sorted;
+ The sorted output of the mapper is fed to the reducer stage.

In [None]:
!cat "os_maias.txt" | ./mapper.py | sort -k1,1 | ./reducer.py

words	213359


### Hadoop standalone mode execution

For executing in an hadoop cluster, input data should be moved into an HDFS directory. For executing in standalone mode, data can be read from the local filesystem. 


The output directory needs to be cleared...

In [None]:
rm -rf results

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [None]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input os_maias.txt -output results

2020-10-12 14:14:27,322 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-10-12 14:14:27,516 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-12 14:14:27,516 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-12 14:14:27,555 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-10-12 14:14:27,816 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-12 14:14:27,836 INFO mapreduce.JobSubmitter: number of splits:1
2020-10-12 14:14:28,127 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local188657134_0001
2020-10-12 14:14:28,128 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-12 14:14:28,475 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper.py as file:/tmp/hadoop-jovyan/mapred/local/job_local188657134_0001_8035d043-6753-422e-b584-2672be9c4446/mapper.py
2020-10-12 14:14:28,499 INFO mapred.LocalDistribut

#### Checking the results
The result is stored in directory results.

In [None]:
!cat results/part-*

words	213359
