<a href="https://colab.research.google.com/github/Migaalee/hadoop/blob/main/Copy_of_mapreduce_week1_word_frequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python MapReduce Exercise

In the notebook, you should create a map-reduce program that count the number of occurrence of each word.

In this exercise, hadoop runs in standalone mode and reads data from the local filesystem.


### Download the dataset 

In [None]:
!wget -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

--2020-10-02 18:49:15--  https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.68.1, 2620:100:6024:1::a27d:4401
Connecting to www.dropbox.com (www.dropbox.com)|162.125.68.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/n24v0z7y79np319/os_maias.txt [following]
--2020-10-02 18:49:16--  https://www.dropbox.com/s/raw/n24v0z7y79np319/os_maias.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce02e56f85dd07a849545c3e633.dl.dropboxusercontent.com/cd/0/inline/BAjdG-vVqAWvj01J4DduV6-_kfeT71HNCKz1ho7mKII8cKPfb9qwtEOLxlFbAy2WTuikkUji_q2_KWiqubawgSFriqy_1PNF6DUVsKxMPXN5Zw/file# [following]
--2020-10-02 18:49:16--  https://uce02e56f85dd07a849545c3e633.dl.dropboxusercontent.com/cd/0/inline/BAjdG-vVqAWvj01J4DduV6-_kfeT71HNCKz1ho7mKII8cKPfb9qwtEOLxlFbAy2WTuikkUji_q2_KWiqubawgSFriqy_1PNF6DUVsKxMPXN5Zw/file
Resolving u

## WordCount Example
Read the words from input and count the number of occurrences of each word.


### Mapper
Complete with the code for the mapper.

In [None]:
%%file mapper_words.py
#!/usr/bin/env python


# import sys
import sys
# import string library function  
import string  

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    for w in words:
        print('%s\t%s' % (w,"1"))

Overwriting mapper_words.py


### Reducer

In [None]:
%%file reducer_words.py
#!/usr/bin/python

#input: w1 w2 w1 w3 w1 w3
# w1,1
# w1,1
# w1, 1
# w2, 1
# w3, 1
# w3, 1

import sys

last_word = ''
count_words = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)
    
    if last_word != word:
        if last_word != '':
            print ("%s\t%s" % (last_word, count_words))
        last_word = word
        count_words = count
    else:
        count_words += count
        
print ("%s\t%s" % (last_word, count_words))

Overwriting reducer_words.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [None]:
rm -rf results_words

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [None]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_words.py,reducer_words.py -mapper mapper_words.py -reducer reducer_words.py -input os_maias.txt -output results_words

2020-10-02 18:49:27,623 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-10-02 18:49:27,688 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-02 18:49:27,688 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-02 18:49:27,701 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-10-02 18:49:27,848 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-02 18:49:27,866 INFO mapreduce.JobSubmitter: number of splits:1
2020-10-02 18:49:28,010 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local2040609651_0001
2020-10-02 18:49:28,010 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-02 18:49:28,219 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_words.py as file:/tmp/hadoop-jovyan/mapred/local/job_local2040609651_0001_df4d4c4c-1948-48a6-9daa-53d728ea7433/mapper_words.py
2020-10-02 18:49:28,247 INFO mapred.

#### Checking the results
The result is stored in directory results.

In [None]:
!cat results_words/part-*

0	2
1	1
15	1
1815	1
1830	3
1836	1
1848	1
1858	1
1870	1
1872	1
1875	3
1886	1
1887	1
20	1
26	2
3	1
32	1
3º	1
4	1
46	1
52	1
6	1
64	1
71	1
79	1
93	1
A	472
Abafava	1
Abaixo	1
Abalemos	1
Abandonaste	1
Abandoneia	1
Abandono	1
Abanouse	1
Abecê	1
Abegoaria	1
Abissínia	1
Abracemse	1
Abraão	10
Abraçaramse	1
Abraçou	1
Abraçoua	1
Abraçouo	1
Abria	2
Abril	4
Abrilada	1
Abriu	11
Absoluto	1
Acabou	1
Acabouse	2
Academia	4
Académico	2
Aceitar	1
Aceito	1
Aceitou	1
Acendeu	2
Acendia	1
Aceso	1
Acha	2
Achandose	1
Acharaa	1
Acharam	1
Achas	3
Achava	2
Achavaa	1
Achavao	2
Achavase	1
Acheia	1
Acheime	1
Acho	4
Achote	1
Achou	2
Achoua	1
Achoulhe	1
Achouo	1
Acompanhada	1
Acordame	1
Acordaria	1
Acordou	3
Acreditas	1
Acredite	4
Acrópole	1
Acudam	1
Addisson	1
Adeus	7
Adiante	7
Admirável	1
Adormeci	1
Adosinda	6
Adquirese	1
Adélia	13
Afastado	1
Aferrolhou	1
Afigiame	1
Afirmoumo	1
Afonso	320
Africana	1
Agarrara	1
Agarraralhe

## Sorting
The results are not sorted. Let's sort them by frequency (the words with higher occurrence first).

### Mapper
Complete with the code for the mapper.

In [None]:
%%file mapper_sort.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string  

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    word, count = line.split('\t')
    
    count = int(count)
    
    print('%05d\t%s' % (count, word))

Overwriting mapper_sort.py


### Reducer

In [None]:
%%file reducer_sort.py
#!/usr/bin/python

import sys
import string

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    count, word = line.split('\t')

    # convert count (currently a string) to int
    count = int(count)
    
    print('%d\t%s' % (count, word))

Overwriting reducer_sort.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [None]:
rm -rf results_sort

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [None]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_sort.py,reducer_sort.py -mapper mapper_sort.py -reducer reducer_sort.py -input results_words/part-* -output results_sort

2020-10-02 18:53:25,380 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-10-02 18:53:25,472 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-02 18:53:25,472 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-02 18:53:25,491 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-10-02 18:53:25,753 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-02 18:53:25,779 INFO mapreduce.JobSubmitter: number of splits:1
2020-10-02 18:53:26,009 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local735000171_0001
2020-10-02 18:53:26,010 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-02 18:53:26,310 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_sort.py as file:/tmp/hadoop-jovyan/mapred/local/job_local735000171_0001_aec9e115-2711-440e-81c8-0db5bd2f3283/mapper_sort.py
2020-10-02 18:53:26,339 INFO mapred.Loca

#### Checking the results
The result is stored in directory results.

In [None]:
!cat results_sort/part-*

1	esmagadas
1	esguios
1	consultarme
1	esguias
1	consultas
1	esgueirouse
1	esgueiravase
1	esguedelhados
1	consultavame
1	esguedelhadas
1	surdir
1	esgrouviada
1	consumado
1	esgotouo
1	esgoto
1	esgotara
1	esgotada
1	esgaçava
1	esgazearemse
1	consumar
1	esgatanhando
1	consumação
1	esganiçava
1	esganiçados
1	esganiçado
1	esganiçada
1	esganalo
1	esgalgado
1	esgalgada
1	surdia
1	esfuzia
1	esfuracou
1	esfuracando
1	esfuminho
1	esfumadas
1	esfumada
1	esfriar
1	consumidas
1	consumindo
1	esfregar
1	consumiria
1	esfrangalhalo
1	surdas
1	contacto
1	esforçara
1	esforçandose
1	esforçado
1	esfomeados
1	esfomeado
1	esfolou
1	esfolhavase
1	esfolhavamse
1	esfolhadas
1	esfolado
1	esfiava
1	contactos
1	esfarrapada
1	esfaqueandose
1	esfalfados
1	contada
1	esfalfada
1	esfaimado
1	esfacelamento
1	contagiosa
1	escárnios
1	escute
1	contaminado
1	Esvaziou
1	contandolha
1	escutassem
1	escutarem
1	contandolhe
1	contaram
1	escutandolhe
1	cont