# Atividade MapReduce

Considerando o dataset detalhado a seguir, extraia o conjunto de informações solicitadas.

### Dataset dados de operacoes comerciais entre paises
- Dados relativos a transacoes comerciais entre paises ao longo do tempo
- 2 bases:
  - base.csv ~433MB (3M de eventos)
  - base_sample.csv ~4MB (35k de eventos)
- 4.4M de instâncias

### Formato da base

|  # | Nome do campo        	| Descrição                                     	|
|----|----------------------	|-----------------------------------------------	|
|  0 | 	País	País envolvido na transação comercial|
|  1 | 	Ano	Ano em que a transação foi efetuada|
|  2 | 	Código	Código da mercadoria|
|  3 | 	Mercadoria 	Descrição da mercadoria|
|  4 | 	Fluxo	Fluxo, e.g. Exportação ou Importação|
|  5 | 	Valor	Valor em dólares|
|  6 | 	Peso	Peso da mercadoria|
|  7 | 	Unidade 	Unidade de medida da mercadoria, e.g. Quantidade de itens|
|  8 | 	Quantidade	Quantidade conforme a unidade especificada da mercadoria|
|  9 | 	Categoria	Categoria da mercadoria, e.g. Produto Animal|


### Informações a serem extraídas:

1.	País com a maior quantidade de transações comerciais efetuadas;
2.	Mercadoria com a maior quantidade de transações comerciais no Brasil (como a base de dados está em inglês utilize Brazil, com Z, utilize a função “contains” do Java);
3.	Quantidade de transações financeiras realizadas por ano;
4.	Mercadoria com maior quantidade de transações financeiras;
5.	Mercadoria com maior quantidade de transações financeiras em 2016;
6.	Mercadoria com maior quantidade de transações financeiras em 2016, no Brasil (como a base de dados está em inglês utilize Brazil, com Z);
7.	Mercadoria com maior total de peso, de acordo com todas transações comerciais;
8.	Mercadoria com maior total de peso, de acordo com todas transações comerciais, separadas de acordo com o ano;
9.	Média de peso por mercadoria, separadas de acordo com o ano;
10.	Média de peso por mercadoria comercializadas no Brasil (como a base de dados está em inglês utilize Brazil, com Z), separadas de acordo com o ano;
11.	Média de peso por mercadoria comercializadas no Brasil (como a base de dados está em inglês utilize Brazil, com Z), em relação ao fluxo, separadas de acordo com o ano;
12.	Preco medio das commodities de acordo com o ano
13.	Valores maximos, minimos e médios de cada tipo de mercadoria por ano
14.	Pais com o maior preco da commodity em fluxos do tipo Export
15.	Quantidade de transações comerciais de acordo com o fluxo, de acordo com o ano;

In [None]:
#Instala ambiente local do MapReduce
!pip --quiet install mrjob
#Efetua download da base completa (400MB)  https://drive.google.com/file/d/1LdeDR5wKP15kUywI7491OmKzdMzVgO-c/view?usp=sharing
!gdown 1LdeDR5wKP15kUywI7491OmKzdMzVgO-c
#Efetua download da base parcial (4MB)   https://drive.google.com/file/d/1CBi-jpBlJrKX4BmWMO0bM-DjLeLXrnDm/view?usp=sharing
!gdown 1CBi-jpBlJrKX4BmWMO0bM-DjLeLXrnDm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading...
From: https://drive.google.com/uc?id=1LdeDR5wKP15kUywI7491OmKzdMzVgO-c
To: /content/base.csv
100% 444M/444M [00:04<00:00, 108MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1CBi-jpBlJrKX4BmWMO0bM-DjLeLXrnDm
To: /content/base_sample.csv
100% 4.06M/4.06M [00:00<00:00, 77.7MB/s]


### Informação 1
 - País com a maior quantidade de transações comerciais efetuadas;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, linha):
      try:
        campos = linha.split(';')
        pais = campos[0]
        yield(pais, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Writing pratica.py


Teste prévio com base parcial (Top 1 = "Canada"	936)

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.183948.389010
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.183948.389010...

Top 10 Entradas da atividade
"Canada"	936
"France"	762
"Germany"	740
"Netherlands"	717
"Greece"	687
"Italy"	655
"Denmark"	646
"Czech Rep."	643
"Malaysia"	640
"Austria"	614


Teste com base completa (Top 1 = "Australia"	89487)

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.183954.730395
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.183954.730395...

Top 10 Entradas da atividade
"Australia"	89487
"Canada"	69468
"Austria"	58683
"Argentina"	57729
"China, Hong Kong SAR"	55535
"Brazil"	54356
"Denmark"	49905
"Belgium"	49525
"China"	49275
"France"	46052


### Informação 2
 - Mercadoria com a maior quantidade de transações comerciais no Brasil (como a base de dados está em inglês utilize Brazil, com Z)

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
      try:
        fields = line.split(';')
        country = fields[0]
        if country == 'Brazil':
          commodity = fields[3]
          yield(commodity, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.185725.743148
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.185725.743148...

Top 10 Entradas da atividade
"Horses, live except pure-bred breeding"	64
"Animals, live, except farm animals"	56
"Bovine animals, live pure-bred breeding"	56
"Fowls, live domestic < 185 grams"	56
"Bovine animals, live, except pure-bred breeding"	54
"Horses, live pure-bred breeding"	54
"Swine, live pure-bred breeding"	52
"Sheep, live"	47
"Poultry, live except domestic fowls, < 185 grams"	32
"Goats, live"	30


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.185735.462604
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.185735.462604...

Top 10 Entradas da atividade
"Industrial fatty alcohols"	84
"Cigarette or pipe tobacco and tobacco substitute mixe"	69
"Synthetic organic pigments and preps based thereon"	69
"Beauty, makeup and suntan preparations nes"	68
"Chocolate/cocoa food preparations nes"	68
"Chocolate, cocoa preps, block, slab, bar, filled, >2k"	68
"Essential oils, nes"	68
"Hair preparations, nes"	68
"Hair shampoos"	68
"Jams, fruit jellies, purees and pastes, except citrus"	68


### Informação 3
 - Quantidade de transações financeiras realizadas por ano;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
      try:
        fields = line.split(';')
        year = fields[1]
        yield(year, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190158.938154
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190158.938154...

Top 10 Entradas da atividade
"2012"	1628
"2013"	1621
"2003"	1580
"2010"	1571
"2014"	1568
"2011"	1566
"2001"	1557
"2015"	1551
"2000"	1549
"2007"	1538


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190204.019480
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190204.019480...

Top 10 Entradas da atividade
"2012"	137457
"2011"	137418
"2010"	137198
"2009"	136645
"2006"	135996
"2007"	135855
"2013"	135501
"2005"	133867
"2014"	133270
"2008"	133210


### Informação 4
 - Mercadoria com maior quantidade de transações financeiras;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
      try:
        fields = line.split(';')
        commodity = fields[3]
        yield(commodity, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190312.409759
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190312.409759...

Top 10 Entradas da atividade
"Animals, live, except farm animals"	4413
"Fowls, live domestic < 185 grams"	3383
"Horses, live except pure-bred breeding"	3331
"Horses, live pure-bred breeding"	2971
"Bovine animals, live, except pure-bred breeding"	2758
"Poultry, live except domestic fowls, < 185 grams"	2494
"Bovine animals, live pure-bred breeding"	2490
"Sheep, live"	2287
"Goats, live"	2002
"Poultry, live except domestic fowls, > 185 grams"	1991


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190315.345501
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190315.345501...

Top 10 Entradas da atividade
"Food preparations nes"	8048
"Sugar confectionery not chewing gum, no cocoa content"	7764
"Sauces nes, mixed condiments, mixed seasoning"	7693
"Cigarettes containing tobacco"	7620
"Communion wafers, rice paper, bakers wares nes"	7530
"Sweet biscuits, waffles and wafers"	7402
"Chocolate/cocoa food preparations nes"	7267
"Refined sugar, in solid form, nes, pure sucrose"	7258
"Animal feed preparations nes"	7242
"Coffee extracts, essences, concentrates, preparations"	7043


### Informação 5
 - Mercadoria com maior quantidade de transações financeiras em 2016

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
      try:
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        if year == '2016':
          yield(commodity, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190508.913059
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190508.913059...

Top 10 Entradas da atividade
"Animals, live, except farm animals"	161
"Horses, live except pure-bred breeding"	121
"Fowls, live domestic < 185 grams"	119
"Horses, live pure-bred breeding"	107
"Bovine animals, live, except pure-bred breeding"	98
"Bovine animals, live pure-bred breeding"	89
"Sheep, live"	86
"Swine, live pure-bred breeding"	71
"Goats, live"	70
"Poultry, live except domestic fowls, > 185 grams"	69


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190528.138040
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190528.138040...

Top 10 Entradas da atividade
"Food preparations nes"	278
"Cigarettes containing tobacco"	272
"Communion wafers, rice paper, bakers wares nes"	269
"Sauces nes, mixed condiments, mixed seasoning"	268
"Sugar confectionery not chewing gum, no cocoa content"	267
"Sweet biscuits, waffles and wafers"	267
"Chocolate/cocoa food preparations nes"	258
"Malt extract & limited cocoa pastrycooks products nes"	251
"Animal feed preparations nes"	249
"Coffee extracts, essences, concentrates, preparations"	249


### Informação 6
 - Mercadoria com maior quantidade de transações financeiras em 2016, no Brasil (como a base de dados está em inglês utilize Brazil, com Z);

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
      try:
        fields = line.split(';')
        country = fields[0]
        if country == 'Brazil':
          commodity = fields[3]
          yield(commodity, 1)
      except:
        pass

    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190655.243027
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190655.243027...

Top 10 Entradas da atividade
"Horses, live except pure-bred breeding"	64
"Animals, live, except farm animals"	56
"Bovine animals, live pure-bred breeding"	56
"Fowls, live domestic < 185 grams"	56
"Bovine animals, live, except pure-bred breeding"	54
"Horses, live pure-bred breeding"	54
"Swine, live pure-bred breeding"	52
"Sheep, live"	47
"Poultry, live except domestic fowls, < 185 grams"	32
"Goats, live"	30


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.190700.295477
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.190700.295477...

Top 10 Entradas da atividade
"Industrial fatty alcohols"	84
"Cigarette or pipe tobacco and tobacco substitute mixe"	69
"Synthetic organic pigments and preps based thereon"	69
"Beauty, makeup and suntan preparations nes"	68
"Chocolate/cocoa food preparations nes"	68
"Chocolate, cocoa preps, block, slab, bar, filled, >2k"	68
"Essential oils, nes"	68
"Hair preparations, nes"	68
"Hair shampoos"	68
"Jams, fruit jellies, purees and pastes, except citrus"	68


### Informação 7
 - Mercadoria com maior total de peso, de acordo com todas transações comerciais;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        try:
          weight = float(fields[6])

          yield(commodity, weight)

        except:
          pass

    def reducer(self, chave, valores):
      valor = 0.0
      for v in valores:
        valor += float(v)
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195118.622196
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195118.622196...

Top 10 Entradas da atividade
"Bovine animals, live, except pure-bred breeding"	72350442141.0
"Swine, live except pure-bred breeding > 50 kg"	38798669146.0
"Fowls, live domestic > 185 grams"	18665789461.0
"Sheep, live"	13771569405.0
"Swine, live except pure-bred breeding < 50 kg"	12228873816.0
"Bovine animals, live pure-bred breeding"	7733896254.0
"Poultry, live except domestic fowls, > 185 grams"	4546972430.0
"Horses, live except pure-bred breeding"	2396204313.0
"Animals, live, except farm animals"	1469485365.0
"Fowls, live domestic < 185 grams"	1404224958.0


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195122.404855
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195122.404855...

Top 10 Entradas da atividade
"Petroleum oils, oils from bituminous minerals, crude"	46002412921988.0
"Iron ore, concentrate, not iron pyrites,unagglomerate"	34878419167261.0
"Ice, snow and potable water not sweetened or flavoure"	25759966772196.0
"Bituminous coal, not agglomerated"	21959118134095.0
"Oils petroleum, bituminous, distillates, except crude"	19216117424056.0
"Natural gas in gaseous state"	10696575625293.0
"Iron ore, concentrate, not iron pyrites, agglomerated"	7116880502870.0
"Coal except anthracite or bituminous, not agglomerate"	6161036788891.0
"Wheat except durum wheat, and meslin"	5452684784869.0
"Natural gas, liquefied"	5297769728104.0


### Informação 8
 - Mercadoria com maior total de peso, de acordo com todas transações comerciais, separadas de acordo com o ano;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        try:
          weight = float(fields[6])
          yield((year,commodity), weight)
        except:
          pass

    def reducer(self, chave, valores):
      valor = 0.0
      for v in valores:
        valor += float(v)
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.202131.027004
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.202131.027004...

Top 10 Entradas da atividade
["2011", "Bovine animals, live, except pure-bred breeding"]	3790561468.0
["2010", "Bovine animals, live, except pure-bred breeding"]	3725385865.0
["2009", "Bovine animals, live, except pure-bred breeding"]	3667337654.0
["2007", "Bovine animals, live, except pure-bred breeding"]	3582928269.0
["2012", "Bovine animals, live, except pure-bred breeding"]	3532005526.0
["2006", "Bovine animals, live, except pure-bred breeding"]	3474206442.0
["2008", "Bovine animals, live, except pure-bred breeding"]	3438256284.0
["2013", "Bovine animals, live, except pure-bred breeding"]	3438146811.0
["2015", "Bovine animals, live, except pure-bred breeding"]	3430878298.0
["2005", "Bovine animals, liv

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195436.566846
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195436.566846...

Top 10 Entradas da atividade
["2016", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	2706312145364.0
["2015", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	2649415129890.0
["2010", "Petroleum oils, oils from bituminous minerals, crude"]	2613316703587.0
["2014", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	2555725226600.0
["2011", "Petroleum oils, oils from bituminous minerals, crude"]	2547369237511.0
["2014", "Petroleum oils, oils from bituminous minerals, crude"]	2389423152614.0
["2015", "Petroleum oils, oils from bituminous minerals, crude"]	2334398373730.0
["2012", "Petroleum oils, oils from bituminous minerals, crude"]	2319929864134.0
["2013", "Iron ore, concentrate, no

### Informação 9
 - Média de peso por mercadoria, separadas de acordo com o ano;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        try:
          weight = float(fields[6])
          yield((year,commodity), weight)
        except:
          pass

    def reducer(self, chave, valores):
      valor = 0.0
      count = 0
      for v in valores:
        count += 1
        valor += float(v)
      valor = valor/count
      yield(chave, valor)
if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195724.492796
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195724.492796...

Top 10 Entradas da atividade
["1988", "Sheep, live"]	54003615.0
["2008", "Swine, live except pure-bred breeding > 50 kg"]	43181538.152542375
["2011", "Swine, live except pure-bred breeding > 50 kg"]	36794547.88888889
["2013", "Swine, live except pure-bred breeding > 50 kg"]	35665136.307692304
["1988", "Swine, live except pure-bred breeding > 50 kg"]	35478237.5
["2014", "Swine, live except pure-bred breeding > 50 kg"]	34317240.484375
["2016", "Fowls, live domestic > 185 grams"]	33725574.385964915
["2007", "Swine, live except pure-bred breeding > 50 kg"]	32519872.38596491
["2016", "Swine, live except pure-bred breeding > 50 kg"]	32402963.825396825
["2009", "Swine, live except pure-bred breeding > 50 kg"]	322

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195728.372957
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195728.372957...

Top 10 Entradas da atividade
["1988", "Bituminous coal, not agglomerated"]	34557636811.333336
["1999", "Oils petroleum, bituminous, distillates, except crude"]	32343556715.881355
["1988", "Petroleum oils, oils from bituminous minerals, crude"]	30595841803.444443
["1991", "Petroleum oils, oils from bituminous minerals, crude"]	20943228268.56
["1989", "Petroleum oils, oils from bituminous minerals, crude"]	20281057983.18182
["2010", "Petroleum oils, oils from bituminous minerals, crude"]	19648997771.330826
["1990", "Petroleum oils, oils from bituminous minerals, crude"]	18362963582.692307
["2014", "Petroleum oils, oils from bituminous minerals, crude"]	18239871394.0
["1988", "Iron ore, concentrate, not iron 

### Informação 10
 - Média de peso por mercadoria comercializadas no Brasil (como a base de dados está em inglês utilize Brazil, com Z), separadas de acordo com o ano;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        country = fields[0]
        if country == 'Brazil':
          try:
            weight = float(fields[6])
            yield((year,commodity), weight)
          except:
            pass

    def reducer(self, chave, valores):
      valor = 0.0
      count = 0
      for v in valores:
        count += 1
        valor += float(v)
      valor = valor/count
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195909.090125
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195909.090125...

Top 10 Entradas da atividade
["2010", "Bovine animals, live, except pure-bred breeding"]	163259108.0
["2013", "Bovine animals, live, except pure-bred breeding"]	161980315.0
["2014", "Bovine animals, live, except pure-bred breeding"]	151681448.5
["2009", "Bovine animals, live, except pure-bred breeding"]	134964128.0
["2012", "Bovine animals, live, except pure-bred breeding"]	121096675.0
["2008", "Bovine animals, live, except pure-bred breeding"]	105512865.5
["2007", "Bovine animals, live, except pure-bred breeding"]	102115793.5
["2011", "Bovine animals, live, except pure-bred breeding"]	98217470.0
["1995", "Bovine animals, live, except pure-bred breeding"]	63394422.0
["1994", "Bovine animals, live, except p

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.195841.135986
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.195841.135986...

Top 10 Entradas da atividade
["2016", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	172274027120.5
["2015", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	157510314215.0
["2014", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	147233395393.0
["2013", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	141076354479.0
["2012", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	137699547522.0
["2011", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	137398453341.5
["2008", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	115846272058.5
["2007", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	109698561880.5
["2006", "Iron ore, concentrate, not i

### Informação 11
 - Média de peso por mercadoria comercializadas no Brasil (como a base de dados está em inglês utilize Brazil, com Z), em relação ao fluxo, separadas de acordo com o ano;

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        country = fields[0]
        flow = fields[4]
        if country == 'Brazil':
          try:
            weight = float(fields[6])
            yield((year,commodity,flow), weight)
          except:
            pass

    def reducer(self, chave, valores):
        valor = 0.0
        count = 0
        for v in valores:
          count += 1
          valor += float(v)
        valor = valor/count
        yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.201132.099507
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.201132.099507...

Top 10 Entradas da atividade
["2013", "Bovine animals, live, except pure-bred breeding", "Export"]	323951180.0
["2010", "Bovine animals, live, except pure-bred breeding", "Export"]	323732156.0
["2014", "Bovine animals, live, except pure-bred breeding", "Export"]	298824624.0
["2009", "Bovine animals, live, except pure-bred breeding", "Export"]	257772216.0
["2012", "Bovine animals, live, except pure-bred breeding", "Export"]	242159350.0
["2007", "Bovine animals, live, except pure-bred breeding", "Export"]	199888037.0
["2011", "Bovine animals, live, except pure-bred breeding", "Export"]	194178940.0
["2008", "Bovine animals, live, except pure-bred breeding", "Export"]	193144870.0
["1995", "Bovine animals, live

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.201234.903948
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.201234.903948...

Top 10 Entradas da atividade
["2016", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	344548049536.0
["2015", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	315020626912.0
["2014", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	294462274531.0
["2013", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	282152705950.0
["2012", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	275398874690.0
["2011", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	274796905500.0
["2010", "Iron ore, concentrate, not iron pyrites,unagglomerate", "Export"]	258820300767.0
["2009", "Iron ore, concentrate, not iron pyrites

### Informação 12
 - Preco medio das commodities de acordo com o ano

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        country = fields[0]
        if country == 'Brazil':
          try:
            cost = float(fields[5])
            yield((year,commodity), cost)
          except:
            pass

    def reducer(self, chave, valores):
      valor = 0.0
      count = 0
      for v in valores:
        count += 1
        valor += float(v)
      valor = valor/count
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.200301.845777
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.200301.845777...

Top 10 Entradas da atividade
["2013", "Bovine animals, live, except pure-bred breeding"]	359115169.0
["2010", "Bovine animals, live, except pure-bred breeding"]	329997463.5
["2014", "Bovine animals, live, except pure-bred breeding"]	327721145.5
["2012", "Bovine animals, live, except pure-bred breeding"]	289126962.5
["2011", "Bovine animals, live, except pure-bred breeding"]	222333208.5
["2009", "Bovine animals, live, except pure-bred breeding"]	222150298.5
["2008", "Bovine animals, live, except pure-bred breeding"]	195578479.0
["2007", "Bovine animals, live, except pure-bred breeding"]	133433484.0
["2015", "Bovine animals, live, except pure-bred breeding"]	105440807.0
["2016", "Bovine animals, live, except

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.200143.450986
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.200143.450986...

Top 10 Entradas da atividade
["2011", "Petroleum oils, oils from bituminous minerals, crude"]	17842002785.5
["2012", "Petroleum oils, oils from bituminous minerals, crude"]	16857785271.0
["2014", "Petroleum oils, oils from bituminous minerals, crude"]	15944924354.0
["2011", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	15925901437.5
["2013", "Petroleum oils, oils from bituminous minerals, crude"]	14638684520.0
["2010", "Petroleum oils, oils from bituminous minerals, crude"]	13193112562.0
["2013", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	12998126098.0
["2012", "Iron ore, concentrate, not iron pyrites,unagglomerate"]	11904980814.0
["2014", "Soya beans"]	11766625868.0
["2013", "Soya

### Informação 13
 - Valores maximos, minimos e médios de cada tipo de mercadoria por ano

In [1]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        year = fields[1]
        category = fields[9]
        try:
          cost = float(fields[5])
          yield((category, year), cost)
        except:
          pass

    def reducer(self, chave, valores):
      value = 0.0
      min_value = sys.maxsize
      max_value = 0
      count = 0
      for v in valores:
        if v > max_value:
          max_value = v
        if v < min_value:
          min_value = v
        count += 1
        value += float(v)


      mean_value = value/count
      yield(chave, (mean_value, min_value, max_value))

if __name__ == '__main__':
     WordCount.run()

Writing pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.234307.273629
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.234307.273629...

Top 10 Entradas da atividade
"01_live_animals"	[13702608.890651273, 1.0, 1895889152.0]


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.234308.402946
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.234308.402946...

Top 10 Entradas da atividade
"01_live_animals"	[12590297.508978438, 1.0, 2532308308.0]
"02_meat_and_edible_meat_offal"	[21930690.75038, 1.0, 4890487240.0]
"03_fish_crustaceans_molluscs_aquatic_invertebrates_ne"	[10713345.4915, 1.0, 4738429235.0]
"04_dairy_products_eggs_honey_edible_animal_product_nes"	[20647237.75476, 1.0, 6138120403.0]
"05_products_of_animal_origin_nes"	[5438164.481769115, 1.0, 1264974178.0]
"06_live_trees_plants_bulbs_roots_cut_flowers_etc"	[11986741.792948559, 1.0, 4735608576.0]
"07_edible_vegetables_and_certain_roots_and_tubers"	[6624974.62102, 1.0, 2644713802.0]
"08_edible_fruit_nuts_peel_of_citrus_fruit_melons"	[13607632.50466, 1.0, 4338020586.0]
"09_coffee_tea_mate_and_spices"	[9200

### Informação 14
 - Pais com o maior preco da commodity em fluxos do tipo Export

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        commodity = fields[3]
        country = fields[0]
        flow = fields[4]
        if flow == 'Export':
          try:
            cost = float(fields[5])
            yield((country,commodity), cost)
          except:
            pass

    def reducer(self, chave, valores):
      valor = 0.0
      for v in valores:
        if v > valor:
          valor = v
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.203034.757764
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.203034.757764...

Top 10 Entradas da atividade
["France", "Bovine animals, live, except pure-bred breeding"]	1895889152.0
["Canada", "Bovine animals, live, except pure-bred breeding"]	1798687283.0
["Australia", "Bovine animals, live, except pure-bred breeding"]	947455270.0
["Netherlands", "Swine, live except pure-bred breeding > 50 kg"]	902682671.0
["Denmark", "Swine, live except pure-bred breeding < 50 kg"]	881569879.0
["Mexico", "Bovine animals, live, except pure-bred breeding"]	870822705.0
["Brazil", "Bovine animals, live, except pure-bred breeding"]	718200007.0
["EU-28", "Bovine animals, live, except pure-bred breeding"]	711362708.0
["Netherlands", "Swine, live except pure-bred breeding < 50 kg"]	698861013.0
["Germany",

Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.203037.428779
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.203037.428779...

Top 10 Entradas da atividade
["EU-28", "Oils petroleum, bituminous, distillates, except crude"]	130265972772.0
["Iraq", "Petroleum oils, oils from bituminous minerals, crude"]	94027500000.0
["Canada", "Petroleum oils, oils from bituminous minerals, crude"]	88119818380.0
["Iran", "Petroleum oils, oils from bituminous minerals, crude"]	84381572020.0
["Kuwait", "Petroleum oils, oils from bituminous minerals, crude"]	79041032289.0
["EU-28", "Medicaments nes, in dosage"]	77054946304.0
["Angola", "Petroleum oils, oils from bituminous minerals, crude"]	68863266749.0
["India", "Oils petroleum, bituminous, distillates, except crude"]	67075185518.0
["Australia", "Iron ore, concentrate, not iron pyrites,unagglomerate

### Informação 15
 - Quantidade de transações comerciais de acordo com o fluxo, de acordo com o ano

In [None]:
%%file pratica.py
from mrjob.job import MRJob
import sys

class WordCount(MRJob):
    def mapper(self, _, line):
        fields = line.split(';')
        year = fields[1]
        flow = fields[4]
        yield((year, flow), 1)


    def reducer(self, chave, valores):
      valor = 0
      for v in valores:
        valor = valor + 1
      yield(chave, valor)

if __name__ == '__main__':
     WordCount.run()

Overwriting pratica.py


Teste prévio com base parcial

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base_sample.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.220012.150095
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.220012.150095...

Top 10 Entradas da atividade
["2012", "Import"]	928
["2013", "Import"]	922
["2010", "Import"]	899
["2011", "Import"]	899
["2007", "Import"]	898
["2003", "Import"]	893
["2014", "Import"]	891
["2000", "Import"]	877
["2015", "Import"]	861
["2001", "Import"]	860


Teste com base completa

In [None]:
!rm -rf /content/atividade
!python pratica.py /content/base.csv --output-dir=/content/atividade
print('\nTop 10 Entradas da atividade')
!cat /content/atividade/* | sort -k2,2rn -t$'\t' | head -10

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/pratica.root.20231019.215940.621568
job output is in /content/atividade
Removing temp directory /tmp/pratica.root.20231019.215940.621568...

Top 10 Entradas da atividade
["2007", "Import"]	82032
["2010", "Import"]	81859
["2009", "Import"]	81463
["2012", "Import"]	81362
["2006", "Import"]	81274
["2011", "Import"]	81225
["2005", "Import"]	79987
["2013", "Import"]	79329
["2004", "Import"]	79280
["2008", "Import"]	79214
