https://mrjob.readthedocs.io/en/latest/guides/quickstart.html

In [None]:
# установить можно с помощью pip или conda
# ! pip install mrjob
# ! conda install mrjob

In [9]:
import os
import re

import numpy as np

- A **mapper** takes a single key and value as input, and returns zero or more (key, value) pairs. The pairs from all map outputs of a single step are grouped by key.

- A **combiner** takes a key and a subset of the values for that key as input and returns zero or more (key, value) pairs. Combiners are optimizations that run immediately after each mapper and can be used to decrease total data transfer. Combiners should be idempotent (produce the same output if run multiple times in the job pipeline).

- A **reducer** takes a key and the complete set of values for that key in the current step, and returns zero or more arbitrary (key, value) pairs as output.

    After the reducer has run, if there are more steps, the individual results are arbitrarily assigned to mappers for further processing. If there are no more steps, the results are sorted and made available for reading.


# Word Count

Давайте поработаем с текстом и чего-нибудь там посчитаем

## Lines, Words, Chars

In [10]:
%%writefile job.py

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):
    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting job.py


```
python3 job.py our_file.txt
```

In [4]:
!source ~/.venvs/teaching3.12/bin/activate && python3 job.py data/crime-punishment.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/h6/lv17v1r10lz21g745pd6774h0tgc4z/T/job.iadovgopolyi.20250215.065003.453047
Running step 1 of 1...
job output is in /var/folders/h6/lv17v1r10lz21g745pd6774h0tgc4z/T/job.iadovgopolyi.20250215.065003.453047/output
Streaming final output from /var/folders/h6/lv17v1r10lz21g745pd6774h0tgc4z/T/job.iadovgopolyi.20250215.065003.453047/output...
"words"	206551
"lines"	22443
"chars"	1131926
Removing temp directory /var/folders/h6/lv17v1r10lz21g745pd6774h0tgc4z/T/job.iadovgopolyi.20250215.065003.453047...


![](img/mrjob_example_1.png)

## Names

Давайте немного усложним задачу и попробуем прикинуть, сколько раз в тексте упоминаются пары Имя Отчество?

Для этого нам надо придумать регулярку

In [11]:
with open('data/crime-punishment.txt', 'r') as file:
    text = file.read()

In [12]:
import re

name_regex = re.compile('([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))')

  name_regex = re.compile('([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))')


In [13]:
for res in re.finditer(name_regex, text):
    name = res.group()
    name = re.sub('\s+', ' ', name)
    print(name)

Alyona Ivanovna
Alyona Ivanovna
Alyona Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Amalia Fyodorovna
Katerina Ivanovna
Ivan Ivanitch
Katerina Ivanovna
Katerina Ivanovna
Darya Frantsovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Darya Frantsovna
Sofya Semyonovna
Amalia Fyodorovna
Darya Frantsovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Ivan Afanasyvitch
Ivan Afanasyvitch
Katerina Ivanovna
Semyon Zaharovitch
Katerina Ivanovna
Katerina Ivanovna
Amalia Fyodorovna
Semyon Zaharovitch
Semyon Zaharovitch
Semyon Zaharovitch
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Praskovya Pavlovna
Vassily Ivanovitch
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Pyotr Petrovitch
Marfa Petrovna
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr 

  name = re.sub('\s+', ' ', name)


Проверили, что регулярка выдает что-то похожее на правду

Применим к нашей джобе

In [31]:
PATTERN = re.compile(r'([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))')

In [23]:
PATTERN.search('Ivan Afanasyvitch').group()

'Ivan Afanasyvitch'

In [34]:
%%writefile job.py

import re
from mrjob.job import MRJob

PATTERN = re.compile(r'([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))')

class MRWordMiddleNameCounts(MRJob):
    def mapper(self, _, line):
        for result in PATTERN.finditer(line):
            name = result.group()
            name = re.sub('\s+', ' ', name)
            
            yield name, 1

    def reducer(self, name, values):
        yield name, sum(values)

if __name__ == '__main__':
    MRWordMiddleNameCounts.run()

Overwriting job.py


In [35]:
!source ~/.venvs/teaching3.12/bin/activate && python3 job.py -q data/crime-punishment.txt

  name = re.sub('\s+', ' ', name)
"Vassily Ivanovitch"	1
"Nastasya Nikiforovna"	1
"Natalya Yegorovna"	1
"Nikodim Fomitch"	21
"Porfiry Petrovitch"	75
"Darya Frantsovna"	4
"Dmitri Prokofitch"	22
"Ilya Petrovitch"	29
"Ivan Afanasyvitch"	2
"Ivan Ivanitch"	1
"Ivan Mihailovitch"	1
"Katerina Ivanovna"	186
"Lizaveta Ivanovna"	5
"Luise Ivanovna"	8
"Madame Resslich"	8
"Marfa Petrovna"	69
"Rodion Romanovitch"	78
"Amalia Ludwigovna"	8
"Andrey Semyonovitch"	18
"Arkady Ivanovitch"	8
"Avdotya Romanovna"	102
"Praskovya Pavlovna"	9
"Pulcheria Alexandrovna"	101
"Pyotr Petrovitch"	148
"Afanasy Ivanitch"	1
"Afanasy Ivanovitch"	6
"Alexandr Grigorievitch"	1
"Alexey Semyonovitch"	1
"Alyona Ivanovna"	11
"Amalia Fyodorovna"	3
"Amalia Ivanovna"	50
"Semyon Semyonovitch"	2
"Semyon Zaharovitch"	7
"Sofya Ivanovna"	3
"Sofya Semyonovna"	62


Аргумент `-l local` позволяет запускать задачу локально не в один поток. Аргумент `-q` подавляет дебажную информацию

![](img/mrjob_example_2.png)

## Most common middle name

Теперь попробуем еще один шаг в работе нашей программы -- подсчет самых популярных 

*(ставлю на то, что там будет или форма от Петра, или форма от Ивана)*

Здесь мы используем 2 шага. На первом шаге получаем агрегаты вида (int, Отчество), а на втором с помощью дополнительного редьюсера берем максимум

In [74]:
%%writefile job.py
import re

from mrjob.job import MRJob
from mrjob.step import MRStep

PATTERN = re.compile('[A-Z][a-z]{2,}(ich|itch|vna)')

class MRWordMostPopularMiddleName(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper, combiner=self.combiner, reducer=self.reducer),
            MRStep(reducer=self.most_common_reducer)
        ]
    
    def mapper(self, _, line):
        for name in re.finditer(PATTERN, line):
            yield name.group(), 1

    def combiner(self, key, values):
        yield key, sum(values)
    
    def reducer(self, key, values):
        yield None, (sum(values), key)

    def most_common_reducer(self, _, values):
        yield max(values)


if __name__ == '__main__':
    MRWordMostPopularMiddleName.run()

Overwriting job.py


In [75]:
!source ~/.venvs/teaching3.12/bin/activate && python3 job.py -q data/crime-punishment.txt

304	"Ivanovna"


In [68]:
max([(1, 'a'), (2, 'b')])

(2, 'b')

![](img/mrjob_example_3.png)

*:)*

# Average of numbers

Сгенерируем себе файл с цифрами для примера. Пусть у нас будут n строчек, в каждой по m чисел

In [36]:
mat = np.random.randint(-5, 255, size=(1337, 42))

with open(os.path.join('data', 'digits'), 'w') as file:
    for line in mat:
        file.write(f'{str(line.tolist())[1:-1]}\n')

In [37]:
mat.mean()

np.float64(124.42443993304128)

In [38]:
with open(os.path.join('data', 'digits'), 'r') as file:
    mat_text = file.readlines()

In [57]:
sum([[0], [1], [2]], start=[])

[0, 1, 2]

In [55]:
?sum

[0;31mSignature:[0m [0msum[0m[0;34m([0m[0miterable[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0mstart[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the sum of a 'start' value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may
reject non-numeric types.
[0;31mType:[0m      builtin_function_or_method

In [66]:
%%writefile job.py

from mrjob.job import MRJob

class MRNumbersAverager(MRJob):
    def mapper(self, _, line):
        line_data = list(map(int, line.split(', ')))
        
        yield '', sum(line_data) / len(line_data)

    # def reducer(self, line, values):
    #     yield sum(values) / len(values)


if __name__ == '__main__':
    MRNumbersAverager.run()

Overwriting job.py


In [67]:
!source ~/.venvs/teaching3.12/bin/activate && python3 job.py -q data/digits

""	142.4047619047619
""	132.47619047619048
""	118.9047619047619
""	123.16666666666667
""	111.14285714285714
""	141.5
""	133.23809523809524
""	121.07142857142857
""	102.38095238095238
""	108.33333333333333
""	113.88095238095238
""	111.26190476190476
""	130.73809523809524
""	115.07142857142857
""	122.26190476190476
""	102.88095238095238
""	127.0952380952381
""	111.9047619047619
""	118.14285714285714
""	142.92857142857142
""	119.16666666666667
""	122.47619047619048
""	111.45238095238095
""	130.0952380952381
""	112.30952380952381
""	113.85714285714286
""	121.07142857142857
""	118.14285714285714
""	130.47619047619048
""	125.78571428571429
""	118.71428571428571
""	130.0
""	131.42857142857142
""	152.88095238095238
""	119.61904761904762
""	132.23809523809524
""	117.92857142857143
""	124.26190476190476
""	156.38095238095238
""	124.57142857142857
""	132.6904761904762
""	128.38095238095238
""	114.54761904761905
""	130.38095238095238
""	116.95238095238095
""	122.69047619047619
""	127.3809523809523

In [48]:
%%writefile job.py

from mrjob.job import MRJob

class MRNumbersAverager(MRJob):
    def mapper(self, _, line):
        for number in line.strip().split(','):
            yield 1, int(number)

    def reducer(self, key, values):
        values = list(values)
        
        yield "avg", sum(values) / len(values)


if __name__ == '__main__':
    MRNumbersAverager.run()

Overwriting job.py


In [49]:
!source ~/.venvs/teaching3.12/bin/activate && python3 job.py -q data/digits

"avg"	124.42443993304128
