#ParDo

- ParDo is a Beam transform for generic parallel processing.
- The ParDo processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.
- ParDo is useful for a variety of common data processing operations, including:
- Filtering a data set.  
- Formatting or type-converting each element in a data set.
- Extracting parts of each element in a data set.  
- Performing computations on each element in a data set.
- When apply a ParDo transform, need to provide user code in the form of a DoFn object.

#DoFn

- DoFn is a Beam SDK class that defines a distributed processing function.
- The DoFn object that you pass to ParDo contains the processing logic that gets applied to the elements in the input collection. Beam handle that for you.
- Inside your DoFn subclass, you’ll write a method process where you provide the actual processing logic.
- You don’t need to manually extract the elements from the input collection; the Beam SDKs
- Your process method should accept an argument element, which is the input element, and return an iterable with its output values.
- A given DoFn instance generally gets invoked one or more times to process some arbitrary bundle of elements.
- Your method should meet the following requirements:
- You should not in any way modify the element argument provided to the process method, or any side inputs.
- Once you output a value using yield or return, you should not modify that value in any way.

In [3]:
import apache_beam as beam

In [5]:
#ParDo Function
class SplitRow(beam.DoFn):
  def process(self,element):
    return [element.split(',')]

class ComputeWordLengthFn(beam.DoFn):
  def process(self,element):
    return [len(element)]


with beam.Pipeline() as pipeline:
  input_data = (pipeline
                | "read from text">> beam.io.ReadFromText("/content/sample_data/students.txt", skip_header_lines= True)
                | "spliting the record" >> beam.ParDo(SplitRow()))

  count_data = (input_data
                |"filtering the data with PASS" >> beam.Filter(lambda record : record[5]=="FAIL"))

  word_lengths = (count_data
                 |"countof records" >> beam.ParDo(ComputeWordLengthFn())
                 |beam.Map(print))

  output_data = (count_data
                 | "Write to Text" >> beam.io.WriteToText("result/fail_data"))



6
6
6
6


#Keys
Takes a collection of key-value pairs and returns the key to each element.

#Values
Takes a collection of key-value pairs, and returns the value of each element.

In [6]:
#extracting keys from pairs
with beam.Pipeline() as pipeline:
  icons = (
      pipeline
      | 'Garden plants' >> beam.Create([
          ('🍓', 'Strawberry'),
          ('🥕', 'Carrot'),
          ('🍆', 'Eggplant'),
          ('🍅', 'Tomato'),
          ('🥔', 'Potato'),
      ])
      | 'Keys' >> beam.Keys()
      | beam.Map(print))

🍓
🥕
🍆
🍅
🥔


In [8]:
#extracting Values from pairs
with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Garden plants' >> beam.Create([
          ('🍓', 'Strawberry'),
          ('🥕', 'Carrot'),
          ('🍆', 'Eggplant'),
          ('🍅', 'Tomato'),
          ('🥔', 'Potato'),
      ])
      | 'Values' >> beam.Values()
      | beam.Map(print))

Strawberry
Carrot
Eggplant
Tomato
Potato


#ToString

- Transforms every element in an input collection to a string.  
- Any non-string element can be converted to a string using standard Python functions and methods.  
- Many I/O transforms, such as textio.WriteToText, expect their input elements to be strings.
  - Key-value pairs to string
  - Elements to string
  - Iterables to string

In [9]:
#key_value pair to string
with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Garden plants' >> beam.Create([
          ('🍓', 'Strawberry'),
          ('🥕', 'Carrot'),
          ('🍆', 'Eggplant'),
          ('🍅', 'Tomato'),
          ('🥔', 'Potato'),
      ])
      | 'To string' >> beam.ToString.Kvs()  #Element() #Iterables()
      | beam.Map(print)
  )

🍓,Strawberry
🥕,Carrot
🍆,Eggplant
🍅,Tomato
🥔,Potato


In [10]:
#dict directly to String, like str(element)
with beam.Pipeline() as pipeline:
    plant_lists = (
        pipeline
        | 'Garden plants' >> beam.Create([
            ['🍓', 'Strawberry', 'perennial'],
            ['🥕', 'Carrot', 'biennial'],
            ['🍆', 'Eggplant', 'perennial'],
            ['🍅', 'Tomato', 'annual'],
            ['🥔', 'Potato', 'perennial'],
        ])
        | 'To string' >> beam.ToString.Element()
        | beam.Map(print)
  )

['🍓', 'Strawberry', 'perennial']
['🥕', 'Carrot', 'biennial']
['🍆', 'Eggplant', 'perennial']
['🍅', 'Tomato', 'annual']
['🥔', 'Potato', 'perennial']


In [11]:
#converts an iterable, in this case a list of strings, into a string like ','.join()
with beam.Pipeline() as pipeline:
    plants_csv = (
        pipeline
        | 'Garden plants' >> beam.Create([
            ['🍓', 'Strawberry', 'perennial'],
            ['🥕', 'Carrot', 'biennial'],
            ['🍆', 'Eggplant', 'perennial'],
            ['🍅', 'Tomato', 'annual'],
            ['🥔', 'Potato', 'perennial'],
        ])
        | 'To string' >> beam.ToString.Iterables()
        | beam.Map(print)
)

🍓,Strawberry,perennial
🥕,Carrot,biennial
🍆,Eggplant,perennial
🍅,Tomato,annual
🥔,Potato,perennial


#KvSwap()
Takes a collection of key-value pairs and returns a collection of key-value pairs which has each key and value swapped.

In [12]:
with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Garden plants' >> beam.Create([
          ('🍓', 'Strawberry'),
          ('🥕', 'Carrot'),
          ('🍆', 'Eggplant'),
          ('🍅', 'Tomato'),
          ('🥔', 'Potato'),
      ])
      | 'Key-Value swap' >> beam.KvSwap()
      | beam.Map(print)
)

('Strawberry', '🍓')
('Carrot', '🥕')
('Eggplant', '🍆')
('Tomato', '🍅')
('Potato', '🥔')


[Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/#pardo)

[Basics of the Beam model](https://beam.apache.org/documentation/basics/#aggregation)

[Python Transform Catalogue](https://beam.apache.org/documentation/transforms/python/overview/)