#PCollections

- It is a dataset or datastream on which we can perform transformations in beam model.

- It is an abstraction represents a potentially distributed, multi-element data set. It represents a distributed data set that our beam pipeline operates on.

- It is immutable, transform applied on pcollection results in another pcollection.

- Element must of same type, but can be of any type.

- In many cases, the element type in a PCollection has a structure that can be introspected. Examples are JSON, Protocol Buffer, Avro, and database records. Schemas provide a way to express types as a set of named fields, allowing for more-expressive aggregations.

- A PCollection does not support random access to individual elements. Instead, Beam Transforms consider every element in a PCollection individually.

- Timestamps: Each element in pcollection has an associated timestamp with it.

In [None]:
#Pcollections
import apache_beam as beam
with beam.Pipeline() as p:
  input = (
      p
      | "Read Data" >>beam.io.ReadFromText('/content/sample_data/grocery.txt',skip_header_lines=1)
      | "Apply Map" >> beam.Map(lambda x : x.split(','))
      | "Apply Filter" >> beam.Filter(lambda x : x[2] == 'Regular')
      | "Save Data" >> beam.io.WriteToText('/content/sample_data/first_pl.txt')
      | "Print Result" >> beam.Map(print)
  )

/content/sample_data/first_pl.txt-00000-of-00001


In [None]:
#!cat /content/sample_data/first_pl.txt-00000-of-00001

#PTransform in Beam

- Transform are operations perfomed on pipeline. It applies a processing logic on every element of pcollection.

- Transform involves concept of Aggregation, which is computing a value from multiple (1 or more) input elements. Works similar way like "Reduce" in MapReduce in hadoop.

- We have various Transforms available in Beam Model. Such as ParDo,Combine and Composite Transforms.

1. Map: Applies a simple 1-to-1 mapping function over each element in the collection.

In [None]:
#using map with used defined function
def Stripchars(text):
  return text.strip('# \n')


with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Gardening plants' >> beam.Create([
          '# 🍓Strawberry\n',
          '# 🥕Carrot\n',
          '# 🍆Eggplant\n',
          '# 🍅Tomato\n',
          '# 🥔Potato\n',
      ])
      | 'Split Records' >> beam.Map(Stripchars)
      | 'Print Records' >> beam.Map(print)
  )

🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato


- MapTuple is used when input is in form of Tuple, (key,value) pairs, etc.

In [None]:
#MapTuple is used when input is in form of Tuple, (key,value) pairs, etc.
with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Gardening plants' >> beam.Create([
            ('🍓', 'Strawberry'),
            ('🥕', 'Carrot'),
            ('🍆', 'Eggplant'),
            ('🍅', 'Tomato'),
            ('🥔', 'Potato'),
      ])
      | 'Format key Value Pair' >> beam.MapTuple(lambda key,value : '{}{}'.format(key,value))
      | 'Print Records' >> beam.Map(print)
  )

🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato


2. FlatMap()

- Applies a simple 1-to-many mapping function over each element in the collection. The many elements are flattened into the resulting collection.

- FlatMap accepts a function that returns an iterable, where each of the output iterable’s elements is an element of the resulting PCollection.



In [None]:
#FlatMap with UDF
def splitFunc(text):
  return text.split(',')

with beam.Pipeline() as pipeline:
    plants = (
        pipeline
        | 'Gardening plants' >> beam.Create([
            '🍓Strawberry,🥕Carrot,🍆Eggplant',
            '🍅Tomato,🥔Potato',
        ])
        | 'Split words' >> beam.FlatMap(splitFunc)
        | 'Print Records' >> beam.Map(print)
  )

🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato


In [None]:
#FlatMapTuple used when input is tuple, (key,value) pair
def format_plant(icon,plant):
  if icon:
    yeild '{}{}'.format(key,value)

with beam.Pipeline() as pipeline:
    plants = (
        pipeline
        | 'Gardening plants' >> beam.Create([
            ('🍓', 'Strawberry'),
            ('🥕', 'Carrot'),
            ('🍆', 'Eggplant'),
            ('🍅', 'Tomato'),
            ('🥔', 'Potato'),
            (None, 'Invalid'),
        ])
        | 'Format' >> beam.FlatMapTuple(format_plant)
        | beam.Map(print)
)

SyntaxError: invalid syntax (<ipython-input-22-5c5a622f94ae>, line 3)

In [None]:
def split_words(text, delimiter=None):
    return text.split(delimiter)

with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Gardening plants' >> beam.Create([
          '🍓Strawberry,🥕Carrot,🍆Eggplant',
          '🍅Tomato,🥔Potato',
      ])
      | 'Split words' >> beam.FlatMap(split_words, delimiter=',')
      | beam.Map(print))


🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato


3. Filter()

- Given a predicate, filter out all elements that don’t satisfy that predicate. May also be used to filter based on an inequality with a given value based on the comparison ordering of the element.

In [None]:
#Filter with UDF
def is_perennial(plant):
  return plant['duration'] == 'perennial'

with beam.Pipeline() as pipeline:
  perennials = (
      pipeline
      | 'Gardening plants' >> beam.Create([
          {
              'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'
          },
          {
              'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'
          },
          {
              'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'
          },
          {
              'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'
          },
          {
              'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'
          },
      ])
      | 'Filter perennials' >> beam.Filter(is_perennial)
      | beam.Map(print))

{'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'}
{'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'}
{'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'}


In [None]:
#Filtering with multiple arguments

def has_duration(plant, duration):
  return plant['duration'] == duration

with beam.Pipeline() as pipeline:
  perennials = (
      pipeline
      | 'Gardening plants' >> beam.Create([
          {
              'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'
          },
          {
              'icon': '🥕', 'name': 'Carrot', 'duration': 'biennial'
          },
          {
              'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'
          },
          {
              'icon': '🍅', 'name': 'Tomato', 'duration': 'annual'
          },
          {
              'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'
          },
      ])
      | 'Filter perennials' >> beam.Filter(has_duration, 'perennial')
      | beam.Map(print))

{'icon': '🍓', 'name': 'Strawberry', 'duration': 'perennial'}
{'icon': '🍆', 'name': 'Eggplant', 'duration': 'perennial'}
{'icon': '🥔', 'name': 'Potato', 'duration': 'perennial'}
