
# Installation
Run the following command to install apache-beam

Note: To run pipeline on the google colab environemnt, no need to install/configure runners. Each session in the colab is assigned with new virtual environment which forces us to install apache beam every time a new session is created

In [0]:
!{'pip install apache-beam'}

# Upload the required files

All the files required to be consumed must be uploaded by the following command. Later transformations could be applied by reading the data file.

In [0]:
from google.colab import files
uploaded = files.upload()

Saving dept_data.txt to dept_data.txt
Saving location.txt to location.txt


# Map method

It takes the single output and returns the single output. For e.g. if a string is passed as input and a split is applied to the string. The Map method will return a list of splitted values.

Following code demostrates the same:

In [0]:
import apache_beam as beam

def split_row(element):
    return element.split(',')

p1 = beam.Pipeline()

attendance_count = (
    
    p1
      |beam.io.ReadFromText('dept-data.txt')
      
      # Split row by columns/elements with comma and returns a list 
      |beam.Map(split_row) 

      |beam.io.WriteToText('data/output_new_final')
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['331593PS', 'Beryl', '20', 'HR', '1-01-2019']
['560447WH', 'Olga', '20', 'HR', '1-01-2019']
['222997TJ', 'Leslie', '20', 'HR', '1-01-2019']
['171752SY', 'Mindy', '20', 'HR', '1-01-2019']
['153636AS', 'Vicky', '20', 'HR', '1-01-2019']
['745411HT', 'Richard', '20', 'HR', '1-01-2019']
['298464HN', 'Kirk', '20', 'HR', '1-01-2019']
['783950BW', 'Kaori', '20', 'HR', '1-01-2019']
['892691AR', 'Beryl', '20', 'HR', '1-01-2019']
['245668UZ', 'Oscar', '20', 'HR', '1-01-2019']
['231206QD', 'Kumiko', '30', 'Finance', '1-01-2019']
['357919KT', 'Wendy', '30', 'Finance', '1-01-2019

# Flat Map Method

This method take a single input and returns multiple output unlike Map method. For e.g. If a string is passed as input Flat map will split the string with a separator and returns all separated individual values. 

Note: Flat map can explicitly return the single element by type casting the returning value to list

In [0]:
import apache_beam as beam

p1 = beam.Pipeline()

attendance_count = (
    
   p1
    |beam.io.ReadFromText('dept-data.txt')

    # Flat map returns individual elements as the output
    |beam.FlatMap(lambda record: record.split(','))

    # Returning value can be explicitly type casted to list for the single value
    |beam.FlatMap(lambda record: [record.split(',')])
    
    |beam.io.WriteToText('data/output_new_final')
  
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}

['149633CM']
['Marco']
['10']
['Accounts']
['1-01-2019']
['212539MU']
['Rebekah']
['10']
['Accounts']
['1-01-2019']
['231555ZZ']
['Itoe']
['10']
['Accounts']
['1-01-2019']
['503996WI']
['Edouard']
['10']
['Accounts']
['1-01-2019']


# Filter Method

This method filters the list based on the condition provided as the function. For e.g. If a list is passed to the lamda function and if it returns only the specific element of the list, it applies the same function to all the lists read from the source. 


In [0]:
import apache_beam as beam

p1 = beam.Pipeline()

attendance_count = (
    
   p1
    |beam.io.ReadFromText('dept-data.txt')

    # Map returns individual elements as the output
    |beam.Map(lambda record: record.split(','))

    # Returning value can be explicitly type casted to list for the single value
    |beam.Filter(lambda record: record[3]=='Accounts')
    
    |beam.io.WriteToText('data/output_new_final')
  
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '2-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '2-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '2-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '2-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '2-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '2-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '2-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '2-01-2019']
['718737IX', 'Ayumi', '10', 'Accounts', '2-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '3-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts'

# Combine per key method

This method performs the group by operation on the tuple with the same keys and apply the operation to each group specified in the argument.

To demonstrate follow the code below:

In [0]:
import apache_beam as beam

# Create the pipeline object 
p = beam.Pipeline()

# Input collection to read the data from the file and separate values by comma from each rows
input_collection = ( 
                      p 
                      # When the file is read, it will separate file with each line
                      | "Read from the text file" >> beam.io.ReadFromText('dept-data.txt')

                      # Each line needs to be separated by comma to perform further transformations
                      | "Split each rows with comma " >> beam.Map(lambda element: element.split(','))
                  
                      # Retrieve all the records with the label 'Accounts' in the 4th column or Department column
                      | 'Get all Accounts dept persons' >> beam.Filter(lambda record: record[3] == 'Accounts')

                      # Assign each employee with the value 1 in a tuple
                      | 'Pair each accounts employee with 1' >> beam.Map(lambda record: ("Accounts, " +record[1], 1))

                      # Group all the tuples with the summation to count total number of employees
                      | 'Group and sum1' >> beam.CombinePerKey(sum)

                      # Write the output to a file
                      | 'Write results for account' >> beam.io.WriteToText('data/Account')
                 )

# Run the pipeline
p.run()
  
# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/Account-00000-of-00001')}



('Accounts, Marco', 31)
('Accounts, Rebekah', 31)
('Accounts, Itoe', 31)
('Accounts, Edouard', 31)
('Accounts, Kyle', 62)
('Accounts, Kumiko', 31)
('Accounts, Gaston', 31)
('Accounts, Ayumi', 30)


# Label Transformations 

Each transformations can be uniquely labelled with a string. '>>' symbol is used to label each tranforms. Error will be thrown if there are any duplicate labels.

In [0]:
import apache_beam as beam

p1 = beam.Pipeline()

attendance_count = (
    
   p1
    | 'Read from the file' >> beam.io.ReadFromText('dept-data.txt')

    # Map returns individual elements as the output
    | 'Split each rows with comma' >> beam.Map(lambda record: record.split(','))
    
    | 'Write final output to file' >> beam.io.WriteToText('data/output_new_final')
  
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['331593PS', 'Beryl', '20', 'HR', '1-01-2019']
['560447WH', 'Olga', '20', 'HR', '1-01-2019']
['222997TJ', 'Leslie', '20', 'HR', '1-01-2019']
['171752SY', 'Mindy', '20', 'HR', '1-01-2019']
['153636AS', 'Vicky', '20', 'HR', '1-01-2019']
['745411HT', 'Richard', '20', 'HR', '1-01-2019']
['298464HN', 'Kirk', '20', 'HR', '1-01-2019']
['783950BW', 'Kaori', '20', 'HR', '1-01-2019']
['892691AR', 'Beryl', '20', 'HR', '1-01-2019']
['245668UZ', 'Oscar', '20', 'HR', '1-01-2019']
['231206QD', 'Kumiko', '30', 'Finance', '1-01-2019']
['357919KT', 'Wendy', '30', 'Finance', '1-01-2019

# Using With Keyword

If pipeline is needed to run within the scope, to look cleaner and neat, "with" statement can be used. Without with statement pipeline needed to be run by calling the run method. 

Following demonstrate the same:

In [0]:
import apache_beam as beam

# using with statement run method is not needed to be called
with beam.Pipeline() as p1:

  attendance_count = (
    
   p1
    | 'Read from the file' >> beam.io.ReadFromText('dept-data.txt')

    # Map returns individual elements as the output
    | 'Split each rows with comma' >> beam.Map(lambda record: record.split(','))
    
    | 'Write final output to file' >> beam.io.WriteToText('data/output_new_final')
  
)

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['331593PS', 'Beryl', '20', 'HR', '1-01-2019']
['560447WH', 'Olga', '20', 'HR', '1-01-2019']
['222997TJ', 'Leslie', '20', 'HR', '1-01-2019']
['171752SY', 'Mindy', '20', 'HR', '1-01-2019']
['153636AS', 'Vicky', '20', 'HR', '1-01-2019']
['745411HT', 'Richard', '20', 'HR', '1-01-2019']
['298464HN', 'Kirk', '20', 'HR', '1-01-2019']
['783950BW', 'Kaori', '20', 'HR', '1-01-2019']
['892691AR', 'Beryl', '20', 'HR', '1-01-2019']
['245668UZ', 'Oscar', '20', 'HR', '1-01-2019']
['231206QD', 'Kumiko', '30', 'Finance', '1-01-2019']
['357919KT', 'Wendy', '30', 'Finance', '1-01-2019

# Branching Pipelines

When there is a need to perform two or more different sets of tranforms from a single source for multiple outputs, branching pipelines come in handy. To branch a pipeline, multiple PCollections are created with different transforms in each collection. 

To demonstrate the pipelines with branches, follow the code below:

In [0]:
import apache_beam as beam

# Create the pipeline object 
p = beam.Pipeline()

# Input collection to read the data from the file and separate values by comma from each rows
input_collection = ( 
                      p 
                      # When the file is read, it will separate file with each line
                      | "Read from the text file" >> beam.io.ReadFromText('dept-data.txt')

                      # Each line needs to be separated by comma to perform further transformations
                      | "Split each rows with comma " >> beam.Map(lambda element: element.split(','))
                   )

# Count the number of employees in the accounts department and write results to a file
accounts_count = (
                      # Start the transformations on the results of the input_collection
                      input_collection
                  
                      # Retrieve all the records with the label 'Accounts' in the 4th column or Department column
                      | 'Get all Accounts dept persons' >> beam.Filter(lambda record: record[3] == 'Accounts')

                      # Assign each employee with the value 1 in a tuple
                      | 'Pair each accounts employee with 1' >> beam.Map(lambda record: ("Accounts, " +record[1], 1))

                      # Group all the tuples with the summation to count total number of employees
                      | 'Group and sum1' >> beam.CombinePerKey(sum)

                      # Write the output to a file
                      | 'Write results for account' >> beam.io.WriteToText('data/Account')
                 )

# Count the number of employees in the HR department and write results to a file
hr_count = (
                # Start the transformations on the results of the input_collection
                input_collection
            
                # Retrieve all the records with the label 'HR' in the 4th column or Department column
                | 'Get all HR dept persons' >> beam.Filter(lambda record: record[3] == 'HR')

                # Assign each employee with the value 1 in a tuple
                | 'Pair each hr employee with 1' >> beam.Map(lambda record: ("HR, " +record[1], 1))

                # Group all the tuples with the summation to count total number of employees
                | 'Group and sum' >> beam.CombinePerKey(sum)

                # Write the output to a file
                | 'Write results for hr' >> beam.io.WriteToText('data/HR')
           )

# Run the pipeline
p.run()
  
# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/Account-00000-of-00001')}
!{('head -n 20 data/HR-00000-of-00001')}



('Accounts, Marco', 31)
('Accounts, Rebekah', 31)
('Accounts, Itoe', 31)
('Accounts, Edouard', 31)
('Accounts, Kyle', 62)
('Accounts, Kumiko', 31)
('Accounts, Gaston', 31)
('Accounts, Ayumi', 30)
('HR, Beryl', 62)
('HR, Olga', 31)
('HR, Leslie', 31)
('HR, Mindy', 31)
('HR, Vicky', 31)
('HR, Richard', 31)
('HR, Kirk', 31)
('HR, Kaori', 31)
('HR, Oscar', 31)


# Flatten Transformation

Two or more PCollections are created when the pipelines are branched. If results from different PCollection are to be merged, flatten transformation is used.
Find the code below for the demonstration:

In [0]:
import apache_beam as beam

# Create the pipeline object 
p = beam.Pipeline()

# Input collection to read the data from the file and separate values by comma from each rows
input_collection = ( 
                      p 
                      # When the file is read, it will separate file with each line
                      | "Read from the text file" >> beam.io.ReadFromText('dept-data.txt')

                      # Each line needs to be separated by comma to perform further transformations
                      | "Split each rows with comma " >> beam.Map(lambda element: element.split(','))
                   )

# Count the number of employees in the accounts department and write results to a file
accounts_count = (
                      # Start the transformations on the results of the input_collection
                      input_collection
                  
                      # Retrieve all the records with the label 'Accounts' in the 4th column or Department column
                      | 'Get all Accounts dept persons' >> beam.Filter(lambda record: record[3] == 'Accounts')

                      # Assign each employee with the value 1 in a tuple
                      | 'Pair each accounts employee with 1' >> beam.Map(lambda record: ("Accounts, " +record[1], 1))

                      # Group all the tuples with the summation to count total number of employees
                      | 'Group and sum1' >> beam.CombinePerKey(sum)

                 )

# Count the number of employees in the HR department and write results to a file
hr_count = (
                # Start the transformations on the results of the input_collection
                input_collection
            
                # Retrieve all the records with the label 'HR' in the 4th column or Department column
                | 'Get all HR dept persons' >> beam.Filter(lambda record: record[3] == 'HR')

                # Assign each employee with the value 1 in a tuple
                | 'Pair each hr employee with 1' >> beam.Map(lambda record: ("HR, " +record[1], 1))

                # Group all the tuples with the summation to count total number of employees
                | 'Group and sum' >> beam.CombinePerKey(sum)
           )

# Merge/Flatten/Union the above two PCollections
output =(
                # Two collections in a tuple which are needed to be flattened 
                (accounts_count,hr_count)

                # Flattens the provided PCollections
                | beam.Flatten()

                # Write the Output to the file
                | beam.io.WriteToText('data/Both')
)

# Run the pipeline
p.run()
  
# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/Both-00000-of-00001')}



('Accounts, Marco', 31)
('Accounts, Rebekah', 31)
('Accounts, Itoe', 31)
('Accounts, Edouard', 31)
('Accounts, Kyle', 62)
('Accounts, Kumiko', 31)
('Accounts, Gaston', 31)
('Accounts, Ayumi', 30)
('HR, Beryl', 62)
('HR, Olga', 31)
('HR, Leslie', 31)
('HR, Mindy', 31)
('HR, Vicky', 31)
('HR, Richard', 31)
('HR, Kirk', 31)
('HR, Kaori', 31)
('HR, Oscar', 31)


# Excersie: Word counts of the file

Find the total number of words from the file. Try it on your own before moving to the solutions below.


In [0]:
import apache_beam as beam

p1 = beam.Pipeline()

attendance_count = (
    
    p1
      # When the file is read, it will separate file with each line
      |'Read from the file' >> beam.io.ReadFromText('dept-data.txt')
      
      # Split row by columns/elements with comma and returns a list 
      |'Split each row by comma' >> beam.FlatMap(lambda element: element.split(',')) 

      # Assign each word with a value 1
      |'Assign a value 1 to each word' >> beam.Map(lambda element: (element,1))

      # Apply sum to the grouped words
      |'Group by words and apply sum to its value which is 1' >> beam.CombinePerKey(sum)

      # Write the Output to the file
      |'Write Output to a file' >> beam.io.WriteToText('data/output_new_final')
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



('149633CM', 31)
('Marco', 31)
('10', 278)
('Accounts', 278)
('1-01-2019', 28)
('212539MU', 31)
('Rebekah', 31)
('231555ZZ', 31)
('Itoe', 31)
('503996WI', 31)
('Edouard', 31)
('704275DC', 31)
('Kyle', 62)
('957149WC', 31)
('241316NX', 31)
('Kumiko', 62)
('796656IE', 31)
('Gaston', 31)
('331593PS', 31)
('Beryl', 62)


# ParDo Transform

A ParDo transform takes each element of input PCollection, performs processing function on it and emits 0, 1 or multiple elements.

Functionalities:
> 
	- Filtering
		○ ParDo can take each element of PCollection and decide either to output or discard it.
	- Formatting or Type conversion
		○ ParDo can change the type or format of input elements
	- Extracting individual parts
		○ ParDo can be used to extract individual elements from a single element
	- Computations
		○ ParDo can perform any processing function on the input elements and outputs a PCollection

Note: Output could be emitted by using yield or return statement inside process method.

In [0]:
import apache_beam as beam

# Extracting individual parts
class SplitRow(beam.DoFn):
  def process(self, element):
    return [element.split(',')]

# Filtering
class FilterColumn(beam.DoFn):
  def process(self, element):
    if element[3]=='Accounts':
      return [element]
      
# Formatting or Type conversion
class ApplyValue(beam.DoFn):
  def process(self, element):
    return [(element[3] + ': ' + element[1],1)]

#Computations
class ApplySummation(beam.DoFn):
  def process(self, element):
    (key, values) = element
    return [(key, sum(values))]

p1 = beam.Pipeline()

attendance_count = (
    
    p1
      |beam.io.ReadFromText('dept-data.txt')
      
      # Split row by columns/elements with comma and returns a list 
      |beam.ParDo(SplitRow()) 

      # Lambda functions can also be used instead of creating a class
      # |beam.ParDo(lambda element : [element.split(',')])

      # Perform filter transforms using ParDo
      |beam.ParDo(FilterColumn())

      # Apply 1 to each elements 
      |beam.ParDo(ApplyValue())

      # Apply Group by to combine the values in a list for all the same keys
      |beam.GroupByKey()

      # Apply summation to all the aggregated values
      |beam.ParDo(ApplySummation())

      # Write output to a file
      |beam.io.WriteToText('data/output_new_final')
)

p1.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/output_new_final-00000-of-00001')}



('Accounts: Marco', 31)
('Accounts: Rebekah', 31)
('Accounts: Itoe', 31)
('Accounts: Edouard', 31)
('Accounts: Kyle', 62)
('Accounts: Kumiko', 31)
('Accounts: Gaston', 31)
('Accounts: Ayumi', 30)


# Combine Transform
Combiner is a mini reducer which does the reduce task locally to a mapper machine. It works for both associative and commutative nature. Combine has four following methods which needs to be overridden when inheriting the combine class



1.   **Create Accumulator** creates a new “local” accumulator. In the example case, taking a mean average, a local accumulator tracks the running sum of values (the numerator value for our final average division) and the number of values summed so far (the denominator value). It may be called any number of times in a distributed fashion.
2.   **Add Input** adds an input element to an accumulator, returning the accumulator value. In our example, it would update the sum and increment the count. It may also be invoked in parallel.
3.   **Merge Accumulators** merges several accumulators into a single accumulator; this is how data in multiple accumulators is combined before the final calculation. In the case of the mean average computation, the accumulators representing each portion of the division are merged together. It may be called again on its outputs any number of times.
4.   **Extract Output** performs the final computation. In the case of computing a mean average, this means dividing the combined sum of all the values by the number of values summed. It is called once on the final, merged accumulator.


In [0]:
import apache_beam as beam

p = beam.Pipeline()

# Class inheriting the CombineFn class
class AverageFn(beam.CombineFn):
  
  def create_accumulator(self):
     return (0.0, 0)   # initialize (sum, count)

  def add_input(self, sum_count, input):
    (sum, count) = sum_count
    return sum + input, count + 1

  def merge_accumulators(self, accumulators):
    
    # * operator is an unpacking operator which helps zip function to unzip accumulators to sums and counts
    # zip - [(27, 3), (39, 3), (18, 2)]  -->   [(27,39,18), (3,3,2)]
    ind_sums, ind_counts = zip(*accumulators)      
    return sum(ind_sums), sum(ind_counts)        # (84,8)

  def extract_output(self, sum_count):    
    
    (sum, count) = sum_count    # combine globally using CombineFn
    return sum / count if count else float('NaN')
  

small_sum = (
           p 
            | beam.Create([15,5,7,7,9,23,13,5])
            | "Combine Globally" >> beam.CombineGlobally(AverageFn()) 
            | 'Write results to a output file' >> beam.io.WriteToText('data/combine')
          )
p.run()

# Sample the first 20 results, remember there are no ordering guarantees.
!{'head -n 20 data/combine-00000-of-00001'}

10.5


# Composite Transform:

Transforms can have a nested structure, where a complex transform performs multiple simpler transforms (such as more than one ParDo, Combine, GroupByKey, or even other composite transforms). These transforms are called composite transforms. Nesting multiple transforms inside a single composite transform can make your code more modular and easier to understand.

Following demonstrate composite transforms

In [0]:
import apache_beam as beam

class MyTransform(beam.PTransform):
  
  def expand(self, input_coll):
    
    a = ( 
        input_coll
                       | 'Group and sum1' >> beam.CombinePerKey(sum)
                       | 'count filter accounts' >> beam.Filter(filter_on_count)
                       | 'Regular accounts employee' >> beam.Map(format_output)
              
    )
    return a

def SplitRow(element):
    return element.split(',')
  
  
def filter_on_count(element):
  name, count = element
  if count > 30:
    return element
  
def format_output(element):
  name, count = element
  return ', '.join((str(count),'Regular employee'))

p = beam.Pipeline()

input_collection = ( 
                      p 
                      | "Read from text file" >> beam.io.ReadFromText('dept-data.txt')
                      | "Split rows" >> beam.Map(SplitRow)
                   )

accounts_count = (
                      input_collection
                      | 'Get all Accounts dept persons' >> beam.Filter(lambda record: record[3] == 'Accounts')
                      | 'Pair each accounts employee with 1' >> beam.Map(lambda record: ("Accounts, " +record[1], 1))
                      | 'composite accounts' >> MyTransform()
                      | 'Write results for account' >> beam.io.WriteToText('data/Account')
                 )

hr_count = (
                input_collection
                | 'Get all HR dept persons' >> beam.Filter(lambda record: record[3] == 'HR')
                | 'Pair each hr employee with 1' >> beam.Map(lambda record: ("HR, " +record[1], 1))
                | 'composite HR' >> MyTransform()
                | 'Write results for hr' >> beam.io.WriteToText('data/HR')
           ) 
p.run()
  
# Sample the first 20 results, remember there are no ordering guarantees.
!{('head -n 20 data/Account-00000-of-00001')}
!{('head -n 20 data/HR-00000-of-00001')}



31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
62, Regular employee
31, Regular employee
31, Regular employee
62, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee
31, Regular employee


# CoGroupBy Transform

CoGroupByKey performs a relational join of two or more key/value PCollections that have the same key type. Consider using CoGroupByKey if you have multiple data sets that provide information about related things. For example, let’s say you have two different files with user data: one file has names and email addresses; the other file has names and phone numbers. You can join those two data sets, using the user name as a common key and the other data as the associated values. After the join, you have one data set that contains all of the information (email addresses and phone numbers) associated with each name

In [0]:
import apache_beam as beam

# Create a tuple from the data having one key and rest columns as a value.
def retTuple(element): 

  thisTuple=element.split(',')
  return (thisTuple[0],thisTuple[1:])
                
p1 = beam.Pipeline()

# Apply a ParDo to the PCollection "words" to compute lengths for each word.
dep_rows = ( 
                p1
                # Read data from the file
                | "Reading File 1" >> beam.io.ReadFromText('dept_data.txt')
                | 'Pair each employee with key' >> beam.Map(retTuple)          # {149633CM : [Marco,10,Accounts,1-01-2019]}
    
               )


loc_rows = ( 
                p1
                # Read data from the second file
                | "Reading File 2" >> beam.io.ReadFromText('location.txt') 
                | 'Pair each loc with key' >> beam.Map(retTuple)                # {149633CM : [9876843261,New York]}
               )


results = ({'dep_data': dep_rows, 'loc_data': loc_rows} 
           
           | beam.CoGroupByKey()
           | 'Write results' >> beam.io.WriteToText('data/result')
          )


p1.run()

!{('head -n 20 data/result-00000-of-00001')}



('149633CM', {'dep_data': [['Marco', '10', 'Accounts', '1-01-2019'], ['Marco', '10', 'Accounts', '2-01-2019'], ['Marco', '10', 'Accounts', '3-01-2019'], ['Marco', '10', 'Accounts', '4-01-2019'], ['Marco', '10', 'Accounts', '5-01-2019'], ['Marco', '10', 'Accounts', '6-01-2019'], ['Marco', '10', 'Accounts', '7-01-2019'], ['Marco', '10', 'Accounts', '8-01-2019'], ['Marco', '10', 'Accounts', '9-01-2019'], ['Marco', '10', 'Accounts', '10-01-2019'], ['Marco', '10', 'Accounts', '11-01-2019'], ['Marco', '10', 'Accounts', '12-01-2019'], ['Marco', '10', 'Accounts', '13-01-2019'], ['Marco', '10', 'Accounts', '14-01-2019'], ['Marco', '10', 'Accounts', '15-01-2019'], ['Marco', '10', 'Accounts', '16-01-2019'], ['Marco', '10', 'Accounts', '17-01-2019'], ['Marco', '10', 'Accounts', '18-01-2019'], ['Marco', '10', 'Accounts', '19-01-2019'], ['Marco', '10', 'Accounts', '20-01-2019'], ['Marco', '10', 'Accounts', '21-01-2019'], ['Marco', '10', 'Accounts', '22-01-2019'], ['Marco', '10', 'Accounts', '23-01-2