# Interactive Beam

In this notebook, we set up your development environment and work through a simple example using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).

The expectation is that this notebook will help you explore the tutorial in a more interactive way.

To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

# Setup

First, you need to set up your environment, which includes installing `apache-beam` and downloading a text file from Cloud Storage to your local file system. We are using this file to test your pipeline.

In [1]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

run('pip install --upgrade pip')

# Install apache-beam.
run('pip install --quiet apache-beam')

# Copy the input file into the local file system.
run('mkdir -p data')
run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/')

>> pip install --upgrade pip
Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1

>> pip install --quiet apache-beam
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.0/152.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
! wc -l data/kinglear.txt


In [3]:

! head -3 data/kinglear.txt

	KING LEAR




In [4]:
import apache_beam as beam
import re

inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part'

## How to interactively work with Beam

Here is an example of how to work iteratively with beam in order to understand what is happening in your pipeline.

Firstly, reduce the size of the King Lear file to be manageable

In [5]:

! head -10 data/kinglear.txt > data/small.txt
! wc -l data/small.txt

10 data/small.txt


Create a custom print function (the python3 function `print` is supposed to work but we define our own here). Then it is possible to see what you are doing to the file.

But something is wrong... why is it printing twice, see [SO](https://stackoverflow.com/a/52282001/1185293)

In [6]:
def myprint(x):
  print('{}'.format(x))

with beam.Pipeline() as pipeline:
  (pipeline
      | 'Read lines' >> beam.io.ReadFromText('data/small.txt')
      | "print" >> beam.Map(myprint)
  )

result = pipeline.run()
result.wait_until_finish()



	KING LEAR


	DRAMATIS PERSONAE


LEAR	king of Britain  (KING LEAR:)

KING OF FRANCE:

	KING LEAR


	DRAMATIS PERSONAE


LEAR	king of Britain  (KING LEAR:)

KING OF FRANCE:



'DONE'

Now, let's break split each line on spaces and get the words out.

In [7]:

with beam.Pipeline() as pipeline:
  (pipeline
      | 'Read lines' >> beam.io.ReadFromText('data/small.txt')
      | 'get words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
      | "print" >> beam.Map(myprint)
  )

KING
LEAR
DRAMATIS
PERSONAE
LEAR
king
of
Britain
KING
LEAR
KING
OF
FRANCE


Recall that `flatMap`s typically act on something (a function, iterable or variable) and apply a function to that something to produce a list of elements. See [this](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap/) great example of how FlatMap works in Beam, and this answer on [SO](https://stackoverflow.com/a/45682977/1185293) for a simple explanation.

In the case above, we applied an anonymous function (lambda function) to a line. We can define it explicitly if you prefer a more conventional syntax

In [8]:
def my_line_split_func(line):
  return re.findall(r"[a-zA-Z']+", line)

with beam.Pipeline() as pipeline:
  (pipeline
      | 'Read lines' >> beam.io.ReadFromText('data/small.txt')
      | 'get words' >> beam.FlatMap(my_line_split_func)
      | "print" >> beam.Map(myprint)
  )


KING
LEAR
DRAMATIS
PERSONAE
LEAR
king
of
Britain
KING
LEAR
KING
OF
FRANCE


### Tutorial



In [14]:
! echo -e 'r1c1,r1c2,2020/03/05\nr2c1,r2c2,2020/03/23' > data/play.csv


In [15]:

class Transform(beam.DoFn):

  # Use classes to perform transformations on your PCollections
  # Yield or return the element(s) needed as input for the next transform
  def process(self, element):
    yield element


with beam.Pipeline() as pipeline:
  (pipeline
      | 'Read lines' >> beam.io.ReadFromText('data/play.csv')
      | 'format line' >> beam.ParDo(Transform())
      | "print" >> beam.Map(myprint)
  )


result.wait_until_finish()

r1c1,r1c2,2020/03/05
r2c1,r2c2,2020/03/23


'DONE'

In [23]:
#MOUNT Google Drive
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Load the dataset into a Pandas DataFrame
data =  pd.read_csv('/content/drive/My Drive/Datasets/users_v.csv', sep=',')

data.head(10)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,user_id,name,gender,age,address,date_joined
0,1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
1,2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06
2,3,Cody Shaw,male,75,North Anne-SC-53799,2004/05/29
3,4,Sierra Hamilton,female,76,New Angelafurt-ME-46190,2005/08/26
4,5,Chase Davis,male,31,South Bethmouth-WI-18562,2018/04/30
5,6,Sierra Andrews,female,21,Ryanville-MI-69690,2007/05/25
6,7,Ann Stone,female,41,Smithmouth-SD-17340,2005/01/05
7,8,Karen Santos,female,34,Mariaville-AK-29888,2003/12/12
8,9,Ronald Meyer,male,41,North Cherylhaven-NJ-04197,2015/11/14
9,10,Steven Rivera,male,43,Wayneside-VT-29040,2003/05/15


In [48]:
input_file = '/content/drive/My Drive/Datasets/users_v.csv'
output_file = '/content/drive/My Drive/Datasets/marketing_format.csv'

def split_row(row):
    user_id, name, gender, age, address, date_joined = row.split(',')
    return {
        'user_id': user_id,
        'name': name,
        'gender': gender,
        'age': age,
        'address': address,
        'date_joined': date_joined
    }

def change_date_format(row):
    try:
        row['date_joined'] = dateutil.parser.parse(row['date_joined']).strftime('%Y-%m-%d')
    except ValueError:
        row['date_joined'] = ''
    return row

def change_address_format(row):
    row['address'] = ','.join(row['address'].split())
    return row

with beam.Pipeline() as pipeline:
    lines = (
        pipeline
        | 'ReadInput' >> beam.io.ReadFromText(input_file)
    )


    formatted_data = (
        lines
        | 'SplitRow' >> beam.Map(split_row)
        | 'ChangeDateFormat' >> beam.Map(change_date_format)
        | 'ChangeAddressFormat' >> beam.Map(change_address_format)
    )


    (
        formatted_data
        | 'WriteFormattedData' >> beam.io.WriteToText(
            '/content/drive/My Drive/Datasets/marketing_format.csv',
            file_name_suffix='.csv',
            header='user_id name gender age address date_joined',
            num_shards=1,
            shard_name_template='',
        )
    )
result = pipeline.run()
result.wait_until_finish()




'DONE'

In [49]:
gender_counts = (
    formatted_data
    | 'CountGender' >> beam.Map(lambda row: (row['gender'], 1))
    | 'GroupByGender' >> beam.GroupByKey()
    | 'CountByGender' >> beam.Map(lambda gender_counts: (gender_counts[0], sum(gender_counts[1])))
)


customers_per_day = (
    formatted_data
    | 'CountByDate1' >> beam.Map(lambda row: (row['date_joined'], 1))
    | 'GroupByDate' >> beam.GroupByKey()
    | 'CountByDate2' >> beam.Map(lambda date_counts: (date_counts[0], sum(date_counts[1])))
)

customers_by_state = (
    formatted_data
    | 'ExtractState' >> beam.Map(lambda row: row['address'].split(',')[-1])
    | 'CountByState1' >> beam.Map(lambda state: (state, 1))
    | 'GroupByState' >> beam.GroupByKey()
    | 'CountByState2' >> beam.Map(lambda state_counts: (state_counts[0], sum(state_counts[1])))
)


gender_output_file = '/content/drive/My Drive/Datasets/gender_counts.csv'
(
    gender_counts
    | 'WriteGenderComposition' >> beam.io.WriteToText(gender_output_file, file_name_suffix='.csv', num_shards=1)
)


date_output_file = '/content/drive/My Drive/Datasets/customers_per_day.csv'
(
    customers_per_day
    | 'WriteCustomersPerDay' >> beam.io.WriteToText(date_output_file, file_name_suffix='.csv', num_shards=1)
)


state_output_file = '/content/drive/My Drive/Datasets/customers_by_state.csv'
(
    customers_by_state
    | 'WriteCustomersByState' >> beam.io.WriteToText(state_output_file, file_name_suffix='.csv', num_shards=1)
)


result = pipeline.run()
result.wait_until_finish()




'DONE'