# Beam Notebooks Demo

This example demonstrates how to set up an Apache Beam pipeline that reads from a
[Google Cloud Storage](https://cloud.google.com/storage) file containing text from Shakespeare's work *King Lear*, 
tokenizes the text lines into individual words, and performs a frequency count on each of those words. 

We will perform the aggregation operations using the Beam Dataframes API, which allows us to use Pandas-like syntax to write your transformations. We will see how we can easily translate from using Pandas locally to using Dataframes in Apache Beam (which could then be run on Dataflow).

We will then show how to use the `beam_sql` cell magic to use SQL to accomplish the same tasks as we performed using the Dataframes API. We will then show how to join multiple PCollections using `beam_sql`.

For details about the Apache Beam Dataframe API, see the [Documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/).

For details about the Apache Beam SQL API, see the [Documentation](https://beam.apache.org/documentation/dsls/sql/overview/).

## Getting set up and importing our data

We first start with the necessary imports:

In [None]:
# Python's regular expression library
import re
import typing

# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

# Dataframe API imports
from apache_beam.dataframe.convert import to_dataframe
from apache_beam.dataframe.convert import to_pcollection

We will be using the `re` library to parse our lines of text. We will import the `InteractiveRunner` class for executing out pipeline in the notebook environment and the `interactive_beam` module for exploring the PCollections. Finally we will import two functions from the Dataframe API, `to_dataframe` and `to_pcollection`. `to_dataframe` converts your (schema-aware) PCollection into a dataframe and `to_pcollection` goes back in the other direction to a `PCollection` of type `beam.Row`.

We will first create a composite PTransform `ReadWordsFromText` to read in a file pattern (`file_pattern`), use the `ReadFromText` source to read in the files, and then `FlatMap` with a lambda to parse the line into individual words.

In [None]:
class ReadWordsFromText(beam.PTransform):
    
    def __init__(self, file_pattern):
        self._file_pattern = file_pattern
    
    def expand(self, pcoll):
        return (pcoll.pipeline
                | beam.io.ReadFromText(self._file_pattern)
                | beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))

To be able to process our data in the notebook environment and explore the PCollections, we will use the interactive runner. We create this pipeline object in the same manner as usually, but passing in `InteractiveRunner()` as the runner.

In [None]:
p = beam.Pipeline(InteractiveRunner())

Now we're ready to start processing our data! We first apply our `ReadWordsFromText` transform to read in the lines of text from Google Cloud Storage and parse into individual words.

In [None]:
words = p | 'ReadWordsFromText' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')

Now we will see some capabilities of the interactive runner. First we can use `ib.show` to view the contents of a specific `PCollection` from any point of our pipeline. 

In [None]:
ib.show(words)

Great! We see that we have 28,001 words in our PCollection and we can view the words in our PCollection. 

We can also view the current DAG for our graph by using the `ib.show_graph()` method. Note that here we pass in the pipeline object rather than a PCollection

In [None]:
ib.show_graph(p)

In the above graph, the rectanglar boxes correspond to PTransforms and the circles correspond to PCollections. 

## Using the Dataframes API

Next we will add a simple schema to our PCollection and convert the PCollection into a dataframe using the `to_dataframe` method. 

In [None]:
class WordRow(typing.NamedTuple):
    word: str

word_rows = words | 'ApplySchema' >> beam.Map(lambda word : WordRow(word=word)).with_output_types(WordRow)

df = to_dataframe(word_rows)

We can now explore our PCollection as a Pandas-like dataframe! One of the first things many data scientists do as soon as they load data into a dataframe is explore the first few rows of data using the `head` method. Let's see what happens here.

In [None]:
df.head()

Notice that we got a very specific type of error! The `WontImplementError` is for Pandas methods that will not be implemented for Beam dataframes. These are methods that violate the Beam model for one reason or another. For example, in this case the `head` method depends on the order of the dataframe. However, this is in conflict with the Beam model. 

Our goal however is to count the number of times each word appears in the ingested text. First we will add a new column in our dataframe named `count` with a value of `1` for all rows. After that, we will group by the value of the `word` column and apply the `sum` method for the `count` field.

In [None]:
df['count'] = 1
counted = df.groupby('word').sum()

That's it! It looks exactly like the code one would write when using Pandas. However, what does this look like in the DAG for the pipeline? We can see this by executing `ib.show_graph(p)` as before.

In [None]:
ib.show_graph(p)

We can see that the dataframe manipulations added a new PTransform to our pipeline. Let us convert the dataframe back to a PCollection so we can use `ib.show` to view the contents.

In [None]:
word_counts = to_pcollection(counted, include_indexes=True)
ib.show(word_counts)

Great! We can now see that the words have been successfully counted. Finally let us build in a sink into the pipeline. We can do this in two ways. If we wish to write to a CSV file, then we can use the dataframe's `to_csv` method. We can also use the `WriteToText` transform after converting back to a PCollection. Let's do both and explore the outputs.

In [None]:
counted.to_csv('from_df.csv')
_ = word_counts | beam.io.WriteToText('from_pcoll.csv')

Before saving the outputs to the sinks, let's take a peek at our finished pipeline.

In [None]:
ib.show_graph(p)

Note that we can see the branching with two different sinks, also we can see where the dataframe is converted back to a PCollection. We can run our entire pipeline by using `p.run()` as normal.

In [None]:
p.run()

Let us now look at the beginning of the CSV files using the bash line magic with the `head` command to compare.

In [None]:
!head from_df*

In [None]:
!head from_pcoll*

We (functionally) end up with the same information as expected! The big difference is in how the results are presented. In the case of the output from the `WriteToText` connector, we did not convert our PCollection from objects of type `Row`. We could write a simple intermediate transform to pull out the properties of the `Row` object into a comma-seperated representation. For example:

```
def row_to_csv(element):
    output = f"{element.word},{element.count}"
    return output
```

The we could replace the code `_ = word_counts | beam.io.WriteToText('from_pcoll.csv')` with

```
_ = word_counts | beam.Map(row_to_csv)
                | beam.io.WriteToText('from_pcoll.csv')
```

However, note that the `to_csv` method for the dataframe took care of this conversion for us.

## Setting up to use Beam SQL in notebooks

Now we will accomplish the same task, counting the number of instances of each word in _King Lear_, using Beam SQL. Recall that in Python, the `SqlTransform` PTransform is a cross-language transform written in Java. Let us be sure that we have both Docker and Java installed in our instance so that we can leverage cross-language transforms.

In [None]:
!docker image list
!java -version

If you are running this notebook using Vertex AI Workbench in a Beam Notebook instance, thenn you should see a list of Docker images that are locally available and that we are using OpenJDK v.1.8+. If you running this notebook elsewhere, then you will need to ensure that Docker and Java are installed and that the right JARs and containers are built. The Appendix of this notebook will walk you through the steps you need to follow to build the relevant JARs and containers.

[Beam SQL](https://beam.apache.org/documentation/dsls/sql/overview/) allows a Beam user to query PCollections with SQL statements. Currently, `InteractiveRunner` does not support `SqlTransform` directly. However, a user could use the `beam_sql` magic to run Beam SQL in the notebook and introspect the result. 

`beam_sql` is an IPython [custom magic](https://ipython.readthedocs.io/en/stable/config/custommagics.html). If you're not familiar with magics, here are some [built-in examples](https://ipython.readthedocs.io/en/stable/interactive/magics.html). It's a convenient way to validate your queries locally against known/test data sources when prototyping a Beam pipeline with SQL, before productionizing it on remote cluster/services.

This Beam Notebook environment has preloaded the `beam_sql` magic. You can also explicitly load it via `%load_ext apache_beam.runners.interactive.sql.beam_sql_magics` if you set up your own notebook elsewhere.

The `beam_sql` magic can be used as either a line magic or a cell magic. You can check its usage by running:

In [None]:
%beam_sql -h

Why would you want to use `beam_sql` in a notebook environment? 
- You can leverage an inutitive syntax with SQL.
    - No need to use the constant `PCOLLECTION` when querying a single PCollection
    - No need to name multiple input PCollections, instead you can refer to them by their variable names.
- No need to write `SqlTransform` and other Beam related boilerplate code.
- You can introspect the result immediately.
- Coder registration for your PCollection schemas is handled automatically.

## Using `beam_sql` 

First let us revisit our earlier problem. We want to count the number of times each word appears in King Lear. Instead of using the Dataframe API, let's use SQL to accomplish the same task as before. Note that as of the time of writing, there is no native Calcite SQL function to break a string into words as we have using the `re` library. For that reason, we will start with the PCollection we called `word_rows` before. Note that we need a subclass of `NamedTuple` to use the `beam_sql` cell magic, which is why we're starting here.

In [None]:
import logging

#Set logging level to ERROR to minimize logs in notebook.
logging.root.setLevel(logging.ERROR)

Let's start out by looking at ten of the elements in our `word_rows` PCollection.

**Note:** The first time you execute the `beam_sql` cell magic in a notebook, it will take a few minutes to run. The relevant container for executing the cross-language transform needs to first be built.

In [None]:
%%beam_sql -o ten_words
SELECT * FROM word_rows LIMIT 10

Great! Everything is working as we can see ten words from our PCollection. Now we can quickly perform the same aggregation as before, but now using SQL.

In [None]:
%%beam_sql -o word_count
SELECT word, COUNT(*) AS word_count FROM word_rows GROUP BY word 

We saved the output of the SQL query in a new PCollection called `word_count`. We can view this using the Interactive Runner as before.

In [None]:
ib.show(word_count)

## Joins using `beam_sql`

Now let us look at one more example of using SQL in the notebook environment. In particular, let us see how easy it is to join two PCollections! We will create two PCollection in-memory for this example: One consisting of PCollections with schema of type `Person` and one consisting of type `Pet`. For the sake of simplicity

In [None]:
class Person(typing.NamedTuple):
    name: str
    age: int

class Pet(typing.NamedTuple):
    name: str
    owner_name: str
    species: str

Now we will create a new pipeline and the corresponding PCollections.

In [None]:
p_join = beam.Pipeline(InteractiveRunner())

people = p_join | "Create_People" >> beam.Create([Person('Bob', 19), 
                                                  Person('Alice', 42),
                                                  Person('Ted', 26),
                                                  Person('Michael', 29)])

pets = p_join | "Create_Pets" >> beam.Create([Pet('Cooper', 'Michael', 'Dog'),
                                              Pet('Moose', 'Alice', 'Gerbil'),
                                              Pet('The Destroyer', 'Alice', 'Cat'),
                                              Pet('Ted Jr.', 'Ted', 'Turtle'),
                                              Pet('Felix', 'Ted', 'Cat')])
                        

Now we have created our sample PCollections. Suppose that we want to answer the following question: What is the average owner age for each species of pet? We have (reasonably) three options we could use:
1. Create KV pairs with owner names as the key, perform a `CoGroupByKey`, extract the `species` and `age` fields, and the perform `beam.CombinePerKey()` with an average `CombineFn` using `species` as the key.
2. Use the Dataframe API to join the dataframes corresponding to each PCollection, join the dataframes on the `age` and `owner_age` columns, drop the uncessary fields and use the `.groupby()` method followed by the `.count()` method to aggregate.
3. Write a SQL query!

The dataframe option is definitely easier than using `CoGroupByKey`...but SQL is even easier than the other two options! Let's now use the `beam_sql` magic.

In [None]:
%%beam_sql -o age_per_species
SELECT 
    pets.species as species, 
    AVG(people.age) as average_age
FROM 
    people 
JOIN 
    pets 
ON 
    people.name = pets.owner_name
GROUP BY 
    pets.species

What if we want to execute this pipeline on Dataflow? We can use the option `-r DataflowRunner` when using the `beam_sql` magic. When you execute the query with this option set, a pop-up will appear for a minimal set of options to run the pipeline and you will be informed where the results of the query will be stored.

In [None]:
%%beam_sql -o age_per_species -r DataflowRunner
SELECT 
    pets.species as species, 
    AVG(people.age) as average_age
FROM 
    people 
JOIN 
    pets 
ON 
    people.name = pets.owner_name
GROUP BY 
    pets.species

Finally, we can check the output of the pipeline once it is completed.

In [None]:
import os

BUCKET = 'your-bucket-name-here' #REPLACE WITH YOUR BUCKET NAME
os.environ['BUCKET'] = BUCKET

!gcloud storage cat gs://$BUCKET/staging/age_per_species*

## Appendix: Setting up `beam_sql` locally

**Important**: If you're using Beam built from your local source code, additionally:

- Have the Java expansion service shadowjar built. Go to the root directory of your local beam repo and then execute:
  `./gradlew :sdks:java:extensions:sql:expansion-service:shadowJar`.
- Based on your jdk version, pull the docker image `docker pull apache/beam_java11_sdk` or `docker pull apache/beam_java8_sdk`.
- Then tag the image with your current Beam dev version.  You can check the dev version under `apache_beam.version.__version__`. For example, if you're using jdk11 and dev version is `x.x.x.dev`, execute `docker image tag apache/beam_java11_sdk:latest apache/beam_java11_sdk:x.x.x.dev`.