# Beam Notebooks and Dataframes Demo

This example demonstrates how to set up an Apache Beam pipeline that reads from a
[Google Cloud Storage](https://cloud.google.com/storage) file containing text from Shakespeare's work *King Lear*, 
tokenizes the text lines into individual words, and performs a frequency count on each of those words. 

We will perform the aggregation operations using the Beam Dataframes API, which allows us to use Pandas-like syntax to write your transformations. We will see how we can easily translate from using Pandas locally to using Dataframes in Apache Beam (which could then be run on Dataflow

For details about the Apache Beam Dataframe API, see the [Documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/).

We first start with the necessary imports:

In [None]:
# Python's regular expression library
import re

# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

# Dataframe API imports
from apache_beam.dataframe.convert import to_dataframe
from apache_beam.dataframe.convert import to_pcollection

We will be using the `re` library to parse our lines of text. We will import the `InteractiveRunner` class for executing out pipeline in the notebook environment and the `interactive_beam` module for exploring the PCollections. Finally we will import two functions from the Dataframe API, `to_dataframe` and `to_pcollection`. `to_dataframe` converts your (schema-aware) PCollection into a dataframe and `to_pcollection` goes back in the other direction to a `PCollection` of type `beam.Row`.

We will first create a composite PTransform `ReadWordsFromText` to read in a file pattern (`file_pattern`), use the `ReadFromText` source to read in the files, and then `FlatMap` with a lambda to parse the line into individual words.

In [None]:
class ReadWordsFromText(beam.PTransform):
    
    def __init__(self, file_pattern):
        self._file_pattern = file_pattern
    
    def expand(self, pcoll):
        return (pcoll.pipeline
                | beam.io.ReadFromText(self._file_pattern)
                | beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))

To be able to process our data in the notebook environment and explore the PCollections, we will use the interactive runner. We create this pipeline object in the same manner as usually, but passing in `InteractiveRunner()` as the runner.

In [None]:
p = beam.Pipeline(InteractiveRunner())

Now we're ready to start processing our data! We first apply our `ReadWordsFromText` transform to read in the lines of text from Google Cloud Storage and parse into individual words.

In [None]:
words = p | 'ReadWordsFromText' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')

Now we will see some capabilities of the interactive runner. First we can use `ib.show` to view the contents of a specific `PCollection` from any point of our pipeline. 

In [None]:
ib.show(words)

Great! We see that we have 28,001 words in our PCollection and we can view the words in our PCollection. 

We can also view the current DAG for our graph by using the `ib.show_graph()` method. Note that here we pass in the pipeline object rather than a PCollection

In [None]:
ib.show_graph(p)

In the above graph, the rectanglar boxes correspond to PTransforms and the circles correspond to PCollections. 

Next we will add a simple schema to our PCollection and convert the PCollection into a dataframe using the `to_dataframe` method. 

In [None]:
word_rows = words | 'ToRows' >> beam.Map(lambda word: beam.Row(word=word))

df = to_dataframe(word_rows)

We can now explore our PCollection as a Pandas-like dataframe! One of the first things many data scientists do as soon as they load data into a dataframe is explore the first few rows of data using the `head` method. Let's see what happens here.

In [None]:
df.head()

Notice that we got a very specific type of error! The `WontImplementError` is for Pandas methods that will not be implemented for Beam dataframes. These are methods that violate the Beam model for one reason or another. For example, in this case the `head` method depends on the order of the dataframe. However, this is in conflict with the Beam model. 

Our goal however is to count the number of times each word appears in the ingested text. First we will add a new column in our dataframe named `count` with a value of `1` for all rows. After that, we will group by the value of the `word` column and apply the `sum` method for the `count` field.

In [None]:
df['count'] = 1
counted = df.groupby('word').sum()

That's it! It looks exactly like the code one would write when using Pandas. However, what does this look like in the DAG for the pipeline? We can see this by executing `ib.show_graph(p)` as before.

In [None]:
ib.show_graph(p)

We can see that the dataframe manipulations added a new PTransform to our pipeline. Let us convert the dataframe back to a PCollection so we can use `ib.show` to view the contents.

In [None]:
word_counts = to_pcollection(counted, include_indexes=True)
ib.show(word_counts)

Great! We can now see that the words have been successfully counted. Finally let us build in a sink into the pipeline. We can do this in two ways. If we wish to write to a CSV file, then we can use the dataframe's `to_csv` method. We can also use the `WriteToText` transform after converting back to a PCollection. Let's do both and explore the outputs.

In [None]:
counted.to_csv('from_df.csv')
_ = word_counts | beam.io.WriteToText('from_pcoll.csv')

In [None]:
ib.show_graph(p)

Note that we can see the branching with two different sinks, also we can see where the dataframe is converted back to a PCollection. We can run our entire pipeline by using `p.run()` as normal.

In [None]:
p.run()

Let us now look at the beginning of the CSV files using the bash line magic with the `head` command to compare.

In [None]:
!head from_df*

In [None]:
!head from_pcoll*

We (functionally) end up with the same information as expected! The big difference is in how the results are presented. In the case of the output from the `WriteToText` connector, we did not convert our PCollection from objects of type `Row`. We could write a simple intermediate transform to pull out the properties of the `Row` object into a comma-seperated representation. For example:

```
def row_to_csv(element):
    output = f"{element.word},{element.count}"
    return output
```

The we could replace the code `_ = word_counts | beam.io.WriteToText('from_pcoll.csv')` with

```
_ = word_counts | beam.Map(row_to_csv)
                | beam.io.WriteToText('from_pcoll.csv')
```

However, note that the `to_csv` method for the dataframe took care of this conversion for us.