# Protobag Demo

This notebook shows:
  * How to write Protobuf messsages to a protobag zip archive
  * How to read those messages back
  * How to convert protobag zip archives to and from Pandas Dataframes and Parquet tables


## Environment Set-up

To run this notebook locally, try using the protobag-demo dockerized environment:

`docker run -it --rm --net=host protobag/demo:0.0.3 jupyter notebook --allow-root`
TODO: build this and push it

**Google Colab** You can also [run this notebook in Google Colab](https://colab.sandbox.google.com/github/StandardCyborg/protobag/blob/master/examples/protobag-parquet/protobag-demo-full.ipynb) TODO FIX LINK. In the Colab environment, you'll need to install `protobag` and some other dependencies. Running the cell below will take care of that for you. You might need to restart the runtime (Use the menu option: Runtime > Restart runtime ...) in order for Colab to recognize the new modules.

In [1]:
import os
import sys
if os.path.exists('/opt/protobag'):
    print("We're running in the dockerized environment! We can simply add protobag to the PYTHONPATH")
    if not os.path.exists('/opt/protobag/python/protobag/protobag_native.cpython-36m-x86_64-linux-gnu.so'):
        # We need to build the `protobag_native` module.  The easiest way is to:
        !cd /opt/protobag/python && python3 setup.py test
    sys.path.append('/opt/protobag/python/')
    print("Protobag added to PYTHONPATH")
elif 'google.colab' in sys.modules:
    !pip install protobag
    print("Protobag installed into Colab runtime environment.  You might need to restart the runtime")

import protobag
print("Using protobag version %s at %s" % (protobag.__version__, protobag.__file__))

We're running in the dockerized environment! We can simply add protobag to the PYTHONPATH
Protobag added to PYTHONPATH
Using protobag version 0.0.3 at /opt/protobag/python/protobag/__init__.py


## Our Protobuf Messages

Suppose we've developed a cool game called Dino Hunters where people run around on a deserted island and try to capture wild dinosaurs.  We're using Protobuf to persist data about the `DinoHunter` characters in our game; furthermore, we want to use Protobuf to log the 2D `Position`s of our hunters as the run around and hunt dinos.  Our message schema is as follows:

```protobuf
syntax = "proto3";

package my_messages;

message DinoHunter {
  string first_name = 1;
  int32 id = 2;
  map<string, string> attribs = 3;

  enum DinoType {
    IDK = 0;
    VEGGIESAURUS = 1;
    MEATIESAURUS = 2;
    PEOPLEEATINGSAURUS = 3;
  }

  message Dino {
    string name = 1;
    DinoType type = 2;
  }

  repeated Dino dinos = 4;
}

message Position {
  float x = 1;
  float y = 2;
}
```

We now need the `protoc`-generated Python code in order to use these messages.  For convenience, we'll just download a copy posted in the `protobag` repo:

In [17]:
if not os.path.exists('MyMessages_pb2.py'):
    !wget http://fixme/this/path

from MyMessages_pb2 import DinoHunter
from MyMessages_pb2 import Position

## Note: you can prove to yourself that the downloaded file matches the schema above using the code below:
# import MyMessages_pb2
# from google.protobuf.descriptor_pb2 import FileDescriptorProto
# fd = FileDescriptorProto()
# MyMessages_pb2.DESCRIPTOR.CopyToProto(fd)
# print(fd)

OK! Let's create some hunters:

In [18]:
max_hunter = DinoHunter(
      first_name='py_max',
      id=1,
      dinos=[
        {'name': 'py_nibbles', 'type': DinoHunter.PEOPLEEATINGSAURUS},
      ])
print(max_hunter)

lara_hunter = DinoHunter(
      first_name='py_lara',
      id=2,
      dinos=[
        {'name': 'py_bites', 'type': DinoHunter.PEOPLEEATINGSAURUS},
        {'name': 'py_stinky', 'type': DinoHunter.VEGGIESAURUS},
      ])

print(lara_hunter)

first_name: "py_max"
id: 1
dinos {
  name: "py_nibbles"
  type: PEOPLEEATINGSAURUS
}

first_name: "py_lara"
id: 2
dinos {
  name: "py_bites"
  type: PEOPLEEATINGSAURUS
}
dinos {
  name: "py_stinky"
  type: VEGGIESAURUS
}



## Writing Protobuf messages to a protobag
### Plain Messages (`protobag.MessageEntry`)

`protobag` archives are just zip archives that contain string-serialized Protobuf messages. (Tar and other formats are also supported via [libarchive](https://github.com/libarchive/libarchive)).  So `protobag` offers a simple API for writing messages to an archive:

In [19]:
bag = protobag.Protobag(path='example.zip')
writer = bag.create_writer()
writer.write_msg("hunters/py_max", max_hunter)
writer.write_msg("hunters/py_lara", lara_hunter)
writer.close()

You can verify that the above just wrote a zip archive for you:

In [22]:
!which unzip > /dev/null || (apt-get update && apt-get install unzip)
!unzip -l example.zip

Archive:  example.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       72  1980-01-01 00:00   hunters/py_max
       86  1980-01-01 00:00   hunters/py_lara
     8691  1980-01-01 00:00   /_protobag_index/bag_index/1595326274.0.stampedmsg.protobin
---------                     -------
     8849                     3 files


Hmm, what's that `/_protobag_index/bag_index/xxxxx.stampedmsg.protobin` file?

## Protobag indexes Protobuf message Descriptors

 By default, `protobag` not only saves those messages but also **indexes Protobuf message descriptors** so that your `protobag` readers don't need your proto schemas to decode your messages.  

### Wat?
In order to deserialize a Protobuf message, typically you need `protoc`-generated code for that message type (and you need `protoc`-generated code for your specific programming language).  This `protoc`-generated code is engineered for efficiency and provides a clean API for accessing message attributes.  But what if you don't have that `protoc`-generated code?  Or you don't even have the `.proto` message definitions to generate such code?

In Protobuf version 3.x, the authors added official support for [the self-describing message paradigm](https://developers.google.com/protocol-buffers/docs/techniques).  Now a user can serialize not just a message but Protobuf Descriptor data that describes the message schema and enables deserialzing the message *without protoc-generated code*-- all you need is the `protobuf` library itself.  (This is a core feature of other serialization libraries [like Avro](http://avro.apache.org/docs/1.6.1/)).

Note: dynamic message decoding is slower than using `protoc`-generated code.  Furthermore, the `protoc`-generated code makes defensive programming a bit easier.  You probably want to use the `protoc`-generated code for your messages if you can.


### Protobag makes all messages self-describing messages
While Protobuf includes tools for using self-describing messages, the feature isn't simply a toggle in your `.proto` file, and the API is a bit complicated (because Google claims they don't use it much internally).

`protobag` automatically indexes the Protobuf Descriptor data for your messages at write time.  (And you can disable this indexing if so desired).  At read time, `protobag` automatically uses this indexed Descriptor data if the user reading your `protobag` file lacks the needed `protoc`-generated code to deserialize a message.

What if a message type evolves?  `protobag` indexes each distinct message type for each write session.  If you change your schema for a message type between write sessions, `protobag` will have indexed both schemas and will use the proper one for dynamic deserialization.


### Time-Series Data (`protobag.StampedEntry`)
`protobag` features a **topic-timestamp-message** API for recording time-series data.  This API is modeled after [`rosbag`](http://wiki.ros.org/rosbag), [LCM log files](https://lcm-proj.github.io/log_file_format.html) (where topics are called "channels"), and your favorite message bus systems like Kafka or AWS SQS.  Each **topic** has Protobuf messages of a single type, and each message has a nanosecond-precision timestamp (using the `google.protobuf.Timestamp` object, which has built-in conversion to other timestamp datastructures like Python `datetime`s).  

Protobag has special handling for these [StampedEntry](fixme link) entries:
  * For writing:
      * Topics organized into archive "folders" and filenames are chosen automatically.
      * Protobag indexes Protobuf Message Descriptors as described above.
      * Protobag indexes the messages for efficient time-ordered playback.
  * For reading:
     * Protobag offers a simple [Selection](fixme link) API for reading specific sets of topics, time ranges, or even just individual events.
     * Protobag offers a [`TimeSync`](fixme link) (in C++ FIXME add python API) for synchronizing topics that have messages recorded at different rates.  
  


### Raw Data (`protobag.RawEntry`)
We can also just put raw data like text files or images into our protobag archive because it's just a zip file.  The C++ `protobag` API even includes an [`ArchiveUtil.hpp` module](fixme) TODO FIXME LINK that has helper functions for common archive operations like zipping a directory or unarchiving a tar file.  The raw write API makes `protobag` skip all indexing and type-tracking activity.