# Apache Avro Data Serialization

More on avro:
- https://www.oreilly.com/content/the-problem-of-managing-schemas/
- https://www.confluent.io/blog/avro-kafka-data/


Avro provides:

- Rich data structures.
- A compact, fast, binary data format.
- A container file, to store persistent data.
- Remote procedure call (RPC).
- Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Benefits of Avro:
- It has a direct mapping to and from JSON
- It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage.
- It is very fast.
- It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.
- It has a rich, extensible schema language defined in pure JSON
- It has the best notion of compatibility for evolving your data over time.

In [1]:
!pip install avro

Collecting avro
  Downloading avro-1.11.0.tar.gz (83 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: avro
  Building wheel for avro (pyproject.toml): started
  Building wheel for avro (pyproject.toml): finished with status 'done'
  Created wheel for avro: filename=avro-1.11.0-py2.py3-none-any.whl size=115925 sha256=df9d7d863e6fc24b8b594a4a06f2273ecdd567fec32b8810b2e3af52f41b2b75
  Stored in directory: c:\users\31653\appdata\local\pip\cache\wheels\9a\a5\9b\d100e4bd3ef9697b2f955616260c77cb136f8cd2fc89533c63
Successfully built avro
Installing collected packages: avro
Successfully installed avro-1.11.0


In [2]:
import avro
print(avro.__version__)

1.11.0


##  Wikipedia example

The English wikipedia has a nice example for avro: 
https://en.wikipedia.org/wiki/Apache_Avro

I'have created a file called user.avsc.

In [1]:
import glob
my_avscs=glob.glob('*.avsc')
my_avscs

['user.avsc']

In [2]:
#inspect the avsc file
%pycat C:\\Users\\31653\\Documents\\GitHub\\Notebooks\user.avsc

In [3]:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("user.avsc", "rb").read())  # need to know the schema to write. According to 1.8.2 of Apache Avro

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 8, "favorite_color": "red"})
writer.close()

In [4]:
import glob
my_avros=glob.glob('*.avro')
my_avros
# One cannot inspect this file as it is a binary format file.

['users.avro']

In [5]:
#deserialization
reader = DataFileReader(open("users.avro", "rb"), DatumReader())  # the schema is embedded in the data file
for user in reader:
    print(user)
reader.close()

{'name': 'Alyssa', 'favorite_number': 256, 'favorite_color': None}
{'name': 'Ben', 'favorite_number': 8, 'favorite_color': 'red'}
