Ref: https://avro.apache.org/docs/1.11.1/getting-started-python/

Setup

```bash
conda create -n data_file_formats python=3.12
conda activate data_file_formats
conda install -c conda-forge notebook
pip install avro=1.12.0
pip install fastavro=1.9.7
```

# Introduction to Avro Files

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Avro is designed to support data-intensive applications and is widely used in big data processing frameworks like Apache Hadoop and Apache Spark.

Key features of Avro:
- Compact, fast, binary data format.
- Rich data structures.
- A container file, to store persistent data.
- Simple integration with dynamic languages.

## Defining a schema

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

We added an example in `user.avsc`file (AVro SChema).

We also define a namespace (`“namespace”: “example.avro”`), which together with the name attribute defines the “full name” of the schema (`example.avro.User` in this case).

# Serialize our first file

Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item, regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation. 

In [2]:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("user.avsc", "rb").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
writer.close()

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
    print(user)
reader.close()

{'name': 'Alyssa', 'favorite_number': 256, 'favorite_color': None}
{'name': 'Ben', 'favorite_number': 7, 'favorite_color': 'red'}


What if we try to write data with an improper schema?

In [5]:
schema = avro.schema.parse(open("user.avsc", "rb").read())

writer2 = DataFileWriter(open("users2.avro", "wb"), DatumWriter(), schema)
writer2.append({"name": "Alyssa", "favorite_number": 256, 'favorite_food': 'pizza'})  # This will raise an error
writer2.close()

AvroTypeException: The datum "{'name': 'Alyssa', 'favorite_number': 256, 'favorite_food': 'pizza'}" provided for "User" is not an example of the schema {
  "type": "record",
  "name": "User",
  "namespace": "example.avro",
  "fields": [
    {
      "type": "string",
      "name": "name"
    },
    {
      "type": [
        "int",
        "null"
      ],
      "name": "favorite_number"
    },
    {
      "type": [
        "string",
        "null"
      ],
      "name": "favorite_color"
    }
  ]
}

In [11]:
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
    # a generator to loop over dictionaries
    print(user)
reader.close()

{'name': 'Alyssa', 'favorite_number': 256, 'favorite_color': None}
{'name': 'Ben', 'favorite_number': 7, 'favorite_color': 'red'}


## Multiple Avro files

Often, Avro files can be captured by daily or streaming jobs. We now write the multiple Avro files with the same schema in the same folder. We then see how to read all those files in a single Python list.

In [16]:
import os
import shutil

# Create the subfolder if it doesn't exist
if os.path.exists('users_split'):
    shutil.rmtree('users_split')
os.makedirs('users_split', exist_ok=True)

# Define two lists of users
users_list1 = [
    {"name": "Alyssa", "favorite_number": 256},
    {"name": "Ben", "favorite_number": 7, "favorite_color": "red"}
]

users_list2 = [
    {"name": "Charlie", "favorite_number": 42},
    {"name": "Diana", "favorite_number": 99, "favorite_color": "blue"}
]

# Write the first list of users to an Avro file
with DataFileWriter(open("users_split/users_list1.avro", "wb"), DatumWriter(), schema) as writer:
    for user in users_list1:
        writer.append(user)

# Write the second list of users to another Avro file
with DataFileWriter(open("users_split/users_list2.avro", "wb"), DatumWriter(), schema) as writer:
    for user in users_list2:
        writer.append(user)

In [None]:
import pandas as pd
import fastavro
import os

# Function to read avro files and return a list of records
def read_avro(file_path):
    with open(file_path, 'rb') as f:
        reader = fastavro.reader(f)
        return [record for record in reader]

# Directory containing the avro files
directory = 'users_split'

# List to hold all records
all_records = []

# Iterate over all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.avro'):
        file_path = os.path.join(directory, filename)
        all_records.extend(read_avro(file_path))

# Create a pandas DataFrame from the list of records
df = pd.DataFrame(all_records)
print(df)u

### Exercise

Write a Python script that reads all the Avro files in the `users_split` subfolder and deserializes them into a single Pandas DataFrame. The script should iterate over all files in the specified directory, read each Avro file, and collect all records into a list. Finally, convert this list of records into a Pandas DataFrame and print the DataFrame. Use the `fastavro` library for reading the Avro files. The resulting DataFrame should contain all the records from the Avro files in the `users_split` subfolder.

In [18]:
import pandas as pd
import fastavro
import os

# Function to read avro files and return a list of records
def read_avro(file_path):
    with open(file_path, 'rb') as f:
        reader = fastavro.reader(f)
        return [record for record in reader]

# Directory containing the avro files
directory = 'users_split'

# List to hold all records
all_records = []

# Iterate over all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.avro'):
        file_path = os.path.join(directory, filename)
        all_records.extend(read_avro(file_path))

# Create a pandas DataFrame from the list of records
df = pd.DataFrame(all_records)
print(df)

      name  favorite_number favorite_color
0  Charlie               42           None
1    Diana               99           blue
2   Alyssa              256           None
3      Ben                7            red
