SOXTA - StackOverflow's XML To AVRO
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


SOXTA - StackOverflow's XML To AVRO

How to run

Firstly, we need uncompressed data from the public dataset: Given the next folder structure, the data should be inside xml_data folder. The avro_schemas folder should contain the schema files in order to convert from XML to AVRO. The avro_data folder should be empty:

├── avro_data       # Empty, to store Avro converted files
├── avro_schemas    # Avro schema files
└── xml_data        # Uncompressed data from

Convert from XML to AVRO:

Then we need to run script, specifying the XML file to convert, the AVRO schema file, and the path for the result AVRO file:

$ xml_data/Posts.xml avro_schemas/Posts.avsc avro_data/Posts.avro