Skip to content
This repository has been archived by the owner. It is now read-only.
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

README.md

soxta

SOXTA - StackOverflow's XML To AVRO

How to run

Firstly, we need uncompressed data from the Archive.org public dataset: https://archive.org/details/stackexchange. Given the next folder structure, the data should be inside xml_data folder. The avro_schemas folder should contain the schema files in order to convert from XML to AVRO. The avro_data folder should be empty:

.
├── avro_data       # Empty, to store Avro converted files
├── avro_schemas    # Avro schema files
└── xml_data        # Uncompressed data from Archive.org

Convert from XML to AVRO:

Then we need to run soxta.py script, specifying the XML file to convert, the AVRO schema file, and the path for the result AVRO file:

$ soxta.py xml_data/Posts.xml avro_schemas/Posts.avsc avro_data/Posts.avro

About

SOXTA - StackOverflow's XML To AVRO

Resources

License

Releases

No releases published

Packages

No packages published

Languages