- Java 1.8
- Spark 3.3.0
- Kryo serializers
(make sure to these with Java 8, check with mvn -v
)
Run the tests:
mvn clean test
Build the application package
mvn clean package
Next, submit your job
mvn clean package && spark-submit --class org.cannotsay.Main --master local[*] --packages com.databricks:spark-xml_2.12:0.15.0 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer ./target/XMLOutagesSparkReader-1.0-SNAPSHOT.jar
Input path
Add your json files under the raw
zone at : src/java/resources/raw
.
Outup path
Find your outputs under the trusted
zone at : src/java/resources/trusted
.
- Raw, where the raw inputs live
- Staging, where the processed/in-process data lives
- Trusted, where the final/trusted/processed data lives
- Use Spark
OutagesRawXMLReader
to read the XML file - Use Spark
OutagesWriter
to write contents to thestaing
area injson
format.- Using json for its readability, should use
parquet
instead
- Using json for its readability, should use
- Use Spark
OutagesStagingStreamReader
to readjson
contents to a Spark stream. - Process data with
OutagesStreamProcessor
- Start stream querying (application will keep running in the background)
- Per each stream batch call
OutageSink
for filtering and driving contents to witherbusiness
orcustomer
trust
areas.
- Parse
postal codes
as a list of elements - Parse
locations
as a list of elements - Replace formats from
json
toparquet
for performance increase.