This repository is outdated and was migrated to project-fortis.
A repository for Project Fortis's data processing pipeline, built on Apache Spark.
This project contains a Spark Streaming job that ingests data into the Fortis system. Specifically, we:
- Ingest data in real time from sources such as Twitter, Facebook, Online Radio, Newspapers, Instagram, TadaWeb, and so forth.
- Analyze and augment the raw data with intelligence like sentiment analysis, entity extraction, place recognition, or image understanding.
- Narrow down the stream of events based on user-defined geo-areas, target keywords and blacklisted terms.
- Perform trend detection and aggregate the metrics that back Project Fortis.
At the end of the ingestion pipeline, we publish the events and various aggregations to Cassandra.
# set up variables from deployment environment
export HA_PROGRESS_DIR="..."
export APPINSIGHTS_INSTRUMENTATIONKEY="..."
export FORTIS_FEATURE_SERVICE_HOST="..."
export FORTIS_MODELS_DIRECTORY="..."
export FORTIS_CENTRAL_ASSETS_HOST="..."
export FORTIS_SERVICEBUS_NAMESPACE="..."
export FORTIS_SERVICEBUS_CONFIG_QUEUE="..."
export FORTIS_SERVICEBUS_POLICY_NAME="..."
export FORTIS_SERVICEBUS_POLICY_KEY="..."
# compile scala, run tests, build fat jar
export JAVA_OPTS="-Xmx2048M"
sbt assembly
# run on spark
spark-submit --driver-memory 4g target/scala-2.11/project-fortis-spark-assembly-0.0.1.jar