A scalable realtime streaming search platform
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
app Update main.py Oct 19, 2018
config Update dependencies.md Sep 3, 2018
local_demo Update Storm worker configuration Sep 3, 2018
media Add media Aug 20, 2018
metrics Test multiple queries on Storm cluster Aug 15, 2018
search Update Storm worker configuration Sep 3, 2018
utils Add media Aug 20, 2018
.gitignore Adding latency metric Aug 15, 2018
README.md Update README.md Sep 3, 2018

README.md

sample

Alluvium

A scalable realtime streaming search platform

DEMO

Overview

Alluvium provides a clean, scalable architecture in Python for realtime streaming search. Realtime search provides insight into high velocity feeds, with applications ranging from media monitoring to up-to-date anti-vandelism detection notifications. A practical example is monitoring community health by tracking the frequency of words in the Twitte feed that correlate with heart disease mortality rates, as presented by Eichstaedt et al in 2015.

Achieving realtime search in high volume streams presents a unique set of engineering challenges. For example, when we search in a static setting we typically create an index on the document we are searching, which is often not feasible in high-volume streams. This limitation led to the development of reverse search where queries are indexed and matched against a tokenized stream of text. Solme challenges emerge as additional queries are added. Should we tokenize the streaming documents for each query, or tokenize them once and run them against several queries in batches? How should we remove queries from the list? How shall we scale the processing distribution to handle both an increase in document volume as well as an increase in number and complexity of queries? These are some of the questions I've been addressing with Alluvium.

Architecture

  • AWS (S3): Simulated firehose of tweets from 2012
  • Kakfa: Scalable, fault-tolerant message delivery
  • Storm: Event-based stream processing
  • Elasticsearch: Tweet search with percolator index
  • RethinkDB: Key-value data store
  • Flask-Socket.io: Server socket connection delivering real-time results to client

Engineering Challenges

  • Kafka tuning
  • Storm topology configuration and deployment in Python
  • Pipeline metrics

Performance monitoring

  • Currently clocking an average of 4 milliseconds per search on a 2000 tweets/second stream.