# Data-Driven Graph-Powered Application

**Lambda architectures** provides a framework to structure scalable applications that require large-scale processing and real-time updates. Apply the framework in the context of graph powered applications. Describe **graph processing engines** and **graph querying engines**, so we cover in chapter 9:
- Overview of lambda architectures
- Lambda architectures for graph-powered applications
- Technologies and examples of graph processing engines
- Graph querying engines and graph databases


In [None]:
!pip install neo4j==4.2.0
!pip install gremlinpython==3.4.6

## Lambda Architectures
Focus of scalable architectures that can handle large amounts of data and can provide alerts in real time, using the latest available information. These systems must scale to large number of users or data by increasing resources horizontally or vertically. Lambda architecture is designed to process massive quantities of data and ensure large throughput efficiently, preserving reduced latency and ensuring fault tolerance and negligible errors. We have three different layers:
- **batch layer**: Sits on top of the storage system, and can handle all historical data, as well as perform **online analytical processing (OLAP)** computation on the entire dataset. New data is continuously ingested and stored. Large-scale processing typically achieved via massively parallel jobs, aiming to produce aggregation, structuring and computation of relevant information. In terms of ML, model training that relies on historic information is generally done at this layer, producing a trained model to be used in either batch prediction or real-time execution.
- **speed layer**: Low-latency layer allows real-time processing of information to provide timely updates and information. Generally fed by a streaming process, usually involving fast computation that does not require a long computational time or load. Produces output integrated with data generated by batch layer in (near) real time, providing support for **OLTP** operations. May also use some outputs of OLTP computations, such as trained model. Often ML models that serve predictions are embedded in speed layers.
- **serving layer**: Duty to organise, structure and index information for fast data retrieval from batch and speed layer. Thus integrates outputs of batch layer with most updated and real-time information of speed-layer to deliver user a unified and coherent view of the data. Serving layer can be composed of a persistence layer that integrates both historic aggregation and real-time updates. Information generally exposed to user via direct connection to DB or accessible with specific domain query language, such as SQL or via dedicated RESTful API servers.

Some of the main pros with Lambda architecture:
-  **No server management**: Abstracts functional layers so no installing, maintaining or administering any software/infrastructure
-  **Flexible Scaling**: Can be automatic to scale horizontally or vertically
-  **Automated high availability**: Represents serverless design which already has built-in availability and fault tolerance
-  **Business agility**: Reacts in real time to changing business/market scenarios

Limitations:
- Separate two interconnected processing flows: **batch layer** and **speed layer**, requiring developers to build and maintain separate code bases for batch and stream processes, requiring more complexity and overhead code management.

## Lambda architectures for Graph-Powered Applications
- **Graph processing engine**: Executes computations on graph structure to extract features, compute statistics, compute metrics and **Key Performance indicators (KPIs)** and identify relevant subgraphs that require OLAP.
- **Graph querying engine**: Allows us to persist network data and provides fast information retrieval and efficient querying and graph traversal (usually via graph querying languages). Information is already persisted in some data storage and no further computation is required, other than some final aggregation so indexing is crucial for high performance and low latency.

Graph processing engines sit on top of batch layers and produce outputs that may be stored and indexed in appropriate graph databases. These databases are the backend of graph querying engines which allow information to be easily and quickly retrieved, representing the operational views used by the serving layer. Can make sense to run processing and querying engine on top of the same infrastructure. 

Graph proccessing engines require information on the whole graph to be accessed quickly, ie. having it in memory and may not require distributed architectures depending on context. Though data can grow so much it is not viable to use a single machine; which case there are solutions: 
-  **Apache Spark GraphX**: Distributed representation of graph using Resilient Distributed Datasets (RDDs) for both edges and vertices, so edges are assigned to different machines and vertices can span multiple machines.
-  **Apache Graph**: Iterative graph processing system built for high scalability. Currently used by Facebook to analyse social graph formed by users, their connections is built upon Hadoop to utilise the potential of structured datasets at large scale.

The choice of algorithms is smaller in shared machines because:
1. Algorithms in a distributed way is more complex than in a shared machine due to communication among nodes, which reduces efficiency
2. Only algorithms that (nearly) scale linearly with num. data points should be implemented to ensure horizontal scalability, by increasing computational nodes as dataset increases

There is an equivalent of map-reduce for graphs, **Pregel**, consisting of a sequence of iterations, each called a **superstep** involving a node and its neighbours. 

Examples of **graph databases** to store non-structured data:
- Neo4j
    - Can distribute to large datasets via sharding, distributing over multiple nodes
    - Also flexible and user-friendly, good for MVP/PoC to get started in agile manner
    - Has Graph Data Science library
    - Can query via Gremlin
    - Though scaling based on sharding and breaking down large graphs into smaller subgraphs may not be the best option
- JanusGraph
    - Scalable graph for storing and querying graphs distributed across multi-machine cluster with hundreds of billions of vertices and edges
    - Sits on top of GCloud Bigtable, Apache HBase, Apache Cassandra, ScyllaDB. Abstracting graph view on top of them.
    - Can efficiently handle **supernodes** with extremely large degree which are often bottlenecks
    - Graph can be analysed with Gremlin with Java connectors and Python bindings
- OrientDB
- Amazon Neptune 
- etc..

And have **graph query languages** that allow us to traverse the underlying graphs. Gremlin is a functional language whereby operators are grouped together to form path-like expressions.