Skip to content

Data Engineering Ecosystem

Judit Lantos edited this page Oct 18, 2016 · 1 revision

Data Engineering Ecosystem

Table of Contents

1 Data Pipeline 2 Data Ingestion 3 Distributed File System 4 Batch Processing 5 Real-time Processing 6 Data Stores 7 Web Application Frameworks 8 Visualization Tools

  1. Data Pipeline

  2. Data Ingestion

  3. Distributed File System

  4. Batch Processing

  5. Real-time Processing

  6. Data Stores

  7. Web Application Frameworks

  8. Visualization Tools

### Data Pipeline

### Data Ingestion

Data ingestion is the first component in the process of building a big data pipeline. Ingestion technologies come into play when data must be collected from multiple sources in a distributed and reliable way. Most data ingestion technologies can be thought of as persistent queues that collect data from various sources, such as web logs and third party APIs, and deliver it to a centralized distributed file system, such as HDFS, or a real-time (stream) processing module, such as Storm, Flink, Spark Streaming and Spark Structured Streaming.

Data ingestion can be categorized as follows:

  1. Batch: This refers to importing data in the form of files as a dump, transforming it as needed and delivering it to a distributed file system. An example could be any form of data dump like Common Crawlhttps://commoncrawl.org/, GitHub Archive, etc.

  2. Real-time (Streaming): It refers to a continuous stream of real time data where each data item is ingested as it is emitted by the source. For example, Twitter, sensor data, Uber cab updates, etc.

Examples of ingestion tools: Kafka, RabbitMQ, AWS Kinesis, Fluentd, Sqoop

Sometimes, data ingestion challenges go beyond funneling information from multiple specific sources into a single data store. Often, there may be several business problems at play: ​

  • Lots of different and/or disparate data coming from or residing on multiple systems
  • Structure and schema of the data differs depending on the source and both are subject to change over time. ​* Values also are often messy and contain unexpected content.

In this blog post, Linkedin's Jay Kreps outlined how engineers used the concept behind log files to bridge the data integration gap. Apache Kafka was born from that work.

Also considered part of the suite of ingestion technologies are serialization tools. Many APIs for publicly available data, such as from Github, frequently change their format, and ingesting ever-changing data in a robust way can be difficult.

Serialization tools such as Avro, allow data to be ingested seamlessly with backward compatibility ­­but this also can introduce other interesting challenges in a robust data platform.

###Distributed File System

Distributed file systems offer many of the features of local file systems, but with added protection, such as storing data redundantly. File systems are often used as a source of truth for later transformations on the data. They are optimized for durable, persistent storage of raw and unstructured data, rather than the quick access and queries that databases provide.

Examples of distributed file systems: HDFS, Amazon S3

###Batch Processing

Batch processing is the process of computing a result from historical data (or source of truth). Once the data is ingested into a distributed file system (DFS), a distributed cluster of machines often is used to perform computation on the partial/entire dataset.

The MapReduce paradigm has become the standard for batch processing with various technologies available based on various use cases. Batch processing isn’t used for low latency results as it could take from minutes to hours to even days to get the result computed.

Finally, the computed result is stored in a database. Batch results are always accurate as they are computed from scratch from the raw data and if there are any human errors in the computation then the result can always be recomputed to make it correct.

Examples of batch processing tools: Hadoop MapReduce, Spark, Amazon EMR

###Real-time Processing

Because batch processing typically takes a significant period of time to finish, real-time processing is generally used to update values as new data enters the system. In some cases updating values accurately may be time-consuming and not feasible given the time window for processing.

In these cases approximate approaches, such as probabilistic data structures like hyperloglog, may be acceptable. This approach may incur cumulative errors, but typically batch processing can fix these calculation errors.

While static data sets are often a hallmark of batch processing, an ever-growing stream of data is the norm in stream processing.

Given the volume of data flowing through many of today’s applications, processing time has become ever more important. Sometimes, the solution requires making architectural or design tweaks. (e.g., Andy Chu’s Insight project implementing a lambda architecture allowed for both batch and real­time updates.)

Other times, it’s about understanding what is happening “under the hood” of a technology and accounting for those details. (e.g., Jamie Grier, of Twitter and Data Artisan fame, re-ran Yahoo’s benchmarking of Flink and Storm based on his deep knowledge of Flink and its advantages over other streaming processing technologies.)

Examples of stream processing tools: Storm, Flink, Spark Streaming and Spark Structured Streaming

###Data Stores

Once data has been processed by the streaming or batch computations, it needs to be stored in a way that can be quickly accessed by the end user or data scientist.

While file systems are designed to store data durably, databases organize data in a way that minimizes unnecessary disk seeks and network transfers to provide the quickest response to queries.

There are hundreds of options for databases and finding the correct one that organizes data correctly for a specific use case is one of the most difficult decisions for data engineers.

One consideration may be dealing with large volumes of data, a challenge that has been classically associated with data engineering. Technologies, such as MapReduce and Google’s BigTable were some of the forerunners in this area. Since their introduction, distributed data stores have become an industry norm.

It’s worth noting that while many businesses are moving toward large NoSQL data stores, they are not abandoning relational databases. Often, they augment their infrastructure with distributed data stores and figure out how to integrate them with the rest of their pipeline.

In the hyper­-competitive landscape of e­-commerce, high system availability is often a prized quality.

Distributed data stores are vaulted for their fault tolerance, made possible by multiple servers replicating data so that if one of them goes down, the data is still available on another server. But tradeoffs are necessary.

According to Brewer’s CAP Theorem, a distributed system cannot guarantee consistency, availability and partition tolerance (in other words, recover from a network failure or loss of communication between nodes) at the same time. One of those guarantees must be broken in order to satisfy the other two. Because losing network connectivity at some point in a system’s lifetime is inevitable, the choice is almost always between data consistency and system availability.

Here are three examples of high availability incurring a cost:

  • Cassandra guarantees availability and partition tolerance but not immediate consistency. The best it can do is guarantee eventual consistency. So while data is always available on one of the nodes, it may be stale until the network recovers and the distributed systems sync up. Insight has a good blog post on it: Cassandra: Daughter of Dynamo and BigTable ­

  • Riak is another distributed NoSQL data store that is highly available and noted for its implementation of conflict­-free replicated data types, which allows for eventual data consistency in the face of network partitions or other failures. Kyle Kingsbury, who maintains a popular blog, describes how CRDTs work. (It might help to read his explanation of Riak first.)

  • Aerospike is a high performance key­-value data store that has impressive record of availability but also carries with it risk of data loss.

Examples of distributed data stores: HBase, Cassandra, Riak, Amazon Redshift, Elasticsearch, Neo4j

###Web Application Frameworks

A web application framework is a software framework that is designed to support the development of dynamic websites, web applications, web services and web resources. The framework aims to alleviate the overhead associated with common activities performed in web development. For example, many frameworks provide libraries for database access, templating frameworks and session management, and they often promote code reuse.

Examples of web application frameworks: Flask, Node.js, CodeIgniter, Django

###Visualization Tools

Data visualization has typically been popular among data scientists, but may be just as important for engineers. Data visualization is all about telling a data story. From a data engineering perspective, this can include graphs and charts for monitoring systems, job/task dependencies, and simple data exploration.

Examples of visualization tools: D3, Highcharts, Google Charts

Clone this wiki locally