<h1>(Big) Data Engineering - Fundamentals and Landscape </h1>


## What is Big Data?
We might have heard the term Big Data before, but what does it really mean? Below is a high-level definition:

> Big data refers to data that is so large, fast and complex that it’s difficult or impossible to process using traditional tools like a datawarehouse.

Traditional methods of data storage included storing data locally on a computer, on a server, mainframe system or in a database. Over time, due to the rapid evolution in data production, these traditional approaches are no longer able to cope with the new reality.

Accordingly, a new group of tools and technologies were introduced starting mid to late 2000's.  The logic behind these new tools was to switch the mentality from storing and analyzing small quantities of high-value data in structured format using expensive systems such as a datawarehouse (DWH) to being able to capture and store __all__ raw data (both structured and unstructured) in a central repistory, and then running various operations and transformations on this comprehensive dataset using tools that were open-source and inexpensive.

Hence, these new big data tools were developed to address the below challenges called the "V's" of big data:


<p align="center">
  <img src="images/5v.png" width=600>
</p>


- __Volume__: 
    -   The quantity of data is growing, and so is the number of data production sources (such as from Internet of Things devices and machines)
- __Variety__: 
    -   The format of produced data is also evolving.  Now we have:
        - Structured: Table-like data organized into rows and columns.  This used to be the standard format for data storage for decades.
        - Unstructured: More modern data types including text documents, emails, videos, images, and audio.
- __Velocity__: 
    -   The speed at which the data is arriving (batch initially and real-time more recently)
- __Veracity__: 
    -   The quality of the data (inconsistencies, uncertainties, empty data records etc.)
- __Value__: 
    -   How valuable is the captured data? How to increase it’s business value?

The first real Big data technology was the introducion of Apache Hadoop in 2006. This was a revolutionary step, as it finally enabled the capture, storage and analysis of structured and unstructured data in one centralized location called the data lake.  Hadoop also provided the capabilites to manipulation, transform and explore the data using tools such as:
    - Java MapReduce
    - HiveQL (which is a SQL-like language)
    - Pig Latin (which is a scripting language similar to Bash)

## Data Ecosystem and Concepts

- The modern data infrastructure consists of various concepts, tools and components to be able to handle the various types of data being generated.
- Thest tools and components are arranged in groupings, with each layer performing a specific activity on the data before passing it on to the next layer
- The main layers in the ecosystem include: 
    -   Data Storage
    -   Data Acquisition
    -   Data Processing
    -   Data Access
    -   Data Management
    -   Data Visualization

    Here is an example diagram showing how these layers and their corresponding tools fit in the big picture:

    <p align="center">
  <img src="images/data-architecture.png" width=600>
</p>
<p></p>


We'll now briefly explain the role of each layer in the modern data ecosystem depicted above.  The data process starts from the bottom and moves up sequentially through each layer:

### 1. Data Storage 

> Data storage, especially within the context of big data, is a compute-and-storage architecture that can be used to collect and manage huge-scale datasets and perform real-time data analyses.

In this layer, data that is generated from sources is captured and persisted.  There are several alternatives available for this approach, each one tailored for specific types of data and its related use cases.  The most common types of enterprise data storage components include:

#### Data/Delta Lakes

A data lake (sometimes called delta lake) is a centralized repository that allows you to store all your structured and unstructured data in one location.  This type of storage approach enables easy scaling to cope with the increasing quantity of data. 

#### Data Warehouses

A data warehouse (DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of the traditional business intelligence architecture.  DWs are central repositories of integrated and structured data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports.  These types of systems are expensive to maintain and usually store high-value data.

#### Cloud Storage

Cloud storage is a cloud computing model that stores data on the Internet through a distributed computing provider who manages and operates data storage as a service. It’s delivered on demand with just-in-time capacity and costs, and eliminates buying and managing data storage infrastructure. This gives users the agility, global scale and durability, with “anytime, anywhere” data access.


### 2. Data Acquisition

> Data acquisition is the process of collecting the data from various sources and moving it from point of origin to the target destination.

Depending on the type of data being produced, and the frequency of its production, we can classify this layer into two main types:

#### Batch Data:

This is the acquisition of a vast amount of data at-rest, in its entirety and at the same time.  This approach is usually implemented at regular intervals (such as once per day or once a week) on an full dataset, which includes the historical data plus any new incremental data. An example of batch data is moving a dataset of the entire list of customers for a certain company.
       
#### Real-time Data:

This is the acquisition of continuous data while it's in-motion in very short, near instanteous intervals as the data arrives from its source. An example of real-time data is mobile application data that is constantly turned on, such as GPS locations.

### 3. Data Processing

> Data processing is, generally, the automated manipulation of raw data and transforming it to produce meaningful information. This is the layer that handles the ETL/ELT operations on the data.

Data processing is handled by specialized tools, with the vast majority of these tools being open-source and oftentimes free to download and use.  

The most popular industry-ready frameworks are Hadoop (including Yarn), Spark, Flink and Storm. The choice on which framework to use, and even which component within that tool to deploy (as they all come with various modules) is determined by an enterprise system Architect. There are several criteria that support in determining the tool of choice. Some of these criteria include:
-   Is the incoming data batch or streaming?
-   What is the type of data being captured and stored (structured, unstructured?)
-   Where will the data be stored?
-   What transformations are required on the data?
-   What is the quality of the data?
-   What are we expected to do with the data after cleaning and transforming?

Once the data passes through the required processing steps, its ready to be accessed via the required stakeholders as it's now in a cleaned and integrated state.

### 4. Data Access

> Data access refers to the capability to interact with the data once its been through the data processing steps.  

In the data access layer, the aim is to expose the cleaned, integrated and prepared data to the downstream systems and various stakeholders. This can be done via several tools, depending on our objective.  For instance, if we are querying the data for analytics purposes, then we'll use tools such as HiveQL, Spark SQL or Python.  If the goal is to create predictive models, we can use a tool like R or Pandas and Numpy to create and run these algorithms.




### 5. Data Management

> Data management is the process of managing and synchronizing the data based on requirements.

In the data management layer, the objective is to determine what to do next with the data after access is provided.  For instance, it may be required that certain jobs need to run hourly or daily.  In this case, we'll use tools like Oozie or Airflow to create and schedule such jobs and to provide the required parameters.

If the requirement is to move or copy the data into another system, we can use Kafka to transport the data to the target system.  If the requirement is to store the data into an enterprise datawarehouse (EDW), we can use a tool like Sqoop to place the data there directly.

Finally, if the need is to create various dashboards and graphs, we can feed the data to the next (and final) layer, which is the data visualization layer.

### 6. Data Visualization

> Data visualization is the process of displaying data in charts, graphs, maps, and other visual forms. It is used to help people easily understand and interpret their data at a glance, and to clearly show trends and patterns that arise from this data.

After the raw data has been through the previous 5 steps, it's now ready to be presented to business leaders and executives.  It's quite common for non-technical professionals to refer to dashboards and graphs to analyze how the business is performing, track metrics and key performance indicators (KPI's) and to monitor the day-to-day operations of the company.  Actually, in reality most data related projects are initiated to at least partially support dashboards and visualization tools for the top-level business leaders.

In this layer, a tool like Tableau can be leveraged to represent the information we have in an easy to understand graphical format.  Some common types of charts include:
-   Pie charts
-   Bar charts
-   Time series charts
-   Graphs

## What is Data Fabric and why is it important

> A data fabric is a novel _data management architecture_ that serves as an integrated layer (fabric) of data and connecting processes.  It can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers.

Data fabric is a very modern approach to managing the data within large organizations. With a data fabric, you can elevate the value of enterprise data by providing users access to the right data just in time, regardless of where it is stored. A data fabric architecture is agnostic to data environments, data processes, data use and geography, while integrating core data management capabilities. It automates data discovery, data governance and consumption, delivering business-ready data for analytics and AI.

Top performing enterprises are data driven. However, several challenges block them from fully exploiting all data:
- Lack of data access. 
- Numerous data sources and data types. 
- Data integration complexities. 


Research shows that up to 74% of data is not analyzed in most organizations and up to 82% of enterprises are inhibited by data silos.  

With a data fabric, business users and data scientists can access trusted data faster for their applications, analytics, AI and machine learning models, and business process automation, helping to improve decision making and drive digital transformation. Technical teams can use a data fabric to radically simplify data management and governance in complex hybrid and multicloud data landscapes while significantly reducing costs and risk.

Data fabric enables a permanent and scalable mechanism for business to consolidate all its data under the umbrella of one unified platform. It leverages storage and processing power from multiple heterogeneous nodes to enable enterprise-wide access to all data assets of an enterprise. According to Forrester, a Big Data Fabric assists enterprises to “…quickly ingest, transform, curate, and prepare streaming and batch data to support a real-time trusted view of the customer and the business.” *

Furthermore, Big Data Fabric enables companies to:

-   Effectively consolidate data assets with on-premises and Cloud data sources, for a complete view of enterprise-wide information.
-   Gain access to the latest data in real-time.
-   Easily onboard new big data systems and retire legacy systems, while keeping business systems running continuously without disruption.
-   From a problem-solving perspective, data fabric overcomes the challenges of insufficient data availability, unreliability of data storage and security, siloed data, poor scalability, and reliance on underperforming legacy systems.

Below is what a typical data fabric ecosystem would look like in a global company:

   <p align="center">
  <img src="images/data-fabric.png" width=600>
</p>
<p></p>

## Data Components

Now that we've seen the big picture, and understood the main conceps of the modern corporate big data ecosystem, we'll take a deeper dive into some of the most popular tools and components that we've discussed so far. The components are grouped together based on the layer in which they operate in and the role that they perform.

As a quick reminder, below are the 6 layers of the modern data ecosystem.  We'll look at the most popular tools for each layer in turn:

-   Data Storage
-   Data Acquisition
-   Data Processing
-   Data Access
-   Data Management
-   Visualization

### Data Storage

#### HDFS 

> HDFS is a distributed file system that handles large data sets running on commodity hardware. 

<p align="left">
  <img src="images/hdfs2.png" width=150>
</p>

It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN.

#### Cloud Storage

> Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". 

<p align="left">
  <img src="images/s3-2.png" width=150>
</p>


The physical storage spans multiple servers, and the physical environment is typically owned and managed by a hosting company. This approach to data storage is quickly becoming the dominant trend in industry and is replacing the traditional approach of compnaies having to purchase and maintain expensive servers and storage devices.

The most popular types of cloud storage are:
- Amazon S3
- Microsoft Azure and OneDrive
- Google Firebase


#### HBase

> HBase is an open-source non-relational (noSQL) distributed database modeled after Google's Bigtable and written in Java. 

<p align="left">
  <img src="images/hbase2.png" width=150>
</p>


It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. Hbase has many connectors to integrate it with the various other tools in the ecosystem, and is a widely used data store in global companies.


#### Cassandra

> Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system.

<p align="left">
  <img src="images/cassandra.png" width=150>
</p>

It's designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

> MongoDB is a source-available cross-platform document-oriented database program. 

#### MongoDB
<p align="left">
  <img src="images/mongoDB.png" width=150>
</p>

Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. It's a popular tool in global companies to handle document type big data structures.


#### Datawarehouse

> Datawarehouse (DWH) systems were traditionally the de-facto standard to store and analyze structured corporate data.  

<p align="left">
  <img src="images/dwh.png" width=150>
</p>


They rely on SQL logic and concepts and could only store structured data types.  There are several brands with popular datawarehouse systems, who include:
- Oracle Datawarehouse
- Microsoft Datawarehouse
- Amazon Redshift
- SAP
- IBM Db2 Warehouse
- Snowflake (the most recent one in the list)

### Data Acquisition


#### Kafka

> Apache Kafka is a relatively new open-source technology for distributed data storage optimized for __ingesting__ and __processing__ streaming data in real-time. 


<p align="left">
  <img src="images/kafka.png" width=150>
</p>

Streaming data is data that is continuously generated by potentially numerous data sources.  An example of this are Internet of Thing (IoT) devices such as sensors and smartphones. Such devices are usually numerous and typically send the data records simultaneously. Accordingly, in order to properly capture and process this constant influx of data, a streaming platform needs to propertly configured to handle the data sequentially and incrementally.

Kafka provides three main functions to its users:

- Publish and subscribe to streams of records
- Effectively store streams of records in the order in which records were generated
- Process streams of records in real-time

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.  

#### Flume

> Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. 

<p align="left">
  <img src="images/flume.png" width=150>
</p>

It has a simple and flexible architecture based on streaming data flows.

### Data Processing

#### Apache Hadoop

> Hadoop is an open-source, Java-based framework that allows for the distributed processing of large data sets across clusters of computers using customizable programming models. 

<p align="left">
  <img src="images/hadoop.png" width=150>
</p>


It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and breaks down jobs into tasks that can run in parallel across the various nodes. 

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop was designed to tackle big data and especially unstructured data types, which traditional relational database systems couldn't handle. It mainly focuses on batch data processing on-disk, although later versions provided support for other components that could handle real-time data.

#### Apache Spark

> Apache Spark is a multi-language data processing engine for executing data engineering, data science, and machine learning on single-node machines or clusters. 

<p align="left">
  <img src="images/spark.png" width=150>
</p>

It can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools.

Spark performs data computations in-memory (as opposed to Hadoop's on-disk approach), and thus can be up to 100X faster than Hadoop for certain types of data processing activities.

Spark was also designed to handle real-time data, something which Hadoop wasn't good at handling.

#### Apache Storm

> Apache Storm is a free and open-source distributed real-time computation system. 

<p align="left">
  <img src="images/storm.png" width=150>
</p>

Apache Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Apache Storm is simple to deploy, and can be used with any programming language.  Storm has many use cases, including: 
- Realtime analytics
- Online machine learning
- Continuous computation
- Distributed RPC
- ETL

Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

#### Apache Flink

> Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. 


<p align="left">
  <img src="images/flink.png" width=150>
</p>


The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.

Apache Storm integrates with the queueing and database technologies most companies already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. 


### Data Access

#### Hive

> Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. 

<p align="left">
  <img src="images/hive.png" width=150>
</p>

Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

#### Knime

> KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. 
<p align="left">
  <img src="images/knime.png" width=150>
</p>

KNIME integrates various components for machine learning and data mining through its modular, GUI-based data pipelining "Building Blocks of Analytics" concept.


#### Snowflake

> Snowflake is a recently introduced data warehouse built on top of the Amazon Web Services or Microsoft Azure cloud infrastructure. 

<p align="left">
  <img src="images/snowflake.png" width=150>
</p>

The Snowflake architecture allows storage and compute to scale independently, so customers can use and pay for storage and computation separately. This technology is becomming quite popular in global companies due to the benefits it provides.

### Data Management

#### Apache Airflow

> Apache Airflow is a workflow engine that will easily schedule and run complex data pipelines. It will make sure that each task of the data pipeline will get executed in the correct order and each task gets the required resources.

<p align="left">
  <img src="images/airflow3.png" width=150>
</p>


It also provides a user interface to monitor and fix any issues that may arise.

Features of Apache Airflow
- Easy to use: If you have a bit of Python knowledge, you are good to go and deploy on Airflow.
- Open source: It is free and open-source with a lot of active users.
- Robust integrations: It will give you ready to use operators so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
- Use standard Python to code: You can use python to create simple to complex workflows with complete flexibility.
- User interface: You can monitor and manage your workflows. It will allow you to check the status of completed and ongoing tasks.

#### Apache Oozie

> Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. 

<p align="left">
  <img src="images/oozie.png" width=150>
</p>
Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow as well as a mechanism to control the workflow execution path.

#### Talend

> Talend is an open source data integration platform. It provides various software and services for data integration, data management, enterprise application integration, data quality, cloud storage and big data.

<p align="left">
  <img src="images/talend.png" width=150>
</p>

### Visualization

#### Tableau

> Tableau is a powerful and fastest growing data visualization tool used in the Business Intelligence Industry. 

<p align="left">
  <img src="images/tableau.png" width=150>
</p>

It helps in simplifying raw data in a very easily understandable format. Tableau helps create the data that can be understood by professionals at any level in an organization. It also allows non-technical users to create customized dashboards.

Data analysis is very fast with Tableau tool and the visualizations created are in the form of dashboards and worksheets.

The top features of Tableau software are:

- Data blending
- Real time analysis
- Collaboration of data
- Doesn't require deep technical or programming skills to operate. 

#### Microstrategy

> MicroStrategy is a business intelligence software, which offers a wide range of data analytics capabilities. 

<p align="left">
  <img src="images/microstrategy.png" width=150>
</p>

As a suite of applications, it offers:

- Data discovery
- Advanced Analytics
- Data visualizations
- Embedded BI
- Banded Reports and Statements. 

Microstrategy can connect to big data storages like Hive, data warehouses, relational systems, flat files, web services and a host of other types of sources to pull data for analysis. Features such as highly formatted reports, ad hoc query, thresholds and alerts, and automated report distribution makes MicroStrategy an industry leader in BI software space. It is recognized as a visionary by Gartner Magic Quadrant.

#### DataWrapper

> Datawrapper is a free, online data visualization tool that can create charts, maps and tables via a user-friendly graphical user interface.

<p align="left">
  <img src="images/datawrapper.png" width=150>
</p>

 With it, you can create three kinds of visualizations: 

- Maps: choropleth maps, symbol maps, and locator maps. Learn more about maps on our website.
- Charts: from simple bar charts, line charts, column charts to arrow charts, scatterplots, population pyramids, etc. Learn more about charts on our website.
- Tables: with mini line charts, bar charts, images, Markdown, etc. Learn more about tables on our website.

#### Lumify

> Lumify is a big data fusion, analysis, and visualization platform. 
<p align="left">
  <img src="images/lumify.png" width=150>
</p>

It helps users to discover connections and explore relationships in their data via a suite of analytic options.


### Other Tools

There are a host of other types of tools that don't explicitly fall under the above 6 layers we discussed throughout this notebook.  These tools generally either perform functions across more than one layer or provide certain features or benefits to the entire ecosystem and all its layers.  One such type of technology is called containerization.

__Containerization__ can be thought of as the evolution to virtual machines (VMs). Containerization is defined as a form of operating system virtualization, through which applications are run in isolated user spaces called containers, all using the same shared operating system (OS). This type of technology is rapidly being adopted by global corporations due to the beneifts they provide.  Currently, the most popular containerization tools in the market are:

#### Kubernetes

> Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. 

<p align="left">
  <img src="images/kubernetes.png" width=150>
</p>

It was originally designed by Google. Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. 

#### Docker

> Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. 

<p align="left">
  <img src="images/docker3.png" width=150>
</p>


Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.

## Key Takeaways

- Big data can be defined using the 5 V's: volume, velocity, variety, veracity and value.
- Big data is different than traditional data, as it comes in both structured and unstructured formats and the data arrives in various velocities in accelerating quantities.
- Traditional relational database and datawarehouse models were well suited for structured data with low volumes, however such systems are expensive to maintain  and can't handle unstructured data.  This is why they are currently being augmented with more modern big data tools. 
- The modern data ecosystem is composed of 6 layers, namely: storage, acquisition, processing, modelling, management and administration.  Data must flow sequentially from the first layer (data storage) till the last layer (visualization) and each layer performs a specific function on the data.
- Each of the 6 layers of the modern data ecosystem has specific tools that have become the industry standard for global companies.
- In the data storage layer, cloud data storage is quickly becoming the dominant trend in industry.
- Another popular trend in the data ecosystem space is containerization, which can be understood as a form of operating system virtualization.