# Data Layers and Tools

Now that we've seen the big picture, and understood the main concepts of the modern corporate big data ecosystem, we'll take a deeper dive into some of the most popular tools and components that we've discussed so far. The components are grouped together based on the layer in which they operate in and the role that they perform.

## Data Layers

As a quick reminder, below are the 6 layers of the modern data ecosystem.  We'll look at the most popular tools for each layer in turn:

1. Data Storage
2. Data Acquisition
3. Data Processing
4. Data Access
5. Data Management
6. Visualisation

### Data Storage

> This layer is responsible for long-term data storage

#### HDFS 

> HDFS is a distributed file system that can store large data sets. It's designed to be installed on cheap commodity hardware. 

<p align="left">
  <img src="images/hdfs3.png" width=150>
</p>

At the time of its release, HDFS offered a much cheaper alternative over expensive databases and data warehouses to store data. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN. It also provides the capability to store both structured and unstructured data types.

#### Cloud Storage

> Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". 

<p align="left">
  <img src="images/s3-3.png" width=150>
</p>


Under the hood, cloud-based systems store their data on a very large number of very powerful server machine located in a single location called a _data center_. Such data centers are typically owned and managed by a hosting company (such as Amazon). This approach to data storage is quickly becoming the dominant trend in industry and is replacing the traditional approach of companies having to purchase and maintain expensive servers and storage devices.

The most popular types of cloud storage are:
- Amazon S3
- Microsoft Azure and OneDrive
- Google Firebase


#### HBase

> HBase is an open-source, non-relational (NoSQL), distributed columnar data store modeled after Google's Bigtable. It's designed to handle big data in real-time.

<p align="left">
  <img src="images/hbase3.png" width=150>
</p>


Hbase has many connectors to integrate it with the various other tools in the big data ecosystem, and is a widely used data store in industry.


#### Cassandra

> Apache Cassandra is another popular free, open-source, distributed, wide-column data store

<p align="left">
  <img src="images/cassandra.png" width=150>
</p>

It's designed to handle big data easily by using a distributed network of servers. Cassandra provides high-availability, and its architecture has no single point of failure.

#### MongoDB
<p align="left">
  <img src="images/mongoDB.png" width=150>
</p>

> MongoDB is another popular free, distributed, non-relational data store used to handle document-oriented data


MongoDB is mainly designed to store JSON-like documents efficiently. The tool provides a flexible schema model, so this unstructured data can easily be stored and analysed in real-time.


#### Data warehouse

> Data warehouse (DWH) systems were traditionally the de-facto standard to store and analyse structured corporate data. Structured data is that which can be stored into tables using rows and columns (such as data stored in an Excel sheet).

<p align="left">
  <img src="images/dwh.png" width=150>
</p>


These advanced tools rely on SQL logic to store, process and analyse data. There are several brands with popular data warehouse systems, which include:
- Oracle Data warehouse
- Microsoft Data warehouse
- Amazon Redshift
- SAP
- IBM Db2 Warehouse
- Snowflake (the latest tool)

### Data Acquisition

> This layer involves tools and applications to ingest data from different sources and move the data throughout the ecosystem

#### Kafka

> Apache Kafka is a free, novel and open-source platform designed to handle big data streaming. It can connect to multiple data produces and consumers, and is able to handle up to trillions of data events daily. 


<p align="left">
  <img src="images/kafka.png" width=150>
</p>

Streaming data is that which is continuously created by one or more data sources. An example of this are Internet of Thing (IoT) devices such as sensors and smartphones. In a typical enterprise, we'll have hundreds or thousands of such devices, and they all send data records simultaneously. In order to properly capture and process this constant flow of data, a streaming platform needs to be properly configured to handle the data sequentially and incrementally. 

Kafka is one such platform that provides the following functions:

- Publish and subscribe one or more data streams
- Store records in the same order in which they were created
- Process streams of records in real-time as the data is being ingested and transported

#### Flume

> Apache Flume is a tool for collecting, aggregating, and transporting big data streams efficiently. It uses a distributed model which provides reliability and robustness.

<p align="left">
  <img src="images/flume.png" width=150>
</p>

Flume is easily customisable and can be configured to ingest data from one or more data sources and feed it to various types of data consumers. Although it was very popular a few years ago, its popularity is currently in decline as its being replaced by Apache Kafka.


### Data Processing

> This layer involves the tools that perform various operations and transformations on the data itself

#### Apache Hadoop

> Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using customisable programming models. 

<p align="left">
  <img src="images/hadoop.png" width=150>
</p>


It is designed to be able to scale-up from a single node/server to any number of machines, with each machine offering local computation and storage in addition to dividing larger jobs into smaller tasks that can run in parallel across the various nodes. 

Hadoop was designed to tackle big data, and especially unstructured data types which traditional relational database systems couldn't handle. It mainly focuses on batch data processing using an on-disk approach. In the later versions of Hadoop, support was provided to integrate newer components that could handle real-time data (such as Apache Storm).

#### Apache Spark

> Apache Spark is a powerful, unified data processing engine which is currently the most widely used tools by top enterprises worldwide such as Netflix, Yahoo and eBay. Due to its flexibility, it can process both batch and real-time data efficiently.

<p align="left">
  <img src="images/spark.png" width=150>
</p>

Spark was originally developed by the University of California, Berkley in 2009. It has the capability to perform data processing activities on massive amounts of structured and unstructured data by leveraging its in-memory, distributed computational model. 

Spark performs data computations in-memory (as opposed to Hadoop's on-disk approach), and thus can be up to 100X faster than Hadoop for certain types of data processing activities.

Spark was also designed to handle real-time data, something which Hadoop wasn't able to handle at the time.

#### Apache Storm

> Apache Storm is another free and open-source distributed real-time data processing platform

<p align="left">
  <img src="images/storm.png" width=150>
</p>

Apache Storm was one of the first real-time data processing engines introduced that could integrate with Apache Hadoop. Storm is not as complicated as Hadoop to set up, and can be used with several programming languages.  

Typical use cases for Storm are those that require constant real-time data processing, such as: 
- Real-time data analytics
- Fraud detection
- Continuous machine learning
- Extract, Transform and Load

Apache Storm is also very fast. One study recorded the processing power to be 1 million tuples processed every second per node.

#### Apache Flink

> Apache Flink is another popular open-source, distributed data processing framework that can process large data streams at-scale in real-time


<p align="left">
  <img src="images/flink.png" width=150>
</p>


Flink can operate in several types of cluster environments and performs data computations in-memory (similar to Apache Spark). It can handle bounded or unbounded data streams, and is designed to be robust and fault-tolerant.


### Data Access

> This layer involves components that are used to explore, query and analyse data

#### Hive

> Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. 

<p align="left">
  <img src="images/hive.png" width=150>
</p>

Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

#### Knime

> KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. 
<p align="left">
  <img src="images/knime.png" width=150>
</p>

KNIME integrates various components for machine learning and data mining through its modular, GUI-based data pipelining "Building Blocks of Analytics" concept.


#### Snowflake

> Snowflake is a recently introduced big data warehouse built on top of cloud infrastructure (such as AWS or Microsoft Azure) 

<p align="left">
  <img src="images/snowflake.png" width=150>
</p>

The Snowflake architecture allows storage and compute to scale independently, so users can pay for storage and computation separately. This technology is becoming quite popular in industry due to the benefits it provides.

### Data Management

> This layer includes tools that organise and manage the jobs and workflows which execute code and applications based on certain conditions

#### Apache Airflow

> Apache Airflow is a workflow engine initially created by Airbnb. It easily automates, schedules, and runs complex data pipelines. It will make sure that each task of the data pipeline will get executed in the correct order and each task gets the required resources.

<p align="left">
  <img src="images/airflow3.png" width=150>
</p>


- Airflow also provides a user interface that can monitor each task's status and allows users to deal with any errors or bugs
- It's easy to use for data engineers who have expertise using Python
- It's free and open-source with an active community
- The tool provides many ready-to-use connectors so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.

#### Apache Oozie

> Apache Oozie is system for workflow scheduling designed to manage Hadoop jobs. The tool is normally hosted on servers that run numerous big data workflows on a regular basis. 

<p align="left">
  <img src="images/oozie.png" width=150>
</p>
Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow as well as a mechanism to control the workflow execution path.

#### Talend

> Talend is an open source data integration platform. It provides various software and services for data integration, data management, enterprise application integration, data quality, cloud storage and big data.

<p align="left">
  <img src="images/talend.png" width=150>
</p>

### Visualisation

> This layer is responsible for visually representing data that's been prepared for reporting to business stakeholders

#### Tableau

> Tableau is a powerful and increasingly popular data visualisation tool used by top-tier companies for business intelligence and real-time report creation

<p align="left">
  <img src="images/tableau.png" width=150>
</p>

It helps users to unlock insights in raw data by visualising it in an easy to interpret format. Tableau uses an intuitive drag-and-drop interface to create advanced dashboards. This allows non-technical users to create and use customised dashboards.

Data analysis with Tableau is fast, and the tool provides a wide variety of customisable dashboards and charts which can connect to a range of data storage tools in the back-end.

#### Microstrategy

> Microstrategy is a business intelligence software, which offers a wide range of data analytics capabilities

<p align="left">
  <img src="images/microstrategy.png" width=150>
</p>

As a suite of applications, it offers:

- Data discovery
- Advanced Analytics
- Data visualisations
- Embedded BI
- Banded Reports and Statements. 

Similar to Tableau, Microstrategy can connect to big data storage tools like Hive, data warehouses, relational systems, flat files, web services and a host of other types of sources to provide data for visual charts and dashboards. Although Microstrategy offers powerful features, it has a steeper learning curve over Tableau.

#### DataWrapper

> Datawrapper is a free, online data visualisation tool that can create charts, maps and tables via a user-friendly graphical user interface.

<p align="left">
  <img src="images/datawrapper.png" width=150>
</p>

 Datawrapper provides the ability to create three kinds of visualisations: maps, charts and tables.


#### Lumify

> Lumify is a big data integration, analysis, and visualisation platform. 
<p align="left">
  <img src="images/lumify.png" width=150>
</p>

Lumify is another tool that allows users to explore connections and discover relationships in their data via a wide range of data visualisation options.

### Other Tools

There are a host of other types of tools that don't explicitly fall under the above 6 layers we discussed throughout this notebook.  These tools generally either perform functions across more than one layer or provide certain features or benefits to the entire ecosystem and all its layers.  One such type of technology is called _containerisation_.

__Containerisation__ can be thought of as the evolution to virtual machines (VMs). Containerisation is defined as a form of operating system virtualisation, through which applications are run in isolated user spaces called containers, all using the same shared operating system (OS). Containers are more light-weight than virtual machines, and typically has its own file-system, applications, and share of resources (memory, CPU et.). This type of technology is rapidly being adopted by industry due to the benefits they provide. Currently, the most popular containerisation tools in the market are:

#### Kubernetes

> Kubernetes is an automated, portable, and open-source system for container-orchestration. It's widely used for software deployment, scaling, and management. 

<p align="left">
  <img src="images/kubernetes.png" width=150>
</p>

It was originally designed by Google. The tool can be used to automatically and reliably run a wide variety of tools and software in parallel. Although it requires some setting up beforehand, containers can easily be deployed any number of times across different computer nodes.

#### Docker

> Docker is another containerisation technology that leverages virtualisation to provide pre-packed software that can be easily deployed and used on servers and on the cloud 

<p align="left">
  <img src="images/docker3.png" width=150>
</p>


Each container is isolated from other containers, and each comes bundled with the required tools, libraries and configuration files needed to operate independently.

# Key Takeaways

- At a high-level, there are 6 layers in a modern data ecosystem. These layers include:
1. Data Storage: 
	- Which is mainly used to store data long-term
2. Data Acquisition: 
	- Which consists of the tools and applications responsible for ingesting and moving data
3. Data Processing: 
	- Which includes tools to perform operations and transformations on the data
4. Data Access: 
	- Which includes tools to explore and analyse data
5. Data Management: 
	- Which consists of tools and applications that manage the running of applications
6. Visualisation: 
	- Which includes tools to visually represent data for reporting purposes

- Containerisation is another popular trend in industry. It involves using isolated user spaces (called containers) which include tools, applications, dependencies and other required components to run an application any number of times across different computers.

