# Data Pipelines

In order to be able to handle the increasing volume, variety and velocity of data in the modern age, the data foundation in modern organisations is becoming more sophisticated. One critical element of a data infrastructure is the data pipeline.

## What is a Data Pipeline?

> A data pipeline is a set of tools and processes that involve moving, processing and storing data in a similar way that a water pipeline moves, processes, and stores water. 

A data pipeline might be as simple as moving raw data from a single source (point A) to a single target location (point B), or as complex as gathering data from multiple sources, transforming it then storing it in multiple destinations. 

Constructing data pipelines is the core responsibility of data engineering. It requires advanced programming skills to design a program for continuous and automated data exchange and sophisticated technical knowledge to enable engineers to integrate various tools and technologies together.

A data pipeline is commonly used for:

- Moving data from the source of creation to a target storage destination
- Wrangling/transforming the data, integrating various datasets and moving the output into a single centralised location (often called a data lake)
- Integrating data generated from various connected devices and systems 
- Migrating/copying databases from on-premise hardware into a cloud-based data warehouse
- Connecting a data lake to advanced business intelligence reporting dashboards which are used by the business executives

## How is Data Actually Moved? 

An example of a real-world data pipeline is in the below image. This is part of the Emirates Airlines check-in system mentioned earlier.  

<p align="center">

<img src= "images/emirates-data-pipeline.png" width=900>

</p>

The above data pipeline consists of the following:

- Data generated at the source (by check-in agents and the Emirates mobile app) is sent to a mainframe computer and also to a data warehouse for permanent storage. This setup was the original legacy system
- Data is extracted in XML format every 5 minutes from the data warehouse and sent to TIBCO, which is a data pipeline tool that has tradtionally been used for some time
- TIBCO forwards the data to Flume every 30 minutes. Flume was a new component added to integrate TIBCO with big data tools.
- Flume moves the data to Kafka, where it's buffered until specific data criteria are met (time and size related criteria). Kafka was added at a later step to increase address some of the issues Flume faced
- Once the data criteria is met, Kafka send the XML files to a big data lake (Hadoop HDFS)

In the past, data pipelines were usually manually triggered and monitored. More recenty however, tools have come out to enable automated scheduling, running and monitoring of pipelines. Some of the benefits of automation include:

- Ensuring consistency of the data by re-using the same data ingestion and storing code
- Enhancing productivity as data engineers don't need to manually ingest the data on a regular basis and can spend more of their time on other tasks
- Providing automatic scaling of the network bandwidth to enable the data pipeline to handle different data speeds throughout the day
- Making the code re-usable across the various teams and systems which helps to standardise best pratices

## Components of a Data Pipeline 

Below is the flight check-in system used by Emirates Airlines. This system displays how check-in data is created, transported, stored and processed in a real production enviornment.  

<p align="center">

<img src= "images/emirates-color.png" width=900>

</p>


Data pipelines almost always start by ingesting data. We'll get into more details regarding data ingestion in a seperate notebook.



In the above system, we have _7 component groups_ (which can be on-premise hardware/software, cloud based tools or a combination of both):

<table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center">Component</th>
            <th style="width:auto;text-align:center">Description</th>
		    <th style="width:auto;text-align:center">Diagram Examples/Color</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>1. Data Source/Origin</th>
            <td><li> Is a component in the system which creates/produces the raw data
                <li> A data source can be human or machine generated data
                <li> Flight check-in data can be manually entered by agents at the airport counter, or via mobile devices if passengers use the online check-in feature provided by the Emirates app
                <li> A data source can also be a data storage location which will export/produce data for another system
            </td>
            <td>
                <li>Top left corner of the diagram
                <li><span style="color:blue">Blue</span>
            </td>
        </tr>
        <tr>
            <th>2. Data Destination</th>
            <td><li> Is the target location to which data is moved to
                <li> The destination is relative to the source, and can be another data pipeline, a data repository, or an entirely different system
                <li> It can be a temporary or permanent target
            </td>
            <td>
                <li> The destination can be any step that recieves the data
                <li> For example, Flume is the destination for TIBCO, while Kafka is the destination for Flume
            </td>
        </tr>  
		<tr>
            <th>3. Data Ingestion</th>
            <td>
                <li> Consists of the tools and processes of moving data from a source to a temporary or permanent target
                <li> Some of the more modern tools (such as Kafka) allow basic data processing (such as running counters) while the data is in-motion
            </td>
            <td> 
                <li> Examples include TIBCO, Flume and Kafka
                <li><span style="color:#FFD580">Light Orange</span>
            </td>
        </tr>
        <tr>
            <th>4. Data Transformation</th>
            <td>
                <li> Is the process of changing the state of the data 
                <li> Involves integrating the raw data which is produced and then manipulating it to become useful information
                <li> Some examples of pre-processing and transformation include:
                <ul> 
                    <li> Simple data cleaning tasks (such as removing empty records)</li>
                    <li> Removing certain unwanted columns from a table</li>
                    <li> Data de-duplication (removing duplicate records)</li>
                    <li> Changing the data file format (from JSON to Parquet)</li>
                </ul> 
            </td>
            <td>
                <li> Apache Spark steps at the bottom of the diagram
                <li> <span style="color:#BABDBA">Grey</span> 
            </td>
        </tr>  
        <tr>
            <th>5. Data Wrangling</th>
            <td>
                <li>Wrangling involves more advanced data cleaning and integrating  complex datasets for easy access and analysis
            </td>
            <td>
                <li> The Apache Spark step at the bottom of the diagram
                <li> <span style="color:#CF9FFF">Purple</span>
            </td>
        </tr>
        <tr>
            <th>6. Workflow Scheduling</th>
            <td>
                <li> Workflow scheduling tools are used to automate certain steps or to trigger specific parts of the pipeline based on a schedule
                <li> For example, you might want to run a data cleaning program once per day at 9pm. Instead of manually starting the job, it can be initiated automatically using a workflow scheduler
                <li> Workflow scheduling is also known as orchestration
                <li> Most pipelines use automated workflow schedulers like Apache Oozie (now somewhat outdated) or Apache Airflow (which is newer) to manage the data movement throughout the system
                <li> These tools sometimes provide graphical user interfaces (GUI) to monitor the status of the data pipeline
            </td>
            <td>
                <li> All arrows connecting the various components
                <li><span style="color:#FF8C00">Orange arrows</span>   
            </td>
        </tr>  
        <tr>
            <th>7. Data Storage</th>
            <td>
                <li> Storage tools can vary and include several options such: as databases, cloud storage or phyiscal commodity (cheap) disk storage  
                <li> Is often a centralised big data repository mainly used by large organisations to capture raw data for processing and ETL purposes
                <li> Data is normally preserved in various storage environments at different stages of the data pipeline
            </td>
            <td>
                <li>Note that there are several landing zones throughout the data pipeline
                <li><span style="color:#90EE90">Green</span>
            </td>
        </tr>    
    </tbody>
 </table>

<p></p>

Once the raw data is captured in the landing zone, it can then be manipulated and transformed into more meaningful information that can be used for various downstream tasks.  

Some of these downstream tasks may include:

-   Ad-hoc queries - such as doing aggregate calculations to retrieve the total sales for the week
-   Advanced analytics - such as creating sophisticated graphs and charts showing various metrics over time
-   Data science - such as creating algorithms to predict product prices for next year

Depending on the business use case, data engineering projects could have different demands for the latency (delay) and speed of the data acquisition process, amongst other things. 

## Challanges Building Data Pipelines

Creating data pipelines in organisations doesn't come without its challenges.  For instance, the data infrastructure can be complex and diversified both in terms of the technologies used and geographic location of the various tools.  Moreover, systems are usually built incrementally over time, so there are a variety of legacy systems and tools that must be accounted for as well.

Setting up secure and reliable data flow pipelines can be a complex task. There are so many things that can go wrong.  

Some of the main challenges are highlighted below:
    
### Complexity
-   Creating data ingestion processes can be complex due to the increasing speed (velocity) of data generation and the growing size (volume) of data files
-   Development times can be costly in time and resources (data engineers are expensive)
-   Building data pipelines from scratch every time a new data source or business requirement comes up can be time consuming
<p></p>

### Data security
-   Keeping sensitive details private while transferring data from one point to another is a key concern 
<p></p>

### Unreliability
-   During data movement, the reliability of the data may be compromised and thus cause incorrect decisions based on untrue or corrupted data. This may occur because:
    -   Data can be corrupted
    -   Data can be lost
    -   Networks can be overloaded during peak times causing delays (latency)
    -   Data sources may conflict, generating duplicate or incorrect data
<p></p>

### Technology evolution
- Modern data sources, tools, and consuming applications evolve rapidly. Thus we are faced with constant changes while deploying and maintaining data pipelines.
- Additionally, integrating newer tools with older legacy ones can be a daunting and a time-consuming challenge
<p></p>

### Data changes
- The structure of the data produced from origin could change without prior notice, causing issues to consuming applications
- For instance, a new column might have been added that the code isn't designed to handle yet. This will lead to a system error or crash.
<p></p>

### Maintenance and rework
- Implementing changes to a data pipeline can be time-consuming and complex
- Debugging and maintenance can take away time from developing new features

## Key Takeaways

- Data pipelines are a critical component in the data infrastructure of the modern day corporation as they are used to manage the data flow throughout entire systems
- A data pipeline consists of the tools and processes involved in moving, processing and storing data
- A typical enterprise data pipeline consists of 7 components which include: a data source, a data target, data ingestion process, data transformation, data wrangling/cleaning, workflow scheduling and data storage  
- Automating data pipelines is important as it provides numerous benefits which include: ensuring consistency of the data, freeing up data engineer's time for other tasks, automatic scaling (up or down) of resources based on the data load and helping to standardise best practices by making the code re-usable across the organisation
- Data pipeline design and implementation is a core role of data engineers in global companies
- Data pipeline creation has several challenges. The top challanges include: complexity, time required, data security, data reliability, technology evolution and system maintenance.
