# Data Ingestion

## What is Data Ingestion? 

> Data ingestion is the process of moving data into a software system. It is one of the main steps of a data pipeline.

Data has become a very valuable asset in modern organisations as they become more data-driven. Data ingestion is a critical step in enabling companies to capture and analyse their data. 

Data ingestion typically do the following things:
- Consumes raw data from a creation source (such as a mobile app)
- Performs processing (such as cleaning the data to remove empty records)
- Writes the cleaned data to a target (which could be another tool, a downstream system or a storage location)

## Approaches to Data Ingestion 

> In industry, two main approaches for data ingestion are commonly used: Batch and Streaming.

Let's take a closer look at each:

### 1. Batch Ingestion

> Batch is when data is grouped together over a period of time and then ingested in bulk (on a regularly scheduled or ad-hoc basis)

Sometimes you need to process the _entire dataset_ to calculate some metrics, because they depend on both new and historical data. That's one place where batch ingestion can be required.

Some other typical examples of batch data ingestion include:
<p></p>

#### Database migration
- A company has an on-premise relational database that is being migrated to the cloud
- The database stores all customer and product information
- A full data dump followed by a data migration is required to move everything to the new cloud database
- This would be done as one large batch data ingestion activity  
<p></p>

#### Data ingestion into data lake
- A company has data arriving regularly to different folders and database systems which it wants to integrate together
- Every night at 9pm, the data engineers run a scheduled job to extract all the data available from each storage location and ingest that data into an HDFS data lake
- The data stored in HDFS can then be processed, integrated and transformed at a later stage
<p></p>

#### Payroll
- Once a month, a bank runs a number of automated batch jobs to ingest payroll data for all of its staff from various data sources such as a timtesheet system or various database tables
- This data can then be integrated and used to do monthly payroll calculations

Batch data ingestion is the traditional way organisations have performed data ingestion. It's been used for several decades and the tools implementing it are mature.



### 2. Streaming

> Streaming is when data is ingested as soon as it's created from the source

Some business use cases require ingesting data as soon as it is produced. There are several reasons for this. For example the source might not able to store data and can only create a single copy of a data point at regular time intervals. 

Some of the typical use cases requiring streaming ingestion include:

#### Google maps
- Location data created from the mobile device must be sent and ingested instantly
- This is required to provide accurate location information even if one is driving at high speeds

#### Self-driving vehicles
- Autonomous vehicles have many systems and sensors producing non-stop data
- This data must be ingested as it's created to be sent for immediate processing 
- A few seconds of delay could be life threatening (for example, a car is at an intersection and the traffic light switches to red)

Previously, almost all processing was done in batches - It's pretty easy to write a script that runs every few hours. Streaming however, is considered tougher as it raises new challenges such as having to keep a system running constantly. In the past, tools to ingest and process data as it was produced didn't yet exist. Having said that, batch processing can have it's own strengths in certain use cases, and raise it's own problems when pushed to it's limits. Although streaming is the more modern approach, it's not the right approach to every problem. We will explore this in more detail later.

### Comparing Batch and Streaming 

> There is a spectrum between streaming and batch data ingestion. The more frequently you run the batch ingestion job, the closer you get to streaming. 

Near the pure batch batch end of the spectrum of data ingestion approaches, there is a technique called _micro-batching_ which is where data is ingested very frequently but not immediately. In micro-batching, data is collected into small groups (batches) before the data is ingestion into a system. An example of this type of data ingestion is to send airline check-in data every 5 minutes to a mainframe computer. This approach has been commonly used in situations where batching can be used but where lower response times are required and there is no need to ingest a full historical dataset.

Here is how we can visualise a micro-batch of data which will ben sent to the mainframe every 5 minutes:

<p></p>
<p align="center">

<img src= "images/micro-batch_2.png" width=600>

</p>



Now that we've discussed at a high-level the different tools and frameworks available to ingest data in batch and streaming, it's important to understand the similarities and differences between each approach and also to clarify when to use each one.

Below is a table summarizing the key criteria defining each of the data ingestion approaches, assesed by the hardware, performance, data size and analysis complexity:
<p></p>
<p align="center">

<img src= "images/batch-micro-stream.png" width=600>

</p>

## Tools for Data Ingestion

> Batch processing data tools are mature and have been used for a several years. They can handle data ingestion, data processing or sometimes both

Tools commonly used for ingesting batch data and loading it into a storage location (such as a data lake) include:



<table>
    <thead>
        <tr>
            <th></th>
            <th>Description</th>
            <th>Use Cases</th>
		    <th>Key Features</th>
            <th>Limitations</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>TIBCO</th>
            <td>Specialised connector that can ingest both batch and streaming data</td>
            <td>Can directly use the pre-built connector to ingest data from over 100 types of source systems</td>
            <td>Industry grade, mature and has out-of-the-box compatibility with the top data sources</td>
            <td>Close-sourced, relatively old and requires paid licesnses</td>
        </tr>
        <tr>
            <th>Flume</th>
            <td>Tool for efficiently collecting, aggregating and moving large amounts of data</td>
            <td>Batch and streaming data pipelines mainly in big data environments</td>
            <td>Reliable, fault-tolerant, distributed architecture, flexible</td>
            <td>Relatively outdated, doesn't guarantee ordering, sometimes causes duplicates</td>
        </tr>  
		<tr>
            <th>Sqoop</th>
            <td>Specialised tool for transferring data between Hadoop and relational databases</td>
            <td>Integrating Hadoop with relational databases</td>
            <td>Easy to use, fast, supports parallel data transfer</td>
            <td>Deprecated, can't pause or resume a transfer, failures need special handling</td>
        </tr>
        <tr>
            <th>Kafka</th>
            <td>Event streaming platform capable of handling big data</td>
            <td>Streaming data pipelines and analytics (although it can also support Batch)</td>
            <td>Distributed, highly scalable, reliable, guarantees data ordering, low-latency</td>
            <td>Lacks monitoring tools, not yet fully mature, can be difficult to configure</td>
        </tr>  
    </tbody>
 </table>


### TIBCO
<p></p>
<p align="left">
<img src= "images/tibco-white.png" width=100>
</p>

- Although a legacy tool, it provides a data pipeline capable of ingesting batch and streaming data
- For more details, [check TIBCO's homepage](https://www.tibco.com/)

### Flume
<p></p>
<p align="left">
<img src= "images/flume.png" width=100>
</p>

- An open-source, distributed service that collects logs from several sources and takes them to a destination for aggregation and analysis. It is highly fault tolerant. 
- Flume allows data collection in batch (processing data in batches) as well as in streaming (processing data in real-time) mode
- For more details, [check Flume's homepage](https://flume.apache.org/)

### Sqoop
<p></p>
<p align="left">
<img src= "images/sqoop_white2.png" width=100>
</p>

- Sqoop is shorthand for SQL to Hadoop. It was popular during the past 10 years but has now become a legacy component, although it's still used in many systems.
- It's a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data into HDFS. This process is called ETL, for extract, transform, and load.
- For more details, [check Sqoop's homepage](https://sqoop.apache.org/)

### Kafka
<p></p>
<p align="left">
<img src= "images/kafka.png" width=100>
</p>

- Kafka is a popular tool used for moving big data between a source and a destination
- The tool provides configurable features that give data engineers the ability to handle both batch and streaming data
- For more details, [check Kafka's homepage](https://kafka.apache.org/)

## Data Ingestion Best Practices 

Some of the best practices to focus on while building industry-grade data ingestion pipelines include:

<table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center">Best Practice Criteria</th>
            <th style="width:auto;text-align:center">Description</th>
		    <th style="width:auto;text-align:center">Examples</th>
        </tr>
    </thead>
    <tbody>
            <th>1. Automating data ingestion</th>
            <td>
                <li> Workflow scheduling tools are used to automate certain steps or to trigger specific parts of the ingestion process based on a schedule
                <li> Instead of manually starting the job, it can be initiated automatically using a workflow scheduler
                <li> Workflow scheduling is also known as orchestration
            </td>
            <td>
                <li> Most pipelines use automated workflow schedulers like Apache Oozie (now somewhat outdated) or Apache Airflow (which is newer) to manage the data movement throughout the system
                <li> These tools sometimes provide graphical user interfaces (GUI) to monitor the status of the data pipeline
            </td>
        </tr>  
        <tr>
            <th>2. Scrubbing sensitive data early </th>
            <td>
                <li> Sensitive data should be scrubbed as early as possible (ideally while it's being ingested), to avoid storing it in the data lake
                <li> Storing such data may potentially expose it accidentally and lead to fines
            </td>
            <td>
                <li> Some ingestion tools allow basic checks on the data while it's in-motion. Some logic can be used to check for sensitive data
                <li> This is to avoid paying hefty fines such as those mandated by GDPR regulations
            </td>
        </tr>
        <tr>
            <th>3. Using heterogeneous/compatible technologies</th>
            <td>
                <li> Tools used in the data ingestion process must be designed and implemented with backward (legacy) and forward (future) compatability in mind
                <li> This is to avoid future re-work and to have a fully heterogeneous data system
            </td>
            <td>
                <li>One way to support this best practice is by using containerisation technologies such as Kubernetes and Docker
            </td>
        </tr>
           <tr>
            <th>4. Flexibility regarding data formats </th>
            <td>
                <li> Data ingestion tools must be flexible to support reading and transforming a wide variety of data formats (such as JSON, CSV and so on)
                <li> In the long-term, this helps to avoid having re-work done as new data file formats are introduced into the system
            </td>
            <td>
                <li> The design and setup should anticipate potential future data types as much as possible
            </td>
        </tr>            
     </tbody>
 </table>

<p></p>

## Key Takeaways

- Ingestion is one of the critical components of an enterprise data pipeline as it is necessary to introduce new data into a software platform
- Data ingestion usually involves 3 steps:  consume raw data from a creation source, processing the data, then writing the processed data to a target 
- There are 2 main approaches to data ingestion: streaming and batch. The approaches differ based on how fast data is ingested. In streaming, data ingestion is almost instant, while in batch data is collected over a period of time and then ingested all together in one big bulk (batch).
- Streaming and micro-batch data ingestion methods are similar but not identical to each other. Streaming requires almost instant data ingestion and processing while micro-batch does not.
- There are a few industry best practices that should be following when designing and deploying data ingestion solutions. These include using automated ingestion workflows, scrubbing sensitive data early, using compatible tools/technologies and providing future flexibility regarding new file formats.