# Data Engineering Lifecycle

> Data engineering is a relatively new practice that evolved from traditional software engineering. It involves the designing, building and deployment of systems that capture, integrate and analyse data in batch and real-time environments.

## Overview of Data Engineering

> As is the case with traditional software engineering, data engineering is also a process that follows a series of well-defined steps

From a bird's eye view, the end goal of data engineering is to address the challenge that many organisations have: which is transforming raw data into meaningful information. To achieve this, organisations require access to large volumes of clean, accurate, complete and well-labelled data. 

## The Data Engineering Lifecycle

> A technology lifecycle refers to a process for planning, creating, testing and deploying an information system to production environments

Data engineering is a core part of the data lifecycle. Data preparation and engineering tasks represent over 80% of the time consumed in most AI and machine learning projects. But what exactly does data engineering include? Data engineering comprises all engineering and operational tasks required to make data available for analytics and data science models.

Now that we've seen where data engineering fits in the bigger picture, we'll take a closer look at the actual steps involved in a data engineering lifecycle.

From a high-level, a typical data engineering lifecycle in large organisations includes the following 7 steps:

1. Requirements gathering and planning
2. Solution architecture
3. Data store setup
4. Data ingestion
5. ETL/ELT (gathering, importing, wrangling, querying, and analyzing data)
6. Solution deployment
7. System monitoring and performance tuning


<p align="center">
  <img src="images/data-eng-lifecycle3.png" width=500>
  <figcaption align="center"><cite>Data Engineering Lifecycle</cite></figcaption>
</p>



Depending on the complexity of the data platform and the defined business requirements, there could be some variation from the above steps as it's not an exact science. In general however, the typical data engineering project goes through these phases one way or another. It's also important to understand that each step can go through several iterations before it's finalised.  Under the currently used Agile Scrum project management model in most data oriented organisations, short incremental iterations with frequent feedback are preferred over long and complex time durations.

Next, we'll explore each one of these phases in greater detail.

### 1. Requirements Gathering and Planning

> During this initial lifecycle phase, the enterprise architects, product owners and relevant experts collect precise project requirements from business executives

In any major corporate project, a business unit will be tasked to fund and oversee the project from planning to completion. The aim of this phase is to present a solution fine-tuned to the needs of the business and fitting the identified business requirements. Any unclear details must be identified and addressed as early as possible to avoid future delays. All these details are captured in a document called the business requirements document (BRD).

Business requirements are a brief set of business functionalities that the new system will require in order to be successfully implemented. This phase does not define technical details such as the type of technology implemented in the system. A sample business requirement might look like this:
-  “The system must track all the employees by their respective department, region, and the designation and needs to include all relevant details in the database.” 

This requirement shows no such detail as to how the system will implement this task, but rather what the system must _do_ to be useful for business.

During this phase, business requirements are captured, and any potential risks are identified. This normally includes a feasibility study, which defines all strengths and limitations of the project in order to determine its feasibility.

The planning phase will determine project goals and establish a high-level plan, which is normally represented by a Gantt chart. 

The three primary activities involved in the planning phase are:

- Identification of the system for development
- Feasibility assessment
- Creation of project plan

The main output of this phase is the detailed project plan, which will explain the business requirements, highlight important milestones, identify milestone dates and determine the acceptance criteria that will be used to approve the system for deployment into production.

### 2. Solution Architecture

> In the solution architecture phase, the desired features and operations of the system are identified. This phase includes identifying things such as the required business rules, system architecture, process diagrams etc.

During this phase, the enterprise architects (who have the ultimate responsibility for delivering this step successfully) along with senior data engineers start the high-level design of the software and systems required to be able to deliver the documented business requirement. The exact technical details are communicated with stakeholders to ensure everyone is aligned. Moreover, other factors such as potential risks, which technologies to be used, the team's skill-set, constraints etc. are discussed. Next, the best design approach is selected. 

The high-level design also defines all the components that needs to be created and how each component will operate. This design is usually captured in the Design Specification Document (DSD). 

The two primary activities involved in the design phase are as follows:
- Designing the IT infrastructure
- Designing the system model


As part of the design considerations, the proposed infrastructure should have solid foundations to avoid any bugs, incompatibility with currently existing systems, and should have high performance. The organisation also creates user interaction samples, data models, and entity relationship diagrams (ERDs) as part of this step.

Successful completion of the solution architecture phase should include:
- Transformation of all requirements into detailed specifications covering all aspects of the
system
- Assessment and planning for security risks
- Approval to proceed to the actual system development


### 3. Deploying Data Stores

> In this phase, actual system development and deployment of the data infrastructure begins. In a typical big data project, we need to first establish a _landing zone_ that'll be used in the next phase to ingest raw data.

One of the first components that need preparation and deployment in a big data ecosystem is: a central data repository to capture and store all incoming raw data.  This is often called a _data lake_. Contrary to other types of common software development projects, data engineering projects must start with capturing and storing the data which will be used throughout the various lifecycle phases.  

Using the technical details specified in the solution architecture phase, the necessary data storage infrastructure will be prepared and deployed. Furthermore, the required security rules, approvals, API's and any other prerequisite activity are performed in this step.

Depending on the nature of incoming data, the appropriate data storage technology will be prepared.  

The most common types of industry data stores include:

- Hadoop HDFS
- Cloud storage (such as S3)
- SQL databases and enterprise data warehouses (EDW)
- NoSQL data stores (such as HBase, MongoDB and Cassandra)

Another important aspect of this activity is identifying and implementing all the necessary integration to prepare the environment for the data ingestion step, which comes next. 

### 4. Data Ingestion

> Within the context of big data, data ingestion involves getting data out of source systems and ingesting it into a central data repository

Data ingestion is the transportation of raw data from various sources to a centralised storage repository where it can be persisted, accessed, used, and analysed by an organisation. Such a storage medium is typically a data lake, data warehouse, relational database, or a NoSQL data store.

In this step, the following information is identified:
- Each individual data source required
- The size of the generated data file/record
- The frequency of data generation and whether it's in batch or real-time
- The data format that will be ingested

Accordingly, an appropriate data pipeline needs to be created to be able to ingest data from each data source and move that data into the landing zone. It's common to use Kafka and Flume for these types of data movement operations.

It's important to note here that the ingested data is usually raw data, exactly in the same form and shape that it was produced in. For instance, if a mobile application sends raw data in a JSON file, the exact file will be moved via the data pipeline and into a folder in the data lake. Sometimes very minor updates are performed on the data while it's in-motion before arriving in the landing zone, but this is not common.

Nonetheless, what could occur as part of this step is a very high-level data quality and sanity check _after_ the data lands in the big data lake. This can be as simple as counting the number of files arriving, or checking for filename extensions etc.

### 5. ETL/ELT

> ETL/ELT is the process of transforming and mapping data from raw data form into another format to make it more appropriate for a variety of downstream purposes such as analytics

Extract, transform and load (ETL) was the traditional approach used in relational database systems for decades. In this process, data is extracted from source systems, transformed using the appropriate schema required by the database, and then loaded into a database or a data warehouse where it'll be used for analytical purposes.

In the modern big data world, this approach has been somewhat modified due to the evolving nature of the incoming data. 

The approach currently used by enterprises is to:
- Extract all required data from source systems (or other data stores if required)
- Move and load all that raw data into a central repository, such as a Hadoop big data lake
- Perform the required transformation and cleaning tasks on the data _after_ it's loaded into the data lake

Accordingly, this new approach is called extract, load then transform, otherwise known as ELT. Additionally, big data processing for many companies has moved to the cloud. This means that rather than housing and storing data in-house, transformed data as well as real-time streaming data can all be pushed to the cloud. This allows companies to have flexibility, agility, simplification of operation, better reliability, and security.

The ETL/ELT phase is the main activity upon which the majority of a data engineer's work occurs on a daily basis. Depending on the platforms and technologies used, code will be constantly created and tested to perform the various transformation and cleaning tasks necessary to prepare the data for consumption by stakeholders and other downstream systems. It's quite common to have several iterations of code creation and testing before moving on to the next step of the lifecycle.

Some main tasks in this phase include:
- Performing detailed data quality checks on the raw data to ensure it meets requirements
- Integrating the various types of raw data which arrived from different sources into a standardised format, such as Parquet
- Enriching the data with external data sources to increase its business value
- Performing data cleaning tasks, such as removing duplicates and handling missing or null values
- Applying a data model to the cleaned and transformed data (if required)
- Performing detailed quality engineering assessments such as integration testing, unit testing, and regression testing
- Checking the performance of the solution in lower environments (such as the development and user acceptance testing environments) before promoting it to run in actual production environments
- Ensuring stakeholders have access to the required data and that it meets their requirements

### 6. Solution Deployment

> During this step, the developed and tested solution is promoted to the production environment, where it will run on real data.

After thorough testing is performed, the next step of the process is to begin actual use of the newly developed solution. This involves adding all necessary files, dependencies, integrations, API's etc. into the actual production server environment so that it'll be available for stakeholders to be able to use it.

There are a number of steps that are normally followed in order to deploy a new solution to the production environment:
- The solution must have already been thoroughly tested
- Business has provided an approval that the solution meets the requirements captured in the BRD
- The technical infrastructure team has performed an assessment and has provided an approval to deploy the new system. This approval usually confirms that existing systems running in production will not be negatively impacted.

Sometimes, after stakeholders start using the system, they have feedback regarding some specific features, or they could request new features. These new requests can't be immediately made on the production environment directly. Rather, they need to be documented, approved and added to something called a _change request_. Change requests are requests for updates or changes that are to be applied to a system or code already running in production. 

It should be noted that, in the past, solution deployment was generally handled by a separate team called the Deployment team. Nowadays, however, most organisations use continuous integration and continuous development (CI/CD) software which automates a large part of the deployment process. CI/CD is used by the data engineers directly to promote code through the various environment levels in the organisation.

### 7. System Monitoring and Performance Tuning

> After a new solution is deployed, the system is continuously monitored to assess its performance, impact and behavior

Once a new or updated version of a solution is released to production, there is usually a maintenance team that looks after any post-production issues. Oftentimes, that team will be the same team that originally developed the solution, as they are the ones familiar with the majority of the details on how the code behaves.

If, for any reason, an issue is encountered in the production environment, the development team is informed and depending on how severe the issue is, it might either require a _hot-fix_, which is created and implemented in a short period of time or, if not very severe, it can wait until the next version of the software/system to be deployed.  This process is tracked by a ticketing system (such as JIRA) where the bug will be captured, details regarding the error are provided (along with any screenshots or log files) and the severity is determined. Any necessary enhancements, corrections, and changes are made during this phase to ensure the system continues to work, and remains updated to meet business goals. It's necessary to maintain and upgrade the system from time to time to adapt to future needs. 

The three primary activities involved in the system monitoring phase are as follows:
- Support the system users
- System maintenance
- System changes and adjustment

## Key Takeaways
- Data engineering is a process that has evolved from traditional software engineering practices
- In general, a technology lifecycle refers to a process for planning, creating, testing and deploying an information system to production environments. Data engineering follows a similar approach.
- The data engineering lifecycle consists of 7 steps, and it's an important process that enterprise companies follow in order to create and deploy data solutions to the production environment. These 7 steps include: 
    - Requirements gathering and project planning
    - Solution architecture
    - Data store setup
    - Data integration
    - ETL/ELT
    - Solution deployment
    - System monitoring and performance tuning
- A data engineer spends the majority of their time working on back-end ETL/ELT and data pipeline tasks
