<h1> Data Engineering  Lifecycle </h1>

## Overview of Data Engineering

> As is the case with general software engineering, data engineering is also a process that follows a series of well-defined steps. 

In the previous section, we introduced the main (big) data engineering concepts, and looked at the various tools available for organizations to use.  In this notebook, we'lll take a closer look at the process of deploying data engineering solutions to production environments, as well as discuss some of the similarities and differences between the roles of the data engineer, data scientist and the data (sometimes called business) analyst.

From a bird's eye view, the end goal of data engineering is to address the challenge that many organizations have, which is getting access to large volumes of clean, accurate, complete and well-labelled data to be able to perform reliable analytics and data to properly create and train data science models.

Data science teams, on the other hand, are one of the main stakeholders who rely on the output of a company's data engineering platform. Looking at it from this angle, data engeering is a pre-requisite to data science activities in large enterprises.

One way to visualize this is using the below pyramid, which portrays the data science hierarchy of needs.  In order for data scientists to properly create, train and test predictive models, a significant amount of back-end effort is required to collect, integrate and prepare the data infrstructure and files:

<p align="center">
  <img src="images/data-science-pyramid.png" width=600>
</p>

The pyramid consists of 6 sequential steps in total.  Starting from the bottom, the steps are:

1. Data Collection.
2. Data Movement and Storage.
3. Data Transformation and Exploration.
4. Data Aggregation and Labeling.
5. Data Training and Optimization.
6. AI and Deep Learning.

Data engineering encompasses steps 1, 2 and 3, while data science includes steps 4, 5 and 6.  The steps are normally done sequentially, and completing one step is usually a pre-requisite for the next step in this process.

Of course, depending on the size of an organization and the complexity of its data architecture, sometimes the roles of the data engineer and data scientist overlap. For example, we could have a data engineer implementing some data aggregations and labeling (step # 4).  We can also have data scientists performing some data movement (step #2), transformation and cleaning (step #3).  In general however, large global companies have mature systems and teams and the roles for each team are well-defined and the roles rarely overlap.


## The Data Engineering Lifecycle

> A technology lifecycle refers to a process for planning, creating, testing and deploying an information system to production environments.

Data engineering is a core part of the data lifecycle. Data preparation and engineering tasks represent over 80% of the time consumed in most AI and machine learning projects. But what exactly does data dngineering entail? Data engineering comprises all engineering and operational tasks required to make data available for analytics and data science purposes.

Now that we've seen where data engineering fits in the bigger picture, we'll take a closer look at the actual steps involved in a data engineering lifecycle.

 From a high-level, a typical data engineering lifecycle in large organizations includes the following 7 steps:

1. Requirements gathering and planning
2. Solution Architecture.
3. Deploying data stores.
4. Data ingestion.
5. ETL/ELT (gathering, importing, wrangling, querying, and analyzing data).
6. Solution deployment.
7. System monitoring and performance tuning.


<p align="center">
  <img src="images/data-eng-lifecycle3.png" width=500>
</p>



Depending on the complexity of the data platform and the defined business requirements, there could be some variation from the above steps as it's not an exact science.  In general however, the typical data engineering project goes through these phases one way or another.  It's also important to understand that each step can go through several iterations before it's finalized.  Under the currently used Agile Scrum project management model in most data oriented organizations, short incremental iterations with frequent feedback are preferred over long and complex time durations.

Next, we'll explore each one of these phases in greater detail.

### 1. Requirements Gathering and Planning

> During this initial lifecycle phase, the enterprise Architects, product owners and relevant experts meticulously collect precise requirements from business.

In any major corporate project, a business unit will be tasked to fund and oversee the project from planning to completion.  The aim of this phase is to present a solution fine-tuned to the needs of the business and fitting the identified business requirements. Any unclarities must be identified and addressed as early as possible to avoid future delays. All these details are captured in a document called the business requirements document (BRD).

Business requirements are a brief set of business functionalities that the system needs to meet to be successful. This phase does not define technical details such as the type of technology implemented in the system. A sample business requirement might look like “The system must track all the employees by their respective department, region, and the designation.” This requirement shows no such detail as to how the system will implement this requirement, but rather what the system must do concerning the business.

During this phase, business requirements are captured and any potential risks are identified. This step can include a feasibility study, which defines all fortes and weak points of the project to assess the overall project viability.

The planning phase will determine project goals and establish a high-level plan for the intended project. Planning is, by definition, a fundamental and critical organizational phase. 

The three primary activities involved in the planning phase are as follows:

- Identification of the system for development
- Feasibility assessment
- Creation of project plan

The main output of this phase is the detailed project plan, which will explain the business requirements, highlight important milestones, identify milestone dates and determine the acceptance criteria that will be used to approve the system for deployment into production.

### 2. Solution Architecture

> In the solution architecture phase, the desired features and operations of the system are identified. This phase includes business rules, pseudo-code, screen layouts, and other necessary documentation. 

During theis phase, the enterprise technical architects (who are have the ultimate responsibility for delivering this step successfully) along with senior data engineers start the high-level design of the software and systems required to be able to deliver each business requirement.The technical details of the design is discussed with the stakeholders and various parameters such as risks, technologies to be used, capability of the team, project constraints, time and budget are reviewed and then the best design approach is selected for the product. 

The high-level design also defines all the components that needs to be developed, communications with third party services, user flows and stakeholder communications as well as front-end representations and behaviour of each components. The design is usually kept in the Design Specification Document (DSD). Among other things, the project team will mull over the core components, structure, processing, and procedures for the system to reach the stated goal.

The two primary activities involved in the design phase are as follows:
- Designing the IT infrastructure
- Designing the system model


The IT infrastructure should have solid foundations to avoid any crash, malfunction, or reduction in performance.  In this phase, the specialist recommends the clients and servers required on a cost and time basis and the system’s technical feasibility. The organization also creates user interaction interfaces, data models, and entity relationship diagrams (ERDs) in this phase.


Successful completion of the solution architecture phase should comprise:
- Transformation of all requirements into detailed specifications covering all aspects of the
system
- Assessment and planning for security risks
- Approval to progress to the actual system development and deployment


### 3. Deploying Data Stores

> In this phase, actual system development and deployment of the data infrastructure begins.  In big data projects, we need to first establish a landing zone that'll be used in the next phase to ingest raw data.

One of the first components that need preperation and deployment in a big data ecosystem is a central data repository to capture and store all incoming raw data.  This is often called a data lake. Contrary to other types of common software development projects, data engineering projects must start with capturing and storing the data which will be used throughout the various lifecycle phases.  

Using the technical details specified in the solution architecture phase, the necessary data storage infrastructure will be prepared and deployed. Furthermore, the required security rules, approvals, API's and any other pre-requisite activity are performed in this step.

Depending on the nature of incoming data, the appropriate data storage technology will be prepared.  As mentioned in the previous lesson, the most common types of industry data stores include:

- Hadoop HDFS
- Cloud storage (such as S3)
- SQL databases and enterprise datawarehouses (EDW)
- NoSQL data stores (such as HBase, MongoDB and Cassandra)

Another important aspect of this activity is identifying and implementing all the necessary integration to prepare the environment for the data ingestion step which comes next. 

### 4. Data Ingestion

> Within the context of big data, data ingestion involves getting data out of source systems and ingesting it into a data lake.

Data ingestion is the transportation of raw data from assorted sources to a storage medium where it can be persisted, accessed, used, and analyzed by an organization. The destination is typically a data lake, datawarehouse, data mart, database, or a NoSQL document store.

In this step, the following information is identified:
- Each individual data source
- The size of the generated data file/record
- The frequency of data generation and whether its in batch or real-time
- The data format that will be ingested

Accordingly, an appropraite data pipeline needs to be created to be able to ingest data from each data source and move that data into the landing zone.  It's common to use Kafka and Flume for these types of data movement operations.

It's important to note here that the ingested data is usually raw data, exactly in the same form and shape that it was produced in.  For instance, if a mobile application sends raw data in a JSON file, the exact file will be moved via the data pipeline and into a folder in the data lake.  Sometimes very minor updates are performed on the data while it's in-motion before arriving in the landing zone, but this is not common.

Nonetheless, what could occur as part of this step is a very high-level data quality and sanity check _after_ the data lands in the big data lake. This can be as simple as counting the number of files arriving, checking for filename extensions etc.

### 5. ETL/ELT

> ETL/ELT is the process of transforming and mapping data from raw data form into another format to make it more appropriate for a variety of downstream purposes such as analytics

Extract, transform and load (ETL) was the traditional approach used in relational database systems for decades.  In this process, data is extracted from source systems, transformed using the appropriate schema required by the database, and then loaded into database or a datawherehouse tables where it'll be used for analytical purposes.

In the modern big data world, this approach has been somewhat modified due to the evolving nature of the incomming data.  The approach currently is to:
- Extract all required data from source systems (or other data stores if required)
- Move and load all that raw data into a central repository such a big data lake
- Perform the required transformation and cleaning tasks

Accordingly, this new approach is now called extract, load then transform, otherwise known as ELT. Additionally, big data processing for many companies has moved to the cloud. This means that rather than housing and storing data in-house, transformed data as well as real-time streaming data can all be pushed to the cloud. This allows companies to have flexibility, agility, simplification of operation, better reliability, and security.

This phase is the main activity whereby the majority of a data engineer's work occurs on a daily base.  Depending on the platforms and technologies used, code will be constantly created and tested to perform the varoius transformation and cleaning tasks necessary to prepare the data for consumption by stakeholders such as data scientists and other downstream systems. It's quite common to have several iterations of code creation and testing before moving on to the next step of the lifecycle.

Some of the main tasks in this phase include:
- Performing detailed data quality checks on the raw data to ensure it meets requirements.
- Integrating the various types of raw data which arrived from differnt sources into a standardized format such as Parquet.
- Enriching the data with external data sources to increase its business value.
- Performing data cleaning tasks such as removing duplicates and handling missing or null values.
- Applying a data model to the cleaned and transformed data (if required).
- Performing detailed quality engineering assessments such as integration testing, unit testing, and regression testing.
- Testing the performance of the solution in lower environments (such as the development and user acceptance testing enviornments) before promoting it to run in actual production environments
- Ensuring stakeholders have access to the required data and that it meets their requirements

### 6. Solution Deployment

> During this step, the developed and tested solution is promoted to the production enviornment, where it will run on real data.

After thorough testing is performed, the next step of the process is to begin actual use of the newly developed solution. This involves adding all necessary files, dependencies, integrations, API's etc. in the actual production server environment so that it'll be available for stakeholders to be able to use it.

There are a number of steps that are normally followed in order to deploy a new solution to the production envionrment:
- The solution must have already been thoroughly tested.
- Business has provided an approval that the solution meets the initial requirements as outlined in the BRD.
- Technical infrastructure team has performed an assessment and provided an approval to deploy the system and that it won't negatively impact existing systems.

Sometimes, once the stakeholders start using the system, they have feedback regarding some of the features or they request new features.  These new requests can't be immediately made on the production envionrment directly.  Rather, they need to be documented, approved and added to something called a __change request__. Change requests are requests for updates or changes that are to be applied to a system or code already running in production. 

It should be noted that, in the past, solution deployment was generally handled by a seperate team called the Deployment team.  Nowadays however, most organizations use continuous integration and continuous development (CI/CD) software which automates a large part of the deployment process.  CI/CD is used by the data engineers directly to promote code through the various environment levels in the organization.

### 7. System Monitoring and Performance Tuning

> After a new solution is deployed, the system is continuously monitored to assess its performance, impact and behavior.

Once a version of the software is released to production, there is usually a maintenance team that look after any post-production issues. Oftentimes, that team will be the same team that originally developed the solution, as they are the ones familiar with the majority of the details and how the code behaves.

If, for any reason, an issue is encountered in production environments, the development team is informed and depending on how severe the issue is, it might either require a hot-fix which is created and shipped in a short period of time or, if not very severe, it can wait until the next version of the software/system to be deployed.  This process is tracked by a ticketing system software (such as JIRA) where the bug will be captured, details regarding the error are provided (along with any screenshots or log files) and the impact severity determined. Any necessary enhancements, corrections, and changes are made during this phase to ensure the system continues to work and remains updated to meet business goals. It's necessary to maintain and upgrade the system from time to time to adapt to future needs. 

The three primary activities involved in the system monitoring phase are as follows:
- Support the system users
- System maintenance
- System changes and adjustment

## Data Engineer vs. Data Scientist vs Data Analysts

> Data engineers design and build pipelines, systems and frameworks that ingest, transport, transform and clean raw data into useful information.
<p></p>

> Data scientists are specialists who leverage the systems put in place by the Data Engineers and the data stored within those system to build, train and deploy predictive models
<p></p>

> Data analysts are less technical and more business-facing professionals.  They explore, interact and analyze data to gain insights and present the findings to business leaders/executives with the goal of improved corporate decision making.
<p></p>

Many of us don't have a clear picture on how the data engineer, data science and data analyst roles differ from one another.  One of the main sources of this confusion is the fact that it's there really isn't a universally accepted clear-cut definition for each role.  Oftentimes, we would read job descriptions titled data engineers and find data analyst responsibilites within it.  Similarly, we can find data science job descriptions that have data engineering tasks.  Another source of this lack of clarity is that the skillsets required for these roles overlap with one another.  Below is a diagram that gives a high-level view of how the skills can overlap:
 
<p align="center">
  <img src="images/eng-sci-analyst2.png" width=600>
</p>

There is no denying that all 3 roles are quickly gaining popularity and that they are in high demand. For instance, for data engineering roles, they are witnessing a rapid increase in job postings.  One of the main reasons for that is the exponential growth in data. According to one report by DICE, it was the fastest growing role in 2019 with a 50% YOY growth.

To highlight the differences in more detail, we can view the data engineer as the "back-end" data professional who does the heavy-lifting of data ingestion, transformation and cleaning.  This helps prepare raw data into more useful information.

The useful information will then be used by:
- Business Intelligence and Analytics teams
- Data Science Teams
- Data Analysts

Based on this understanding, data scientists are next in the sequence of data flow steps.  They are experts in statistics and modelling, and they leverage the data and enviornment already prepared by the data engineers.  Data analysts (sometimes also called business analysts) can either use the output of data science teams and present them to business, or leverage the data provided by the data engineering teams to perform their own analytical work.



## Skills to become successful in Data Engineering

> Merging of some of the Software Development skillsets with Database skillsets led to the introduction of the Data Engineer.

Now that we've discussed some of the differences between a data engineer, data scientist and data analyst, we'll take a look at the skills that companies need engineers to have in order to be able to join their teams and contribute to their data frameworks.

In the past, companies required database experts as their main data engineers. Database experts have been around as a professional for a few decades. These roles starting being introduced during the late 70’s and early 80’s when database systems started becomming a more mainstream technology.  Some of the roles we might have seen are similar to:

- Database Administrator
- SQL Developer
- Data Modeler

Eventually, in the 90’s and early 2000’s, two major technology changes occurred:

- Internet
- Object Oriented Programming

Starting around 2010 and afterwards, we started hearing terms like:
- Big Data (Hadoop, Spark)
- Batch Data Processing
- Real Time Data Processing
- NoSQL

Hence, a Big Data Engineer can be viewed as the evolution of the previous role of a Database Developer with a more diverse skillet which ideally include the following:

- Coding (Python, Java, Scala..)
- Automation and Scripting (Linux)
- Relational Database Systems (SQL)
- Non-Relational Database Systems 
- ETL 
- System Architecture
- Cloud Computing (AWS, Microsoft…)
- Big Data Frameworks (Hadoop, Spark…)
- Agile Project Management (Scrum)
- DevOps, MLOps
- Data Visualization/Presentation (Tableau)

Below is a diagram showing the important skills that data engineer job descriptions are looking for:

<p align="center">
  <img src="images/skills.png" width=600>
</p>


## Key Takeaways
- Data engeering is an imortant pre-requisite for data science and artificial intelligence activities in enterprise organizations as it creates the required data infrstracture needed for predictive model development, testing and deployment.
- The role of the data engineer, although sometimes overlapping with that of the data scientist and data analyst, is actually somewhat unique.  The key differences are that a data engineer is expected to perform the bulk of the work on the back-end system data using software development and data transformation expertise, while the data scientist is a user of the systems that data engineers create and a data analyst is more business-facing.
- The data engineering lifecycle consists of 7 steps, and it's an important process that enterprise companies follow in order to create and deploy data solutions to the production enviornment. 
- The expertise required to become successful as a data engineer is vast and diverse, and covers several subject matter areas with the top being software development, database, cloud and big data.
- The role of the data engineer evolved as a hybrid between the role of a database professional and that of a software developer. 
