#  Data Engineers vs Scientists vs Analysts

> The work done by data engineers is closely integrated and sometimes overlaps with tasks performed by data scientists and data analysts. Accordingly, the lines between these roles can be confusing.

Oftentimes, we may see job postings with the title "Data Scientist", but which include many of the tasks one would consider as part of the data engineering role. This confusion is due to the fact that the scope of both overlap in many areas. 

At a high-level, data engineers prepare data that will then be used by other teams, such as data scientists. Data engineers enable reliable analytics to be performed, and unlock the capability to properly create and train data science models.

Data science teams, on the other hand, are one of the main stakeholders who rely on the output of a company's data engineering team. Looking at it from this angle, data engineering is a prerequisite to data science activities, especially in large enterprises.

One way to visualise this relationship is by using the below pyramid, which portrays the _data science hierarchy of needs_. 

In order for data scientists to properly create, train and test predictive models, a significant amount of back-end effort is required to collect, integrate and prepare the data infrastructure and files:

<p align="center">
  <img src="images/data-science-pyramid2.png" width=600>
  <figcaption align="center"><cite>Data Science Hierarchy of Needs</cite></figcaption>
</p>

The pyramid consists of a total of 6 sequential steps. Starting from the bottom, the steps are:

1. Data Collection
2. Data Movement and Storage
3. Data Transformation and Exploration
4. Data Aggregation and Labeling
5. Data Training and Optimisation
6. AI and Deep Learning

Data engineering encompasses steps #1, #2 and #3, while data science includes steps #4, #5 and #6. These steps are normally done sequentially, and completing one step is usually a prerequisite for the next step in this process.

Since these roles are not always clear-cut in all organisations, we could see a data engineer implementing some data aggregations and labeling (step # 4), although that should've been the work of a data scientist. On the other hand, we can have data scientists performing some data movement (step #2) and data transformation and cleaning (step #3), which typically fall in the realm of a data engineer. In general however, large global companies have mature teams, and the roles for each team are well-defined. In contrast, smaller companies and start-ups is where we will see a greater degree of overlap between the roles.

## Data Engineer vs. Data Scientist vs Data Analysts

> Data engineers design and build pipelines, systems and frameworks that ingest, transport, transform, and clean raw data to help transform it into useful information
<p></p>

> Data scientists are specialists who leverage the systems put in place by the data engineers and the data stored within those systems to build, train and deploy predictive models
<p></p>

> Data analysts are less technical and more business-facing professionals. They explore, interact and analyse data to gain insights and present the findings to business leaders/executives with the goal of improved corporate decision-making.
<p></p>

Many of us don't have a clear picture on how the data engineer, data scientist and data analyst roles differ from one another. One of the main sources of this confusion is the fact that it's there really isn't a universally accepted clear-cut definition for each role. Oftentimes, we would read job descriptions titled data engineers and find data analyst responsibilities within it. Similarly, we can find data science job descriptions that have data engineering tasks. Another source of this lack of clarity is that the skill-sets required for these roles overlap with one another. 

To help clarify these 3 roles, look at the below diagram:
 
<p align="center">
  <img src="images/eng-sci-analyst3.png" width=600>
  <figcaption align="center"><cite>Data Engineer vs Data Scientist vs Data Analyst</cite></figcaption>
</p>

There is no denying that all 3 roles are quickly gaining popularity and that they are in high demand. In the case of data engineering roles, they are witnessing a rapid increase in job postings. One of the main reasons for that is the exponential growth in data. According to one report by DICE, it was the fastest growing role in 2019 with a 50% YOY growth.

To highlight the differences in more detail, we can view the data engineer as the "back-end" data professional who does the heavy-lifting of data ingestion, transformation and cleaning. This helps prepare raw data into more useful information.

The useful information will then be used by:
- Business Intelligence and Analytics teams
- Data Science Teams
- Data Analysts

Based on this understanding, data scientists depend on data engineers in order to be able to do their work. They are experts in statistics and modelling, and they leverage the data and environment already prepared by the data engineers. Data analysts (sometimes also called business analysts) can either use the output of data science teams and present them to business, or leverage the data provided by the data engineering teams to perform their own analytical work. In either case, both teams require data engineering to be performed first.


## Skills to become successful in Data Engineering

> Merging of some software development skill-sets with database skill-sets led to the introduction of the data engineer role

Now that we've discussed some differences between a data engineer, data scientist and data analyst, we'll take a look at the skills that companies need engineers to have in order to be able to join their teams and contribute to their data frameworks.

In the past, companies required database experts as their main back-end engineers. Database experts have been around as a professional for a few decades. These roles starting being introduced during the late 70’s and early 80’s when database systems started becoming a more mainstream technology.  At the time, some roles we might have seen are similar to:

- Database Administrator
- SQL Developer
- Data Modeler

Eventually, in the 90’s and early 2000’s, two major technology changes occurred:

- The Internet
- Object-oriented Programming (OOP)

Starting around 2010 and afterwards, we started hearing terms like:
- Big Data (Hadoop, Spark)
- Batch Data Processing
- Real Time Data Processing
- NoSQL

Hence, a (big) data engineer can be viewed as the evolution of the previous role of a database developer with a more diverse skillet which ideally includes the following:

- Coding (Python, Java, Scala..)
- Automation and scripting (Linux)
- Relational database systems (SQL)
- Non-relational database systems 
- ETL/ELT
- System architecture
- Cloud computing (AWS, Microsoft…)
- Big data frameworks (Hadoop, Spark…)
- Agile project management (Scrum)
- DevOps, MLOps
- Data visualisation (Tableau)

Below is a diagram showing the important skills that data engineer job descriptions are currently looking for:

<p align="center">
  <img src="images/skills.png" width=600>
  <figcaption align="center"><cite>Top Technology Skills Required by Data Engineers</cite></figcaption>
</p>

# Key Takeaways

- The role of the data engineer, although sometimes overlapping with that of the data scientist and data analyst, is actually unique. 
- The role of the data engineer evolved over time, and can be seen as a hybrid between the traditional role of a database professional and that of a software developer
- The key differences between both roles are that: a data engineer is expected to perform the bulk of the work on the back-end system data using software development and data transformation expertise, while the data scientist is a user of the systems that data engineers create. 
- Data analysts are more business-facing and normally don't need to use programming skills. On the other hand, they must have excellent data interpretation and presentation skills.
- Data engineering is an important prerequisite for data science and artificial intelligence activities in enterprises. This is because data engineering creates the required data infrastructure needed for predictive model development, testing and deployment.
- The expertise required to become successful as a data engineer is vast and diverse, and covers several subject-matter areas with the top being software development, database, cloud and big data
