# Data Transformation

## What is Data Transformation?

> Transformation is the process of converting data from one state to another

In order to transform data into more useful information, we may need to clean that data, compute new metrics, filter out only the important stuff, or manipulate it in some other way. We call this _data transformation_. 

Some examples of transformations could include processes such as: 
- Data integration (combining different datasets together)
- Data ingestion/migration (moving data from one location to another)
- Data warehousing (storing data in a warehouse)
- Data wrangling (cleaning and organizing the data)


## Why Transform Data?

There are a number of reasons why an organisation might want to transform its data.

Raw (source) data is often:

- Partly irrelevant: It contains both relevant and irrelevant data
- Inaccurate: It contains incorrectly entered information or missing values
- Repetitive: It can contain duplicates of the same data, or of data that we already have

There are different categories to data transformation organisations use which include:

- Constructive transformation: Such as adding, copying, and replicating data
- Destructive transformation:  Like deleting fields or specific records 
- Aesthetic transformations:   Such as standardising salutations or street names 
- Structural transformations:  Such as renaming, moving, and combining columns in a database

Accordingly, proper data checking, cleaning and transformation activities are required before organisations can unlock the value stored in this data.


## Steps for Data Transformation

> Data transformation typically includes two primary stages: data discovery and data mapping.

### 1. Data Discovery

The aim of data discovery is to clarify the format of the data, and the types of transformations required.

It consists of:
- Performing data exploration, where we identify the sources, data types, and their locations
- Determine the structure and data transformations that need to occur

Some questions to ask during this step include:
- In structured data, what do the columns and rows look like?
- In unstructured files, how does the data look? And how is it organised (for example, is it a nested JSON?)
- What kind of information do the different datasets contain?
- How does the information in one data source relate to another source? Are there common fields we can use to join the files?

### 2. Data Mapping

The next goal is to determine how to perform data mapping. The aim of data mapping is to define what types of data transformations are required, and then implement them. 

It consists of:
- Defining how individual fields are mapped, modified, joined, filtered, and aggregated to fit the target data model
- Implementing these transformations

There are several strategies for doing this: 

#### Manual Scripting: 
- Traditionally, this was implemented by hand-writing code in languages such as Python, Scala or SQL
- Common transformations include aggregating data or converting date formats, editing text strings, or joining rows and columns
- This step also includes sending the data to the target temporary or permanent store (which could be a database, data warehouse or a data lake) 
- This approach provides several benefits such as having a customised solution specifically tailored for the company's needs
- However, it may be time-consuming and more costly than other options

#### On-premise Software 
- Out-of-the-box 3rd party tools are available to do transformation work. These tools are quite mature.
- Compared to offsite (3rd party vendor) scripting solutions, onsite tooling offers the benefit of more oversight by the end-user
- The downside however, is that it may need hiring additional expert staff to manage it

#### Cloud-Based Solutions
- These tools are hosted in the cloud, where they can leverage the expertise and infrastructure of the vendor (such as the Azure Data Factory)
- Many of these tools automate major parts of the data transformation process using graphical user interfaces
- These solutions are particularly useful when linking cloud-based solutions together, such as integrating software as a service (SaaS) platforms like Salesforce to a cloud-based data warehouse like Amazon Redshift
- The benefit from using such tools is that they are mature and compatible with the cloud provider's infrastructure
- The downside could be the long-term running costs

## Types of Data Transformation

<table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center"></th>
            <th style="width:auto;text-align:center">Description</th>
            <th style="width:auto;text-align:center">Examples</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Bucketing/Binning</th>
            <td>Used to change a continuous numeric series into fixed, categorical ranges</td>
            <td>Converting data from {2,5,8…} to {2-5, 6-9, 10-13…} </td>
        </tr>
        <tr>
            <th>Data Aggregation</th>
            <td>Data aggregation is a process that searches, gathers, summarises, and presents data at an aggregate level</td>
            <td>Used mainy for reporting. For example, summing the total weekly sales of a product</td>
        </tr>  
		<tr>
            <th>Data Cleaning</th>
            <td> Involves deleting out-of-date, inaccurate, or incomplete information to increase the accuracy of data</td>
            <td>Replacing NULL values with 0</td>
        </tr>
        <tr>
            <th>Data Deduplication</th>
            <td>Data deduplication is a data wrangling process where we identify and remove duplicate records to store a unique "golden record" of data</td>
            <td>Duplicates can occur during data integration tasks. Removing exact duplicate records helps increase data quality</td>
        </tr>
		<tr>
            <th>Data Derivation</th>
            <td>Involves the creation of special rules to “derive” specific information from available data to create new, derived fields</td>
            <td>Creating a new "Year" column after extracting the year value from a "Date" column</td>
        </tr>	
		<tr>
            <th>Data Filtering</th>
            <td>Includes techniques used to refine datasets. The goal is to remove irrelavent data</td>
            <td>Data filters, such as the SQL WHERE command, can be used to select only "Male" employees</td>
        </tr>
		<tr>
            <th>Data Integration</th>
            <td>Taking different data sources (such as different database tables, or file formats) and merging them into one table/file which has the same structure</td>
            <td>SQL JOIN operations can be used to combine multiple tables into one larger table</td>
        </tr>
		<tr>
            <th>Data Splitting</th>
            <td>Refers to dividing a single dataset (or column) into multiple datasets (or columns)</td>
            <td>Splitting a large, historical dataset into smaller datasets split by year and month</td>
        </tr>
		<tr>
            <th>Data Validation</th>
            <td>Process of creating automated rules that activate when the system encounters pre-determined data issues. This helps maintain data quality</td>
            <td>Certain fields that require data come as NULL. This triggers an automatic alert to the data engineers to notify them</td>
        </tr>
		<tr>
            <th>Indexing and Ordering</th>
            <td>Data can be transformed so that it’s ordered logically to speed up data retrieval or to suit a pre-determined data storage schema</td>
            <td>In RDBMSs, creating indexes can improve data querying performance</td>
        </tr>
		<tr>
            <th>Anonymisation and Encryption</th>
            <td>Data containing private details should be anonymised to avoid potential legal issues</td>
            <td>Anonymising date of birth or certain patient health information</td>
        </tr>
    </tbody>
 </table>
 


## Benefits of Data Transformation

### Provides useful information:
- Data is transformed to make it better-organised. Transformed data may be easier for humans to read and for automated reports to derive value from.

### Data Integration
- Data transformation enables organisations to integrate various datasets together to get a holistic view of various aspects of the company

### Data Quality
- Properly formatted and validated data improves data quality and protects applications from potential issues that could break the system such as null values, unexpected duplicates, incorrect indexing, and incompatible formats

### Easier Storage
- Transformed data could be easier to track, store and maintain. For example, compressing  data will save disk stroage space.

### Enhance Data Science Models 
- Transformed data can be more easily stored over time and integrated with older datasets to increase the efficiency of data science algorithms by providing a larger and more accurate dataset for model training 

## Challenges of Data Transformation

### Time Consuming
- You may need to extensively cleanse the data so you can transform or migrate it
- This can be extremely time-consuming, and is a common complaint amongst data engineers and data scientists working with unstructured data in particular
- According to a 2017 Crowdflower report, data scientists spend 51% of their time compiling, cleaning, and organising data. They also spend 30% of their time collecting datasets and mining data to identify patterns.

### Costly
- Depending on the infrastructure, transforming data may require a team of experts and substantial infrastructure (hardware and software) costs.

### Slow Processing
- Because the process of extracting and transforming data can be a burden on a system, it is often done in batches, which means you may have to wait for hours for the next batch to be processed
- This can cost time in making time-sensitive business decisions

### Resource Intensive
- Performing transformations in an on-premise data warehouse after loading, or transforming data before feeding it into applications, can create a computational burden that slows down other operations

### Lack of Expertise 
- The ability to wield big data technologies successfully requires both knowledge and talent, and there currently is a lack of talent in the data engineering and data science domains 
- This makes it more difficult to find, recruit and retain talented experts

### Changing Business Requirements
- The format of source data may change suddenly, causing issues in the system

### Risk of Project Failure
- According to a recent survey, companies are falling behind in their data-driven goals: 72% of survey participants have yet to forge an internal data culture, while 52% say they have not leveraged data and analytics to remain competitive


## Extract, Transform, and Load

> Extract, Transform and Load are 3 steps required in order to prepare raw or isolated data into useful, integrated information that the business can analyse

The reason businesses need to implement these steps is because data is usually generated in different formats from various sources. Therefore, we have to clean, enrich, and transform the data sources before integrating them into an analysable form. That way, business intelligence and visualisation platforms (like Microstrategy or Tableau) can understand the data to derive insights.

### Extract: 
- _Extract_ refers to ingesting the data either from the original creation source or other systems storing data
- An example of extraction is reading raw data arriving from a mobile application (such as Uber) 

### Transform: 
- _Transform_ refers to the process of changing the structure of the data in order to prepare it for loading into a storage location
- An example of transformation is changing the file format from CSV to JSON

### Load: 
- _Loading_ refers to the process of depositing the data into a data storage system
- An example of loading is moving all transformed data into a Postgres database

The transformation step is by far the most complex one. Transformations for ETL or ELT differ based on:

- _When_ the transformation takes place
- _Where_ the transformation occurs

In order to implement the above steps in an organisation, there are 2 main approaches:

### ETL

> Extract, Transform _then_ Load. Transformations are applied before loading into storage.

Transformation happens before loading so that the data can be fit into an existing schema. This makes it easy to do analytics, but it means that you might be throwing away valuable data. ETL has been the traditional data handling paradigm for the past several decades. The tools used are mature, and data storage is usually in a centralised database.

### ELT

> Extract, Load _then_ Transform. Data is loaded into storage in the raw format it arrives in. Transformations are applied later.

Loading happens before the transformations so that all of the data can be stored in the format which it arrives. This means that you are never throwing away potentially valuable data, but it makes it harder to do analytics on messy data. The data is usually stored in distributed file system storage such as AWS S3, which can store any raw data format, including complex unstructured data types. This is the more modern approach which became popular after the advent of big data, especially as more complex data types like images and video proliferated. 

In ELT systems, data transformation is still necessary - to do analytics you need to cleanse, enrich, and integrate data (amongst other transformations).

> Both ELT and ETL are popular approaches to data wrangling in organisations

The diagram below shows how both approaches compare:

<p align="center">

<img src= "images/etl-elt-details.png" width=900>

</p>

## ETL vs ELT


 <table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center"></th>
            <th style="width:auto;text-align:center">ETL</th>
            <th style="width:auto;text-align:center">ELT</th>
        </tr>
    </thead>
    <tbody>
		<tr>
            <th>Transformation</th>
            <td><li>Transformations are done in an ETL temporary server/staging area before loading the data into a final, permanent repository
                <li>As data size grows, transformation time increases
            </td>
            <td><li>All raw and incoming data is stored in a data like first, then transformations are performed on the data
                <li>ELT only transforms the required data (not the full data set), so transformation time could be less
            </td>
        </tr>
	    <tr>
            <th>Data Load Time</th>
            <td><li>Data first loaded into a temporary staging area and later loaded into a target system
                <li>Data must be first transformed before loading, which needs significant amounts of time initially
            <td><li>Data loaded into a data lake only once as the data arrives
                <li>Faster to implement as there is no requirement to first transform the data
            </td>
        </tr>
        <tr>
            <th>Cloud Support</th>
            <td><li>Historically, ETL solutions were hosted on-premise which made them expensive and difficult to maintain
                <li>More modern ETL solutions are now cloud-native
                <li>To scale up or down, the server needs to increase its resources (increasing/decreasing the node CPU or disk storage)
            </td>
            <td><li>ELT uses a distributed compute and data storage model which was designed to leverage cloud features
                <li>Can easily scale vertically (up or down) and horizontally (by adding additional nodes to the cluster)
            </td>
        </tr>
		<tr>
            <th>Maintenance Effort</th>
            <td><li>It needs higher maintenance as you need to select which high-value data to load and transform</td>
            <td><li>Low maintenance as data is always loaded in full as-is, and is available anytime</td>
        </tr>
		<tr>
            <th>Support for Data warehouses</th>
            <td><li>ETL model used mainly for relational and structured data which is designed for data warehouses</td>
            <td><li>Used more commonly in scalable distributed infrastructure which supports both structured and unstructured data</td>
        </tr>  
		<tr>
            <th>Data Lake Support</th>
            <td><li>Does not support the concept of a data lake
                <li>Only supports structured data
            </td>
            <td><li>Designed to leverage data lakes
                <li>Supports both structured and unstructured data at any volumes
                <li>Can store raw and transformed data easily
            </td>
        </tr>
		<tr>
            <th>Data Aggregations</th>
            <td><li>Due to the centralised architecture, complexity grows with the additional amount of data in the dataset</td>
            <td><li>Due to the distributed architecture, processing time for aggregations depends more on the resources of the platform</td>
        </tr>
		<tr>
            <th>Maturity</th>
            <td><li>The technology has been used for several decades
                <li>Best practices are well documented and talent is readily available
            </td>
            <td><li>Relatively new technologies which are still evolving
                <li>Tools can be complex to deploy and integrate with legacy systems
                <li>Talent supply is still lagging behind industry demand
            </td>
        </tr>  
    </tbody>
 </table>

## ETL vs ELT


## Examples

Now that we have a better understanding of what ETL and ELT are, which approach should we be using for our data engineering tasks? The answer is: that depends on the use case and business requirementS.  Below are some sample use cases highlighting when it's advisable to use each approach:

### Use case #1
-   A company has massive amounts of data being ingested in real-time from multiple sources and in different file formats
-   In this example, _ELT_ works best with huge quantities of data, both structured and unstructured. As long as the target system is cloud-based or a data lake, you will likely be able to process those huge amounts of data more quickly with an ELT solution.

### Use case #2
-   An organisation has its entire dataset organized into structured rows and columns stored in Postgres
-   Reporting on the data is usually done in batches at regular intervals (daily, weekly and monthly)
-   In this case, ETL is the preferred option as the data is structured and stored in a relational database system

### Use case #3 
-   A company that needs all it's data in one place as soon as possible
-   When the transformations take place at the end of the process, ELT prioritizes the speed of transfer over almost everything else, which means that all data - good, bad, and otherwise - ends up in the data lake for later transformation


Continuing with the Emirates Airlines example discussed earlier, the data transformation activities in this system include the Apache Spark steps in the below diagram:

<p align="center">

<img src= "images/emirates-batch2.png" width=1000>

</p>

Some examples of data transformation activities that the Emirates check-in appication performs include:
-   Combining all incoming XML files for a particular hour into one folder on HDFS.
-   Joining together all the data files for a particular day by using an end-of-day batch job.
-   The combined daily file is checked for duplicate data.  Any duplicate records are removed.
-   The daily cleaned file from the above step is transformed into Parquet file format and stored in another HDFS storage location


## Key Takeaways

- Data transformation is a critical part of the data engineering process at organisations. The reason being that data in modern systems exists in various forms and in ever increasing volumes.
- Raw data generated from different sources is often inconsistent, imprecise and could have duplicates. Hence, data cleansing and transformation are important to create a reliable analyatics data platform.
- _ETL_ is an acronym for Extract, Transform then Load. It should not be confused with _ELT_, which has the same steps but in a different order: Extract, Load then Transform.
- Data transformation is important to change data into useful information. Some common types of transformations include constructive (like adding columns), destructive (like removing incomplete records, filtering certain rows), structural (such as integrating different files) or altering file formats.
- Data transformation can be implemented manually by scripting, using automated tools or by cloud-based software
- Despite the benefits of data transformation such as increasing reporting accuracy, it's also a challenging task due to the complexity, time and costs involved among other factors
- ETL has been around for decades, and is better equiped to handle structured data (data that's arranged in rows and columns) efficiently and cannot handle unstructured data.
- ELT is better tailored for big data that can be both structured or unstructured (such as image and video files) in batch or real-time ingestion
- The modern trend in global companies is to migrate data extraction, loading and transformation to cloud-based tools
- The decision to use ETL or ELT in companies depends on various criteria such as the business requirements and the type of data being ingested