<h1>Course Introduction</h1>


ETL stands for Extract, Transform, and Load. It refers to the process of curating data from multiple sources and preparing the data for integration and loading into a destination platform such as a data warehouse or analytics environment.

<p>
ELT is similar but loads the data in its raw format, reserving the transformations for people to apply themselves in a ‘self-serve analytics’ destination environment. Both methods are typical examples of data pipeline deployments. 

<p>

You will also learn about the tools, technologies, and use cases for the two main paradigms within data pipeline engineering: batch and streaming data pipelines. 

<p>

You will further cement this knowledge by exploring and applying two popular open-source data pipeline tools: Apache Airflow and Apache Kafka.

<h2>Extraction</h2>


What is Extraction?
To extract data is to configure access to it and read it into an application.
Normally this is an automated process.
Some common methods include:
Web scraping, where data is extracted from
web pages using applications such as Python or R to parse the underlying HTML code, and
Using APIs to programmatically connect to data and query it.
The source data may be relatively static, such as a data archive

<h2>Transformation</h2>


Data transformation, also known as data wrangling, means
processing data to make it conform to the requirements of both the
target system and the intended use case for the curated data.
Transformation can include any of the following kinds of processes:
Cleaning: fixing errors or missing values.
Filtering: selecting only what is needed.
Joining disparate data sources: merging related data.
Feature engineering: such as creating KPIs for dashboards or machine learning.
Formatting and data typing: making the data compatible

<h2>Loading</h2>


Generally this just means writing data to some new destination environment.
Typical destinations include databases, data warehouses, and data marts.
The key goal of data loading is to make the data readily available for ingestion
by analytics applications so that end users can gain value from it

<h1>ELT</h1>



​ELT is an acronym for a specific automated data pipeline engineering methodology.​
ELT is similar to ETL in that similar stages are
involved but the order in which they are performed is different. For ELT processes, data is acquired and directly loaded, as-is, into its destination environment.
From its new home, usually a sophisticated analytics platform such as a data lake, 

Cases include:
Dealing with the massive swings in scale that come with implementing Big Data products,
Calculating real time analytics on streaming Big Data, and
Bringing together data sources that are highly distributed around the globe.
In terms of speed, moving data is usually more of a bottleneck than processing it,
so the less you move it, the better.
<p>
<strong>Therefore, ELT may be your best bet when you want
flexibility in building a suite of data products from the same sources.</strong>
 

ELT is a flexible option that enables
a variety of applications from the same source of data.
Because you are working with a replica of the source data, there is no information loss.
Many kinds of transformations can lead to information loss, and if these happen somewhere
upstream in the pipeline, it may be a long time before you can have a change request met.
Worse yet, the information may be forever lost if the raw data is not stored. 

<h1>ETL v/s ELT</h1>



<ul>
    <li>Transformations for ETL pipelines take place within the data pipeline, before the data
reaches its destination, whereas
Transformations for ELT are decoupled from the data pipeline, and happen in the destination
environment at will.</li>
<li>ETL is normally a fixed process meant to serve a very specific function, whereas
ELT is flexible, making data readily available for self-serve analytics</li>
<li>ETL processes traditionally handle structured, relational data, and on-premise computing
resources handle the workflow.
Thus, scalability can be a problem.
ELT on the other hand, handles any kind of data, structured and unstructured. </li>
<li> ETL pipelines take time and effort to modify, which means users must wait for the development
team to implement their requested changes.
ELT provides more agility. With some training in modern analytics applications, end users
can easily connect to and experiment with the raw data </li>

</ul>
<strong>ELT is a natural evolution of ETL. However, conventional ETL still has many applications, and still has its place</strong>
 

<h1>Data Transformation Techniques</h1>

Data transformation is mainly about formatting the data to suit the application.
This can involve many kinds of operations, such as:
Data typing, which involves casting data to appropriate types, such as integer, float,
string, object, and category.
Data structuring, which includes converting one data format to another, such as JSON,
XML, or CSV to database tables.
Anonymizing and encrypting transformations to help ensure privacy and security. <p>

Other types of transformations include:
Cleaning operations for removing duplicate records and filling missing values.
Normalizing data to ensure units are comparable, for example, using a common currency.
Filtering, sorting, aggregating, and binning operations for accessing the right data at
a suitable level of detail and in a sensible order.
Joining, or merging, disparate data sources.


<strong>Schema-on-write is the conventional approach used in ETL pipelines, where the data must
be conformed to a defined schema prior to loading to a destination, such as a relational
database. This comes at the cost of limiting the versatility of
the data</strong>

 
<strong>Schema-on-read relates to the modern ELT approach, where the schema is applied to the raw
data after reading it from the raw data storage.
This approach is versatile since it can obtain multiple views of the same source data using
ad-hoc schemas.</strong>


<p>
Ways of losing information in transformation processes include filtering, aggregation,
using edge computing devices, and lossy data compression. 

<h1>Data Loading Techniques</h1>

<ul>
    <li>Full loading: You can load an initial history into a database.</li>
<li>Incremental</li>
<li>Scheduled</li>
<li>On Demand</li>
<li>Batch</li>
<li>Streaming</li>
<li>Micro batch</li>
<li>Push</li>
<li>Pull</li>
<li>Paralel </li>


</ul>

<h1>Summary & Highlights</h1>

    ETL stands for Extract, Transform, and Load 

    Loading means writing the data to its destination environment 

    Cloud platforms are enabling ELT to become an emerging trend 

    The key differences between ETL and ELT include the place of transformation, flexibility, Big Data support, and time-to-insight 

    There is an increasing demand for access to raw data that drives the evolution from ETL, which is still used, to ELT, which enables ad-hoc, self-serve analytics 

    Data extraction often involves advanced technology including database querying, web scraping, and APIs  

    Data transformation, such as typing, structuring, normalizing, aggregating, and cleaning, is about formatting data to suit the application 

    Information can be lost in transformation processes through filtering and aggregation 

    Data loading techniques include scheduled, on-demand, and incremental 

    Data can be loaded in batches or streamed continuously 


<strong>xxx</strong>
<h1>Title</h1>

<ul>
    <li>1</li>
<li>2 </li>
<li> 3 </li>

</ul>
