#### Theories of ETL & ELT
##### Data Pipelines
* For all intents and purposes, data pipelines are responsible for moving data from a source to a destination, and transforming it somewhere along the way. These data sources, or source systems, can be a variety of things such as CSV files, APIs, or databases. Once data is pulled from a source system, the goal is to transform this data and load it into a destination where it can be used for things like business intelligence, machine learning, or AI

* There are different flavours or types of pipelines
 1. ETL
 2. ELT

 * ETL is short for extract, transform, and load. ETL pipelines first extract data, before transforming it and loading it to a destination. ETL is the most traditional data pipeline design pattern, and may pull from tabular or non-tabular data sources. Typically, ETL pipelines use tools like Python and libraries such as pandas to manipulate and transform data. 

* Basic Extract, Load and Transform functions are highlighted below 

In [14]:
#pandas dependencies 
import pandas as pd

In [8]:
def extract(file_name):
  import pandas as pd
  # Read a CSV with a path stored using file_name into memory
  return pd.read_csv(file_name)

In [7]:
#Load function
#Write a function to load a dataframe to a csv file 
def load(dataframe, file_name):
    dataframe.to_csv(file_name)
    print(f'Successfully loaded data to {file_name}')

In [9]:
# Transform the dataframe to only include  the columns industry_name and number_of_firms
def transform(data_frame):
  # Filter the data_frame to only incude a subset of columns
  return data_frame.loc[:, ["industry_name", "number_of_firms"]]

In [12]:
#Combining all functions together to build a simple python pipeline
def extract(file_name):
  return pd.read_csv(file_name)

def transform(data_frame):
  return data_frame.loc[:, ["industry_name", "number_of_firms"]]

def load(data_frame, file_name):
  data_frame.to_csv(file_name)
  
extracted_data = extract(file_name="test_file.csv")
transformed_data = transform(data_frame=extracted_data)

# Pass the transformed_data DataFrame to the load() function
load(data_frame=transformed_data, file_name="number_of_firms.csv")

In [13]:
transformed_data.head(5)

Unnamed: 0,industry_name,number_of_firms
0,Finance,92
1,Healthcare,208
2,Retail,403
3,Manufacturing,341
4,Technology,196
