In [None]:
'''
Data Pipelines :  

Extract data from a data source, apply operations, such as transformations, filters, joins, or aggregations to the data, and 
publish the processed data to a data sink (target data machine).

A data pipeline may be a simple process of data extraction and loading, or, it may be designed to handle data in a more 
advanced manner, such as training datasets for machine learning.

Need for AWS Data Pipeline ?
    Data is growing exponentially and that too at a faster pace. Companies of all sizes are realizing that managing, processing,
    storing & migrating the data has become more complicated & time-consuming than in the past.
    So, listed below are some of the issues that companies are facing with ever increasing data: 

    Bulk amount of Data: There is a lot of raw & unprocessed data. 
        There are log files, 
        Demographic data
        Data collected from sensors
        Transaction histories & lot more.

    Different data stores: 
        There are a variety of data storage options. 
        Companies have their own data warehouse, cloud-based storage like Amazon S3, 
            Amazon Relational Database Service(RDS) & database servers running on EC2 instances. 

    Time-consuming & costly: Managing bulk of data is time-consuming & a very expensive. A lot of money is to be spent on 
    transform, store & process data.

In [None]:
'''
Data Ingestion
    - The first step of the pipeline is data ingestion. This stage will be responsible for running the extractors that will 
      collect data from the different sources and load them into the data lake.
      
Data Storage :
    - Once the scripts extracted the data from the different data sources, the data was loaded into S3.
    - You can create three directories here, like that:
        - Raw: here you will store data in its true form, the way it came from the source without modifications.
        
        - Transformed: after transforming data, treating possible problems such as standardization, missing values and those 
          kind of problems, data will be loaded here. That data will be useful for data scientists.
          
        - Data enrichment: Refers to the process of appending or otherwise enhancing collected data with relevant context 
          obtained from additional sources.For analysis you will have to enrich data
          
        - Data warehouse : Now that your data is already on your data lake, transformed and enriched, it is time to send it 
          to a data warehouse.

Source: Data sources may include relational databases and data from SaaS applications.
Most pipelines ingest raw data from multiple sources via a push mechanism, an API call, 

a replication engine that pulls data at regular intervals, or a webhook. Also, the data may be synchronized in real time or at 
scheduled intervals.

Destination: A destination may be a data store — such as an on-premises or cloud-based data warehouse, a data lake, or a 
data mart — or it may be a BI or analytics application.

Transformation: Transformation refers to operations that change data, which may include data standardization, sorting, 
deduplication, validation, and verification. The ultimate goal is to make it possible to analyze the data.

Processing: There are two data ingestion models: batch processing, in which source data is collected periodically and sent to 
the destination system, and stream processing, in which data is sourced, manipulated, and loaded as soon as it’s created

In [None]:
'''

Batch processing 
    - Refers to processing of high volume of data in batch within a specific time span.
    - Batch processing is where the processing happens of blocks of data that have already been stored over a period of time.
    - Batch processing is used in payroll and billing system, food processing system etc.
    - Hadoop MapReduce is the best framework for processing data in batches.

Stream Processing
    - Refers to processing of continuous stream of data immediately as it is produced.
    - Stream processing allows us to process data in real time as they arrive and quickly detect conditions within small 
      time period from the 
      point of receiving the data
    - Stream processing is used in stock market, e-commerce transactions, social media etc.
    - open source stream processing platforms such as Apache Kafka, Apache Flink, Apache Storm, Apache Samza, etc

In [None]:
'''
Building an ETL Pipeline with Batch Processing

You process data in batches from source databases to a data warehouse.

To build an ETL pipeline with batch processing, you need to:
    - Create reference data: create a dataset that defines the set of permissible values your data may contain. 
      For example, in a country data field, specify the list of country codes allowed.
      
    - Extract data from different sources:  the basis for the success of subsequent ETL steps is to extract data correctly.
      Take data from a range of sources, such as APIs, non/relational databases, XML, JSON, CSV files, and convert it into a 
      single format for standardized processing.
    
    - Validate data: Keep data that have values in the expected ranges and reject any that do not. 
      For example, if you only want dates from the last year, reject any values older than 12 months.
      Analyze rejected records, on an on-going basis, to identify issues, correct the source data, and modify the extraction 
      process to resolve the problem in future batches
      
    - Transform data: Remove duplicate data (cleaning), apply business rules, check data integrity (ensure that data has not 
      been corrupted or lost), and create aggregates as necessary.
      For example, if you want to analyze revenue, you can summarize the dollar amount of invoices into a daily or monthly 
      total. You need to program numerous functions to transform the data automatically.
      
    - Publish to your data warehouse: Load data to the target tables. 
      Some data warehouses overwrite existing information whenever the ETL pipeline loads a new batch - this might happen daily,
      weekly, or monthly. In other cases, the ETL workflow can add data without overwriting, including a timestamp to indicate 
      it is new.
      