# Modern ETL process

- Before raw data can be used for analytics, it must first be converted into a form that can be easily queried and placed into a secure, centralized location. The ETL process is designed to accomplish these tasks. While the process used to be time-consuming and cumbersome, the modern ETL pipeline has made faster and easier data processing possible. Implementing a modern ETL process has significant benefits for efficiently building data applications and empowering data-driven decision-making. 

- ETL is an acronym that represents “extract, transform, load.” During this process, data is gathered from one or more databases or other sources. The data is also cleaned, removing or flagging invalid data, and then transformed into a format that’s conducive for analysis. The cleaned and transformed data is then often loaded into a cloud data warehouse or other target data store for analysis.

## ELT: The Future of Data Pipelines
- ETL pipelines first appeared in the 1970s, and a lot has changed since then. Today, organizations have access to more powerful ways to process and prepare data for use. Modern ELT (extract, load, transform) pipelines have significantly greater capabilities than their predecessors. With ELT, the raw data is extracted from its sources and loaded directly into the target data store. It’s then transformed as needed directly within the data store. Here are five benefits of using a modern ELT data pipeline. 

### Provide continuous data processing
Yesterday’s ETL pipelines worked well only with slow-moving, predictable data that fit into neat categories. Common data sources included CRMs, ERPs, and supply chain management (SCM) systems. Data gathered from these sources was typically loaded into an onsite data warehouse and stored in highly structured tables that made it easy to query using SQL and SQL-based tools. Data was typically processed in batches on a predefined schedule, resulting in data that was already hours or days old before it was ready for analysis. 

Fast forward to today. Organizations collect massive amounts of data generated from many different sources including databases, messaging systems, consumer purchasing behavior, financial transactions, and IoT devices. This data can now be captured in near real time or real time with the modern ELT pipeline, since today’s technology is capable of loading, transforming, and analyzing data as it’s created. 

## Execute with  elasticity and agility
Today’s ELT pipelines rely on the power of the cloud to rapidly scale computing and storage resources to meet current data processing and analytics demands. Modern cloud data platforms offer near-infinite data processing and storage capabilities. It’s unnecessary for organizations to plan in advance to accommodate anticipated surges in resources during periods of more intensive use.  

## Use isolated, independent processing resources

Legacy ETL pipeline configurations typically used the same computing resources to process multiple workloads. Running workloads in parallel on the same resource negatively impacts performance, resulting in longer wait times. In contrast, modern ELT pipelines separate compute resources into multiple, independent clusters with each workload receiving its own dedicated resources. This setup drastically increases the speed at which data can be processed, transformed, and analyzed. The size and number of clusters can rise and fall instantly to easily accommodate current resource demands.

## Increase data access
Some data pipelines relied on highly skilled data engineers to build and maintain the complex constellation of external tools required to customize the ETL process to the unique needs of the organization. The resulting IT resource bottlenecks prevented timely access to relevant data, resulting in decisions based on stale data. However, modern and ideal ELT pipelines democratize data access by simplifying data processing, making the process of creating and managing data much less reliant on IT experts. This democratization of data allows business teams self-serve, accessing and analyzing relevant data independently. For example, Snowflake is known for its ease of use for SQL users, and other developers with different programming skills and preferences, including Java, Scala, and Python, can also build with Snowflake using Snowpark. 

## Are easy to set up and maintain
Legacy ETL pipelines relied on equipment and technologies that are costly to operate and maintain. To conserve computing resources, ETL projects were dependent on batch processing methods that were run during times when resource demands were low. This approach translated into data pipelines that were slow and complex. More importantly, this setup made it all but impossible for teams to use and analyze data much faster. Today’s modern ELT pipelines operate much differently. An architecture based on a cloud computing and storage solution such as Snowflake eliminates these constraints. Business teams can engage in data analysis at any time, accessing current data instantly to take advantage of time-sensitive insights.

## Snowflake and Data Integration
Snowflake supports both ETL and ELT transformation processes, providing organizations with the flexibility to customize data pipelines to meet the needs of the business. And because Snowflake pairs seamlessly with data integration tools such as Informatica, Talend, and Matillion, organizations no longer need manual ETL coding and data cleansing. As a result, data engineers are freed to spend time developing advanced data strategies and pipeline optimization. Near-limitless computing and storage resources offer immediate access to validated and prepared data.

# what is a CI/CD Pipeline?

- A continuous integration/continuous delivery (CI/CD) pipeline is a software development or engineering process that combines automated code building and testing with deployment. A CI/CD pipeline is used to deploy new and updated software safely. 

A CI/CD pipeline automates the following two processes for an end-to-end software delivery process: 

- Continuous integration for automated code building and testing. CI allows developers to submit multiple changes to a shared repository or main code branch while maintaining version control. Many software development teams are geographically dispersed or isolated, but CI enables fast development while avoiding merge conflicts, bugs, and duplication. CI always keeps the main branch up to date but can also facilitate short-term isolated side or feature branches for smaller changes that can eventually be merged into the main branch.

- Continuous delivery (or continuous deployment) for code releases. CD enables short-cycle, incremental development and allows dev teams to build and release software at any time. It also helps devops teams to reduce costs and speed up deployment for new releases. CD requires a highly repeatable structure and is often considered an extension of CI.

- A CI/CD pipeline combines code building, testing, and deployment into one continuous process ensuring that all changes to the main branch code are releasable to production. An automated CI/CD pipeline prevents manual errors, provides standardized feedback loops to developers, and enables quick software iterations.

## Benefits of a CI/CD Pipeline

CI/CD is essentially a set of best practices for software development, enabling frequent, typically small code updates and releases. It enables developers to meet business requirements while maintaining code consistency and security. 

A CI/CD pipeline automates the CI/CD process, including regression and performance testing. Builds, testing, and deployment cycles occur regularly and frequently, as often as daily or even hourly. 

CI/CD pipelines enable:

- Code quality, consistency, and security

- Frequent, iterative updates 

- Scheduling flexibility for builds and deployments

- Developer consensus and collaboration

- Versioning control with logs of changes

- Customized, automated testing and timely feedback

- Visibility into build or delivery failures via dashboards and reports

- Environmental stability with automated rollback features

- Reliable, standardized builds

## CI/CD Pipeline Workflow

A CI/CD pipeline workflow involves several stages, namely:

- Source: A code change or automated or user-initiated workflow triggers the CI/CD pipeline to run.

- Build: New code is merged with the source code.

- Test: Automated tests are run to validate the code and reveal bugs.

- Deploy: Code is released to staging or production environments. 

## CI/CD Pipeline Solutions

A CI/CD pipeline tool automates many steps of the CI/CD pipeline workflow, freeing developers to focus on new functionality and features. A few of the most popular CI/CD pipeline solutions include:

- Jenkins, an open-source automation server

- CircleCI, which enables automated code building, testing, and deployment

- TeamCity, a general-purpose CI/CD solution

- GitLab, a web-based tool and Git-repository manager

- Bamboo, a CI/CD tool with Jira and Bitbucket 

- Microsoft Azure DevOps, tools for planning, collaborating, and building and deployment

## Snowflake and CI/CD Pipelines

- Snowflake’s Data Cloud powers applications with virtually no limitations on performance, concurrency, or scale. Trusted by fast-growing software companies, Snowflake handles all the infrastructure complexity so that application developers can focus on innovation.

- Snowflake provides the perfect environment for dataops and devops, including CI/CD to accelerate software development, enhance collaboration and foster agility. Snowflake customers can industrialize data pipelines in and around Snowflake. 

- Snowpark is a developer framework for Snowflake that brings data processing and pipelines written in Python, Java, and Scala to Snowflake's elastic processing engine. Snowpark allows data engineers, data scientists, and data developers to execute pipelines feeding ML models and applications faster and more securely in a single platform using their language of choice. 

# Lambda Architecture

- Lambda architecture is a data deployment model for processing that consists of a traditional batch data pipeline and a fast streaming data pipeline for handling real-time data.
- In addition to the batch layer and speed layers, Lambda architecture also includes a data serving layer for responding to user queries.
- This hybrid approach is designed to harness enormous volumes of rapidly created data, enabling businesses to make use of data more quickly.

## How Does Lambda Architecture Work?
Lambda architecture is complex. Its dual layers operate in tandem to make data available more quickly than the traditional batch processing approach. Here’s how the magic happens.  

- Data sources
Lambda architecture is used to quickly access real-time data for querying. In this data serving model, data is fed into the system continuously from a variety of sources. New data is fed into the batch and speed layers simultaneously. 

- Batch layer
In the batch layer, all of the incoming data is saved as batch views to ready it for indexing. This layer serves two important purposes. First, it manages the master data set where the data is immutable and append-only, preserving a trusted historical record of the incoming data from all sources. Second, it precomputes the batch views.

- Data serving layer
The data serving layer receives the batch views from the batch layer on a predefined schedule. This layer also receives the near real-time views streaming in from the speed layer. Here, the batch views are indexed to make them available for querying. As one indexing job is running, the serving layer queues newly arriving data for inclusion in the next indexing run.

- Speed layer
By design, the batch layer has a high latency, typically delivering batch views to the serving layer at a rate of once or twice per day. The job of the speed layer is to narrow the gap between when the data is created and when it’s available for querying. The speed layer does this by indexing all of the data in the serving layer’s current indexing job as well as all the data that’s arrived since the most recent indexing job began. After the serving layer completes an indexing job, all of the data included in the job is no longer needed in the speed layer and is deleted. 

- Querying
Since queryable data is stored in both the serving and speed layers, queries must be submitted to both with the results merged before being presented to end users. 

## Benefits of Lambda Architecture’s Data Serving 
Synthesizing the capabilities of a traditional batch pipeline and real-time stream pipeline, Lambda architecture’s data serving technique offers the best of both systems. 

- Serverless management
Lambda architecture is a serverless system. That means there’s no server software to install, update, or maintain. As a bonus, there’s very little danger of errors even in the event of a system crash. That’s because the batch layer manages all historical data using fault-tolerant distributed storage. 

- Access data in real time 
Traditional batch processing isn’t designed to accommodate streaming data such as transactional data. But Lambda’s inclusion of a fast streaming data pipeline and data serving layer makes it possible to query data as it’s being created. 

- Scalability 
The distributed nature of Lambda architecture allows it to automatically scale to meet current business needs. Flexible, cloud-based storage doesn’t rely on the predefined computing resources inherent in on-premises server setups. 

## Drawbacks of Lambda Architecture 
Lambda architecture is a creative way to access real-time and near real-time data. But it comes with limitations. 

- Logic duplication
Supporting two separate code bases for the batch and streaming layers requires additional time and resources to upkeep. The use of two distinct layers can make ongoing maintenance complex and time-consuming and complicate debugging efforts.

- Batch processing inefficiencies 
For some use cases, the need to reprocess every batch cycle is highly inefficient. Running a parallel batch and speed layer requires dedicating additional time and computing resources.

- Complexity
Lambda architecture relies on many moving pieces. The complexity of operating this data serving model presents numerous challenges. With multiple components each running different software, maintenance burdens are high. Intensive processing requirements involve complex coding while data modeled using this architecture requires extensive effort to reorganize or migrate. 

## Snowflake and Lamba
Lambda architecture has been adopted broadly to have separate layers of traditional batch and streaming data pipelines. However, more industry innovations in the last couple years have made it possible to bring streaming and batch pipelines together in unification. Apache Beam, for example, can be an abstraction layer that handles both data pipelines and processing. Snowflake’s Snowpipe and Snowpipe Streaming can also be used to eliminate some needs for Lambda, where there are no more boundaries between streaming and batch data. Data can be ingested in both fashions using unified data pipelines in one single system without setting up complex pipelines and architecture in place. 

# Data Modeling

- Data modeling is the process of organizing and mapping data using simplified diagrams, symbols, and text to represent data associations and flow. 
- Engineers use these models to develop new software and to update legacy software.
- Data modeling also ensures the consistency and quality of data. Data modeling differs from database schemas.
- A schema is a database blueprint while a data model is an overarching design that determines what can exist in the schema.

## Benefits of Data Modeling
- Improved accuracy, standardization, consistency, and predictability of data
- Expanded access to actionable insights
- Smoother integration of data systems with less development time
- Faster, less expensive maintenance and updates of software
- Quicker identification of errors and omissions
- Reduced risk
- Better collaboration between teams, including non-developers
- Expedited training and onboarding for anyone accessing data

## Types of Approaches
There are four primary approaches to data modeling.  

1. Hierarchical
A hierarchical database model organizes data into tree-like structures with data stored as interconnected records with one-to-many arrangements. Hierarchical database models are standard in XML and GIS.  

2. Relational
A relational data model, AKA a relational model, manages data by providing methodology for specifying data and queries. Most relational data models use SQL for data definition and query language.

3. Entity-relationship
Entity-relationship models use diagrams to portray data and their relationships. Integrated with relational data models, entity-relationship models graphically depict data elements to understand underlying models. 

4. Graph
Graph data models are visualizations of complex relationships within data sets that are limited by a chosen domain.

## Types of Data Models
There are three primary types of data models. 

1. Conceptual, defining what data system contains, used to organize, scope, and define business concepts and rules.

2. Logical, defining how a data system should be implemented, used to develop a technical map of rules and data structures.

3. Physical, defining how the data system will be implemented according to the specific use case.

## Role of a Modeler

- A data modeler maps complex software system designs into easy-to-understand diagrams, using symbols and text to represent proper data flows. Data modelers often build multiple models for the same data to ensure all data flows and processes have been properly mapped. Data modelers work closely with data architects.

## Data Modeling Versus Database Architecture

- Data architecture defines a blueprint for managing data sets by aligning with organizational needs to establish data requirements and designs to meet these requirements.
- Database architecture and data modeling align when new systems are integrated into an existing system, as part of the overall architecture. With data modeling, it’s possible to compare data from two systems and integrate smoothly.

## Snowflake Data Cloud and Data Modeling
The Snowflake’s platform is ANSI SQL-compliant, allowing customers to leverage a wide selection of data modeling tools tailored to specific needs and purposes. 

Snowflake has introduced several features enhancing data modeling capabilities. 

- Snowpark Enhancements:

  
        The Snowpark ML Modeling API, now generally available, allows data modelers to use Python ML frameworks like scikit-learn and XGBoost for feature engineering and model training within Snowflake. This integration simplifies the data modeling process by enabling direct operation on the data stored in Snowflake, reducing the need for data movement.

- Advanced Analytics with Snowflake Cortex:

  
        The new ML-based functions for forecasting and anomaly detection provide data modelers with powerful tools to perform complex analyses directly through SQL. This simplifies the process of incorporating advanced analytics into data models, making it accessible even to those with limited ML expertise.

- Developer Experience with Snowflake Python API:

      In public preview, this API enhances Python's integration with Snowflake, making it easier for data modelers to manipulate and interact with data within Snowflake using familiar Python constructs.


# Data Pipeline

A data pipeline is a means of moving data from one place to a destination (such as a data warehouse) while simultaneously optimizing and transforming the data. As a result, the data arrives in a state that can be analyzed and used to develop business insights.

A data pipeline essentially is the steps involved in aggregating, organizing, and moving data. Modern data pipelines automate many of the manual steps involved in transforming and optimizing continuous data loads. Typically, this includes loading raw data into a staging table for interim storage and then changing it before ultimately inserting it into the destination reporting tables.

## Benefits of a Data Pipeline
Your organization likely deals with massive amounts of data. To analyze all of that data, you need a single view of the entire data set. When that data resides in multiple systems and services, it needs to be combined in ways that make sense for in-depth analysis. Data flow itself can be unreliable: there are many points during the transport from one system to another where corruption or bottlenecks can occur. As the breadth and scope of the role data plays increases, the problems only get magnified in scale and impact.

That is why data pipelines are critical. They eliminate most manual steps from the process and enable a smooth, automated flow of data from one stage to another. They are essential for real-time analytics to help you make faster, data-driven decisions. They’re important if your organization:

- Relies on real-time data analysis
- Stores data in the cloud
- Houses data in multiple sources
By consolidating data from your various silos into one single source of truth, you are ensuring consistent data quality and enabling quick data analysis for business insights.

## Elements
Data pipelines consist of three essential elements: a source or sources, processing steps, and a destination.

1. Sources
Sources are where data comes from. Common sources include relational database management systems like MySQL, CRMs such as Salesforce and HubSpot, ERPs like SAP and Oracle, social media management tools, and even IoT device sensors.

2. Processing steps
In general, data is extracted data from sources, manipulated and changed according to business needs, and then deposited at its destination. Common processing steps include transformation, augmentation, filtering, grouping, and aggregation.

3. Destination
A destination is where the data arrives at the end of its processing, typically a data lake or data warehouse for analysis.

## Data Pipeline versus ETL
Extract, transform, and load (ETL) systems are a kind of data pipeline in that they move data from a source, transform the data, and then load the data into a destination. But ETL is usually just a sub-process. Depending on the nature of the pipeline, ETL may be automated or may not be included at all. On the other hand, a data pipeline is broader in that it is the entire process involved in transporting data from one location to another.

## Characteristics of a Data Pipeline
Only robust end-to-end data pipelines can properly equip you to source, collect, manage, analyze, and effectively use data so you can generate new market opportunities and deliver cost-saving business processes. Modern data pipelines make extracting information from the data you collect fast and efficient.

Characteristics to look for when considering a data pipeline include:

- Continuous and extensible data processing
- The elasticity and agility of the cloud
- Isolated and independent resources for data processing
- Democratized data access and self-service management
- High availability and disaster recovery

## In the Cloud
Modern data pipelines can provide many benefits to your business, including easier access to insights and information, speedier decision-making, and the flexibility and agility to handle peak demand. Modern, cloud-based data pipelines can leverage instant elasticity at a far lower price point than traditional solutions. Like an assembly line for data, it is a powerful engine that sends data through various filters, apps, and APIs, ultimately depositing it at its final destination in a usable state. They offer agile provisioning when demand spikes, eliminate access barriers to shared data, and enable quick deployment across the entire business.

## Data Pipelines in Snowflake
- Snowpark is a developer framework for Snowflake that brings data processing and pipelines written in Python, Java, and Scala to Snowflake's elastic processing engine. Snowpark allows data engineers, data scientists, and data developers to execute pipelines feeding ML models and applications faster and more securely in a single platform using their language of choice. 

- Modern Data Engineering with Snowflake's platform allows you to use data pipelines for data ingestion into your data lake or data warehouse. Data pipelines in Snowflake can be batch or continuous, and processing can happen directly within Snowflake itself. Thanks to Snowflake’s multi-cluster compute approach, these pipelines can handle complex transformations, without impacting the performance of other workloads.

## Python For Data Engineering 

With the shift from analytics to machine learning and app development, the logic and transformations of data became more complex and required the flexibility of programming languages such as Python. Python’s inherent characteristics and the wealth of resources that have grown around it have made it the data engineer’s language of choice. Here are a few examples of how modern teams are using Python for data engineering.

Data acquisition
Python is used extensively to gather data relevant to a project. Data engineers use Python libraries to acquire data via web scraping, interacting with the APIs many companies use to make their data available and connecting with databases.

Data wrangling
With libraries for cleaning, transforming, and enriching data, Python helps data engineers create usable, high-quality data sets ready for analysis. Python’s powerful libraries for data sampling and visualization allow data scientists to better understand their data, helping them uncover meaningful relationships in the larger data set.

Custom business logic
Bringing data into dashboards, machine learning models, and applications involves complex data and business logic transformations that require the use of complex business logic defined as code. Because of the simplicity of Python, it is often used to write this logic and execute it as part of a data pipeline or data transformation, triggering actions downstream as part of a business process or an application.

Data storage and retrieval
Python’s libraries offer solutions for accessing data stored in a variety of ways, including in SQL and NoSQL databases and cloud storage services. Thanks to these resources, Python has become critical to building data pipelines. In addition, Python is used to serialize data, making it possible to store and retrieve data more efficiently.

Machine learning
Python is also deeply embedded into the machine learning process, finding applications in nearly every aspect of ML, including data preprocessing, model selection and training, and model evaluation. With applications for deep learning, Python provides a powerful selection of tools for building neural networks and is often used for tasks such as image classification, natural language processing, and speech recognition.

Popular Python Libraries for Data Engineering
One of Python’s primary advantages for data engineering tasks is its extensive ecosystem of libraries. These libraries provide data engineers with a wide range of tools to work with, helping them manipulate, transform, and store data faster and more effectively. From small data projects to large-scale data pipelines, these six popular Python libraries streamline data engineering tasks:

1. Pandas
The pandas library is one of the most frequently used libraries for data engineering in Python. This versatile library equips data engineers with powerful manipulation and analysis capabilities. Pandas is used to preprocess, clean, and transform raw data for downstream analysis or storage.

2. Apache Airflow
Apache Airflow serves as a platform for data engineers to author, schedule, and monitor workflows. It provides an easily accessible, intuitive interface data engineers can use to create, schedule, and execute multiple tasks, as well as manage complex data processing pipelines.

3. Pyparsing
Pyparsing is a Python class library that eliminates the need to manually craft a parsing state machine. Pyparsing allows data engineers to build recursive descent parsers quickly.

4. TensorFlow
TensorFlow is a popular Python library for machine learning and deep learning applications by providing a versatile platform for training and serving models. One of TensorFlow’s primary value adds is its ability to handle large-scale data processing and modeling tasks, including data preprocessing, data transformation, data analytics, and data visualization.

5. Scikit-learn
Built atop libraries, including NumPy and SciPy, the scikit-learn library offers data engineers a broad selection of machine learning algorithms and utilities for working with structured data. Data engineers use scikit-learn for tasks such as data classification, regression, clustering, and feature engineering to streamline the process of building machine learning models and pipelines.

6. Beautiful soup
Beautiful Soup is one of the most effective tools for web scraping and data extraction, making it a valuable asset for data engineering. Using Beautiful Soup, data engineers can easily parse HTML and XML documents, extract specific data from such as websites and web pages—for example, text, images, links, and metadata—and quickly navigate document trees.

Python for Data Engineering Use Cases
Python can be used for myriad data engineering tasks. The following three use cases highlight how today’s teams are using Python to solve real-world data engineering challenges.

Real-time data processing
Python effectively powers stream processing, a data management technique where data is ingested, analyzed, filtered, transformed, and enhanced in real time. Using Python, stream processing enables teams to generate insights from data as it’s being created with direct application to marketing, fraud detection, and cybersecurity use cases.

Large-scale data processing
Python is one of the most popular languages for processing data at scale. Its simplicity, scalability, and efficiency make it ideal for processing massive amounts of data at speed. This is why it’s commonly used for data pipelines and machine learning applications.

Data pipeline automation
By removing manual processes, data pipeline automation facilitates the free flow of data, reducing the time to value. Python’s deep bench of libraries and tools makes it easy to automate the data pipeline process, including data acquisition, cleaning, transformation, and loading.

Streamline Your Python Data Engineering Workflows with Snowflake
Today, Python occupies a prominent place in the data engineering field. Ideal for working with data at scale, this programming language is helping data engineers prepare their data for a number of analytical and operational uses. Snowflake makes it possible to accelerate data engineering workflows when using Python and other popular languages.

With Snowpark, the new developer experience for Snowflake, data engineers can write code in their preferred language and run code directly on Snowflake. Snowpark  accelerates the pace of innovation by leveraging Python’s familiar syntax and thriving ecosystem of open-source libraries to explore and process data where it lives. Build powerful and efficient pipelines, machine learning (ML) workflows, and data applications while enjoying the performance, ease of use, and security of working inside the Snowflake Data Cloud.