# Azure for the Data Engineer

# Understand the Evloving World of Data

- Learn the key factors that are driving changes in data generation, roles, and technologies.
- Compare the differences between on-premises data technologies and cloud data technologies.
- Outline how the role of the data professional is changing in organizations.
- Identify use cases that involve these changes.

## Data Abundance
- view data on PCs, tablets, and mobile devices
- Data forms include text, stream, audio, video, and metadata. Data can be structured, unstructured, or aggregated. 
- must maintain data systems that are accurate, highly secure, and constantly available. The systems must comply with applicable regulations 
-  

## Cloud v On Premises Servers
### OnPremise  
- requires physical servers
- individual CALs and OS
- self-maintained
- scale through adding more servers
- higher uptime (Availability) increases cost and complexity
- TCO
  - Hardware
  - Software licensing
  - Labor (installation, upgrades, maintenance)
  - Datacenter overhead (power, telecommunications, building, heating and cooling)
### Cloud
- no physical devices
- no licensing or CALs
- maintenance included
- scale automatically
- higher uptime (Availability) increases cost and complexity
- multilingual support built in

## Job Responsibilities

### Data Engineer
- Traditionally does an ETL DW  
- Now focusing on ELT and DataLakes  
- now provision server needs instead if implementing new servers

## Use Cases for Cloud
- Web
- healthcare
- IoT solutions

## Section Questions
- Which data processing framework will a data engineer use to ingest data onto cloud data platforms in Azure?
  - ELT
- The schema of what data type can be defined at query time?
  - unstructured data
- Duplicating customer content for redundancy and meeting service-level agreements (SLAs) in Azure meets which cloud technical requirement?
  - High Availability

# Survey the Services on the Azure Data Platform

- Contrast structured data with nonstructured data.
- Explore common Azure data platform technologies and identify when to use them.
- List additional technologies that support the common data platform technologies.

## Explore Data Types
### Structured data
In relational database systems data structure is defined at design time. Data structure is designed in the form of tables.
### Nonstructured data
binary, audio, and image files  
Nonstructured data is stored in nonrelational systems, commonly called unstructured or NoSQL systems. 

### 4 types of NoSQL databases:

- Key-value store: Stores key-value pairs of data in a table structure.
- Document database: Stores documents that are tagged with metadata to aid document searches
- Graph database: Finds relationships between data points by using a structure that's composed of vertices and edges.
- Column database: Stores data based on columns rather than rows. Columns can be defined at the query's runtime, allowing flexibility in the data that's returned performantly.

## Understand Data Storage in Azure Storage
### 4 configuration options:

Azure Blob: A scalable object store for text and binary data
Azure Files: Managed file shares for cloud or on-premises deployments
Azure Queue: A messaging store for reliable messaging between application components
Azure Table: A NoSQL store for no-schema storage of structured data

### When to use Blob storage
- Provision a data store that will store but not query data
- Cheapest option
- Blob storage works well withunstructured data

### Key Features
- scalable and secure
- durable
- highly available
- handles hardware maintenance, updates, and critical issues
- provides REST APIs and SDKs 
- supports .NET, Node.js, Java, Python, PHP, Ruby, and Go
- supports scripting in Azure PowerShell and the Azure CLI

### Data Ingestion

- Azure Data Factory
- Azure Storage Explorer
- Apache Sqoop
- the AzCopy tool
- PowerShell
- Visual Studio
- File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio
- AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB

### Data Security
- Azure Storage encrypts all data that's written to it
- Azure Storage securesdata by using keys or shared access signatures
- Azure Resource Manager provides a permissions model that uses role-based access control (RBAC)

## Understanding Azure Data Lake Storage

- Hadoop-compatible data repository
- Store any size or type of data
- Available as Generation 1 (Gen1) 

### Gen2
- Azure Blob storage
- hierarchical file system
- performance tuning
- access data through 
  - Blob API
  - Data Lake file API
- Storage layer for Databricks, Hadoop, HDInsight


### When Data Lake
Data Lake Storage is designed to store massive amounts of data for big-data analytics.

### DataLake Features

- Unlimited scalability
- Hadoop compatibility
- Security support for both access control lists (ACLs)
- POSIX compliance
- An optimized Azure Blob File System (ABFS) driver that's designed for big-data analytics
- Zone-redundant storage
- Geo-redundant storage

### DataLake Queries
- Gen1, data engineers query data by using the U-SQL language  
- Gen 2, use the Azure Blob Storage API or the Azure Data Lake System (ADLS) API.

### DataLake Security
- Azure Active Directory ACLs - Active Directory Security Groups
- Role-based access control (RBAC) in Gen1 and Gen2
  - Built-in security groups include ReadOnlyUsers, WriteAccessUsers, and FullAccessUsers
- Data Lake Storage automatically encrypts data at rest,

## Cosmos DB
- globally distributed, multimodel database  
- Deploy with API models
  - SQL API
  - MongoDB API (unstructured)
  - Cassandra API (Wide columns)
  - Gremlin API (Graph database)
  - Table API

### When Cosmos DB
- Need a NoSQL database of the supported API model
  - at planet scale
  - with low latency performance
- Currently, Azure Cosmos DB supports five-nines uptime (99.999 percent). It can support response times below 10 ms when it's provisioned correctly.
- Consistency levels include strong, bounded staleness, session, consistent prefix, and eventual.

### Ingest Data into Cosmos 
with Azure DataFactory

### Queries
- stored procedures
- triggers
- user-defined functions (UDFs)
- JavaScript query API
- other APIs
  - Data Explorer component uses the graph visualization panel

### Security
- data encryption
- IP firewall configurations
- access from virtual networks
-  Data is encrypted automatically
-  User authentication is based on tokens
-  Azure AD provides role-based security
-  Security compliance certifications HIPAA, FedRAMP, SOX, and HITRUST

## Understand Azure SQL Database
- managed relational database service
- supports structures such as relational data and unstructured formats
  - spatial
  - XML data
- provides online transaction processing (OLTP) that can scale on demand
- comprehensive security and availability

### When Azure SQL Database
- need to scale up and scale down OLTP systems on demand
- take advantage of Azure security and availability features
- flexible - provision in minutes
- backed by Azure SLA

### Key Features
- predictable performance
- minimal admin
- dynamic scalability
- optimization
- high availability
- advanced security

### Ingesting and Processing Data
- application integration from developer SDKs
  - .NET
  - Node.js
  - Python
  - Java
- Transact-SQL (T-SQL) techniques 
- Azure Data Factory

### Queries
With T-SQL

### Security
- Advanced Threat Protection
- SQL Database auditing
- Data encryption
- Azure Active Directory authentication
- Multi-factor authentication
- Compliance certification

## Azure Synapse Analytics
a cloud-based data platform that brings together enterprise data warehousing and Big Data analytics  

### When Azure Synapse
- reduced processing times
- release reports faster

### Key Features
- SQL pools use massively parallel processing (MPP) across petabytes
- scale compute separately
- Data Movement Service (DMS) to balance compute (can pause and resume - pay as you use)
- Distributed tables
  - hash
  - round-robin
  - replicated
  - these tables tune performance

### Ingesting and Processsing Data
- ELT process
  - bcp
  - SQLBulkCopy API
- Polybase
  - lowers complexity
  - applies stored procedures
  - labels
  - views

### Queries
T-SQL

### Security
- SQL Server and Azure AD authentication
- row and column level security

## Azure Stream Analytics
Applications, sensors, monitoring devices, and gateways broadcast continuous event data known as data streams. Streaming data is high volume and has a lighter payload than nonstreaming systems.

### When Stream Analytics
- respond to data events in real time
- analyze large batches in continuous time-bound stream
- events into an Event hun or IoT hub
  - then streamed to Stream Analytics
- can't wait for batch sytems
  - autonomous driving

### Ingest Data
- Azure Event Hubs
  - big data streaming
  - large customer request volume
  - push to databricks, stream analytics, datalake, hdinsight 
- Azure Iot Hub
  - bisirectional - data in and commands/policy back to IoT devices
- Azure Blob Storage
  - batch processing

### Process Data
- Inputs 
  - Event Hubs
  - IoT Hubs
  - Azure Storage
- Outputs
  - Azure Blob
  - Azure SQL Database
  - Azure Data Lake Storage
  - Azure Cosmos DB
- Analysis after storing
  - Run batch analytics in Azure HDInsight
  - Or send the output to a service (e.g. Event Hubs) for consumption
  - Or use the Power BI streaming API to send the output to Power BI for real-time visualization.

## Queries
- Stream Analytics language
  - SQL plus complex temporal queries

### Security
- transport layer
- data discarded after use
- use storage method to store encrypted data



## HD Insight
Azure HDInsight provides technologies to help you ingest, process, and analyze big data. It supports batch processing, data warehousing, IoT, and data science.

### Key Features
- low-cost cloud solution
- includes 
  - Apache Hadoop
    - Hadoop includes 
      - Apache Hive, HBase, Spark, and Kafka. 
      - Hadoop stores data in a file system (HDFS)
  - Spark
    - Spark stores data in memory
    - makes Spark about 100 times faster
  - HBase Kafka, , and Interactive Query
    - HBase is a NoSQL database built on Hadoop
  - Storm
    - distributed real-time streaming analytics solution.
  - Kafka
    - open-source platform that's used to compose data pipelines
    - offers message queue functionality,
    - allows publish or subscribe to real-time data streams.

### Ingesting Data
use Hive to run ETL operations or orchestrate Hive queries in ADF

### Data Processing
- Hadoop
  - Python
  - Java
  - Mapper consumes and analyzes input data
  - Reducer can analyse can run summary operations to reduce result set
- Spark
  - Spark Streaming
  - Anaconda Python libraries
  - GrpahX
- Storm
  - Java
  - C#
  - Python

### Queries
- Hadoop
  - Pig
  - HiveQL
- Spark
  - Spark SQL

### Security
- encryption
- SSH
- shared access signatures
- Azure aD

## Other Data Services

### Databricks
- serveless platform
- one-click setup
- streamlined workflows
- interactive workspace
- Spark based
- REST APIs to program Spark clusters
- Notebooks
  - R, Python, Scala, SQL

### Data Factory
Use Data Factory to organize raw data into meaningful data stores and data lakes so your organization can make better business decisions  
- cloud integration service
- orchestrates movement between data stores
- create and schedule pipelines
- Compute with
  - HD Insight
  - Hadoop
  - Spark
- Publish to 
  - Snapyse Analytics
  - data stores

### Azure Purview
- a unified data governance service that helps you manage and govern your on-premises, multicloud, and software-as-a-service (SaaS) data
- create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage

## Section Questions
- Which data platform technology is a globally distributed, multimodel database that can perform queries in less than a second?
  - Azure Cosmos DB
- Which data store is the least expensive choice when you want to store data but don't need to query it?
  - Azure Storage
- Which Azure service is the best choice to store documentation about a data source?
  - Azure Purview

# Data Engineer Tasks in Cloud-Hosted Architecture
- List the work roles involved in modern data projects
- Outline data engineering practices
- Explore the high-level process for designing a data engineering project

## Job Roles
### Data Engineer
- provison and set up platform technologies
- manage data flows
  - databases (relational or non)
  - data streams
  - file stores
- integrate data services
- ingest, egress, transform data from multiple sources
- collaborate with stakeholders
- manage security
- data wrangling
  - ingest, egress, transform, validata, and clean data

### Data Scientist
- advanced analytics to extract value
- descriptive, predictive, prescriptive analytics
- forecast models
- deep learning
- build models

### AI Engineer
- work with cognitive services, cognitive search, bot framework
- computer vison, text analysis
- use APIS over models
- add the intelligent capabilities of vision, voice, language, and knowledge to applications. 

## Data Engineering Processes
- Design and develop data storage and data processing solutions for the enterprise
- Set up and deploy cloud-based data services such as blob services, databases, and analytics
- Secure the platform and the stored data
- Ensure business continuity in uncommon conditions by using techniques for high availability and disaster recovery
- Monitor to ensure that the systems run properly and are cost-effective.

## Move Data Around
- Extract
  - define data source (resource group, subscription, identity)
  - define the data (query, files, blob storage)
- Transform
  - define transformation (splitting, combining, deriving, adding, removing, pivoting, aggreagating, merging)
- Load
  - define the destination - (JSON, file, blob, API)
    - Node.js, .NET, Python, and Java. 
    - Extensible Markup Language (XML)common in the past
    - most now use JSON because of its flexibility as a semistructured data type
  - start the job (test and run)
  - monitor the job (ensure it completes or troubleshoot and rerun)

### ETL Tools
- Azure Data Factory
- Azure Purview for data source info and data dictionaries

### Evolution from ETL
- Move to ELT
- Stores data in original format
  - reduces load time

### Holistic Data Engineering
- Phases of Data Projects
  - Source: data systems to extract
  - Ingest: identify tech and methods
  - Prepare: identify tech and methods
  - Analyze: identify tech and method
  - Consume: identify tech and method
- Not necessarily a liner flow between phases

## Analyze Data Engineers Tasks
### Five Phases
- source
- ingest
- prepare
- analyze
- consume

## Section Questions
- Which role works with Azure Cognitive Services, Cognitive Search, and the Bot Framework?
  - AI Engineer
- Which Azure data platform is commonly used to process data in an ELT framework?
  - Azure Data Factory