# Azure for the Data Engineer

# Understand the Evloving World of Data

- Learn the key factors that are driving changes in data generation, roles, and technologies.
- Compare the differences between on-premises data technologies and cloud data technologies.
- Outline how the role of the data professional is changing in organizations.
- Identify use cases that involve these changes.

## Data Abundance
- view data on PCs, tablets, and mobile devices
- Data forms include text, stream, audio, video, and metadata. Data can be structured, unstructured, or aggregated. 
- must maintain data systems that are accurate, highly secure, and constantly available. The systems must comply with applicable regulations 
-  

## Cloud v On Premises Servers
### OnPremise  
- requires physical servers
- individual CALs and OS
- self-maintained
- scale through adding more servers
- higher uptime (Availability) increases cost and complexity
- TCO
  - Hardware
  - Software licensing
  - Labor (installation, upgrades, maintenance)
  - Datacenter overhead (power, telecommunications, building, heating and cooling)
### Cloud
- no physical devices
- no licensing or CALs
- maintenance included
- scale automatically
- higher uptime (Availability) increases cost and complexity
- multilingual support built in

## Job Responsibilities

### Data Engineer
- Traditionally does an ETL DW  
- Now focusing on ELT and DataLakes  
- now provision server needs instead if implementing new servers

## Use Cases for Cloud
- Web
- healthcare
- IoT solutions

## Section Questions
- Which data processing framework will a data engineer use to ingest data onto cloud data platforms in Azure?
  - ELT
- The schema of what data type can be defined at query time?
  - unstructured data
- Duplicating customer content for redundancy and meeting service-level agreements (SLAs) in Azure meets which cloud technical requirement?
  - High Availability

# Survey the Services on the Azure Data Platform

- Contrast structured data with nonstructured data.
- Explore common Azure data platform technologies and identify when to use them.
- List additional technologies that support the common data platform technologies.

## Explore Data Types
### Structured data
In relational database systems data structure is defined at design time. Data structure is designed in the form of tables.
### Nonstructured data
binary, audio, and image files  
Nonstructured data is stored in nonrelational systems, commonly called unstructured or NoSQL systems. 

### 4 types of NoSQL databases:

- Key-value store: Stores key-value pairs of data in a table structure.
- Document database: Stores documents that are tagged with metadata to aid document searches
- Graph database: Finds relationships between data points by using a structure that's composed of vertices and edges.
- Column database: Stores data based on columns rather than rows. Columns can be defined at the query's runtime, allowing flexibility in the data that's returned performantly.

## Understand Data Storage in Azure Storage
### 4 configuration options:

Azure Blob: A scalable object store for text and binary data
Azure Files: Managed file shares for cloud or on-premises deployments
Azure Queue: A messaging store for reliable messaging between application components
Azure Table: A NoSQL store for no-schema storage of structured data

### When to use Blob storage
- Provision a data store that will store but not query data
- Cheapest option
- Blob storage works well withunstructured data

### Key Features
- scalable and secure
- durable
- highly available
- handles hardware maintenance, updates, and critical issues
- provides REST APIs and SDKs 
- supports .NET, Node.js, Java, Python, PHP, Ruby, and Go
- supports scripting in Azure PowerShell and the Azure CLI

### Data Ingestion

- Azure Data Factory
- Azure Storage Explorer
- Apache Sqoop
- the AzCopy tool
- PowerShell
- Visual Studio
- File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio
- AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB

### Data Security
- Azure Storage encrypts all data that's written to it
- Azure Storage securesdata by using keys or shared access signatures
- Azure Resource Manager provides a permissions model that uses role-based access control (RBAC)

## Understanding Azure Data Lake Storage

- Hadoop-compatible data repository
- Store any size or type of data
- Available as Generation 1 (Gen1) 

### Gen2
- Azure Blob storage
- hierarchical file system
- performance tuning
- access data through 
  - Blob API
  - Data Lake file API
- Storage layer for Databricks, Hadoop, HDInsight


### When Data Lake
Data Lake Storage is designed to store massive amounts of data for big-data analytics.

### DataLake Features

- Unlimited scalability
- Hadoop compatibility
- Security support for both access control lists (ACLs)
- POSIX compliance
- An optimized Azure Blob File System (ABFS) driver that's designed for big-data analytics
- Zone-redundant storage
- Geo-redundant storage

### DataLake Queries
- Gen1, data engineers query data by using the U-SQL language  
- Gen 2, use the Azure Blob Storage API or the Azure Data Lake System (ADLS) API.

### DataLake Security
- Azure Active Directory ACLs - Active Directory Security Groups
- Role-based access control (RBAC) in Gen1 and Gen2
  - Built-in security groups include ReadOnlyUsers, WriteAccessUsers, and FullAccessUsers
- Data Lake Storage automatically encrypts data at rest,

## Cosmos DB
- globally distributed, multimodel database  
- Deploy with API models
  - SQL API
  - MongoDB API (unstructured)
  - Cassandra API (Wide columns)
  - Gremlin API (Graph database)
  - Table API

### When Cosmos DB
- Need a NoSQL database of the supported API model
  - at planet scale
  - with low latency performance
- Currently, Azure Cosmos DB supports five-nines uptime (99.999 percent). It can support response times below 10 ms when it's provisioned correctly.
- Consistency levels include strong, bounded staleness, session, consistent prefix, and eventual.

### Ingest Data into Cosmos 
with Azure DataFactory

### Queries
- stored procedures
- triggers
- user-defined functions (UDFs)
- JavaScript query API
- other APIs
  - Data Explorer component uses the graph visualization panel

### Security
- data encryption
- IP firewall configurations
- access from virtual networks
-  Data is encrypted automatically
-  User authentication is based on tokens
-  Azure AD provides role-based security
-  Security compliance certifications HIPAA, FedRAMP, SOX, and HITRUST

## Understand Azure SQL Database
- managed relational database service
- supports structures such as relational data and unstructured formats
  - spatial
  - XML data
- provides online transaction processing (OLTP) that can scale on demand
- comprehensive security and availability

### When Azure SQL Database
- need to scale up and scale down OLTP systems on demand
- take advantage of Azure security and availability features
- flexible - provision in minutes
- backed by Azure SLA

### Key Features
- predictable performance
- minimal admin
- dynamic scalability
- optimization
- high availability
- advanced security

### Ingesting and Processing Data
- application integration from developer SDKs
  - .NET
  - Node.js
  - Python
  - Java
- Transact-SQL (T-SQL) techniques 
- Azure Data Factory

### Queries
With T-SQL

### Security
- Advanced Threat Protection
- SQL Database auditing
- Data encryption
- Azure Active Directory authentication
- Multi-factor authentication
- Compliance certification

## Azure Synapse Analytics