## Review - IAAS/PAAS/SAAS

https://www.ibm.com/topics/iaas-paas-saas

### Infrastructure as a Service (IaaS)

* On-demand access to cloud-hosted physical & virtual servers. 
* Maximum control, Minimum ease of use

### Platform as a Service (PaaS)

* On-demand access to a cloud-hosted platform.
* Medium control, Medium ease of use

### Software as a Service (SaaS)

* On-demand access to ready-to-use, cloud-hosted application software.
* Minimum control, Maximum ease of use

## Review - Data Services

https://www.youtube.com/watch?v=pBB5zFnhgyE
https://learn.microsoft.com/en-us/training/modules/explore-roles-responsibilities-world-of-data/3-data-services

The Azure suite of tools includes many services that satisfy various enterprise needs. This includes:

* Azure SQL
* Azure Database for open-source relational databases
* Azure Cosmos DB
* Azure Storage
* Azure Data Factory
* Azure Synapse Analytics
* Azure Databricks
* Azure HDInsight
* Azure Stream Analytics
* Azure Data Explorer
* Microsoft Purview
* Microsoft Power BI

This is a fairly large list to memorize, so first let's define categories that satisfy the data science "hierarchy of needs."

* databases: tools that provide data storage solutions

* pipelines: tools that allow data engineers to ingest, transform, and load data

* analytics: tools that allow data analysts to explore, aggregate, or perform machine learning on data

* maintenance: tools that allow data engineers to monitor & validate data  

Now, we can group these tools into these discrete categories. These categories are not exclusive, so you'll notice that some tools will exist in multiple categories.

### **databases**

* Azure SQL
* Azure Database for open-source relational databases
* Azure Cosmos DB
* Azure Storage

### **pipelines**

* Azure Data Factory
* Azure Synapse Analytics
* Azure Databricks
* Azure HDInsight
* Azure Stream Analytics

### **analytics** 

* Azure Data Explorer
* Microsoft Power BI
* Azure Databricks
* Azure Synapse Analytics

### **maintenance**

* Microsoft Purview

## Review - Databases

### Azure SQL

* **Purpose**: Collective name for a family of relational databases based on the [Microsoft SQL Server](https://www.tutorialspoint.com/ms_sql_server/index.htm) database engine. Each solution represents the same tool with increasing levels of configuration (customization) & maintainability. 
* **Sub-tools**:
    * Azure SQL Database: a simple transactional data storage (PaaS)
    * Azure SQL Managed Instance: a hosted instance of SQL server
    * Azure SQL VM: a virtual machine (basically a computer) w/ an installation of SQL Server

### Azure Database for open-source relational databases

* **Purpose**: Common open source databases hosted on Azure cloud environment.
* **Sub-tools**:
    * Azure Database for MySQL: commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps
    * Azure Database for MariaDB: Rewritten MySQL db for improved performance. Compatible with Oracle DB.
    * Azure Database for PostgreSQL: hybrid object-relational database. Allows storage of custom data types, with non-relational properties.

### Azure Cosmos DB

* **Purpose**: Non-relational (NoSQL) database that supports multiple APIs. Allow storage of JSON, key-value pairs, column families, and graphs
* **Sub-tools**:
    * NA

### Azure Storage

* **Purpose**: Allows you to store data in Blob containers (cost-effective binary files), File shares (open network file shares), and tables (key-value storage for applications that allow quick read & write operations). Used by data engineers to host `data lakes` - blob storage with a hierarchical namespace that enables files to be organized in folders in a distributed file system.
* **Sub-tools**:
    * NA

## Review - Pipelines

### Azure Data Factory

* **Purpose**: Allows data engineers to define & schedule data pipelines to transfer & transform data. Allows integration with cloud stores to extract & load data. Essentially, it allows data engineers to create ETL pipelines. 
* **Sub-tools**:
    * NA

### Azure Synapse Analytics

* **Purpose**: Unified data analytics solution designed to allow data engineers & data analysts to create a comprehensive "data analytics solution." Data engineers can use synapse analytics to set up ETL pipelines, whereas data analysts can create interactive notebooks and data models to extract insights.
* **Sub-tools**:
    * Pipelines: Azure Data Factory
    * SQL: scalable SQL database engine
    * Apache Spark: open-source distributed data-processing system, with support for Java, Scala, Python, & SQL
    * Azure Synapse Data Explorer: real-time querying of log and telemetry data using Kusto Query Language

### Azure Databricks

* **Purpose**: Uses Apache spark data processing & SQL database semantics to enable large-scale analytics. Data engineers can use Databricks to create analytical data stores, whereas data analysts can create interactive notebooks.
* **Sub-tools**:
        * NA

### Azure HDInsight

* **Purpose**: Allows data processing via popular Apache open-source big data processing solutions on Azure-hosted clusters. Data engineers can use this solution to implement big data analytics workloads.
* **Sub-tools**:
    * Apache Spark: open-source distributed data-processing system, with support for Java, Scala, Python, & SQL
    * Apache Hadoop: a system that uses MapReduce (MR) jobs to process large volumes of data efficiently across multiple cluster nodes. MR jobs can be written in Java, or abstracted by interfaces such as Apache Hive - a SQL-based API.
    * Apache HBase: open source system for large-scale NoSQL data storage & querying
    * Apache Kafka: [message broker](https://www.ibm.com/topics/message-brokers) for data stream processing

### Azure Stream Analytics

* **Purpose**: A solution used to capture real-time stream processing data, apply queries to extract and manipulate data from the input stream and write output for analysis or further processing. Used by data engineers to ingest streaming data.
* **Sub-tools**:
    * NA

## Review - Analytics

### Azure Data Explorer

* **Purpose**: Data analysts use this solution to query and analyze telemetry data, such as timestamp attributes typically found in Internet-of-Things data
* **Sub-tools**:
    * NA

### Microsoft Power BI

* **Purpose**: Platform for analytical data modeling & reporting to create & share interactive data visualizations (SaaS)
* **Sub-tools**:
    * NA

### Azure Synapse Analytics

* **Purpose**: Unified data analytics solution designed to allow data engineers & data analysts to create a comprehensive "data analytics solution." Data engineers can use synapse analytics to set up ETL pipelines, whereas data analysts can create interactive notebooks and data models to extract insights.
* **Sub-tools**:
    * Pipelines: Azure Data Factory
    * SQL: scalable SQL database engine
    * Apache Spark: open-source distributed data-processing system, with support for Java, Scala, Python, & SQL
    * Azure Synapse Data Explorer: real-time querying of log and telemetry data using Kusto Query Language

### Azure Databricks

* **Purpose**: Uses Apache spark data processing & SQL database semantics to enable large-scale analytics. Data engineers can use Databricks to create analytical data stores, whereas data analysts can create interactive notebooks.
* **Sub-tools**:
    * NA

## Review - Maintenance

### Microsoft Purview

* **Purpose**: Data engineers can use this solution to enforce data governance and integrity. Used to map & track data lineage across multiple data sources & systems. 
* **Sub-tools**:
    * NA

## Review - Data Formats

### Structured

* Data that adheres to a fixed *schema* so that all data has the same fields or properties. Most commonly, the schema for structured data is tabular, where rows represent samples & columns represent attributes. 
* Ex: Relational data

### Semi-Structured

* Data that contains some structure, but allows variations between entity instances. 
* Ex: JSON data. 

### Unstructured

* Data that abides by *no* structure.
* Ex: Documents, images, audio, video, & binary data.

The two broad categories of data stores include:
* file stores
* databases

## Review - File Storage

https://www.oracle.com/database/what-is-json/
https://learn.microsoft.com/en-us/training/modules/explore-core-data-concepts/3-file-storage

### Delimited Text Files

Purpose: Plain text files with delimiters (commas, tabs, pipes) & row terminators. Good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format. Includes CSV (comma-separated values), TSV, SSV, etc.


```
FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com
```

### JavaScript Object Notation

Purpose: Human-readable hierarchical document schema used to define data entities (objects) that have multiple attributes. Objects are expressed in curly brackets `{}`. Each attribute might be an object. Used for both structured & semi-structured data.

```
{
  "customers":
  [
    {
      "firstName": "Joe",
      "lastName": "Jones",
      "contact":
      [
        {
          "type": "home",
          "number": "555 123-1234"
        },
        {
          "type": "email",
          "address": "joe@litware.com"
        }
      ]
    },
    {
      "firstName": "Samir",
      "lastName": "Nadoy",
      "contact":
      [
        {
          "type": "email",
          "address": "samir@northwind.com"
        }
      ]
    }
  ]
}
```

### Extensible Markup Language

Purpose: Human-readable document schema that fulfills a similar purpose to JSON. Uses tags enclosed in angle brackets `<..>` to define elements & attributes.

```
<Customers>
  <Customer name="Joe" lastName="Jones">
    <ContactDetails>
      <Contact type="home" number="555 123-1234"/>
      <Contact type="email" address="joe@litware.com"/>
    </ContactDetails>
  </Customer>
  <Customer name="Samir" lastName="Nadoy">
    <ContactDetails>
      <Contact type="email" address="samir@northwind.com"/>
    </ContactDetails>
  </Customer>
</Customers>
```

### Binary Large Objects

* Purpose: While *everything* on a computer is eventually expressed as binary data, the formats above are mapped to human-readable formats via a character encoding scheme. Unstructured data is commonly stored as raw binary which then must be interpreted by specific applications. This includes images, videos, audio, and app-specific documents.
    
### Optimized File Formats

https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data

Specialized file formats designed for efficient usage of storage space or processing. 

#### AVRO

* Format: Row-based format w/ a JSON header that describes the structure of data. Data is stored in binary format. The header informs how to parse binary data for extraction.
* Purpose: Optimized for compressing data & minimizing storage & bandwidth requirements.

```
{
  "type": "record",
  "name": "LongList",
  "aliases": ["LinkedLongs"],                      
  "fields" : [
    {"name": "value", "type": "long"},             // each element has a long
    {"name": "next", "type": ["null", "LongList"]} // optional next element
  ]
}
```

#### ORC

* Format: Column-based format. Stands for "Optimized row columnar format." Formated as a "stripe of data" which describes data for columns, index for rows of stripe, data for each row, and a footer with statistical information for each column. 
* Purpose: Optimized for read & write operations in Apache Hive. 

#### Parquet

* Format: Column-based format. Contains row groups. Data for each column is stored in the same row group, which contains column data. Contains metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve data in specified columns.
* Purpose: Supports efficient compression & encoding schemas.

## Next Actions

Review `exercises.ipynb` & `discussion.ipynb`.