## Review - Azure Databases & Management

As we've discussed in our [week 3](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_4/azure/week_3/content.ipynb) content, there are a variety of database solutions designed to help us store a variety of data.

However, we've skipped over a vital question: how are these database solutions set up?

Let's take a shallow look at how Azure is accessed. If we access the [Azure home portal](https://portal.azure.com/#home) and if we are signed into a Microsoft account, we will be greeted with varying options to purchase subscriptions.

A subscription is simply a digital key that enables you to access various services of Azure. This is usually doled out by your organization, which could support multiple Azure subscriptions across the company via a management group.

After acquiring a subscription, we then create "[resource groups](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal)" which are groupings of various Azure solutions. For example, if you are interested in hosting relational data, you might create a resource group that **only** has **Azure SQL** set up. 

Next, you might also create a separate resource group that contains **Azure Data Factory** which will allow you to transform data across the transactional resource group into a **data warehouse**. 

Throughout this course, we will skip over the actual steps needed to create these services and simply talk about use cases. To get an introduction to creating resource groups & subsequent services, tackle the following [lab](https://microsoftlearning.github.io/DP-900T00A-Azure-Data-Fundamentals/Instructions/Labs/dp900-02-storage-lab.html). 

Now that we understood the very basics of management, let's take a retrospective and consider the solutions we've gone over so far...

* Azure SQL
* Azure DB for Open-Source RDBMS
* Azure Cosmos DB
* Azure Storage

Last week we covered how to work with relational data in Azure using `T-SQL`. Consider which database solution this involves. (Remember! Azure DB for Open-Source RDBMS is not guaranteed to be T-SQL).

We will now jump ahead to *Azure Storage* to explore how we can store unstructured & semi-structured data.

## Review - Azure Blob Storage

*Blob* (binary large object) files represent a large variety of binary data that could entail:

* Images
* Videos
* Audio
* Git History Objects
* etc

This "etc" really encapsulates any and all other forms of data, even `sql`, `json`, and `xml` files. Although keep in mind that these files will be underutilized in blob storage (unless that is your intent).

To host blob data in Azure, we must set up simple Azure Blob Storage using the **Azure Storage** solution.

To host blobs, we create **blob containers** that allow us to group related objects. We can even control who can read & write blobs in our containers.

We then create **virtual folders** within blob containers which allow us to maintain a *hierarchy* of namespaces. This is similar (but not quite exact!) to the file system on our computer. Keep in mind that these *folders* are *virtual*, as in they are not available for manipulation like common folders (control-access/bulk-operations). They solely exist for organizational purposes.

We have 3 types of block blob objects:

**Block blobs**

Used to store discrete, large binary objects that do not change often. Is composed of a set of blocks (each 100 MB large) and has a maximum size of 4.7 TB. Commonly used for large images, videos, etc.

**Page blobs**

Used to store objects that are constantly read & written. Holds up to 8 TB of data. Commonly used for disk storage for virtual machines.

**Append blobs**

Used to store block blobs that only support appending. You can not delete or update existing blocks. Holds up to 195 GB of data. Commonly used for [log files](https://www.sumologic.com/glossary/log-file/).

In addition to these 3 types of objects, we also have 3 modes of access:

**Hot Tier**

Default tier which is for blobs that are accessed frequently. Stored on high-performance media, which allows near-instantaneous access & update. Reading latency in milliseconds.

**Cool Tier**

Lower performance than hot tier, and with less storage costs. Used for data is accessed infrequently. Reading latency in milliseconds.

**Archive Tier**

Lowest storage cost but with greater increased latency. Designed for data that mustn't be lost, but is accessed rarely. Effectively stored in an offline state. Reading latency could be hours. Must be transformed to cool or hot tier to be read in a process called "rehydration."

## Review - Azure Data Lakes

Now that we've gone over blob storage, we can move onto the `Azure DataLake Storage Gen2` service. This separate service combines the scalability of blob storage and the analytics compatibility of Azure Data Lake Store into one service.

We can create an Azure Data Lake by simply enabling *hierarchical namespaces* in an Azure Storage account. This will *permanently* convert our Azure Storage account from a simple Blob Storage to a Data Lake, which allows us to use the following features:

* Organize and manage data with **real** (not virtual) directories and folders.
* Optimize for high-performance analytics.
* Enable granular access control on folders, which will lock away folders from improper usage.
* Enable atomic operations on directories: succeed or fail.

But also lose the following features:

* Lose access to blob-level tiering
* ~~Lose access to blob-level "soft deletes"~~ (Fixed June 2021)

However, if we are interested in scaling up our unstructured data store, as well as our analytical capabilities, we must upgrade to `Azure DataLake Storage Gen2` which supports integration with Azure HDInsight, Azure Databricks, and Azure Synapse Analytics. 

## Review - Azure Files

We're all familiar with file shares at this point: dropbox, google drive, etc. These are solutions that allow us to access a central repository of files hosted on remote storage.

Microsoft, of course, would be amiss to not include its own file share solution within `Azure Storage`, called `Azure File Storage`. What this file share does differently amongst the entire ecosystem of file shares is for us to discover.

Azure File storage enables us to store up to 100 TB. The maximum size of a single file is 1 TB, and we are able to maintain 2000 concurrent connections per shared file.

We can even synchronize locally cached copies of shared files with data in `Azure File Storage`.

We, once again, have two tiers of storage:

**Standard** HDD-based storage

**Premium** SDD-based storage, which provides faster access at a higher cost.

Lastly, we also have two network file-sharing protocols:

**Server Message Block (SMB)**

File sharing across Windows, Linux, and MacOS.

**Network File System (NFS)**

File sharing across Linux & Windows. Only accessible via `premium` storage (the sin of not hosting windows).

## Review - Azure Tables

Lastly, we can store structured and semi-structured tables in `Azure Tables`. 

Much like a relational database, we can store rows in preset columns using Azure Tables. Instead of having a primary key, we have a `partition key`, which is unique across partitions, and a `row key`, which is unique to each row in the same partition.

A `partition` is a block of data that splits tables. This is a mechanism for grouping related rows, which improves scalability and performance. This is achieved through the following features:

* Partitions exist independently from each other. 
* Including the partition key in the search criteria improves read performance 

All rows with the same partition key are grouped. When searching for rows, we can use `point queries` to locate a single row, or a `range query` to locate contiguous rows.

This is as far as tables get with similarities to RDBMS.

* Data is usually denormalized
* Foreign keys do not exist
* A timestamp column records modifications to row
* Stored procedures, views, nor indexes do not exist
* Columns vary across rows occasionally 

Why might we want to opt for Azure Tables in this instance then? Well:

* It's cheaper: 0.045 cents per GB/Month vs 0.12 cents per GB/Month
* It's faster when used correctly

If you need to store table-based data and do not care much for relational features, tables are the way to go.