# Module 02 - Data Storage: Azure Storage, Data Lake, File System

## Overview

This module covers Azure storage services that form the foundation of data engineering solutions. Understanding storage options is crucial for designing efficient and cost-effective data architectures.

## Learning Objectives

By the end of this module, you will understand:
- Azure Storage Account and its services (Blob, File, Queue, Table)
- Azure Data Lake Storage Gen2 and its capabilities
- Hierarchical namespace and file system concepts
- When to use Azure Storage vs Data Lake Storage
- Storage account types and performance tiers
- Best practices for organizing data in storage


## Azure Storage Account

**Azure Storage Account** is a container that groups a set of Azure Storage services together. It provides a unique namespace for your data in Azure.

### Storage Account Services

A storage account can contain four types of data services:

1. **Blob Storage**: Object storage for unstructured data (text, binary, images, videos)
2. **File Storage**: Managed file shares accessible via SMB protocol
3. **Queue Storage**: Message queuing for application communication
4. **Table Storage**: NoSQL key-value store for structured data

### Storage Account Types

#### General Purpose v2 (GPv2)
- **Recommended** for most scenarios
- Supports all storage services
- Best price-performance balance
- Supports hot, cool, and archive access tiers

#### General Purpose v1 (GPv1)
- Legacy account type
- Still supported but not recommended for new accounts
- Limited features compared to v2

#### Blob Storage Account
- Legacy account type
- Only supports blob storage
- Not recommended for new deployments


## Blob Storage

**Azure Blob Storage** is Microsoft's object storage solution for the cloud. It's optimized for storing massive amounts of unstructured data.

### Blob Types

1. **Block Blobs**
   - For text and binary data
   - Up to ~4.75 TB per blob
   - Best for streaming and cloud-native applications
   - Most common type for data engineering

2. **Append Blobs**
   - Optimized for append operations
   - Good for logging scenarios
   - Up to ~195 GB per blob

3. **Page Blobs**
   - For random read/write operations
   - Used for VHD files (virtual machine disks)
   - Up to 8 TB per blob

### Blob Storage Structure

```
Storage Account
└── Container (like a folder)
    └── Blob (the actual file)
```

### Access Tiers

- **Hot**: Frequently accessed data (lowest storage cost, highest access cost)
- **Cool**: Infrequently accessed data (lower access cost, higher storage cost)
- **Archive**: Rarely accessed data (lowest storage cost, highest access cost, requires rehydration)

### Use Cases

- Data lakes and big data analytics
- Backup and disaster recovery
- Media storage and streaming
- Log file storage
- Raw data storage before processing


## Azure Data Lake Storage Gen2

**Azure Data Lake Storage Gen2 (ADLS Gen2)** is built on Azure Blob Storage and adds a hierarchical namespace, making it optimized for big data analytics workloads.

### Key Features

1. **Hierarchical Namespace**
   - Organizes objects/files into a directory hierarchy
   - Enables file system semantics (directories, subdirectories)
   - Improves performance for analytics operations

2. **Hadoop Compatible**
   - Works with Hadoop Distributed File System (HDFS)
   - Supports Apache Spark, Hive, Presto, and other analytics engines
   - Can be used as primary storage for HDInsight, Databricks, Synapse

3. **ACL (Access Control Lists)**
   - Fine-grained access control at file and directory level
   - POSIX-compliant permissions
   - Supports both Azure RBAC and ACLs

4. **Optimized for Analytics**
   - Better performance for large-scale analytics
   - Supports atomic operations
   - Optimized for parallel processing

### ADLS Gen2 Structure

```
Storage Account (with hierarchical namespace enabled)
└── File System (Container)
    └── Directory
        └── Subdirectory
            └── File
```

### When to Use ADLS Gen2

✅ Big data analytics workloads
✅ Need for hierarchical organization
✅ Working with Spark, Hive, or other analytics tools
✅ Require fine-grained access control
✅ Processing large files in parallel

### When to Use Regular Blob Storage

✅ Simple object storage needs
✅ Not using analytics tools
✅ Cost optimization for simple storage
✅ Legacy applications that don't support ADLS Gen2


## File System Concepts

### Hierarchical Namespace

**Hierarchical namespace** organizes blob data into a directory structure, similar to a file system on your computer.

#### Without Hierarchical Namespace (Blob Storage)
```
https://storageaccount.blob.core.windows.net/container/blob1
https://storageaccount.blob.core.windows.net/container/blob2
https://storageaccount.blob.core.windows.net/container/blob3
```
- Flat structure
- No true directories
- Slower for directory operations

#### With Hierarchical Namespace (ADLS Gen2)
```
https://storageaccount.dfs.core.windows.net/filesystem/sales/2024/january/data.csv
https://storageaccount.dfs.core.windows.net/filesystem/sales/2024/february/data.csv
https://storageaccount.dfs.core.windows.net/filesystem/marketing/2024/data.csv
```
- Directory structure: `/sales/2024/january/`
- True directories and subdirectories
- Faster directory operations and analytics

### File System Operations

Common operations in a hierarchical namespace:
- **Create Directory**: Organize files into folders
- **Rename Directory**: Reorganize structure
- **Delete Directory**: Remove entire folder trees
- **List Directory**: Get contents of a directory
- **Move/Rename Files**: Reorganize files within structure

### Benefits for Data Engineering

1. **Organization**: Logical structure matching business needs
2. **Performance**: Faster operations on directories
3. **Compatibility**: Works with tools expecting file systems
4. **Partitioning**: Natural partitioning by date, region, etc.


## Storage Account Configuration

### Performance Tiers

#### Standard Performance
- Uses hard disk drives (HDD)
- Lower cost
- Good for bulk data, backups, archives
- Suitable for data accessed infrequently

#### Premium Performance
- Uses solid-state drives (SSD)
- Higher cost, better performance
- Good for frequently accessed data
- Lower latency, higher throughput

### Redundancy Options

1. **Locally Redundant Storage (LRS)**
   - Data replicated 3 times within a single datacenter
   - Lowest cost, lowest durability
   - 99.999999999% (11 9's) durability

2. **Zone-Redundant Storage (ZRS)**
   - Data replicated across 3 availability zones in a region
   - Higher durability than LRS
   - Protects against datacenter failures

3. **Geo-Redundant Storage (GRS)**
   - Data replicated to a secondary region
   - 99.9999999999999% (16 9's) durability
   - Protects against regional disasters

4. **Geo-Zone-Redundant Storage (GZRS)**
   - Combines ZRS in primary region with GRS
   - Highest durability and availability
   - Best for mission-critical data


## Data Organization Best Practices

### Folder Structure Patterns

#### Pattern 1: Date-Based Partitioning
```
/data/
  /raw/
    /2024/
      /01/
        /01/  (day)
          data.csv
      /02/
  /processed/
    /2024/
      /01/
```

#### Pattern 2: Subject Area Partitioning
```
/data/
  /sales/
    /raw/
    /processed/
  /marketing/
    /raw/
    /processed/
  /finance/
    /raw/
    /processed/
```

#### Pattern 3: Hybrid (Recommended)
```
/data/
  /sales/
    /raw/
      /2024/
        /01/
          sales_2024_01_01.csv
    /processed/
      /2024/
        /01/
          sales_daily_summary_2024_01.parquet
  /marketing/
    /raw/
      /2024/
        /01/
```

### Naming Conventions

✅ **Good Practices:**
- Use lowercase letters and hyphens: `sales-data-2024-01-01.csv`
- Include date in filename: `data_YYYY_MM_DD.csv`
- Be descriptive: `customer-transactions-raw.csv`
- Use consistent formats across projects

❌ **Avoid:**
- Spaces in names: `sales data.csv`
- Special characters: `sales@data#2024.csv`
- Inconsistent formats: `SalesData_2024-1-1.csv` vs `sales_data_2024_01_01.csv`


In [None]:
# Example: Working with Azure Storage using Python SDK
# Note: This is a conceptual example. In practice, you'll need:
# - Azure Storage Account credentials
# - azure-storage-blob package installed
# - Proper authentication configured

# Uncomment and install if needed:
# !pip install azure-storage-blob azure-identity

"""
from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential
import os

# Authentication (using DefaultAzureCredential)
credential = DefaultAzureCredential()

# Storage account URL
storage_account_url = "https://<storageaccountname>.blob.core.windows.net"

# Create BlobServiceClient
blob_service_client = BlobServiceClient(
    account_url=storage_account_url,
    credential=credential
)

# Create a container (if it doesn't exist)
container_name = "data-lake"
try:
    container_client = blob_service_client.create_container(container_name)
    print(f"Container '{container_name}' created successfully")
except Exception as e:
    print(f"Container may already exist: {e}")

# Upload a file
local_file_path = "sample_data.csv"
blob_name = "raw/2024/01/sample_data.csv"
blob_client = blob_service_client.get_blob_client(
    container=container_name, 
    blob=blob_name
)

with open(local_file_path, "rb") as data:
    blob_client.upload_blob(data, overwrite=True)
    print(f"File uploaded to {blob_name}")

# List blobs in a directory
container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(name_starts_with="raw/2024/01/")
for blob in blobs:
    print(f"Blob: {blob.name}, Size: {blob.size} bytes")
"""

print("This is a conceptual example.")
print("To use Azure Storage SDK, you need:")
print("1. Azure Storage Account")
print("2. Authentication credentials")
print("3. azure-storage-blob package installed")
print("\nKey concepts demonstrated:")
print("- Creating containers")
print("- Uploading files")
print("- Listing blobs")
print("- Organizing data in hierarchical structure")


## Azure Storage vs Data Lake Storage Gen2

### Comparison Table

| Feature | Azure Blob Storage | Azure Data Lake Storage Gen2 |
|---------|-------------------|------------------------------|
| **Namespace** | Flat | Hierarchical |
| **File System Semantics** | Limited | Full support |
| **Analytics Performance** | Good | Optimized |
| **ACL Support** | Container-level | File and directory level |
| **Hadoop Compatibility** | Via WASB | Native HDFS |
| **Cost** | Lower | Slightly higher |
| **Use Case** | General object storage | Big data analytics |

### Decision Matrix

**Choose Azure Blob Storage when:**
- Simple object storage needs
- Not using analytics tools (Spark, Hive)
- Cost is primary concern
- Legacy applications

**Choose ADLS Gen2 when:**
- Big data analytics workloads
- Need hierarchical organization
- Using Spark, Databricks, Synapse
- Require fine-grained access control
- Processing large-scale data

### Migration Path

You can enable hierarchical namespace on an existing storage account (one-way operation):
- Enables ADLS Gen2 features
- Existing blobs remain accessible
- Cannot be disabled once enabled


## File Storage (Azure Files)

**Azure Files** provides fully managed file shares in the cloud, accessible via Server Message Block (SMB) protocol.

### Key Features

- **SMB Protocol**: Accessible like a network drive
- **Mountable**: Can be mounted on Windows, Linux, macOS
- **Shared Access**: Multiple users/applications can access simultaneously
- **Snapshot Support**: Point-in-time backups

### Use Cases

- Lift and shift applications expecting file shares
- Shared storage for multiple VMs
- Development and testing environments
- Content management systems
- Application configuration storage

### File Share Types

1. **Standard File Shares**
   - HDD-backed
   - Lower cost
   - Good for general-purpose file sharing

2. **Premium File Shares**
   - SSD-backed
   - Higher performance
   - Better for I/O-intensive workloads

### Note for Data Engineering

While Azure Files is useful for certain scenarios, **Blob Storage** and **ADLS Gen2** are more commonly used in data engineering pipelines due to:
- Better integration with analytics tools
- Lower cost for large-scale data
- Optimized for batch processing


## Storage Endpoints and URLs

### Blob Storage Endpoints

```
https://<storageaccountname>.blob.core.windows.net/<container>/<blob>
```

Example:
```
https://mydatalake.blob.core.windows.net/raw-data/sales/2024/01/data.csv
```

### Data Lake Storage Gen2 Endpoints

**Data Lake Storage (DFS) Endpoint:**
```
https://<storageaccountname>.dfs.core.windows.net/<filesystem>/<path>/<file>
```

**Blob Endpoint (also works):**
```
https://<storageaccountname>.blob.core.windows.net/<filesystem>/<path>/<file>
```

Example:
```
https://mydatalake.dfs.core.windows.net/data-lake/raw/sales/2024/01/data.csv
```

### Access Methods

1. **Azure Portal**: Web-based interface
2. **Azure Storage Explorer**: Desktop application
3. **REST API**: Programmatic access
4. **SDKs**: Python, .NET, Java, etc.
5. **Command Line**: Azure CLI, PowerShell
6. **Analytics Tools**: Spark, Hive, Synapse (via abfss:// protocol)


## Summary

In this module, we've covered:

✅ Azure Storage Account and its services (Blob, File, Queue, Table)
✅ Blob Storage types and access tiers
✅ Azure Data Lake Storage Gen2 and hierarchical namespace
✅ File system concepts and organization
✅ Storage account types and performance tiers
✅ Redundancy options for data durability
✅ Data organization best practices
✅ Comparison between Blob Storage and ADLS Gen2
✅ Storage endpoints and access methods

### Key Takeaways

1. **Azure Storage Account** is the foundation for storing data in Azure
2. **Blob Storage** is ideal for unstructured data and object storage
3. **ADLS Gen2** adds hierarchical namespace for better analytics performance
4. **Hierarchical namespace** enables file system semantics and better organization
5. **Choose storage type** based on your use case: simple storage vs analytics workloads
6. **Organize data** using consistent folder structures and naming conventions
7. **Consider redundancy** based on your durability and availability requirements

### Next Steps

Proceed to **Module 03: Data Ingestion** to learn about:
- How to move data from sources to Azure
- Batch vs streaming data ingestion
- Azure Data Factory for data movement
- Event Hubs for streaming data
