## 02 - Azure Data Lake Storage Gen2

In this task, we will look at how to create an Azure Data Lake Storage account and upload a file to it. Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Azure Data Lake Storage Gen2 is the result of converging the capabilities of Azure Data Lake Storage Gen1 with Azure Blob storage.

Creating an Azure Data Lake Storage gen2 account is the same as creating an Azure Storage account, but we select he Hierarchical namespace as the storage account kind.

 Open the Azure portal.
1. Search for `Storage accounts` and click on it.
1. Click on the `+ Create` button to create a new storage account.
1. Fill in the basic details to create a new storage account.

   > Important: Storage Account names must be globally unique. You may need to try a few different names before you find one that is available.

   ![Basics](../images/02-azure-data-lake-storage-gen2-basics.png)

1. Click `Next`.
1. Tick `Enable hierarchical namespace`.

   ![Advanced](../images/02-azure-data-lake-storage-gen2-advanced.png)
   
1. Click `Review + Create`.
1. Click `Create`.
1. After the storage account is created, navigate to it in the Portal.
1. Select `Containers` blade under `Data Storage`.
1. Click `+ Container` to create a new container and name it `postanalytics`.

### Creating an analytics data structure to Azure Data Lake Storage Gen2 using Python

In this task, we will create an analytics structure (a folder for each date in a month) in Azure Data Lake Storage Gen2 and upload some files to it.

You'll need to get the Storage Account key from the Azure Portal.

In [1]:
# Uncomment next line if not running in Devcontainer
# %pip install azure-storage-file-datalake, pandas, numpy

# Configure these for your Azure Storage Account
account_name = '<your storage account name>'
account_key = '<your storage account key>'

# Define the container name
container_name = 'postanalytics'

Collecting azure-storage-file-datalake
  Using cached azure_storage_file_datalake-12.15.0-py3-none-any.whl.metadata (15 kB)
Using cached azure_storage_file_datalake-12.15.0-py3-none-any.whl (254 kB)
Installing collected packages: azure-storage-file-datalake
Successfully installed azure-storage-file-datalake-12.15.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Create an analytics folder structure in Azure Data Lake Storage gen2 with one date folder per day
from azure.storage.filedatalake import DataLakeServiceClient
import datetime
import os

# Create a DataLakeServiceClient
datalake_service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
    "https", account_name), credential=account_key)

# Create a file system
file_system_client = datalake_service_client.get_file_system_client(file_system=container_name)

# Create a date folder for each day in the previous month
today = datetime.date.today()
first_day_of_month = today.replace(day=1)
last_day_of_previous_month = first_day_of_month - datetime.timedelta(days=1)

for day in range(1, last_day_of_previous_month.day + 1):
    date_folder = last_day_of_previous_month.replace(day=day).strftime('%Y-%m-%d')
    directory_client = file_system_client.get_directory_client(date_folder)
    directory_client.create_directory()

    print(f"Created folder: {date_folder}")


Created folder: 2024-06-01
Created folder: 2024-06-02
Created folder: 2024-06-03
Created folder: 2024-06-04
Created folder: 2024-06-05
Created folder: 2024-06-06
Created folder: 2024-06-07
Created folder: 2024-06-08
Created folder: 2024-06-09
Created folder: 2024-06-10
Created folder: 2024-06-11
Created folder: 2024-06-12
Created folder: 2024-06-13
Created folder: 2024-06-14
Created folder: 2024-06-15
Created folder: 2024-06-16
Created folder: 2024-06-17
Created folder: 2024-06-18
Created folder: 2024-06-19
Created folder: 2024-06-20
Created folder: 2024-06-21
Created folder: 2024-06-22
Created folder: 2024-06-23
Created folder: 2024-06-24
Created folder: 2024-06-25
Created folder: 2024-06-26
Created folder: 2024-06-27
Created folder: 2024-06-28
Created folder: 2024-06-29
Created folder: 2024-06-30


In [6]:
# For each day in the previous month, create a postanalytics.csv and commentsanalytics.csv file in Azure Data Lake Storage gen2 container that contains randomly generated content

import pandas as pd
import numpy as np

for day in range(1, last_day_of_previous_month.day + 1):
    date_folder = last_day_of_previous_month.replace(day=day).strftime('%Y-%m-%d')
    directory_client = file_system_client.get_directory_client(date_folder)

    # Create a postanalytics.csv file
    postanalytics_file_client = directory_client.get_file_client("postanalytics.csv")
    postanalytics_data = pd.DataFrame({
        'PostId': np.random.randint(1, 1000, 1000),
        'PostTitle': np.random.choice(['Azure', 'Data', 'AI', 'Machine Learning', 'Python'], 1000),
        'PostViews': np.random.randint(1, 1000, 1000),
        'PostLikes': np.random.randint(1, 100, 1000),
        'PostComments': np.random.randint(1, 50, 1000)
    })
    postanalytics_file_client.create_file()
    postanalytics_file_client.append_data(data=postanalytics_data.to_csv(index=False), offset=0, length=len(postanalytics_data.to_csv(index=False)))
    postanalytics_file_client.flush_data(len(postanalytics_data.to_csv(index=False)))

    # Create a commentsanalytics.csv file
    commentsanalytics_file_client = directory_client.get_file_client("commentsanalytics.csv")
    commentsanalytics_data = pd.DataFrame({
        'CommentId': np.random.randint(1, 1000, 1000),
        'PostId': np.random.randint(1, 1000, 1000),
        'CommentText': np.random.choice(['Great post!', 'Thanks for sharing', 'I agree', 'Interesting', 'Not helpful'], 1000),
        'CommentLikes': np.random.randint(1, 50, 1000)
    })
    commentsanalytics_file_client.create_file()
    commentsanalytics_file_client.append_data(data=commentsanalytics_data.to_csv(index=False), offset=0, length=len(commentsanalytics_data.to_csv(index=False)))
    commentsanalytics_file_client.flush_data(len(commentsanalytics_data.to_csv(index=False)))

    print(f"Created files in folder: {date_folder}")


Created files in folder: 2024-06-01
Created files in folder: 2024-06-02
Created files in folder: 2024-06-03
Created files in folder: 2024-06-04
Created files in folder: 2024-06-05
Created files in folder: 2024-06-06
Created files in folder: 2024-06-07
Created files in folder: 2024-06-08
Created files in folder: 2024-06-09
Created files in folder: 2024-06-10
Created files in folder: 2024-06-11
Created files in folder: 2024-06-12
Created files in folder: 2024-06-13
Created files in folder: 2024-06-14
Created files in folder: 2024-06-15
Created files in folder: 2024-06-16
Created files in folder: 2024-06-17
Created files in folder: 2024-06-18
Created files in folder: 2024-06-19
Created files in folder: 2024-06-20
Created files in folder: 2024-06-21
Created files in folder: 2024-06-22
Created files in folder: 2024-06-23
Created files in folder: 2024-06-24
Created files in folder: 2024-06-25
Created files in folder: 2024-06-26
Created files in folder: 2024-06-27
Created files in folder: 202