<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Data Lakes for Azure with Python

## Azure Actions Covered

* Creating a data lake file system
* Creating directories and files in the data lake file system
* Uploading a file to a data lake file system

In this lecture, we'll learn how to set up data lake storage in Azure with Python.

To begin, we'll need to import our usual libraries as well as any useful environment variables (e.g. AZURE_SUBSCRIPTION_ID). We'll add some new imports as well.

In [1]:
from azure.identity import AzureCliCredential
# New imports for data lake storage
from azure.storage.filedatalake import DataLakeServiceClient
from azure.mgmt.storage import StorageManagementClient

from settings import AZURE_SUBSCRIPTION_ID, DEFAULT_LOCATION, DEFAULT_RESOURCE_GROUP

Let's instantiate our credential and use it to set up our `StorageManagementClient`.

In [2]:
credential = AzureCliCredential()
sm_client = StorageManagementClient(credential, AZURE_SUBSCRIPTION_ID)

Let's list our storage account as well. I've set up one specifically for data lake storage.

In [3]:
for account in sm_client.storage_accounts.list():
    print(account.name)

benbstorage1234
bendatalake1234


To interact with our storage account as part of the data lake service, we can instantiate a `DataLakeServiceClient` object. Our account URL will be of the form `https://<storage-account>.dfs.core.windows.net/`.

In [4]:
dl_service_client = DataLakeServiceClient(
    account_url='https://bendatalake1234.dfs.core.windows.net/',
    credential=credential
)

We can use the client to check some parameters.

In [5]:
dl_service_client.url

'https://bendatalake1234.dfs.core.windows.net/'

In our data lake storage, we can create a new file system for storing files using the management client. This will return a `FileSystemClient` object.

In [8]:
fs_client = dl_service_client.create_file_system(
    file_system='dl-file-system'
)

Let's check some of the attributes on our returned object.

In [9]:
fs_client.file_system_name

'dl-file-system'

In [10]:
fs_client.primary_endpoint

'https://bendatalake1234.dfs.core.windows.net/dl-file-system'

Once the file system is created, we can add structure to it by creating directories and adding files. Let's create a directory for raw data and then upload a file, starting off with the `create_directory()` method on our file system client. This returns a `DataLakeDirectoryClient` object.

In [11]:
dl_dir_client = fs_client.create_directory(directory='raw-data')

Let's check the primary endpoint for our new directory.

In [12]:
dl_dir_client.primary_endpoint

'https://bendatalake1234.dfs.core.windows.net/dl-file-system/raw-data'

To upload a file to our directory, we need to:
    
* Create a file to generate a `DataLakeFileClient`
* Use the new file client to upload the relevant file

In [26]:
file_client = dl_dir_client.create_file('income.csv')

We can upload the file with Python's context manager.

In [30]:
with open('/home/ben/Downloads/income.csv', 'rb') as myfile:
    file_client.upload_data(data=myfile.read(), overwrite=True)

Let's take a look at our directory structure with the `get_paths()` method.

In [32]:
for directory in fs_client.get_paths():
    print(directory.name)

raw-data
raw-data/income.csv
