# Example: Using a Datastore and Creating Versioned Datasets 📁

Let's take an example of using the popular MNIST dataset in Azure ML. 🔢

We will follow the following steps:

1. Download the MNIST dataset
2. Create a Datastore pointing to azure blob storage
3. Upload the MNIST dataset to the Datastore
4. Create a Train, Validate, and Test Data Assets pointing to the MNIST dataset in the Datastore

Let's get started!

# The Data 🔢

The MNIST dataset is a collection of 70,000 images of handwritten digits. It is a popular dataset used for image classification. The dataset is split into 60,000 training images and 10,000 test images. The images are grayscale and 28 x 28 pixels in size. The dataset also includes labels for each image, telling us which digit it is. We will work with a subset of the MNIST dataset for this example.

To download the data you can just clone this repo or download the data as a zip and navigate to the data folder: https://github.com/BredaUniversityADSAI/MNIST-Data.git 

Extract the data to a folder called `data` in the same directory as this notebook.

# Import Libraries and Set Up Workspace 🏗️

In [None]:
from azureml.core import Workspace, Datastore
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core import Dataset

subscription_id = #use your subscription id
resource_group = #use your resource group
workspace_name = #use your workspace name

# Log in using interactive Auth
auth = InteractiveLoginAuthentication()

# Declare workspace & datastore.
workspace = Workspace(subscription_id=subscription_id,
                      resource_group=resource_group,
                      workspace_name=workspace_name,
                      auth=auth,
                      )

# Inspect available data stores and upload data :outbox_tray: ☁️


In [None]:
# list all datastores registered in the current workspace
datastores = workspace.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

In [None]:
# Create a datastore object from the existing datastore named "workspaceblobstore".
datastore = Datastore(workspace, name='workspaceblobstore')

# Upload the data to the path target_path in datastore
datastore.upload(src_dir='data', target_path='mnist', overwrite=True, show_progress=True)

# Sample and plot images from the data store

In [None]:
# Create a FileDataset from a path to a directory.
# The directory contains a folder per class, each of which contains image files.
sample_set = Dataset.File.from_files(path=(datastore, 'mnist/train'))
paths = sample_set.take_sample(0.001).take(30).download()
print(paths)

### Challenge: Plot a sample of images from the data store in a grid with the corresponding labels


In [None]:
# Your code here


# Create and Register a Training, Validation and Test Dataset 📝

Running the code below multiple times will create multiple versions of the dataset. You can see the versions in the UI.

In [None]:
# Create a FileDataset from a path to a directory for the training data.
train_set = Dataset.File.from_files(path=(datastore, 'mnist/train'))
# Split the dataset into train and validation sets
train_set, val_set = train_set.random_split(0.8, seed=123)
# Create a FileDataset from a path to a directory for the test data.
test_set = Dataset.File.from_files(path=(datastore, 'mnist/test'))

#register the datasets
train_reg = train_set.register(workspace=workspace, name='digits_train', description='training data', create_new_version=True)
val_reg = val_set.register(workspace=workspace, name='digits_val', description='validation data', create_new_version=True)
test_reg = test_set.register(workspace=workspace, name='digits_test', description='test data', create_new_version=True)


In [None]:
# list all datasets registered in the current workspace
datasets = workspace.datasets
for name, dataset in datasets.items():
    print(name)

### Challenge: Try to print the version of the datasets as well.


In [None]:
# Your code here