# Exercise 2 - From Data to Model

In [the previous exercise](./01%20-%20Getting%20Started%20with%20Azure%20ML.ipynb), you created an Azure ML workspace and ran a simple experiment based on data in a CSV file in the **data** folder where this notebook is stored. Although it's fairly common for data scientists to work with data on their local file system, in an enterprise environment it can be more effective to store the data in a central location where multiple data scientists can access it. 

> **Important**: This exercise assumes you have completed the previous exercise in this series - specifically, you must have:
>
> - Created an Azure ML Workspace, and saved its configuration in this Azure Notebooks project.
>
> If you haven't done that, do it now - it'll only take a few minutes!

## Task 1: Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK. Let's start by ensuring you still have the latest version installed.

In [None]:
#!pip install --upgrade azureml-sdk[notebooks]
import azureml.core
print("Ready to use Azure ML", azureml.core.VERSION)

Now you're ready to connect to your workspace. When you created it in the previous exercise, you saved its configuration; so now you can simply load the workspace from its configuration file.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [None]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

## Task 2: Upload Data to a Datastore

In Azure ML, *datastores* are references to storage locations, such as Azure Storage blob containers. Every workspace has a *default* datastore - usually the Azure storage blob container that was created with the workspace. If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

Run the following code to determine the datastores in your workspace:

In [None]:
from azureml.core import Datastore

# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

Now that you have determined the available datastores, you can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

In [None]:
default_ds.upload_files(files=['./data/diabetes.csv'], # Upload the data/diabetes.csv file
                       target_path='diabetes-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

Note that the upload to the datastore results in the creation of a *data reference*, which is an abstraction that represents the connection to the data in the datastore.

So now there's a copy of the diabetes data in the default datastore for the workspace, that you can use in future experiments. If you like, you can use the *Storage Explorer* interface for the Azure Storage account that was created with your Azure ML workspace in the [Azure portal](https://portal.azure.com) to verify that the *diabetes.csv* file has been uploaded to a *diabetes-dataset* folder in the *azureml-blobstore-nnnn...* blob container.

We'll return to datastores later, but for now, let's turn our attention to another data-related object in Azure ML - the *dataset*.

## Task 3: Create and Register a Dataset

A dataset is an object that encapsulates a specific data source. Let's create a dataset from the diabetes data you uploaded to the datastore, and view the first 20 records. In this case, the data is in a structured format in a CSV file, so we'll use a *Tabular* dataset.

> **More Information**: See the [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets) for more information about creating datasets.

In [None]:
from azureml.core import Dataset

#Create a tabular dataset from the path on the datastore (this may take a short while)
data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/diabetes.csv'))

# Display the first 20 rows as a Pandas dataframe
data_set.take(20).to_pandas_dataframe()

As you can see in the code above, it's easy to convert the dataset to a Pandas dataframe, enabling you to work with the data using common python techniques.

Now that we have a dataset that references the diabetes data, we can register it to make it easily accessible to any experiment being run in the workspace.

In [None]:
# Register the dataset
dataset_name = 'Diabetes Dataset'
data_set = data_set.register(workspace=ws, 
                           name=dataset_name,
                           description='diabetes data',
                           tags = {'year':'2019', 'category':'Diabetes'},
                           create_new_version=True)

# List the datasets registered in the workspace
for ds in ws.datasets:
    print(ds)