# Create an Azure Workspace Data asset from local file

## Step 1: Import a Data assets into the Azure ML Workspace using CLI

Import local data into the Azure ML Platform workspace

In [1]:
%%writefile dependencies/dataimport.yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: credit_cards
description: Data asset created from local file.
type: uri_file
path: ../data/default_of_credit_card_clients.csv

Overwriting dependencies/dataimport.yml


In [2]:
!az ml data create --file dependencies/dataimport.yml 

{
  "creation_context": {
    "created_at": "2023-11-15T12:54:39.404122+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-15T12:54:39.412977+00:00"
  },
  "description": "Data asset created from local file.",
  "id": "/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/aigbb-aml-bootcamp/providers/Microsoft.MachineLearningServices/workspaces/aigbb-aml-bootcamp/data/credit_cards/versions/1",
  "name": "credit_cards",
  "path": "azureml://subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourcegroups/aigbb-aml-bootcamp/workspaces/aigbb-aml-bootcamp/datastores/workspaceblobstore/paths/LocalUpload/4b1dfc4d12429b46389cabdf25b886a2/default_of_credit_card_clients.csv",
  "properties": {},
  "resourceGroup": "aigbb-aml-bootcamp",
  "tags": {},
  "type": "uri_file",
  "version": "1"
}



Uploading default_of_credit_card_clients.csv (< 1 MB): 0.00B [00:00, ?B/s]
Uploading default_of_credit_card_clients.csv (< 1 MB): 100%|##########| 2.90M/2.90M [00:00<00:00, 9.18MB/s]
Uploading default_of_credit_card_clients.csv (< 1 MB): 100%|##########| 2.90M/2.90M [00:00<00:00, 9.18MB/s]




Read [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md) to learn more about data access in a notebook.

## Step 2: Create Parquet version of the Data asset

Create a new version of the data asset and store in Parquet file format

You might have noticed that the data needs a little light cleaning, to make it fit to train a machine learning model. It has:

* two headers
* a client ID column; we wouldn't use this feature in Machine Learning
* spaces in the response variable name

Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. Therefore, to clean the data and store it in Parquet, use:

### Step 2a: Install data wrangling libraries

In [3]:
!pip install pyarrow pandas



In [4]:
import pandas as pd

# read in data again, this time using the 2nd row as the header
df = pd.read_csv("data/default_of_credit_card_clients.csv", header=1)
# rename column
df.rename(columns={"default payment next month": "default"}, inplace=True)
# remove ID column
df.drop("ID", axis=1, inplace=True)

# write file to filesystem
df.to_parquet("./data/cleaned-credit-card.parquet")

This table shows the structure of the data in the original **default_of_credit_card_clients.csv** file .CSV file downloaded in an earlier step. The uploaded data contains 23 explanatory variables and 1 response variable, as shown here:

|Column Name(s) | Variable Type  |Description  |
|---------|---------|---------|
|X1     |   Explanatory      |    Amount of the given credit (NT dollar): it includes both the individual consumer credit and their family (supplementary) credit.    |
|X2     |   Explanatory      |   Gender (1 = male; 2 = female).      |
|X3     |   Explanatory      |   Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).      |
|X4     |   Explanatory      |    Marital status (1 = married; 2 = single; 3 = others).     |
|X5     |   Explanatory      |    Age (years).     |
|X6-X11     | Explanatory        |  History of past payment. We tracked the past monthly payment records (from April to September  2005). -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.      |
|X12-17     | Explanatory        |  Amount of bill statement (NT dollar) from April to September  2005.      |
|X18-23     | Explanatory        |  Amount of previous payment (NT dollar) from April to September  2005.      |
|Y     | Response        |    Default payment (Yes = 1, No = 0)     |

Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage):

> [!NOTE]
>
> This Python code cell sets **name** and **version** values for the data asset it creates. As a result, the code in this cell will fail if executed more than once, without a change to these values. Fixed **name** and **version** values offer a way to pass values that work for specific situations, without concern for auto-generated or randomly-generated values.


### Step 2b: Import local data into the Azure ML Platform workspace

In [5]:
%%writefile dependencies/dataimport_parquet.yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: credit_cards_parquet
description: Data asset created from local file.
type: uri_file
path: ../data/cleaned-credit-card.parquet

Writing dependencies/dataimport_parquet.yml


In [6]:
!az ml data create --file dependencies/dataimport_parquet.yml 

[32mUploading cleaned-credit-card.parquet[32m (< 1 MB): 100%|█| 1.58M/1.58M [00:00<00:00[0m
[39m

{
  "creation_context": {
    "created_at": "2023-11-15T12:55:22.488730+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-15T12:55:22.519881+00:00"
  },
  "description": "Data asset created from local file.",
  "id": "/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/aigbb-aml-bootcamp/providers/Microsoft.MachineLearningServices/workspaces/aigbb-aml-bootcamp/data/credit_cards_parquet/versions/1",
  "name": "credit_cards_parquet",
  "path": "azureml://subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourcegroups/aigbb-aml-bootcamp/workspaces/aigbb-aml-bootcamp/datastores/workspaceblobstore/paths/LocalUpload/56b293e476b6312818bef5931254dae9/cleaned-credit-card.parquet",
  "properties": {},
  "resourceGroup": "aigbb-aml-bootcamp",
  "tags": {},
  "type": "uri_file",
  "version": "1"
}



Uploading cleaned-credit-card.parquet (< 1 MB): 0.00B [00:00, ?B/s]
Uploading cleaned-credit-card.parquet (< 1 MB): 100%|##########| 1.58M/1.58M [00:00<00:00, 7.39MB/s]
Uploading cleaned-credit-card.parquet (< 1 MB): 100%|##########| 1.58M/1.58M [00:00<00:00, 7.23MB/s]


