# Create an ML Table

ML Table is aimed at:
- Capturing and defining schema contained in flat files (csv, parquet)
- Extracting relevant subsets from large data
- Fast materialization of data into Pandas and Spark

 **Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb)
- A Compute Cluster. [Check this notebook to create a compute cluster](/sdk/resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create new mltable using different `from_*` options or `load` an existing mltable artifact
- `save` mltable locally or `upload` to a cloud path

"**Motivations** - This notebook explains with examples various ways of creating mltable and also to use different helper methods on mltable.


# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the datastore will be created.

## 1.1. Import the required libraries


In [None]:
#mltable will be a separate package and the user has "pip install mltable"
#import required libraries
import mltable as mlt
from azure.ml import MLClient
from azure.ml.entities import CommandJob, JobInput
from azure.identity import InteractiveBrowserCredential

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
#Enter details of your AML workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AML_WORKSPACE_NAME>'

In [None]:
#get a handle to the workspace
ml_client = MLClient(InteractiveBrowserCredential(), subscription_id, resource_group, workspace)

### Create mltable artifact

The entry point for creating an MLTable object in memory is a from_* method. The from does not materialize the data, but rather stores is as a transform in the MLTable definition.

In [None]:
import mltable as mlt

# Use a local file
tbl = mlt.from_delimited_files(
    file="./data/titanic.csv", 
    validate=True # default=False
)



# Use a local folder

tbl = mlt.from_delimited_files(
    folder="./data/nyc-taxi", 
    delimiter=","
    validate=True # default=False
    header = "all_files_same_headers", # [no_header, from_first_file, all_files_different_headers, all_files_same_headers]
    support_multi_line = False,
    empty_as_string = False,
    encoding = "utf8", # [utf8, iso88591, latin1, ascii, utf16, utf32, utf8bom, windows1252]
    include_path_column = False
)
 


### Saving an ML Table artifact to local path
`save` only supports saving to local.


In [None]:
import mltable as mlt
tbl = mlt.from_delimited_files(file="./data/iris/iris.csv")
tbl.drop_columns("petal.width")
local_path = "./data"
tbl.save(path=local_path, overwrite=True)


### Loading an existing MLTable artifact

NOTE: The previous python cell with "saving" an ML Table to local path should be run successfully

In [None]:
import mltable as mlt

tbl = mlt.load(folder="./data/iris")
df = tbl.to_pandas_dataframe()

In [None]:
### Loading a V1 Dataset

NOTE: Substitute the name of the V1 AzureML Dataset in the Workspace in "v1dataset" below and the version if any for "5" below

In [None]:
import mltable as mlt

tbl = mlt.load(folder="azureml:v1dataset:5")
df = tbl.to_pandas_dataframe()

### Uploading an ML artifact to cloud path
To save an MLTable to a cloud location, datastore.upload() should be used where the datastores can be accessed via MLClient

In [None]:
from azure.ml import MLClient
import mltable as mlt
tbl = mlt.from_delimited_files(file="./data/iris/iris.csv")
tbl.drop_columns("petal.width")
dstor = mlclient.datastores.get_default()
cloud_path = dstor.upload(src_path="./my_data", target_folder="")


# MLTable operations following are helper methods on mltable 

### take

In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl.take(5)

### take_random_sample

In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl.take_random_sample(
    probability=1,
    seed=3 
)

### skip

In [None]:
import mltable as mlt
import mltable.PromoteHeadersBehavior as phb

tbl = mlt.from_delimited_files(
    path="./data/iris/iris.csv",
    header=phb.ONLY_FIRST_FILE_HAS_HEADERS)
tbl.skip(n=5)



### keep_columns
`keep_columns` will retain only the listed `column_names` and ignores everything else. Here a user can specify names OR use a search pattern.


In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl.keep_columns(column_list = ["PassengerId","Survived","Name","Sex","Age"])

### drop_columns

In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl.drop_columns(colum_list= ["Embarked","Ticket"])

### filter

In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl = tbl.filter(tbl['Age'] >25)

### convert_column_types
`convert_column_types` converts all the listed columns to the same selected data type

In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
#using a dict
new_types= {
    'Fare': mlt.DataType.to_int(),
    'Parch': mlt.DataType.to_string()
}

tbl.convert_column_types(new_types)



### show
Before materializing into a pandas/spark data frame, it can be convienant to preview the data quickly. The `show` function will print out the first 20 rows by default, but can be adjusted by the user.


In [None]:
import mltable as mlt

tbl = mlt.from_delimited_files(file="./data/titanic.csv")
tbl.show()
# outputs first 20 rows

tbl.show(5)