# In this notebook, you'll test the data_prep module


You'll need the latest version of the **azureml-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

## Setup

This is important if you want to use your `.py` modules to create jobs, rather than writing your code directly in the notebook.




In [1]:
import sys

print(sys.prefix)

/anaconda/envs/condav0


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [2]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


## Testing the Data Processing Pipeline

In [4]:
import os
import sys

project_root_directory = os.getcwd().split("/notebooks")[0]
sys.path.insert(0, project_root_directory)

In [5]:
from core.data_preparation.config import FeatureEngineeringConfig, MethodConfig
from core.data_preparation.pipeline import DataPrepPipelineBuilder
import pandas as pd

In [6]:
# Create the MLPipelineConfig object
config = FeatureEngineeringConfig(
    data_preparation_steps=[
        MethodConfig(name="sklearn.preprocessing.Normalizer", params=dict(norm="l2")),
        MethodConfig(
            name="core.data_preparation.pipeline.DropColumns",
            params=dict(columns="Selfi_Cam"),
        ),
    ],
)

removing the normalizer just for testing purposes 

In [7]:
# Create the MLPipelineConfig object
config = FeatureEngineeringConfig(
    data_preparation_steps=[
        MethodConfig(
            name="core.data_preparation.pipeline.DropColumns",
            params=dict(columns="Selfi_Cam"),
        )
    ],
)

or

In [8]:
# model_training_config_file = 'path'
# model_training_config = FeatureEngineeringConfig.load_from_yaml(model_training_config_file)

In [10]:
import mltable

registered_data_asset = ml_client.data.get(name="mobile-price-table", version=1)
tbl = mltable.load(f"azureml:/{registered_data_asset.id}")
df = tbl.to_pandas_dataframe()
df.head(5)

INFO:azure.identity._internal.get_token_mixin:AzureMLCredential.get_token succeeded
INFO:azure.identity._internal.decorators:ManagedIdentityCredential.get_token succeeded
INFO:azure.identity._credentials.default:DefaultAzureCredential acquired a token from ManagedIdentityCredential
INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use Azure ML managed identity
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from ManagedIdentityCredential
INFO:azure.identity._internal.get_token_mixin:AzureMLCredential.get_token succeeded
INFO:azure.identity._internal.decorators:ManagedIdentityCredential.get_token succeeded
INFO:azure.identity._credentials.default:DefaultAzureCredential acquired a token from ManagedIdentityCredential
INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity

Unnamed: 0,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price
0,4.3,4.0,128.0,6.0,48,13.0,4000,24999
1,3.4,6.0,64.0,4.5,48,12.0,4000,15999
2,4.3,4.0,4.0,4.5,64,16.0,4000,15000
3,4.4,6.0,64.0,6.4,48,15.0,3800,18999
4,4.5,6.0,128.0,6.18,35,15.0,3800,18999


In [None]:
pipeline_builder = DataPrepPipelineBuilder(config)
pipeline = pipeline_builder.get_pipeline()
X_transformed = pipeline.transform(df)
print(X_transformed)




## Checking for existing de env

Let's explore the environments within the workspace.


> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [None]:
envs = ml_client.environments.list()
for env in envs:
    print(env.name)

Submitting the job with the new custom environment triggers the build of the environment. The first time you use a newly created environment, it can take 10-15 minutes to build the environment, which also means your job will take longer to complete.
You can also choose to manually trigger the build of the environment before you submit a job. The environment only needs to be built the first time you use it.

## Creating a job to use a data asset

After using a notebook for experimentation. You can use scripts to train machine learning models. A script can be run as a job, and for each job you can specify inputs and outputs. 

You can use either **data assets** or **datastore paths** as inputs or outputs of a job. 

The cells below creates the **move-data.py** script in the **src** folder. The script reads the input data with the `read_csv()` function. The script then stores the data as a CSV file in the output path.

To submit a job that runs the **custom_job.py** script, run the cell below. 

The job is configured to use the data asset `diabetes-local`, pointing to the local **mobile-price-local.csv** file as input. The output is a path pointing to a folder in the new datastore `blob_mobileprice_cleaned`.

In [None]:
%%writefile ../main.py

import sys
import os

# Add the parent directory to sys.path
current_path = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, current_path + "/ml-core-template/")

print(sys.path)

from core.data_preparation.config import FeatureEngineeringConfig, MethodConfig
from core.data_preparation.pipeline import DataPrepPipelineBuilder
from azure_module.data_loader import DataLoader


data_loader = DataLoader()
data_loader.parse_input_arg()
data_loader.parse_output_arg()
df = data_loader.get_data()


config = FeatureEngineeringConfig(
    data_preparation_steps=[MethodConfig(name='core.data_preparation.pipeline.DropColumns', params=dict(columns='Selfi_Cam'))],
    
)

pipeline_builder = DataPrepPipelineBuilder(config)
pipeline = pipeline_builder.get_pipeline()
X_transformed = pipeline.transform(df)


data_loader.save_data(X_transformed, filename="load_outside_job.csv")

In [None]:
envs = ml_client.environments.list()
for env in envs:
    print(env.name)

In [None]:
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import command

# configure input and output
my_job_inputs = {
    "local_data": Input(type=AssetTypes.URI_FILE, path="azureml:mobile-price-local:1")
}

my_job_outputs = {
    "datastore_data": Output(
        type=AssetTypes.URI_FOLDER,
        path="azureml://datastores/blob_mobileprice_cleaned/paths/datastore-path",
    )
}

# configure job
job = command(
    code="../",
    command="python main.py --input_data ${{inputs.local_data}} --output_datastore ${{outputs.datastore_data}}",
    inputs=my_job_inputs,
    outputs=my_job_outputs,
    environment="docker-context-repo-based-v1:1",
    compute="sandbox-ci",
    display_name="test",
    experiment_name="test",
)

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor the job at", aml_url)

### We can also load the data 'inside' the job itself"

In [None]:
%%writefile ../main.py

import sys
import os

# Add the parent directory to sys.path
current_path = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, current_path + "/ml-core-template/")


from core.data_preparation.config import FeatureEngineeringConfig, MethodConfig
from core.data_preparation.pipeline import DataPrepPipelineBuilder
from azure_module.data_loader import DataLoader


import mltable

data_loader = DataLoader()
data_loader.parse_output_arg()

tbl = mltable.load(f"azureml://.....path...")
df = tbl.to_pandas_dataframe()


config = FeatureEngineeringConfig(
    data_preparation_steps=[MethodConfig(name='core.data_preparation.pipeline.DropColumns', params=dict(columns='Selfi_Cam'))],
    
)

pipeline_builder = DataPrepPipelineBuilder(config)
pipeline = pipeline_builder.get_pipeline()
X_transformed = pipeline.transform(df)


data_loader.save_data(X_transformed, filename="load_inside_job.csv")

In [None]:
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import command

# configure input and output

my_job_outputs = {
    "datastore_data": Output(
        type=AssetTypes.URI_FOLDER,
        path="azureml://datastores/blob_mobileprice_cleaned/paths/datastore-path",
    )
}

# configure job
job = command(
    code="../",
    command="python main.py --output_datastore ${{outputs.datastore_data}}",
    outputs=my_job_outputs,
    environment="docker-context-repo-based-v1:1",
    compute="sandbox-ci",
    display_name="test",
    experiment_name="test",
)

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor the job at", aml_url)