# AutoML - Sentiment Analysis scenario: Train "the best" NLP Text Classification Multi-class model for a 'Sentiment Labeled Sentences' dataset.

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) 
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [1]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

from azure.ai.ml import automl
from azure.ai.ml.entities import ResourceConfiguration

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [2]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Found the config file in: .\config.json


### Show Azure ML Workspace information

In [3]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output

{'Workspace': 'rbhimani-vnet-2',
 'Subscription ID': 'dbd697c3-ef40-488f-83e6-5ad4dfb78f9b',
 'Resource Group': 'rbhimani-rg',
 'Location': 'eastus'}

# 2. Data

Please make use of the MLTable files present in separate folders at the same location (in the repo) as this notebook.

In [4]:
import os
import urllib
from zipfile import ZipFile

# Destination folders
training_mltable_path = "./training-mltable-folder/"
validation_mltable_path = "./validation-mltable-folder/"

# Download dataset files and copy within each MLTable folder

training_download_url = "https://raw.githubusercontent.com/dotnet/spark/main/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/Resources/yelptrain.csv"
training_data_file = training_mltable_path + "yelp_training_set.csv"
urllib.request.urlretrieve(training_download_url, filename=training_data_file)

valid_download_url = "https://raw.githubusercontent.com/dotnet/spark/main/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/Resources/yelptest.csv"
valid_data_file = validation_mltable_path + "yelp_validation_set.csv"
urllib.request.urlretrieve(valid_download_url, filename=valid_data_file)

print("Dataset files downloaded...")

Dataset files downloaded...


In [5]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_training_mltable")
# my_validation_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_validation_mltable")

For documentation on creating your own MLTable assets for jobs beyond this notebook:
- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK.

# 3. Configure and run the AutoML NLP Text Classification Multiclass training job
In this section we will configure and run the AutoML job, for training the model.    

In [6]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "gpu-cluster-nc12s-v3"

try:
    _ = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="Standard_NC12s_v3",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(compute_config).result()

Found existing compute target.


In [7]:
# general job parameters
exp_name = "dpv2-nlp-text-classification-experiment"
dataset_language_code = "eng"

In [11]:
# Create the AutoML job with the related factory-function.
from azure.ai.ml.automl import SearchSpace
from azure.ai.ml.sweep import Choice, Uniform, BanditPolicy
text_classification_job = automl.text_classification(
    # name="dpv2-nlp-text-classification-multiclass-job-01",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    target_column_name="Sentiment",
    primary_metric="accuracy",
    tags={"my_custom_tag": "My custom value"},
    compute = "gpu-cluster-nc12s-v3"
)

text_classification_job.set_limits(timeout_minutes=120, max_trials=4, max_concurrent_trials=2)

text_classification_job.set_training_parameters(
    model_name="roberta-base"
)

text_classification_job.set_sweep(
    sampling_algorithm="Random"
)

text_classification_job.extend_search_space(
    SearchSpace(
        weight_decay=Uniform(0.01, 0.1),
    ),
)

In [12]:
text_classification_job.set_featurization(dataset_language=dataset_language_code)

## 2.2 Run the Command
Using the `MLClient` created earlier, we will now run this Command as a job in the workspace.

In [13]:
# Submit the AutoML job

returned_job = ml_client.jobs.create_or_update(
    text_classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

Exception: 
[37m
[30m
1) One or more fields are invalid[39m[39m

Details: 

[31m(x) Supported input path value are ARM id, AzureML id, remote uri or local path.
Met <class 'azure.core.exceptions.ServiceRequestError'>:
<urllib3.connection.HTTPSConnection object at 0x0000020AAA76C3C8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed[39m

Resolutions: 
1) Double-check that all specified parameters are of the correct types and formats prescribed by the Job schema.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: [36mhttps://code.visualstudio.com/docs/datascience/azure-machine-learning.[39m To set up VS Code, visit [36mhttps://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code[39m


In [None]:
ml_client.jobs.stream(returned_job.name)

# Next Steps
You can see further examples of other AutoML tasks such as Regression, Image-Object-Detection, NLP-Text-Classification, Time-Series-Forcasting, etc.