# AutoML - Train "the best" NLP NER model for a named entity recognition dataset.

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](/sdk/resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorizes specified entities in a body or bodies of texts. NER is also known simply as entity identification, entity chunking and entity extraction.

This notebook using AutoML NLP NER task trains a model using prepared datasets derived from the CoNLL-2003 dataset introduced by Sang et al. in [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://paperswithcode.com/paper/introduction-to-the-conll-2003-shared-task) and also available with a derived version at KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt)

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [3]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.identity import InteractiveBrowserCredential
from azure.ml import MLClient

from azure.ml._constants import AssetTypes
from azure.ml.entities import JobInput

from azure.ml import automl
# from azure.ml.automl import text_ner

from pprint import pprint

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [4]:
#Enter details of your AML workspace

# CDLTLL-GPU
# subscription_id = '381b38e9-9840-4719-a5a0-61d9585e1e91' #'<SUBSCRIPTION_ID>'
# resource_group = 'cesardl-automl-eastus2euap-resgrp' # '<RESOURCE_GROUP>'
# workspace = 'cesardl-dist-training-eastus-ws' # '<AML_WORKSPACE_NAME>'

# SAGAR
subscription_id = "381b38e9-9840-4719-a5a0-61d9585e1e91" #'<SUBSCRIPTION_ID>'
resource_group = "sasum_centraluseuap_rg" # '<RESOURCE_GROUP>'
workspace = "sasum-centraluseuap-ws" # '<AML_WORKSPACE_NAME>'

# CDLTLL
# subscription_id = '102a16c3-37d3-48a8-9237-4c9b1e8e80e0' #'<SUBSCRIPTION_ID>'
# resource_group = 'automlpmdemo' # '<RESOURCE_GROUP>'
# workspace = 'cesardl-automl-centraluseuap-ws' # '<AML_WORKSPACE_NAME>'

# JUAMARTI
# subscription_id = "381b38e9-9840-4719-a5a0-61d9585e1e91"
# resource_group = "juamarti"
# workspace = "centraluseuap_phmantri"


In [5]:
#get a handle to the workspace
credential = InteractiveBrowserCredential() # DefaultAzureCredential()
#credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace)

# 2. Data

This model trianing uses the datasets from KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt), in particular using the following datasets in the training and validation process:

- Training dataset file (train.txt)
- Validation dataset file (valid.txt)

Both files are placed within their related MLTable folder.

**NOTE:** In this PRIVATE PREVIEW we're defining the MLTable in a separate folder and .YAML file.
In later versions, you'll be able to do it all in Python APIs.

In [8]:
# MLTable folders
training_mltable_path = "./training-mltable-folder/"
validation_mltable_path = "./validation-mltable-folder/"

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = JobInput(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = JobInput(type=AssetTypes.MLTABLE, path=validation_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = JobInput(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_training_mltable")
# my_validation_data_input = JobInput(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_validation_mltable")    

# 3. Configure and run the AutoML NLP Text NER training job
In this section we will configure and run the AutoML job, for training the model.    

In [16]:
# Create the AutoML job with the related factory-function.

text_ner_job = automl.text_ner(compute = "gpu-cluster",
                        # name="dpv2-nlp-text-ner-job-01",
                        experiment_name = "dpv2-nlp-text-ner-experiment",
                        training_data = my_training_data_input,                           
                        validation_data = my_validation_data_input,
                        primary_metric = "accuracy",
                        tags={"owner": "cesardl"},
                        target_column_name="xxxxx", #REMOVE: This should be optional (BUG:1721836)
                
                        # These are temporal properties needed in Private Preview
                        properties={
                            "_automl_internal_enable_mltable_quick_profile": True,
                            "_automl_internal_label": "latest"
                        }
)

text_ner_job.set_limits(timeout=60)

## 2.2 Run the CommandJob
Using the `MLClient` created earlier, we will now run this CommandJob in the workspace.

In [17]:
# Submit the AutoML job

returned_job = ml_client.jobs.create_or_update(text_ner_job)  # submit the job to the backend

print(f"Created job: {returned_job}")

Readonly attribute primary_metric will be ignored in class <class 'azure.ml._restclient.v2022_02_01_preview.models._models_py3.TextNer'>


HttpResponseError: (UserError) A job with this name already exists. If you are trying to create a new job, use a different name. If you are trying to update an existing job, the existing job's compute, forecasting settings, general settings, limit settings, hyperparameter settings cannot be changed. Only description, tags, displayName, properties, and isArchived can be updated.
Code: UserError
Message: A job with this name already exists. If you are trying to create a new job, use a different name. If you are trying to update an existing job, the existing job's compute, forecasting settings, general settings, limit settings, hyperparameter settings cannot be changed. Only description, tags, displayName, properties, and isArchived can be updated.
Additional Information:Type: ComponentName
Info: {
    "value": "managementfrontend"
}Type: Correlation
Info: {
    "value": {
        "operation": "6a69656d3a0cb44caef1fa4afa272713",
        "request": "d24c8e836eaf04a2"
    }
}Type: Environment
Info: {
    "value": "master"
}Type: Location
Info: {
    "value": "westus2"
}Type: Time
Info: {
    "value": "2022-03-25T21:10:15.6348655+00:00"
}Type: InnerError
Info: {
    "value": {
        "code": "Immutable",
        "innerError": {
            "code": "JobPropertyImmutable",
            "innerError": null
        }
    }
}Type: MessageFormat
Info: {
    "value": "A job with this name already exists. If you are trying to create a new job, use a different name. If you are trying to update an existing job, the existing job's{property} cannot be changed. Only description, tags, displayName, properties, and isArchived can be updated."
}Type: MessageParameters
Info: {
    "value": {
        "property": " compute, forecasting settings, general settings, limit settings, hyperparameter settings"
    }
}

In [None]:
# Get a URL for the status of the job
print("Open the following link to observe the AutoML training job/run:")

returned_job.services["Studio"].endpoint

# Next Steps
You can see further examples of other AutoML tasks such as Regression, Image-Object-Detection, NLP-Text-Classification, Time-Series-Forcasting, etc.