Skip to content

Commit

Permalink
Merge pull request #11940 from MicrosoftDocs/learn-build-service-prod…
Browse files Browse the repository at this point in the history
…bot/docutune-autopr-20240506-050657-4093632-ignore-build

[DocuTune-Remediation] - DocuTune scheduled execution in AAC (part 3)
  • Loading branch information
JamesJBarnett committed May 6, 2024
2 parents 3f1ea31 + e99efa0 commit aa9967e
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 61 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ During the processing steps, Azure Databricks, Azure Synapse Analytics, and Azur
- [Event Hubs](https://azure.microsoft.com/services/event-hubs) ingests data streams that client applications generate. Event Hubs stores the streaming data and preserves the sequence of received events. Consumers can connect to hub endpoints to retrieve messages for processing. Event Hubs integrates with Data Lake Storage, as this solution shows.
- [Azure HDInsight](/azure/hdinsight/hdinsight-overview) is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. You can use open-source frameworks with Azure HDInsight, such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, and R.
- [Data Factory](https://azure.microsoft.com/services/data-factory) automatically moves data between storage accounts of differing security levels to ensure separation of duties.
- [Computer Vision](https://azure.microsoft.com/resources/cloud-computing-dictionary/what-is-computer-vision/) uses [text recognition APIs](/azure/cognitive-services/computer-vision/overview-ocr) to recognize text in images and extract that information. [The Read API](/azure/cognitive-services/computer-vision/overview-ocr#read-api) uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images. [The OCR API](/azure/cognitive-services/computer-vision/concept-recognizing-text#ocr-optical-character-recognition-api) isn't optimized for large documents but supports more languages than the Read API. This solution uses OCR to produce data in the [hOCR](https://en.wikipedia.org/wiki/HOCR) format.
- [Computer Vision](https://azure.microsoft.com/resources/cloud-computing-dictionary/what-is-computer-vision/) uses [text recognition APIs](/azure/cognitive-services/computer-vision/overview-ocr) to recognize text in images and extract that information. The [Read API](/azure/cognitive-services/computer-vision/overview-ocr#read-api) uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images. The [OCR API](/azure/cognitive-services/computer-vision/concept-recognizing-text#ocr-optical-character-recognition-api) isn't optimized for large documents but supports more languages than the Read API. This solution uses OCR to produce data in the [hOCR](https://en.wikipedia.org/wiki/HOCR) format.

## Scenario details

Expand All @@ -44,20 +44,20 @@ For customized NLP workloads, the open-source library Spark NLP serves as an eff

### Potential use cases

- **Document classification**: Spark NLP offers several options for text classification:
- **Document classification:** Spark NLP offers several options for text classification:

- Text preprocessing in Spark NLP and machine learning algorithms that are based on Spark ML
- Text preprocessing and word embedding in Spark NLP and machine learning algorithms such as GloVe, BERT, and ELMo
- Text preprocessing and sentence embedding in spark NLP and machine learning algorithms and models such as the Universal Sentence Encoder
- Text preprocessing and classification in Spark NLP that uses the ClassifierDL annotator and is based on TensorFlow

- **Name entity extraction (NER)**: In Spark NLP, with a few lines of code, you can train a NER model that uses BERT, and you can achieve state-of-the-art accuracy. NER is a subtask of information extraction. NER locates named entities in unstructured text and classifies them into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. Spark NLP uses a state-of-the-art NER model with BERT. The model is inspired by a former NER model, bidirectional LSTM-CNN. That former model uses a novel neural network architecture that automatically detects word-level and character-level features. For this purpose, the model uses a hybrid bidirectional LSTM and CNN architecture, so it eliminates the need for most feature engineering.
- **Name entity extraction (NER):** In Spark NLP, with a few lines of code, you can train a NER model that uses BERT, and you can achieve state-of-the-art accuracy. NER is a subtask of information extraction. NER locates named entities in unstructured text and classifies them into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. Spark NLP uses a state-of-the-art NER model with BERT. The model is inspired by a former NER model, bidirectional LSTM-CNN. That former model uses a novel neural network architecture that automatically detects word-level and character-level features. For this purpose, the model uses a hybrid bidirectional LSTM and CNN architecture, so it eliminates the need for most feature engineering.

- **Sentiment and emotion detection**: Spark NLP can automatically detect positive, negative, and neutral aspects of language.
- **Sentiment and emotion detection:** Spark NLP can automatically detect positive, negative, and neutral aspects of language.

- **Part of speech (POS)**: This functionality assigns a grammatical label to each token in input text.
- **Part of speech (POS):** This functionality assigns a grammatical label to each token in input text.

- **Sentence detection (SD)**: SD is based on a general-purpose neural network model for sentence boundary detection that identifies sentences within text. Many NLP tasks take a sentence as an input unit. Examples of these tasks include POS tagging, dependency parsing, named entity recognition, and machine translation.
- **Sentence detection (SD):** SD is based on a general-purpose neural network model for sentence boundary detection that identifies sentences within text. Many NLP tasks take a sentence as an input unit. Examples of these tasks include POS tagging, dependency parsing, named entity recognition, and machine translation.

### Spark NLP functionality and pipelines

Expand All @@ -67,7 +67,7 @@ Spark NLP is by far the fastest open-source NLP library. Recent public benchmark

Besides excellent performance, Spark NLP also delivers state-of-the-art accuracy for a growing number of NLP tasks. The Spark NLP team regularly reads the latest relevant academic papers and produces the most accurate models.

For the execution order of an NLP pipeline, Spark NLP follows the same development concept as traditional Spark ML machine learning models. But Spark NLP applies NLP techniques. The following diagram shows the core components of a Spark NLP pipeline.
For the execution order of an NLP pipeline, Spark NLP follows the same development concept as traditional Spark machine learning models. But Spark NLP applies NLP techniques. The following diagram shows the core components of a Spark NLP pipeline.

:::image type="content" source="_images/spark-natural-language-processing-pipeline.png" alt-text="Diagram that shows N L P pipeline stages, such as document assembly, sentence detection, tokenization, normalization, and word embedding." border="false":::

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
This article describes an architecture for many models that uses Machine Learning and compute clusters. It provides great versatility for situations that require complex setup.

A companion article, [Many models machine learning (ML) at scale in Azure with Spark](many-models-machine-learning-azure-spark.yml), uses Apache Spark in either Azure Databricks or Azure Synapse Analytics.
A companion article, [Many models machine learning at scale in Azure with Spark](many-models-machine-learning-azure-spark.yml), uses Apache Spark in either Azure Databricks or Azure Synapse Analytics.

## Architecture

Expand Down Expand Up @@ -28,10 +28,10 @@

### Components

- [Azure Machine Learning](https://azure.microsoft.com/services/machine-learning) is an enterprise-grade ML service for building and deploying models quickly. It provides users at all skill levels with a low-code designer, automated ML (AutoML), and a hosted Jupyter notebook environment that supports various IDEs.
- [Azure Databricks](https://azure.microsoft.com/services/databricks) is a cloud-based data-engineering tool that's based on Apache Spark. It can process and transform massive quantities of data and explore it by using ML models. You can write jobs in R, Python, Java, Scala, and Spark SQL.
- [Azure Machine Learning](https://azure.microsoft.com/services/machine-learning) is an enterprise-grade machine learning service for building and deploying models quickly. It provides users at all skill levels with a low-code designer, automated ML (AutoML), and a hosted Jupyter notebook environment that supports various IDEs.
- [Azure Databricks](https://azure.microsoft.com/services/databricks) is a cloud-based data-engineering tool that's based on Apache Spark. It can process and transform massive quantities of data and explore it by using machine learning models. You can write jobs in R, Python, Java, Scala, and Spark SQL.
- [Azure Synapse Analytics](https://azure.microsoft.com/services/synapse-analytics) is an analytics service that unifies data integration, enterprise data warehousing, and big data analytics.
- Synapse SQL is a distributed query system for T-SQL that enables data warehousing and data virtualization scenarios and extends T-SQL to address streaming and ML scenarios. It offers both serverless and dedicated resource models.
- Synapse SQL is a distributed query system for T-SQL that enables data warehousing and data virtualization scenarios and extends T-SQL to address streaming and machine learning scenarios. It offers both serverless and dedicated resource models.
- [Azure Data Lake Storage](https://azure.microsoft.com/services/storage/data-lake-storage) is a massively scalable and secure storage service for high-performance analytics workloads.
- [Azure Kubernetes Service (AKS)](https://azure.microsoft.com/services/kubernetes-service) is a fully managed Kubernetes service for deploying and managing containerized applications. AKS simplifies deployment of a managed AKS cluster in Azure by offloading the operational overhead to Azure.
- [Azure DevOps](https://azure.microsoft.com/services/devops/) is a set of developer services that provide comprehensive application and infrastructure lifecycle management. DevOps includes work tracking, source control, build and CI/CD, package management, and testing solutions.
Expand All @@ -44,7 +44,7 @@

## Scenario details

Many machine learning (ML) problems are too complex for a single ML model to solve. Whether it's predicting sales for every item of every store, or modeling maintenance for hundreds of oil wells, having a model for each instance might improve results on many ML problems. This *many models* pattern is common across a wide variety of industries, and has many real-world use cases. With the use of Azure Machine Learning, an end-to-end many models pipeline can include model training, batch-inferencing deployment, and real-time deployment.
Many machine learning problems are too complex for a single machine learning model to solve. Whether it's predicting sales for every item of every store, or modeling maintenance for hundreds of oil wells, having a model for each instance might improve results on many machine learning problems. This *many models* pattern is common across a wide variety of industries, and has many real-world use cases. With the use of Azure Machine Learning, an end-to-end many models pipeline can include model training, batch-inferencing deployment, and real-time deployment.

A many models solution requires a different dataset for every model during training and scoring. For instance, if the task is to predict sales for every item of every store, every dataset will be for a unique item-store combination.

Expand All @@ -58,16 +58,16 @@ A many models solution requires a different dataset for every model during train

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see [Microsoft Azure Well-Architected Framework](/azure/architecture/framework).

- **Data partitions** Partitioning the data is the key to implementing the many models pattern. If you want one model per store, a dataset comprises all the data for one store, and there are as many datasets as there are stores. If you want to model products by store, there will be a dataset for every combination of product and store. Depending on the source data format, it may be easy to partition the data, or it might require extensive data shuffling and transformation. Spark and Synapse SQL scale very well for such tasks, while Python pandas doesn't, since it runs only on one node and process.
- **Data partitions** Partitioning the data is the key to implementing the many models pattern. If you want one model per store, a dataset comprises all the data for one store, and there are as many datasets as there are stores. If you want to model products by store, there will be a dataset for every combination of product and store. Depending on the source data format, it might be easy to partition the data, or it might require extensive data shuffling and transformation. Spark and Synapse SQL scale very well for such tasks, while Python pandas doesn't, since it runs only on one node and process.
- **Model management:** The training and scoring pipelines identify and invoke the right model for each dataset. To do this, they calculate tags that characterize the dataset, and then use the tags to find the matching model. The tags identify the data partition key and the model version, and might also provide other information.
- **Choosing the right architecture:**
- Spark is appropriate when your training pipeline has complex data transformation and grouping requirements. It provides flexible splitting and grouping techniques to group data by combinations of characteristics, such as product-store or location-product. The results can be placed in a Spark DataFrame for use in subsequent steps.
- When your ML training and scoring algorithms are straightforward, you might be able to partition data with libraries such as Scikit-learn. In such cases, you might not need Spark, so you can avoid possible complexities that can arise when installing Azure Synapse or Azure Databricks.
- When the training datasets are already created—for example, they're in separate files or in separate rows or columns—you dont need Spark for complex data transformations.
- When your machine learning training and scoring algorithms are straightforward, you might be able to partition data with libraries such as scikit-learn. In such cases, you might not need Spark, so you can avoid possible complexities that can arise when installing Azure Synapse or Azure Databricks.
- When the training datasets are already created—for example, they're in separate files or in separate rows or columns—you don't need Spark for complex data transformations.
- The Machine Learning and compute clusters solution provides great versatility for situations that require complex setup. For example, you can make use of a custom Docker container, or download files, or download pre-trained models. Computer vision and natural language processing (NLP) deep learning are examples of applications that might require such versatility.
- **Spark training and scoring:** When you use the Spark architecture, you can use the Spark pandas function API for parallel training and scoring.
- **Separate model repos:** To protect the deployed models, consider storing them in their own repository that the training and testing pipelines don't touch.
- **ParallelRunStep Class:** The Python [ParallelRunStep Class](/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) is a powerful option to run many models training and inferencing. It can partition your data in a variety of ways, and then apply your ML script on elements of the partition in parallel. Like other forms of Machine Learning training, you can specify a custom training environment with access to Python Package Index (PyPI) packages, or a more advanced custom docker environment for configurations that require more than standard PyPI. There are many CPUs and GPUs to choose from.
- **ParallelRunStep Class:** The Python [ParallelRunStep Class](/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) is a powerful option to run many models training and inferencing. It can partition your data in a variety of ways, and then apply your machine learning script on elements of the partition in parallel. Like other forms of Machine Learning training, you can specify a custom training environment with access to Python Package Index (PyPI) packages, or a more advanced custom Docker environment for configurations that require more than standard PyPI. There are many CPUs and GPUs to choose from.
- **Online inferencing:** If a pipeline loads and caches all models at the start, the models might exhaust the container's memory. Therefore, load the models on demand in the run method, even though it might increase latency slightly.

### Cost optimization
Expand Down Expand Up @@ -107,4 +107,4 @@ Principal author:
- [Analytics architecture design](../../solution-ideas/articles/analytics-start-here.yml)
- [Choose an analytical data store in Azure](../../data-guide/technology-choices/analytical-data-stores.md)
- [Choose a data analytics technology in Azure](../../data-guide/technology-choices/analysis-visualizations-reporting.md)
- [Many models machine learning (ML) at scale in Azure with Spark](many-models-machine-learning-azure-spark.yml)
- [Many models machine learning at scale in Azure with Spark](many-models-machine-learning-azure-spark.yml)
Loading

0 comments on commit aa9967e

Please sign in to comment.