Biomedical Entity Recognition using TDSP Template

NOTE This content is no longer maintained. Visit the Azure Machine Learning Notebook project for sample Jupyter notebooks for ML and deep learning with Azure Machine Learning.

Link to the Microsoft DOCS site

The detailed documentation for this example includes the step-by-step walk-through: https://docs.microsoft.com/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition

Link to the Gallery GitHub repository

The public GitHub repository for this example contains all the code samples: https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction

Summary

Entity extraction is a subtask of information extraction (also known as Named-entity recognition (NER), entity chunking and entity identification). Biomedical named entity recognition is a critical step for complex biomedical NLP tasks such as:

Extraction of diseases, symptoms from electronic medical or health records.
Drug discovery
Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship.

This real-world scenario focuses on how a large amount of unstructured unlabeled data corpus such as PubMed article abstracts can be analyzed to train a domain-specific word embedding model. Then the output embeddings are considered as automatically generated features to train a neural entity extraction model using Keras with TensorFlow deep learning framework as backend and a small amoht of labeled data.

Description

The aim of this real-world scenario is to highlight how to use Azure Machine Learning Workbench to solve a complicated NLP task such as entity extraction from unstructured text. Here are the key points addressed:

How to train a neural word embeddings model on a text corpus of about 18 million PubMed abstracts using Spark Word2Vec implementation.
How to build a deep Long Short-Term Memory (LSTM) recurrent neural network model for entity extraction on a GPU-enabled Azure Data Science Virtual Machine (GPU DSVM) on Azure.
Demonstrate that domain-specific word embeddings models can outperform generic word embeddings models in the entity recognition task.
Demonstrate how to train and operationalize deep learning models using Azure Machine Learning Workbench.

The following capabilities within Azure Machine Learning Workbench:

Instantiation of Team Data Science Process (TDSP) structure and templates.
Automated management of your project dependencies including the download and the installation.
Execution of code in Jupyter notebooks as well as Python scripts.
Run history tracking for Python files.
Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters.
Execution of jobs in remote GPU VMs on Azure.
Easy operationalization of deep learning models as web-services hosted on Azure Container Services.

The detailed documentation for this scenario including the step-by-step walk-through: https://review.docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition.

For code samples, click the View Project icon on the right and visit the project GitHub repository.

Key components needed to run this example:

An Azure subscription
Azure Machine Learning Workbench with a workspace created. See installation guide.
To run this scenario with Spark cluster, provision Azure HDInsight Spark cluster (Spark 2.1 on Linux (HDI 3.6)) for scale-out computation. To process the full amount of MEDLINE abstracts discussed below, we recommend having a cluster with:
- a head node of type D13_V2
- at least four worker nodes of type D12_V2.
- To maximize performance of the cluster, we recommend to change the parameters spark.executor.instances, spark.executor.cores, and spark.executor.memory by following the instructions here and editing the definitions in "custom spark defaults" section.
You can run the entity extraction model training locally on a Data Science Virtual Machine (DSVM) or in a remote Docker container in a remote DSVM.
To provision DSVM for Linux (Ubuntu), follow the instructions here. We recommend using NC6 Standard (56 GB, K80 NVIDIA Tesla).

Data/Telemetry

The Biomedical named entity recognition scenario collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
aml_config		aml_config
code		code
docs		docs
sample_data		sample_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.md.bak		README.md.bak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aml_config

aml_config

code

code

docs

docs

sample_data

sample_data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README.md.bak

README.md.bak

Repository files navigation

Biomedical Entity Recognition using TDSP Template

Link to the Microsoft DOCS site

Link to the Gallery GitHub repository

Summary

Description

Key components needed to run this example:

Data/Telemetry

Contributing

About

Releases

Packages

Contributors 11

Languages

License

Azure-Samples/MachineLearningSamples-BiomedicalEntityExtraction

Folders and files

Latest commit

History

Repository files navigation

Biomedical Entity Recognition using TDSP Template

Link to the Microsoft DOCS site

Link to the Gallery GitHub repository

Summary

Description

Key components needed to run this example:

Data/Telemetry

Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages