01_data_acquisition_and_understanding

History

Name		Name	Last commit message	Last commit date
parent directory ..
pubmed_parser		pubmed_parser
1_Download_and_Parse_XML_Spark.py		1_Download_and_Parse_XML_Spark.py
ReadMe.md		ReadMe.md

ReadMe.md

1. Data Acquisition and Understanding

Download and Parse MEDLINE Abstracts

The raw MEDLINE corpus has a total of 27 million abstracts where about 10 million articles have an empty abstract field. The total size of the downloaded files is 22 GB. Azure HDInsight Spark is used to process big data that cannot be loaded into the memory of a single machine as a Pandas DataFrame.

Objective

The companion script covers how to:

download the MEDLINE XML files from Medline site (see the download_xml_gz_files() function) to the head node of the Spark cluster,
parse them using the publically available XML parser pubmed_parser,
save the abstracts into TSV files and
upload the TSV files to the blob storage container associated with the spark cluster (see the process_files() function in the Python script).

To upload these files to a different blob storage, create blob storage container 'dataset' in your storage account. You can do that by going to Azure page of your storage account, clicking Blobs and then clicking +Container. Enter 'dataset' as Name and click OK. The following screenshots illustrate these steps:

The upload of the files takes several minutes, depending on your Internet connection.

Execution Steps

First, the data is downloaded into the Spark cluster. Then the following steps are executed on the Spark DataFrame:

parse the XML files using Medline XML Parser
preprocess the abstract text including sentence splitting, tokenization and case normalization.
exclude articles where abstract field is empty or has short text
create the word vocabulary from the training abstracts
train the word embedding neural model. You can refer to this Python script and its documentation to get started.

After parsing the Medline XML files, each data record has the following format:

The neural entity extraction model has been trained and evaluated on publicly available datasets. To obtain a detailed description about these datasets, you could refer to the following sources:

How to run this script

To run this script into the HDInsight Spark cluster,

Run the Azure ML Workbench installed into your DS VM.
Open command line window (CLI) by clicking File menu in the top left corner of AML Workbench and choosing "Open Command Prompt."
Then run the following command in the CLI window:

    az ml experiment submit -c myspark 1_Download_and_Parse_XML_Spark.py

where myspark is the Spark environment defined in the configuration step.

Notes

There are more that 800 XML files that are present on the Medline ftp server. The shared code downloads them all which takes a long time. If you just want to test the code, you can change that and download only a subsample.
The source code of the PubMed Parser is also included in the repository.

Next Step

Modeling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

01_data_acquisition_and_understanding

01_data_acquisition_and_understanding

ReadMe.md

1. Data Acquisition and Understanding

Download and Parse MEDLINE Abstracts

Objective

Execution Steps

How to run this script

Notes

Next Step

Files

01_data_acquisition_and_understanding

Directory actions

More options

Directory actions

More options

Latest commit

History

01_data_acquisition_and_understanding

Folders and files

parent directory

ReadMe.md

1. Data Acquisition and Understanding

Download and Parse MEDLINE Abstracts

Objective

Execution Steps

How to run this script

Notes

Next Step