## AML Package for Text Analytics - Custom Word2Vec Embeddings
### Training Word2Vec Embeddings Model on Domain-Specific Data

This notebook shows how to use **Azure Machine Learning Package for Text Analytics** (AMLPTA) to train Word2Vec word embeddings on a large text corpus. 

The trained Word2Vec embeddings can be saved, and reloaded for later use to get embeddings of words or sentences in new datasets, where they will be used as features. 

A Word2Vec model, trained on a large corpus, learns representation of words that incorporates information about their context. The model is a simple neural network that predicts context words from a target word (skip-gram) or a target word from context words (Continuous Bag of Words CBOW). The first layer of the network acts like a lookup table, converting each input (ie. each word of the dictionary of the corpus) to a vector of fixed size, called embedding vector. This table of word vector representations is learned during training, which is done in an unsupervised fashion -- only text is supplied as input. The embedding table will be extracted after training and used for lookup and word similarity calculations. For example, it is used for lookup in pipelines in which word2vec features are used. 

In this notebook, a Word2Vec model training is implemented in the class `Word2VecModel`. It consists of the following steps:
- `RegExTransformer`, which is used to clean the input text by removing certain patterns (regular expression).
- `NltkPreprocessor`, which is used to tockenize and split the input in sentences.
- `UngroupTransformer`, which ungroups the detected sentences by writing each one on a row of the dataset.
- `Word2VecVectorizer`, which trains the FastText model.

Note that `RegExTransformer` is only added to the pipeline if *regex* parameter is not *None* when creating a `Word2VecModel` instance, while `UngroupTransformer` is only added if *detect_sentences = True*.

The model is trained on PubMed abstracts, grouped in 18 batches (tsv files of about 1 Gb each). Since the training on the whole data is time-consuming, in the context of this tutorial, training will be performed on a subset of the data.

Following are the steps for creating a custom word2vec model using the package:
<br> Step 1: Configure and import modules
<br> Step 2: Prepare data for modeling and evaluation
<br> Step 3: Train the Word2Vec model 
<br> Step 4: Save and load pipeline for additional training
<br> Step 5: Save and load embeddings for lookup 

Consult the [package reference documentation](https://aka.ms/aml-packages/text) for the detailed reference for each module and class.

## Prerequisites 

1. If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.

1. The following accounts and application must be set up and installed:
   - Azure Machine Learning Experimentation account 
   - An Azure Machine Learning Model Management account
   - Azure Machine Learning Workbench installed

   If these three are not yet created or installed, follow the [Azure Machine Learning Quickstart and Workbench installation](../service/quickstart-installation.md) article. 

1. The Azure Machine Learning Package for Text Analytics must be installed. Learn how to [install this package here](https://aka.ms/aml-packages/text).

## Step 1: Configure and import modules

In [25]:
import pip
print(pip.main(["show","azureml-tatk"]))

---
Metadata-Version: 2.0
Name: azureml-tatk
Version: 0.1.18129.4a1
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
Installer: pip
License: UNKNOWN
Location: c:\users\tatk\appdata\local\amlworkbench\python\lib\site-packages
Requires: unidecode, ipython, jsonpickle, matplotlib, keras, dill, pandas, ruamel.yaml, ipywidgets, nltk, pdfminer.six, pyspark, azure-ml-api-sdk, requests, qgrid, scipy, azure-storage, nose, h5py, bqplot, pytest, gensim, lxml, sklearn-crfsuite, docker, numpy, validators, scikit-learn
Classifiers:


You are using pip version 8.1.2, however version 10.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


0


### (1.1) Configure AzureML logger

In [2]:
# Azure Machine Learning logger and magics for logging and run-history tracking
# Use the Azure Machine Learning data collector to log various metrics
from azureml.logging import get_azureml_logger
logger = get_azureml_logger()

# Log cell runs into run history
logger.log('Cell','Set up run')

<azureml.logging.script_run_request.ScriptRunRequest at 0x2518b309e80>

### (1.2) Import libraries

In [26]:
import os
import sys
import pandas as pd
import numpy as np

from tatk.utils import models_dir
from tatk.pipelines.feature_extraction.word2vec_model import Word2VecModel
from tatk.feature_extraction.word2vec_vectorizer import Word2VecVectorizer

'C:\\Users\\tatk\\tatk\\resources\\models'

## Step 2: Prepare data for modeling and evaluation

### (2.1) Download Pubmed data. 
Please download the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is publicly available in the form of XML files on their [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline). There are 892 XML files available on the server and each of the XML files has the information of 30,000 articles.

Set `xml_local_dir` to point to the folder containing downloaded xml files.

In [27]:
xml_local_dir = r'C:\Users\tatk\Downloads\pubmed\raw'

### (2.2) Parse the downloaded xml data.

You can use the Azure ML Text Analytics Package to parse the downloaded xml Pubmed files. It will read the xml files per batches (see batch_size argument), preprocess them (including removing articles with short abstracts; see min_abstract_len argument), keep specific columns (see cols_to_keep argument), and save them to disk in tsv files in tsv_local_dir.

Reading and saving xml files in batches will let you train the embeddings model incrementally, instead of reading in-memory a big amount of data.

You can change `tsv_local_dir` to specify another path to the output preprocessed data.

Change `num_xml_files` based on how many xml files you downloaded and/or want to process.

Change `batch_size` based on how many xml files you want to concatenate into 1 tsv file for the model training.

In [5]:
from tatk.utils.parse_pubmed_data import process_files

tsv_local_dir = r'C:\Users\tatk\Downloads\pubmed\processed'
num_xml_files = 20
batch_size = 10 

process_files(xml_local_dir = xml_local_dir,
              tsv_local_dir =tsv_local_dir,
              batch_size=batch_size,
              num_xml_files=num_xml_files,
              min_abstract_len=10,
              cols_to_keep=['pmid', 'abstract'])

Process downloaded MEDLINE XML files - start
TSV directory C:\Users\tatk\Downloads\pubmed\processed
Processing 10 files ....
	processing pubmed18n0001.xml.gz .....
	number of records before missing data removal = 30000
RegExTransformer::tatk_transform ==> start
RegExTransformer::tatk_transform ==> end 	 Time taken: 0.02 mins
	number of records after missing data removal = 15377
	processing pubmed18n0002.xml.gz .....
	number of records before missing data removal = 30000
RegExTransformer::tatk_transform ==> start
RegExTransformer::tatk_transform ==> end 	 Time taken: 0.02 mins
	number of records after missing data removal = 13414
	processing pubmed18n0003.xml.gz .....
	number of records before missing data removal = 30000
RegExTransformer::tatk_transform ==> start
RegExTransformer::tatk_transform ==> end 	 Time taken: 0.02 mins
	number of records after missing data removal = 12615
	processing pubmed18n0004.xml.gz .....
	number of records before missing data removal = 30000
RegExTransfor

### (2.3) Read the first batch in-memory. 
Each paper has an ID (pmid) and a text abstract. 

In [6]:
file_path = os.path.join(tsv_local_dir, 'batch#1.tsv')

data = pd.read_csv(file_path, sep = "\t", usecols = ['pmid', 'abstract'], encoding = "ISO-8859-1").dropna()#read in-memory
df = data[:5000] # take a subset for faster training. 

In [7]:
display(df[:5])

Unnamed: 0,pmid,abstract
20,21.0,(--)-alpha-Bisabolol has a primary antipeptic ...
21,22.0,A report is given on the recent discovery of o...
22,23.0,The distribution of blood flow to the subendoc...
23,24.0,"The virostatic compound N,N-diethyl-4-[2-(2-ox..."
24,25.0,"RMI 61 140, RMI 61 144 and RMI 61 280 are newl..."


## Step 3: Train the Word2Vec model

### (3.1) Create the Word2Vec model pipeline
Initialize the pipeline with default parameters. No regular expression cleaning is performed, and sentences are detected. 

In [8]:
word2vec_model = Word2VecModel(input_col = 'abstract', regex = None, detect_sentences = True)

Word2VecModel::create_pipeline ==> start
input_col=abstract
input_col=NltkPreprocessor6a6fd0bac3c54285b1fa0f4fb761c8ec
input_col=UngroupTransformerf7c4646520a84945ae49b853c24e4b26
:: number of jobs for the pipeline : 6
0	nltk_preprocessor
1	ungroup_transformer
2	vectorizer
Word2VecModel::create_pipeline ==> end


In [9]:
print(word2vec_model)

Word2VecModel TATK Pipeline:
0 - nltk_preprocessor(abstract,NltkPreprocessor6a6fd0bac3c54285b1fa0f4fb761c8ec)
1 - ungroup_transformer(NltkPreprocessor6a6fd0bac3c54285b1fa0f4fb761c8ec,UngroupTransformerf7c4646520a84945ae49b853c24e4b26)
2 - vectorizer(UngroupTransformerf7c4646520a84945ae49b853c24e4b26,Word2VecVectorizer6b41f68986cd47e39a887e1df7ad3c1e)



### (3.2) Display and Change default pipeline parameters

In [10]:
word2vec_model.get_step_params_by_name('vectorizer')

{'aggregation_func': <function tatk.feature_extraction.word2vec_vectorizer.Word2VecVectorizer.aggregate_mean(sentence_matrix)>,
 'case_sensitive': False,
 'context_window_size': 5,
 'copy_from_path': True,
 'embedding_size': 100,
 'embedding_table': None,
 'get_from_path': True,
 'input_col': 'UngroupTransformerf7c4646520a84945ae49b853c24e4b26',
 'lr_end': 0.005,
 'lr_start': 0.05,
 'min_df': 5,
 'negative_sample_size': 5,
 'num_epochs': 5,
 'num_workers': 4,
 'output_col': 'Word2VecVectorizer6b41f68986cd47e39a887e1df7ad3c1e',
 'return_type': 'word_vector',
 'save_overwrite': True,
 'skip_OOV': False,
 'trainable': True,
 'trained_model': None,
 'use_hierarchical_softmax': 0,
 'use_skipgram': 0}

In [11]:
# Change model parameters
word2vec_model.set_step_params_by_name('vectorizer', use_skipgram = 1) 

### (3.3) Fit the model on the training set

In [12]:
word2vec_model.fit(df)

Word2VecModel::fit ==> start
schema: col=pmid:R4:0 col=abstract:TX:1 header+
NltkPreprocessor::tatk_fit_transform ==> start
NltkPreprocessor::tatk_fit_transform ==> end 	 Time taken: 0.05 mins
UngroupTransformer::tatk_fit_transform ==> start
UngroupTransformer::tatk_fit_transform ==> end 	 Time taken: 0.0 mins
Word2VecVectorizer::tatk_fit ==> start
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
PROGRESS: at sentence #10000, processed 265322 words, keeping 15654 word types
PROGRESS: at sentence #20000, processed 527867 words, keeping 22727 word types
PROGRESS: at sentence #30000, processed 792813 words, keeping 28054 word types
collected 30449 word types from a corpus of 917752 raw words and 34803 sentences
Loading a fresh vocabulary
min_count=5 retains 9697 unique words (31% of original 30449, drops 20752)
min_count=5 leaves 883549 word corpus (96% of original 917752, drops 34203)
deleting the raw counts dictionary of 30449 items

Word2VecModel(detect_sentences=True, input_col='abstract', regex=None)

### (3.4) Script to train the embeddings on multiple batches (note: very long on full 21 Gb of data).
We loop over the different batches in tsv_local_dir, read every file in-memory, and feed it to the model for incremental learning.

Replace *False* by *True* if you would like to perform this step. Otherwise, proceed with the model trained above on a subset of batch #1.

In [13]:
num_batches = 2
if False:
    word2vec_model = Word2VecModel(input_col = 'abstract', regex = None, detect_sentences = True)#Initialize the model.
    for b in range(1, num_batches + 1):
        print(b)
        file_path = os.path.join(tsv_local_dir, 'batch#{}.tsv'.format(b))
        df = pd.read_csv(file_path, sep = "\t", usecols = ['abstract'], encoding = "ISO-8859-1").dropna()
        print(df.shape)
        word2vec_model.fit(df)

Word2VecModel::create_pipeline ==> start
input_col=abstract
input_col=NltkPreprocessor527ddb5170cd41728a90cd72a46d4d41
input_col=UngroupTransformeree8af7b41ea344379dbf9803e6ee5fbb
:: number of jobs for the pipeline : 6
0	nltk_preprocessor
1	ungroup_transformer
2	vectorizer
Word2VecModel::create_pipeline ==> end
1
(15377, 1)
Word2VecModel::fit ==> start
schema: col=abstract:TX:0 header+
NltkPreprocessor::tatk_fit_transform ==> start
NltkPreprocessor::tatk_fit_transform ==> end 	 Time taken: 0.15 mins
UngroupTransformer::tatk_fit_transform ==> start
UngroupTransformer::tatk_fit_transform ==> end 	 Time taken: 0.0 mins
Word2VecVectorizer::tatk_fit ==> start
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
PROGRESS: at sentence #10000, processed 265322 words, keeping 15654 word types
PROGRESS: at sentence #20000, processed 527867 words, keeping 22727 word types
PROGRESS: at sentence #30000, processed 792813 words, keeping 28054 word ty

## Step 4: Save and load pipeline for additional training

### (4.1) Save and Load the pipeline

In [14]:
pipeline_path = os.path.join(models_dir, 'word2vec_model')
word2vec_model.save(pipeline_path, create_folders_on_path=True)
word2vec_model2 = Word2VecModel.load(pipeline_path)

BaseTextModel::save ==> start
TatkPipeline::save ==> start
saving Word2Vec object under C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen, separately None
not storing attribute syn0norm
not storing attribute cum_table
saved C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
Time taken: 0.01 mins
TatkPipeline::save ==> end
Time taken: 0.01 mins
BaseTextModel::save ==> end
BaseTextModel::load ==> start
TatkPipeline::load ==> start
loading Word2Vec object from C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
loading wv recursively from C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen.wv.* with mmap=None
setting ignored attribute syn0norm to None
setting ignored attribute cum_table to None
loaded C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
Word2VecVectorizer: Word2Vec model loaded 

### (4.2) Perform additional training on new data

In [15]:
df2 = pd.DataFrame(data[5000:6000])
word2vec_model2.fit(df2)

Word2VecModel::fit ==> start
schema: col=pmid:R4:0 col=abstract:TX:1 header+
NltkPreprocessor::tatk_fit_transform ==> start
NltkPreprocessor::tatk_fit_transform ==> end 	 Time taken: 0.01 mins
UngroupTransformer::tatk_fit_transform ==> start
UngroupTransformer::tatk_fit_transform ==> end 	 Time taken: 0.0 mins
Word2VecVectorizer::tatk_fit ==> start
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
collected 13179 word types from a corpus of 184248 raw words and 7106 sentences
Updating model with new vocabulary
New added 3795 unique words (22% of original 16974) and increased the count of 3795 pre-existing words (22% of original 16974)
deleting the raw counts dictionary of 13179 items
sample=0.0001 downsamples 976 most-common words
downsampling leaves estimated 150507 word corpus (89.3% of prior 168598)
estimated required memory for 7590 words and 100 dimensions: 9867000 bytes
updating layer weights
training model with 4 workers on 2

Word2VecModel(detect_sentences=True, input_col='abstract', regex=None)

## Step 5: Save and load embeddings for lookup

### (5.1) Save the embeddings from the model

In [16]:
# Saved embeddings file is in textual format and is readable if opened with a text editor
embeddings_file_path = os.path.join(models_dir, 'word2vec_embeddings.txt')
word2vec_model2.save_embeddings(embeddings_file_path)

Word2VecVectorizer::save_embeddings ==> start
storing 21903x100 projection weights into C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
Time taken: 0.02 mins
Word2VecVectorizer::save_embeddings ==> end


### (5.2) Load the embeddings to memory with include_unk set to True to add OOV treatment

In [17]:
vectorizer = Word2VecVectorizer.load_embeddings(embeddings_file_path, include_unk = True,
                                                unk_method = 'rnd', unk_vector = None, unk_word = '<UNK>')

Word2VecVectorizer::load_embeddings ==> start
loading projection weights from C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
loaded (21903, 100) matrix from C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
Time taken: 0.04 mins
Word2VecVectorizer::load_embeddings ==> end


### (5.3) Embedding Lookup: Get word and subword indices.

In [18]:
df_predict = pd.DataFrame({'text' : ["I have fever", "My doctor prescribed me ibuprofen."]})
vectorizer.input_col = 'text'
vectorizer.output_col = 'indices'
vectorizer.return_type = 'word_index'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices
0,I have fever,"[125, 53, 1891]"
1,My doctor prescribed me ibuprofen.,"[10249, 7005, 4920, 3623, 21903]"


### (5.4) Embedding Lookup: Get word embeddings.

In [19]:
vectorizer.output_col = 'word_vector'
vectorizer.return_type = 'word_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector
0,I have fever,"[125, 53, 1891]","[[0.0705619975924, -0.0768989995122, 1.8238869..."
1,My doctor prescribed me ibuprofen.,"[10249, 7005, 4920, 3623, 21903]","[[0.075989998877, 0.00504799978808, -0.1458210..."


### (5.5) Embedding Lookup: Get sentence embedding.

In [20]:
vectorizer.output_col = 'sentence_vector'
vectorizer.return_type = 'sentence_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector,sentence_vector
0,I have fever,"[125, 53, 1891]","[[0.0705619975924, -0.0768989995122, 1.8238869...","[0.0703796669841, 0.10862232993, 0.62739165624..."
1,My doctor prescribed me ibuprofen.,"[10249, 7005, 4920, 3623, 21903]","[[0.075989998877, 0.00504799978808, -0.1458210...","[0.49121833265, 0.210629751274, -0.23683439670..."


### (5.6) Embedding Lookup: Get most similar word to a given word.

In [21]:
vectorizer.embedding_table.most_similar('fever')

precomputing L2-norms of word weight vectors


[('diarrhea', 0.7121096849441528),
 ('febrile', 0.6585642099380493),
 ('meningitis', 0.658104658126831),
 ('endophthalmitis', 0.6556650400161743),
 ('dysentery', 0.646338939666748),
 ('fatal', 0.6434411406517029),
 ('necrotizing', 0.6364932060241699),
 ('eosinophilia', 0.6359970569610596),
 ('diarrhoea', 0.6343546509742737),
 ('rash', 0.6313232183456421)]

In [22]:
vectorizer.embedding_table.most_similar('doctor')

[('physician', 0.8832104206085205),
 ('education', 0.846272349357605),
 ('physicians', 0.8457393050193787),
 ('assistants', 0.8402310609817505),
 ('staff', 0.8373662233352661),
 ('insurance', 0.8310399055480957),
 ('attitudes', 0.8197391033172607),
 ('nurse', 0.8195598125457764),
 ('assistant', 0.8187754154205322),
 ('practitioner', 0.8185635805130005)]

In [23]:
vectorizer.embedding_table.most_similar('ibuprofen')

[('diphenhydramine', 0.8111883401870728),
 ('phenylbutazone', 0.7920838594436646),
 ('asa', 0.7769595384597778),
 ('flurbiprofen', 0.7701882719993591),
 ('methadone', 0.7679111957550049),
 ('secobarbital', 0.7673499584197998),
 ('aspirin', 0.7639237642288208),
 ('perphenazine', 0.7494533061981201),
 ('indoprofen', 0.742222011089325),
 ('codeine', 0.7416385412216187)]

In [24]:
vectorizer.embedding_table.most_similar('have')

[('has', 0.712862491607666),
 ('previously', 0.45508110523223877),
 ('recently', 0.45498350262641907),
 ('examining', 0.41651538014411926),
 ('had', 0.4008268713951111),
 ('speculate', 0.395913302898407),
 ('herein', 0.3913381099700928),
 ('heretofore', 0.3901556730270386),
 ('nap', 0.3745071291923523),
 ('terminology', 0.36553382873535156)]

## Next steps

Learn more about Azure Machine Learning Package for Text Analytics in these articles:

+ Read the [package overview and learn how to install it](https://aka.ms/aml-packages/text).

+ Explore the [reference documentation](https://aka.ms/aml-packages/text) for this package.

+ Learn about [other Python packages for Azure Machine Learning](reference-python-package-overview.md).