## AML Package for Text Analytics - Custom Word2Vec Embeddings
### Training Word2Vec Embeddings Model on Domain-Specific Data

This notebook shows how to use **Azure Machine Learning Package for Text Analytics** (AMLPTA) to train Word2Vec word embeddings on a large text corpus. 

The trained Word2Vec embeddings can be saved, and reloaded for later use to get embeddings of words or sentences in new datasets, where they will be used as features. 

A Word2Vec model, trained on a large corpus, learns representation of words that incorporates information about their context. The model is a simple neural network that predicts context words from a target word (skip-gram) or a target word from context words (Continuous Bag of Words CBOW). The first layer of the network acts like a lookup table, converting each input (ie. each word of the dictionary of the corpus) to a vector of fixed size, called embedding vector. This table of word vector representations is learned during training, which is done in an unsupervised fashion -- only text is supplied as input. The embedding table will be extracted after training and used for lookup and word similarity calculations. For example, it is used for lookup in pipelines in which word2vec features are used. 

In this notebook, a Word2Vec model training is implemented in the class `Word2VecModel`. It consists of the following steps:
- `RegExTransformer`, which is used to clean the input text by removing certain patterns (regular expression).
- `NltkPreprocessor`, which is used to tockenize and split the input in sentences.
- `UngroupTransformer`, which ungroups the detected sentences by writing each one on a row of the dataset.
- `Word2VecVectorizer`, which trains the FastText model.

Note that `RegExTransformer` is only added to the pipeline if *regex* parameter is not *None* when creating a `Word2VecModel` instance, while `UngroupTransformer` is only added if *detect_sentences = True*.

The model is trained on PubMed abstracts, grouped in 18 batches (tsv files of about 1 Gb each). Since the training on the whole data is time-consuming, in the context of this tutorial, training will be performed on a subset of the data.

Following are the steps for creating a custom word2vec model using the package:
<br> Step 1: Configure and import modules
<br> Step 2: Prepare data for modeling and evaluation
<br> Step 3: Train the Word2Vec model 
<br> Step 4: Save and load pipeline for additional training
<br> Step 5: Save and load embeddings for lookup 

Consult the [package reference documentation](https://aka.ms/aml-packages/text) for the detailed reference for each module and class.

## Prerequisites 

1. If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.

1. The following accounts and application must be set up and installed:
   - Azure Machine Learning Experimentation account 
   - An Azure Machine Learning Model Management account
   - Azure Machine Learning Workbench installed

   If these three are not yet created or installed, follow the [Azure Machine Learning Quickstart and Workbench installation](../service/quickstart-installation.md) article. 

1. The Azure Machine Learning Package for Text Analytics must be installed. Learn how to [install this package here](https://aka.ms/aml-packages/text).

## Step 1: Configure and import modules

In [1]:
# Import Packages 
# Use Azure Machine Learning history magic to control history collection
# History is off by default, options are "on", "off", or "show"
#%azureml history on
%matplotlib inline
# Use the Azure Machine Learning data collector to log various metrics
from azureml.logging import get_azureml_logger
import os
import warnings
logger = get_azureml_logger()

# Log cell runs into run history
logger.log('Cell','Set up run')

warnings.filterwarnings("ignore")

---
Metadata-Version: 2.0
Name: azureml-tatk
Version: 0.1.18121.30a1
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
Installer: pip
License: UNKNOWN
Location: c:\users\tatk\appdata\local\amlworkbench\python\lib\site-packages
Requires: pyspark, unidecode, dill, sklearn-crfsuite, h5py, scipy, pdfminer.six, azure-storage, scikit-learn, ruamel.yaml, azure-ml-api-sdk, bqplot, ipython, jsonpickle, numpy, matplotlib, ipywidgets, nose, pandas, pytest, docker, validators, qgrid, nltk, gensim, requests, lxml, keras
Classifiers:


You are using pip version 8.1.2, however version 10.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


0


### (1.1) Configure AzureML logger

In [2]:
# Azure Machine Learning logger and magics for logging and run-history tracking
# Use the Azure Machine Learning data collector to log various metrics
from azureml.logging import get_azureml_logger
logger = get_azureml_logger()

# Log cell runs into run history
logger.log('Cell','Set up run')

<azureml.logging.script_run_request.ScriptRunRequest at 0x1cc693c9eb8>

### (1.2) Import libraries

In [3]:
import os
import sys
import pandas as pd
import numpy as np

from tatk.utils import models_dir
from tatk.pipelines.feature_extraction.word2vec_model import Word2VecModel
from tatk.feature_extraction.word2vec_vectorizer import Word2VecVectorizer

'pattern' package not found; tag filters are not available for English


## Step 2: Prepare data for modeling and evaluation

### (2.1) Download and parse Pubmed data. 
Please download the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is publicly available in the form of XML files on their [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline). There are 892 XML files available on the server and each of the XML files has the information of 30,000 articles.

You should keep 2 columns: pmid (ID of the document) and abstract, and save them in tab-separated format.

You can split them in multiple files and perform incremental learning, instead of reading in-memory a big amount of data.

Set `file_path` to point to the specific dataset to train on.

In [4]:
file_path = r'C:\Users\tatk\Downloads\pubmed18nbatch#1.tsv'

### (2.2) Read the first batch in-memory. 
Each paper has an ID (pmid) and a text abstract. 

In [5]:
data = pd.read_csv(file_path, sep = "\t", usecols = ['pmid', 'abstract'], encoding = "ISO-8859-1").dropna()#read in-memory
df = data[:5000] # take a subset for faster training. 

In [6]:
display(df[:5])

Unnamed: 0,pmid,abstract
0,4525790,We have developed a cell-free system to study ...
1,4525907,The oxy-form of sickle hemoglobin (Hb S) is ab...
2,4525937,Of 17 consecutive patients with acute granuloc...
3,4526201,Carbon magnetic resonance T(1) relaxation and ...
4,4526202,The ability of fibroblasts to perform unschedu...


## Step 3: Train the Word2Vec model

### (3.1) Create the Word2Vec model pipeline
Initialize the pipeline with default parameters. No regular expression cleaning is performed, and sentences are detected. 

In [7]:
word2vec_model = Word2VecModel(input_col = 'abstract', regex = None, detect_sentences = True)

Word2VecModel::create_pipeline ==> start
input_col=abstract
input_col=NltkPreprocessor08a5db0dd16644aabf8c7e8812e47549
input_col=UngroupTransformerb59cc67a90b34d78bb2e9adb445290d6
:: number of jobs for the pipeline : 6
0	nltk_preprocessor
1	ungroup_transformer
2	vectorizer
Word2VecModel::create_pipeline ==> end


In [8]:
print(word2vec_model)

Word2VecModel TATK Pipeline:
0 - nltk_preprocessor(abstract,NltkPreprocessor08a5db0dd16644aabf8c7e8812e47549)
1 - ungroup_transformer(NltkPreprocessor08a5db0dd16644aabf8c7e8812e47549,UngroupTransformerb59cc67a90b34d78bb2e9adb445290d6)
2 - vectorizer(UngroupTransformerb59cc67a90b34d78bb2e9adb445290d6,Word2VecVectorizerb2f2ed9dea734f5cb2c53909b78e15a6)



### (3.2) Display and Change default pipeline parameters

In [9]:
word2vec_model.get_step_params_by_name('vectorizer')

{'aggregation_func': <function tatk.feature_extraction.word2vec_vectorizer.Word2VecVectorizer.aggregate_mean(sentence_matrix)>,
 'case_sensitive': False,
 'context_window_size': 5,
 'copy_from_path': True,
 'embedding_size': 100,
 'embedding_table': None,
 'get_from_path': True,
 'input_col': 'UngroupTransformerb59cc67a90b34d78bb2e9adb445290d6',
 'lr_end': 0.005,
 'lr_start': 0.05,
 'min_df': 5,
 'negative_sample_size': 5,
 'num_epochs': 5,
 'num_workers': 4,
 'output_col': 'Word2VecVectorizerb2f2ed9dea734f5cb2c53909b78e15a6',
 'return_type': 'word_vector',
 'save_overwrite': True,
 'skip_OOV': False,
 'trainable': True,
 'trained_model': None,
 'use_hierarchical_softmax': 0,
 'use_skipgram': 0}

In [10]:
# Change model parameters
word2vec_model.set_step_params_by_name('vectorizer', use_skipgram = 1) 

### (3.3) Fit the model on the training set

In [11]:
word2vec_model.fit(df)

Word2VecModel::fit ==> start
schema: col=pmid:I8:0 col=abstract:TX:1 header+
NltkPreprocessor::tatk_fit_transform ==> start
NltkPreprocessor::tatk_fit_transform ==> end 	 Time taken: 0.05 mins
UngroupTransformer::tatk_fit_transform ==> start
UngroupTransformer::tatk_fit_transform ==> end 	 Time taken: 0.0 mins
Word2VecVectorizer::tatk_fit ==> start
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
PROGRESS: at sentence #10000, processed 281599 words, keeping 15380 word types
PROGRESS: at sentence #20000, processed 557607 words, keeping 20905 word types
collected 25746 word types from a corpus of 834512 raw words and 29982 sentences
Loading a fresh vocabulary
min_count=5 retains 9277 unique words (36% of original 25746, drops 16469)
min_count=5 leaves 806233 word corpus (96% of original 834512, drops 28279)
deleting the raw counts dictionary of 25746 items
sample=0.0001 downsamples 476 most-common words
downsampling leaves estimated 

Word2VecModel(detect_sentences=True, input_col='abstract', regex=None)

### (3.4) Script to train the embeddings on multiple batches (note: very long on full 21 Gb of data).
We loop over the different batches. Assuming file paths to the batches are (replace below with your path):

file_path = r"C:\Users\tatkdocs\Downloads\batch_files\pubmed18nbatch#1.tsv", r"C:\Users\tatkdocs\Downloads\batch_files\pubmed18nbatch#2.tsv", etc.

We read every file in-memory, and feed it to the model for incremental learning.

Replace *False* by *True* if you would like to perform this step. Otherwise, proceed with the model trained above on a subset of batch #1.

In [12]:
num_batches = 1
if False:#False:
    word2vec_model = Word2VecModel(input_col = 'abstract', regex = None, detect_sentences = True)#Initialize the model.
    for b in range(1, num_batches + 1):
        print(b)
        file_path = r"C:\Users\tatkdocs\Downloads\pubmed18nbatch#{}}.tsv".format(b)
        df = pd.read_csv(file_path, sep = "\t", usecols = ['abstract'], encoding = "ISO-8859-1").dropna()
        print(df.shape)
        word2vec_model.fit(df)

## Step 4: Save and load pipeline for additional training

### (4.1) Save and Load the pipeline

In [13]:
pipeline_path = os.path.join(models_dir, 'word2vec_model')
word2vec_model.save(pipeline_path, create_folders_on_path=True)
word2vec_model2 = Word2VecModel.load(pipeline_path)

BaseTextModel::save ==> start
TatkPipeline::save ==> start
saving Word2Vec object under C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen, separately None
not storing attribute syn0norm
not storing attribute cum_table
saved C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
Time taken: 0.0 mins
TatkPipeline::save ==> end
Time taken: 0.0 mins
BaseTextModel::save ==> end
BaseTextModel::load ==> start
TatkPipeline::load ==> start
loading Word2Vec object from C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
loading wv recursively from C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen.wv.* with mmap=None
setting ignored attribute syn0norm to None
setting ignored attribute cum_table to None
loaded C:\Users\tatk\tatk\resources\models\word2vec_model\pipeline\vectorizer\embedding_model.gen
Word2VecVectorizer: Word2Vec model loaded fr

### (4.2) Perform additional training on new data

In [14]:
df2 = pd.DataFrame(data[5000:6000])
word2vec_model2.fit(df2)

Word2VecModel::fit ==> start
schema: col=pmid:I8:0 col=abstract:TX:1 header+
NltkPreprocessor::tatk_fit_transform ==> start
NltkPreprocessor::tatk_fit_transform ==> end 	 Time taken: 0.01 mins
UngroupTransformer::tatk_fit_transform ==> start
UngroupTransformer::tatk_fit_transform ==> end 	 Time taken: 0.0 mins
Word2VecVectorizer::tatk_fit ==> start
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
collected 11942 word types from a corpus of 181960 raw words and 7167 sentences
Updating model with new vocabulary
New added 3704 unique words (23% of original 15646) and increased the count of 3704 pre-existing words (23% of original 15646)
deleting the raw counts dictionary of 11942 items
sample=0.0001 downsamples 1046 most-common words
downsampling leaves estimated 149681 word corpus (89.2% of prior 167794)
estimated required memory for 7408 words and 100 dimensions: 9630400 bytes
updating layer weights
training model with 4 workers on 

Word2VecModel(detect_sentences=True, input_col='abstract', regex=None)

## Step 5: Save and load embeddings for lookup

### (5.1) Save the embeddings from the model

In [15]:
# Saved embeddings file is in textual format and is readable if opened with a text editor
embeddings_file_path = os.path.join(models_dir, 'word2vec_embeddings.txt')
word2vec_model2.save_embeddings(embeddings_file_path)

Word2VecVectorizer::save_embeddings ==> start
storing 9620x100 projection weights into C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
Time taken: 0.01 mins
Word2VecVectorizer::save_embeddings ==> end


### (5.2) Load the embeddings to memory with include_unk set to True to add OOV treatment

In [16]:
vectorizer = Word2VecVectorizer.load_embeddings(embeddings_file_path, include_unk = True,
                                                unk_method = 'rnd', unk_vector = None, unk_word = '<UNK>')

Word2VecVectorizer::load_embeddings ==> start
loading projection weights from C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
loaded (9620, 100) matrix from C:\Users\tatk\tatk\resources\models\word2vec_embeddings.txt
Time taken: 0.02 mins
Word2VecVectorizer::load_embeddings ==> end


### (5.3) Embedding Lookup: Get word and subword indices.

In [17]:
df_predict = pd.DataFrame({'text' : ["I have fever", "My doctor prescribed me ibuprofen."]})
vectorizer.input_col = 'text'
vectorizer.output_col = 'indices'
vectorizer.return_type = 'word_index'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices
0,I have fever,"[94, 55, 1078]"
1,My doctor prescribed me ibuprofen.,"[9620, 8025, 9620, 4657, 9620]"


### (5.4) Embedding Lookup: Get word embeddings.

In [18]:
vectorizer.output_col = 'word_vector'
vectorizer.return_type = 'word_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector
0,I have fever,"[94, 55, 1078]","[[0.175807997584, -0.01292799972, 0.2981559932..."
1,My doctor prescribed me ibuprofen.,"[9620, 8025, 9620, 4657, 9620]","[[-0.118047584745, -0.0765161739483, -0.008675..."


### (5.5) Embedding Lookup: Get sentence embedding.

In [19]:
vectorizer.output_col = 'sentence_vector'
vectorizer.return_type = 'sentence_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector,sentence_vector
0,I have fever,"[94, 55, 1078]","[[0.175807997584, -0.01292799972, 0.2981559932...","[-0.198054343462, -0.0625083340953, 0.09954333..."
1,My doctor prescribed me ibuprofen.,"[9620, 8025, 9620, 4657, 9620]","[[-0.118047584745, -0.0765161739483, -0.008675...","[-0.0933727497029, -0.113094106418, -0.0371973..."


### (5.6) Embedding Lookup: Get most similar word to a given word.

In [20]:
vectorizer.embedding_table.most_similar('fever')

precomputing L2-norms of word weight vectors


[('hay', 0.7872635126113892),
 ('haemorrhagic', 0.7152248620986938),
 ('era', 0.7067720890045166),
 ('chikungunya', 0.6999553442001343),
 ('southern', 0.6974883675575256),
 ('dairy', 0.695317804813385),
 ('encephalitis', 0.6922379732131958),
 ('vomiting', 0.6884969472885132),
 ('yellow', 0.6828291416168213),
 ('diarrhea', 0.681731104850769)]

In [21]:
vectorizer.embedding_table.most_similar('doctor')

[('prognosis', 0.9426259994506836),
 ('facilities', 0.941356360912323),
 ('radiotherapy', 0.9408228993415833),
 ('grounds', 0.9350035190582275),
 ('papers', 0.9279524087905884),
 ('communicable', 0.9261314868927002),
 ('emphasized', 0.9256585836410522),
 ('tasks', 0.9232277870178223),
 ('achievements', 0.9187901616096497),
 ('personnel', 0.9185605645179749)]

In [22]:
vectorizer.embedding_table.most_similar('ibuprofen')

[('chlormethiazole', 0.940321683883667),
 ('dipropionate', 0.9348151087760925),
 ('beclomethasone', 0.9326766133308411),
 ('benorylate', 0.9220473766326904),
 ('f2alpha', 0.9094793200492859),
 ('phosphide', 0.9047806859016418),
 ('bx24', 0.8957622051239014),
 ('frusemide', 0.8943196535110474),
 ('thrice', 0.8939546942710876),
 ('ra27', 0.8856890201568604)]

In [23]:
vectorizer.embedding_table.most_similar('have')

[('has', 0.6336463689804077),
 ('been', 0.6312592029571533),
 ('examining', 0.579426646232605),
 ('vertebrates', 0.5628061294555664),
 ('observing', 0.5623247623443604),
 ('microscopically', 0.5592154860496521),
 ('localize', 0.5555355548858643),
 ('call', 0.5527269840240479),
 ('characterize', 0.5492762923240662),
 ('technic', 0.5434949994087219)]

## Next steps

Learn more about Azure Machine Learning Package for Text Analytics in these articles:

+ Read the [package overview and learn how to install it](https://aka.ms/aml-packages/text).

+ Explore the [reference documentation](https://aka.ms/aml-packages/text) for this package.

+ Learn about [other Python packages for Azure Machine Learning](reference-python-package-overview.md).

© 2018 Microsoft. All rights reserved.