# CRDCDH Semantic Mapping with BlazingText



The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.  Moreover, it supports Out-of-vocabulary (OOV) to predict words that are not in training dataset with much higher performance than FastText.

## Setup

Let's start by setting SageMaker environment:
- For the first time to run the notebook, make sure sageMaker environment configuration file, sagemaker_config.py, exists in src/common directory.  If not, create it by copy sagemaker_config_sample.py file under src/common dir and rename the copy to sagemaker_config.py
- Open the sagemaker_config.py under src/common dir, review all environment settings and update them if necessary. 

In [None]:
from semantic_analysis_class import SemanticAnalysis
import common.sagemaker_config as config

# instantiate a class object from SemanticAnalysis and set SageMaker environment
semantic_analysis = SemanticAnalysis()


## Training and/or Test Model Setup
First of all, set the container by calling:
    semantic_analysis.set_container(container_name, container_version)


In [None]:
CONTAINER_IMAGE_NAME = "blazingtext"
CONTAINER_IMAGE_VERSION = "latest"
semantic_analysis.set_container(CONTAINER_IMAGE_NAME, CONTAINER_IMAGE_VERSION)

Second, decide train and/or test model workflow by setting switches:

By setting TEST_MODEL_ONLY to false, we decide to train model and test model.  Otherwise, only test trained model, that skips training step, a very costly step.
    1) TEST_MODEL_ONLY = False  
   
By setting TRANSFORM_DATA to false, we decide to use existing training data in text8 format.  Otherwise, need call transformData function to convert json files to a text8 file .
    2) TRANSFORM_DATA = False 


In [None]:
#setting switches to determine workflow
TEST_MODEL_ONLY = True 
TRANSFORM_DATA = False 

trained_model_s3_path = ""

if TEST_MODEL_ONLY == True:
    #must set trained model s3 path here
    trained_model_s3_path = "train_output/blazingtext-2024-06-21-14-56-31-917/output/model.tar.gz"

For train model, need to prepare training data, if only have datasets in json files, need call transform function to create a one training file in text8 format

In [None]:
trining_data_s3_path = "" 
if TEST_MODEL_ONLY == False:
    if TRANSFORM_DATA == True:
        # if need to transform data, must set the raw data folder that contain json file(s) either in s3 bucket of in local folder
        # contact admin if you don't have it.
        s3_raw_data_prefix = "data/raw/json/blazingtext-2024-06-13-16-15/" 
        trining_data_s3_path = semantic_analysis.transformData(s3_raw_data_prefix)
    else:
        # if not need to transform data, must set the training data file path in s3 bucket, contact admin if you don't have it.
        trining_data_s3_path = "data/train/blazingtext-2024-06-21-10-51-45-535/train_data_text8"
        # or local path to the training dataset.
        trining_data_local_path = "../data/train/text8/evaluation_data_set.txt"
    path = trining_data_s3_path if trining_data_s3_path else trining_data_local_path
    print(f"Training data path: {path}")
    #call prepare_train_data function
    semantic_analysis.prepare_train_data(trining_data_s3_path, trining_data_local_path)

## Training the BlazingText model for generating word vectors

Now let's train the model.

In [None]:
if TEST_MODEL_ONLY == False:
    # Train the model
    #set training algorithm
    algorithm = "FastText"  #values: "FastText", "Word2Vec, TextClassification"
    semantic_analysis.train(algorithm)

### Evaluation

Let us now download the word vectors learned by our model and visualize them using a [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) plot.

In [None]:
if TEST_MODEL_ONLY == False:
    downloaded_model_path = "../output/model.tar.gz"
    semantic_analysis.download_trained_model(downloaded_model_path) # download trained model to local
    semantic_analysis.evaluate_learned_model_vacs(downloaded_model_path, "../output/model") # evaluate learned vectors in trained model

As expected, we get an n-dimensional vector (where n is vector_dim as specified in hyperparameters) for each of the words. If the word is not there in the training dataset, the model will return a vector of zeros.

Running the code above might generate a plot like the one below. t-SNE and Word2Vec are stochastic, so although when you run the code the plot won’t look exactly like this, you can still see clusters of similar words such as below where 'british', 'american', 'french', 'english' are near the bottom-left, and 'military', 'army' and 'forces' are all together near the bottom.

![tsne plot of embeddings](../images/tsne.png)

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [None]:
endpoint_name = config.ENDPOINT_NAME #default endpoint name, crdcdh-ml-dev-endpoint in dev. The name can be customized by valid unique name.
semantic_analysis.deploy_trained_model(endpoint_name, trained_model_s3_path)

### Check the accuracy of predictions with test dataset

#### Use YAML format for inference
Extract permissive and non-permissive value pairs then calculate mean similarity by compare word embeddings (vectors) of values paris.

In [None]:
yaml_test_data_path = "../data/test/cds_clean_dict_v1.3.yaml" # set the local yaml file path 
semantic_analysis.evaluate_trained_model(endpoint_name, yaml_test_data_path)

### Getting vector representations for two words and get similarity score

In [None]:
words = ["genomics", "genomic"]   
semantic_analysis.test_trained_model(words, endpoint_name)

### Stop / Close the Endpoint (Optional)
Finally, for training and test purpose, we need delete the endpoint before we close the notebook.  If you want to host the model in the endpoint for a while, just remove the endpoint_name parameter from the close() as listed below:

semantic_analysis.close()

In [None]:
# # Delete the SageMaker endpoint and resources
semantic_analysis.close(0, endpoint_name)
semantic_analysis = None
s3_bucket = None