# CRDCDH Semantic Mapping with BlazingText Word2Vec 



Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learning. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used extensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation.  

## Setup

Let's start by setting SageMaker environment:
- For the first time to run the notebook, make sure sageMaker environment configuration file, sagemaker_config.py, exists in src/common directory.  If not, create it by copy sagemaker_config_sample.py file under src/common dir and rename the copy to sagemaker_config.py
- Open the sagemaker_config.py under src/common dir, review all environment settings and update them if necessary. 

In [1]:
from semantic_analysis_class import SemanticAnalysis
import common.sagemaker_config as config

# instantiate a class object from SemanticAnalysis and set SageMaker environment
semantic_analysis = SemanticAnalysis()
s3_bucket = semantic_analysis.get_s3_bucket()


sagemaker.config INFO - Not applying SDK defaults from location: /opt/homebrew/share/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/gup2/Library/Application Support/sagemaker/config.yaml
src dir:/Users/gup2/workspace/vscode/crdcdh/crdcdh-ml-notebooks/src
root dir:/Users/gup2/workspace/vscode/crdcdh/crdcdh-ml-notebooks
sagemaker.config INFO - Fetched defaults config from location: /Users/gup2/workspace/vscode/crdcdh/crdcdh-ml-notebooks/configs/sagemaker/config.yaml


## Training and/or Test Model Setup
First of all, set the container by calling:
    semantic_analysis.set_container(container_name, container_version)


In [2]:
CONTAINER_IMAGE_NAME = "blazingtext"
CONTAINER_IMAGE_VERSION = "latest"
semantic_analysis.set_container(CONTAINER_IMAGE_NAME, CONTAINER_IMAGE_VERSION)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:1 (us-east-1)


Second, decide train and/or test model workflow by setting switches:

By setting TEST_MODEL_ONLY to false, we decide to train model and test model.  Otherwise, only test trained model, that skips training step, a very costly step.
    1) TEST_MODEL_ONLY = False  
   
By setting TRANSFORM_DATA to false, we decide to use existing training data in text8 format.  Otherwise, need call transformData function to convert json files to a text8 file .
    2) TRANSFORM_DATA = False 


In [3]:
#setting switches to determine workflow
TEST_MODEL_ONLY = True 
TRANSFORM_DATA = False 

trained_model_s3_path = ""

if TEST_MODEL_ONLY == True:
    #must set trained model s3 path here
    trained_model_s3_path = "train_output/blazingtext-2024-06-12-15-16-09-327/output/model.tar.gz"
    #check if the model exists
    if trained_model_s3_path == "" or not s3_bucket.file_exists_on_s3(trained_model_s3_path):
        #if not, raise an exception and exit.
        print("Trained model is not existed!")
        semantic_analysis.close()
    else:
        print("Trained model is existed at " + trained_model_s3_path)    

Trained model is existed at train_output/blazingtext-2024-06-12-15-16-09-327/output/model.tar.gz


For train model, need to prepare training data, if only have datasets in json files, need call transform function to create a one training file in text8 format

In [4]:
import os
from common.utils import get_data_time
trining_data_s3_path = "" 
if TEST_MODEL_ONLY == False:
    if TRANSFORM_DATA == True:
        # if need to transform data, must set the raw data folder that contain json file(s) either in s3 bucket of in local folder
        s3_raw_data_prefix = "" #set the prefix of raw data folder in s3 bucket if you have
        local_raw_data_folder = "../data/raw/json/"
        if not s3_raw_data_prefix or not s3_bucket.file_exists_on_s3(s3_raw_data_prefix):
            #check if json file in local folder
            if not os.path.exists(local_raw_data_folder) or not any(fname.endswith('.json') for fname in os.listdir(local_raw_data_folder)):
                print("No raw data found on s3 bucket or local folder")
                semantic_analysis.close()

        else:
            # if in s3 bucket, download to local folder
            s3_bucket.download_files_in_folder(s3_raw_data_prefix, local_raw_data_folder)
        
        local_text8_file_path = "../data/train/test8/train_data.txt"
        trining_data_s3_path = semantic_analysis.transformData(s3_raw_data_prefix, local_text8_file_path)
    else:
        # if not need to transform data, must set the training data file either in s3 bucket of in local folder
        local_text8_file_path = "../data/train/text8/updated_all_training_set_gdcfixed.txt"
        trining_data_s3_path = "data/train/blazingtext-2024-06-12-11-15-48-f/train_data"
        
        if not trining_data_s3_path or not s3_bucket.file_exists_on_s3(trining_data_s3_path):
            #check if training data file in local folder
            if not os.path.exists(local_text8_file_path):
                print("No training data found on s3 bucket or local folder")
                semantic_analysis.close()
            else:
                # if in local folder, upload to s3
                trining_data_s3_path = os.path.join(config.TRAIN_DATA_PREFIX, f"{CONTAINER_IMAGE_NAME}-{get_data_time}/train_data")
                try:
                    s3_bucket.upload_file(trining_data_s3_path, local_text8_file_path)
                except Exception as e:
                    print("Failed to update training data.")
                    semantic_analysis.close()

    print("Training data path: " + trining_data_s3_path)
    #call prepare_train_data function
    semantic_analysis.prepare_train_data(trining_data_s3_path)

## Training the BlazingText model for generating word vectors

Now let's train the model.

In [5]:
if TEST_MODEL_ONLY == False:
    semantic_analysis.train()

### Evaluation

Let us now download the word vectors learned by our model and visualize them using a [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) plot.

In [6]:
if TEST_MODEL_ONLY == False:
    downloaded_model_path = "../output/model.tar.gz"
    semantic_analysis.download_trained_model(downloaded_model_path) # download trained model to local
    semantic_analysis.evaluate_learned_model_vacs(downloaded_model_path, "../output/model") # evaluate learned vectors in trained model

As expected, we get an n-dimensional vector (where n is vector_dim as specified in hyperparameters) for each of the words. If the word is not there in the training dataset, the model will return a vector of zeros.

Running the code above might generate a plot like the one below. t-SNE and Word2Vec are stochastic, so although when you run the code the plot wonâ€™t look exactly like this, you can still see clusters of similar words such as below where 'british', 'american', 'french', 'english' are near the bottom-left, and 'military', 'army' and 'forces' are all together near the bottom.

![tsne plot of embeddings](../images/tsne.png)

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [7]:
endpoint_name = config.ENDPOINT_NAME #default endpoint name, crdcdh-ml-dev-endpoint in dev. The name can be customized by valid unique name.
semantic_analysis.deploy_trained_model(endpoint_name, trained_model_s3_path)

<sagemaker.model.Model object at 0x31bcd3cd0>
------!

### Getting vector representations for words

#### Use JSON format for inference
The payload should contain a list of words with the key as "**instances**". BlazingText supports content-type `application/json`.

In [8]:
words = ["protocol identifier"]
semantic_analysis.test_trained_model(words, endpoint_name)

b'[{"vector": [-0.08171992003917694, 0.4133380055427551, 0.15874870121479034, 0.9120864272117615, 0.35315394401550293, 0.5989611148834229, -0.13320013880729675, 0.32997578382492065, -0.931456983089447, -0.3034103810787201, -0.6623309254646301, 0.7876906394958496, -0.1892700493335724, 0.354099839925766, -0.4875767230987549, 0.5718265175819397, -0.015936193987727165, -0.2858751714229584, 0.1493469625711441, 0.5570916533470154, 0.05738626420497894, 0.2775570750236511, -0.2805495858192444, 0.8012495636940002, -0.49053627252578735, 0.15655098855495453, 0.6446893811225891, -0.23246367275714874, 0.3104402422904968, 0.336190402507782, -0.2623969614505768, 0.5630905628204346, -0.6810024380683899, -0.3621712327003479, 0.003490819362923503, -0.08670128136873245, 0.08936437964439392, -0.20517677068710327, -0.36212030053138733, 0.5797169804573059, -0.28241923451423645, -0.3633529543876648, 0.4453902542591095, 0.4731803238391876, -0.44600656628608704, -0.5014044046401978, -0.7921153903007507, -0.579

### Stop / Close the Endpoint (Optional)
Finally, for training and test purpose, we need delete the endpoint before we close the notebook.  If you want to host the model in the endpoint for a while, just remove the endpoint_name parameter from the close() as listed below:

semantic_analysis.close()

In [9]:
# # Delete the SageMaker endpoint and resources
semantic_analysis.close(endpoint_name)
semantic_analysis = None
s3_bucket = None