<h1 style="text-align: center; font-size: 50px;"> 🌍 Word Embeddings Generation</h1>

This Jupyter notebook demonstrates how to generate word embeddings from a given corpus using a pre-trained BERT model. These embeddings will be used to find semantically similar matches for a user query.

# Notebook Overview
- Start Execution
- Install and Import Libraries
- Configure Settings
- Verify Assets
- Load and Preprocess Data
- Initialize BERT Tokenizer and Model
- Generate Embeddings in Batches
- Save Embeddings to File
- Downloading the Bert Large Uncased Model

# Start Execution

In [1]:
import logging  # For application-level logging
import time     # For runtime measurement (wall clock)

# Configure logger
logger: logging.Logger = logging.getLogger("run_workflow_logger")
logger.setLevel(logging.INFO)
logger.propagate = False  # Prevent duplicate logs from parent loggers

# Set formatter
formatter: logging.Formatter = logging.Formatter(
    fmt="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Configure and attach stream handler
stream_handler: logging.StreamHandler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

In [2]:
start_time = time.time()  

logger.info("Notebook execution started.")

2025-08-07 14:26:05 - INFO - Notebook execution started.


# Install and Import Libraries

In [3]:
%%time

# Install required Python packages listed in requirements.txt silently
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.
CPU times: user 1 s, sys: 429 ms, total: 1.43 s
Wall time: 45.1 s


In [4]:
import sys
import os  
from datetime import datetime
import warnings
from pathlib import Path

# Data manipulation libraries
import pandas as pd
import numpy as np
from tabulate import tabulate
from sklearn.metrics.pairwise import cosine_similarity

# Deep learning framework
import torch  

# NLP libraries
import nltk  # Natural Language Toolkit
from nemo.collections.nlp.models import BERTLMModel  # BERT Language Model from NVIDIA NeMo
from transformers import AutoTokenizer  # Tokenizer for transformer-based models
from transformers import logging as hf_logging
import mlflow
from mlflow import MlflowClient
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, ColSpec, TensorSpec, ParamSchema, ParamSpec
from mlflow.tracking import MlflowClient

    


# Configure Settings

In [5]:
# ------------------------ Suppress Verbose Logs ------------------------
warnings.filterwarnings("ignore")

In [6]:
CORPUS_PATH = "../data/raw/corpus.csv"
TOKENIZER_DIR = "../artifacts/tokenizer"
BERT_MODEL_NAME = "bert-large-uncased"
BERT_MODEL_DATAFABRIC_PATH = "/home/jovyan/datafabric/Bertlargeuncased/bertlargeuncased.nemo"
EMBEDDINGS_OUTPUT_PATH = "../data/processed/"
BERT_MODEL_ONLINE_PATH = "/root/.cache/torch/NeMo/NeMo_1.22.0/bertlargeuncased/ca4ebba9f05a8ffb79845249ca046983/bertlargeuncased.nemo"
DEMO_PATH = "../demo"
EMBEDDINGS_PATH = "../data/processed/embeddings.csv"
MODEL_NAME = "BERT_Tourism_Model"

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# Verify Assets

In [8]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")

log_asset_status(
    asset_path=BERT_MODEL_DATAFABRIC_PATH ,
    asset_name="BERT model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=CORPUS_PATH,
    asset_name="Corpus data",
    success_message="",
    failure_message="Please check if Corpus was properly downloaded in your project on AI Studio."
)

2025-08-07 14:27:14 - INFO - BERT model is properly configured. 
2025-08-07 14:27:14 - INFO - Corpus data is properly configured. 


# Load and Preprocess Data

In [9]:
%%time

# Download the Punkt tokenizer data for sentence tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


CPU times: user 281 ms, sys: 90 ms, total: 371 ms
Wall time: 890 ms


True

In [10]:
# Load the dataset into a Pandas DataFrame
corpus_df = pd.read_csv(CORPUS_PATH)

# Display the first few rows of the DataFrame
logger.info("First few entries of the DataFrame:")
print(corpus_df.head())

2025-08-07 14:27:15 - INFO - First few entries of the DataFrame:


   Unnamed: 0  Topic                                             Pledge
0           0      1  Actually we as an association are still pretty...
1           1      1  EFFAT welcomes the Commission Proposal for a R...
2           2      1  HOTREC calls for a level playing field and fai...
3           3      1  Estonia sees the need to synchronize and harmo...
4           4      1  Sphere Travel Club contributes to a flourishin...


In [11]:
documents = corpus_df["Pledge"].astype(str).tolist()  # Convert the column to a list

# Initialize BERT Tokenizer and Model

In [12]:
%%time

# Initialize the tokenizer with a pre-trained BERT model
tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL_NAME)
tokenizer.save_pretrained(TOKENIZER_DIR)

# Set device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

logger.info("Loading BERT model...")

# Ensure you have added the 'bertlargeuncased' model from the NVIDIA NGC model catalog.
# If unavailable, use the alternative method below to download the model online.

# Uncomment the following line to download the BERT model online:
# bert_model = BERTLMModel.from_pretrained(model_name="bertlargeuncased", strict=False).to(device)

# Load the BERT model from a local .nemo file inside datafabric folder
bert_model = BERTLMModel.restore_from(BERT_MODEL_DATAFABRIC_PATH, strict=False).to(device)

logger.info("BERT model loaded successfully.")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2025-08-07 14:27:16 - INFO - Loading BERT model...
[NeMo W 2025-08-07 14:29:05 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    data_file: /home/yzhang/data/nlp/bert/47316/hdf5/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/
    max_predictions_per_seq: 80
    batch_size: 16
    shuffle: true
    num_samples: -1
    num_workers: 2
    drop_last: false
    pin_memory: false
    


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

[NeMo W 2025-08-07 14:29:28 modelPT:617] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2025-08-07 14:29:28 modelPT:728] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: (0.9, 0.999)
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 4.375e-05
        maximize: False
        weight_decay: 0.01
    )


[NeMo W 2025-08-07 14:29:28 lr_scheduler:890] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !


[NeMo I 2025-08-07 14:29:30 save_restore_connector:249] Model BERTLMModel was successfully restored from /home/jovyan/datafabric/Bertlargeuncased/bertlargeuncased.nemo.


2025-08-07 14:29:30 - INFO - BERT model loaded successfully.


CPU times: user 28.3 s, sys: 13.7 s, total: 42 s
Wall time: 2min 15s


# Generate Embeddings in Batches

In [13]:
def generate_embeddings_in_batches(texts, tokenizer, model, batch_size=32):
    """
    Generates text embeddings using the NeMo BERT model in batches.
    
    Args:
        texts (list of str): List of input texts.
        tokenizer: Pretrained tokenizer.
        model: Pretrained NeMo BERT model.
        batch_size (int, optional): Batch size for processing. Default is 32.
    
    Returns:
        np.ndarray: Generated embeddings.
    """
    model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        
        # Tokenize batch with padding and truncation
        encoded_input = tokenizer(
            batch_texts, padding=True, truncation=True, return_tensors="pt", max_length=128
        )
        encoded_input = {key: val.to(device) for key, val in encoded_input.items()}

        with torch.no_grad():  # Disable gradient computation for inference
            output = model.bert_model(**encoded_input)
        
        # Extract the CLS token representation for embeddings
        embeddings = output[:, 0, :].cpu().numpy()  # CLS token representation
        all_embeddings.append(embeddings)

    return np.vstack(all_embeddings)

# Save Embeddings to File

In [14]:
%%time

# Generate embeddings using the pre-trained model
embeddings = generate_embeddings_in_batches(documents, tokenizer, bert_model)

# Convert embeddings into a DataFrame
df_embeddings = pd.DataFrame(embeddings)

# Ensure the output directory exists
os.makedirs(EMBEDDINGS_OUTPUT_PATH, exist_ok=True)
    
# Define output file path
output_file = os.path.join(EMBEDDINGS_OUTPUT_PATH, "embeddings.csv")

# Save embeddings
df_embeddings.to_csv(output_file , index=False)

logger.info(f"✅ Embedding completed and saved to: {output_file}")

2025-08-07 14:29:58 - INFO - ✅ Embedding completed and saved to: ../data/processed/embeddings.csv


CPU times: user 29.7 s, sys: 10.4 s, total: 40.1 s
Wall time: 28.4 s


# Downloading the Bert Large Uncased Model

In [15]:
# Ensure you have added the 'bertlargeuncased' model from the NVIDIA NGC model catalog.
# If unavailable, uncomment the following line and use the alternative method below to download the BERT model online.
# bert_model = BERTLMModel.from_pretrained(model_name="bertlargeuncased", strict=False).to(device)

In [16]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed successfully.")

2025-08-07 14:29:58 - INFO - ⏱️ Total execution time: 3m 53.51s
2025-08-07 14:29:58 - INFO - ✅ Notebook execution completed successfully.


Built with ❤️ using Z by HP AI Studio.