## Introduction to Advanced Semantic Similarity Analysis with Sentence Transformers and MLflow

Welcome to our in-depth tutorial on Semantic Similarity Analysis using **Sentence Transformers** with an advanced twist in **MLflow**. This tutorial is tailored for individuals eager to explore sophisticated applications in natural language processing (NLP), with a particular focus on managing and deploying flexible NLP models. We will take you through an illustrative example, showcasing the integration of the `sentence-transformers` library with MLflow, and highlight a custom implementation that transcends typical usage.

### Unveiling the Power of Sentence Transformers for NLP

**Sentence Transformers** stand out as a specialized adaptation of traditional transformer models, meticulously optimized to produce semantically rich sentence embeddings. Stemming from the prominent Transformers library by 🤗 Hugging Face, these models excel in NLP tasks such as semantic search, clustering, and most pertinently, similarity analysis. Leveraging advanced models like BERT and RoBERTa, `sentence-transformers` enable deep semantic understanding at the sentence level.

### MLflow: Pioneering Flexible Model Management and Deployment

Integrating MLflow with Sentence Transformers not only simplifies managing NLP projects but also introduces a realm of possibilities for custom model functionalities:

- **Enhanced Experiment Tracking**: Effortlessly log detailed experiments, including unique model parameters and sentence embeddings, using MLflow.
- **Custom `PythonModel` Implementation**: A key learning point in this tutorial is the custom `PythonModel` implementation within MLflow. While the native `sentence-transformers` model in MLflow returns pooled embeddings, our custom implementation creatively adapts this to return cosine similarity scores between pairs of texts, showcasing the versatility enabled by MLflow's `PythonModel` abstraction.
- **Robust Model Lifecycle Management**: MLflow facilitates effective versioning and configuration control, vital for the iterative nature of NLP model development.
- **Deployment Readiness and Reproducibility**: MLflow ensures that your NLP models are not only deployment-ready but also reproducible, allowing for consistent and reliable applications of your models in production environments.

### Learning Objectives

In this tutorial, you will:

- Configure and utilize the `sentence-transformers` library for semantic similarity analysis.
- Delve into MLflow’s custom `PythonModel` implementation, understanding how to extend base model functionalities for bespoke requirements.
- Master logging models, configurations, and leveraging model signatures in MLflow.
- Deploy and apply these advanced models for inference, taking full advantage of MLflow's deployment capabilities.

By the conclusion of this tutorial, you will have gained valuable insights into conducting advanced semantic similarity analyses using Sentence Transformers and exploiting MLflow's flexibility for custom model deployment. Whether you're deepening your NLP expertise or branching out into new territories of model management, this tutorial will empower you with the skills to innovatively track, manage, and deploy complex NLP models.

Let's embark on this enlightening journey of semantic similarity exploration with Sentence Transformers and MLflow!


## Implementing a Custom SimilarityModel with MLflow

In this section, we introduce a custom model class, `SimilarityModel`, derived from MLflow's `PythonModel`. This class is designed to compare the semantic similarity between two sentences using sentence embeddings.

### Overview of SimilarityModel

The `SimilarityModel` class is a bespoke implementation that allows us to extend the basic functionality of the Sentence Transformer model. It encapsulates the logic required for loading the model, preparing input data, and predicting the cosine similarity between sentence pairs.

#### 1. Importing Necessary Libraries:

- **MLflow Components**: We import `mlflow`, `infer_signature`, and `PythonModel` from MLflow to handle model logging, signature inference, and custom model creation.
- **Data Handling Libraries**: `numpy` and `pandas` are imported for numerical operations and data manipulation.
- **Sentence Transformer Components**: `SentenceTransformer` and `util` from the `sentence_transformers` library are used for model loading and utility functions.

#### 2. Custom PythonModel - SimilarityModel:

- **load_context Method**:
  - The `load_context` method is crucial for loading the model context during inference. Instead of initializing the Sentence Transformer model directly within the class (which can cause serialization issues due to the complexity of the object), we load the model using the path provided in the context. This method ensures safe and efficient model loading, avoiding potential serialization errors.
  - The `SentenceTransformer.load` function is used to load the model from the specified path in the `context.artifacts`.

- **predict Method**:
  - The `predict` method is the core of our custom model, designed to accept either a DataFrame or a dictionary as input.
  - **Input Type Checking**: We implement checks to ensure the input is either a DataFrame with exactly two columns or a dictionary with two specific keys (`'sentence_1'` and `'sentence_2'`). This type checking is vital to ensure the model receives correctly formatted input, thereby protecting end-users from encountering unexpected errors.
  - **Sentence Embeddings**: For both sentences provided as input, we generate embeddings using the Sentence Transformer model's `encode` method.
  - **Cosine Similarity Calculation**: Utilizing the `util.cos_sim` function, we calculate the cosine similarity between the two sentence embeddings. This similarity score is a measure of how semantically similar the two sentences are.

### Significance of Custom SimilarityModel

- **Flexibility**: By defining our model within `PythonModel`, we gain the flexibility to specify how inputs are handled and how predictions are made, tailoring the model to our specific semantic analysis task.
- **Robustness**: The input type checking and error handling ensure that the model behaves predictably and robustly, providing clear and informative error messages if incorrect input types are provided.
- **Efficient Model Loading**: Using `load_context` for model loading helps in avoiding common serialization pitfalls associated with complex model objects like Sentence Transformers.
- **Custom Functionality**: Implementing the `predict` method with Sentence Transformer's encoding and utility function for cosine similarity allows us to create a model that directly computes similarity scores, a more specialized application compared to standard sentence embedding models.

This custom `SimilarityModel` demonstrates the power of MLflow's `PythonModel` for creating advanced, deployable NLP models that go beyond basic functionalities, providing a blueprint for similar custom implementations in various ML projects.

Let's proceed to implement this custom model in our MLflow setup.


In [1]:
import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util


class SimilarityModel(PythonModel):

    def load_context(self, context):
        """Load the model context for inference."""
        from sentence_transformers import SentenceTransformer

        try:
            self.model = SentenceTransformer.load(context.artifacts["model_path"])
        except Exception as e:
            raise ValueError(f"Error loading model: {e}")

    def predict(self, context, model_input, params):
        """Predict method for comparing similarity between two sentences."""
        from sentence_transformers import util

        if isinstance(model_input, pd.DataFrame):
            if model_input.shape[1] != 2:
                raise ValueError("DataFrame input must have exactly two columns.")
            sentence_1, sentence_2 = model_input.iloc[0, 0], model_input.iloc[0, 1]
        elif isinstance(model_input, dict):
            sentence_1 = model_input.get("sentence_1")
            sentence_2 = model_input.get("sentence_2")
            if sentence_1 is None or sentence_2 is None:
                raise ValueError("Both 'sentence_1' and 'sentence_2' must be provided in the input dictionary.")
        else:
            raise TypeError(f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame.")

        embedding_1 = self.model.encode(sentence_1)
        embedding_2 = self.model.encode(sentence_2)

        return np.array(util.cos_sim(embedding_1, embedding_2).tolist())


* 'schema_extra' has been renamed to 'json_schema_extra'


## Preparing the Sentence Transformer Model and Signature

This part of the tutorial focuses on loading a pre-trained Sentence Transformer model, preparing an input example, saving the model, and defining its signature. These steps are essential for setting up the model for subsequent logging and deployment with MLflow.

### Loading and Saving the Pre-trained Model

1. **Model Initialization**:
   - We load a pre-trained Sentence Transformer model using `SentenceTransformer("all-MiniLM-L6-v2")`. The `"all-MiniLM-L6-v2"` model is known for its balance between performance and size, making it ideal for a variety of NLP tasks.

2. **Model Saving**:
   - After loading the model, we save it to a directory. In this case, we use `/tmp/sbert_model` as our saving location.
   - Saving the model locally is a necessary step before we can log it with MLflow, as MLflow requires access to the model's file path.

### Preparing Input Example and Artifacts

1. **Input Example Creation**:
   - We create a DataFrame `input_example` with two example sentences. This DataFrame mimics the format of the data that the model expects during inference.
   - The example sentences chosen are `"I like apples"` and `"I like oranges"`.

2. **Defining Artifacts**:
   - Artifacts in MLflow are additional files, like models and data files, associated with ML runs. We define our model's path in the `artifacts` dictionary, using the key `"model_path"` and the path where we saved the model.

### Generating Test Output for Signature

1. **Test Output Calculation**:
   - To generate a test output for the signature, we calculate the cosine similarity between the embeddings of our example sentences. This is done using `util.cos_sim`.
   - The cosine similarity score gives us an insight into how similar the sentences are in terms of their semantic content.

2. **Signature Inference**:
   - The signature of a model in MLflow defines the input and output schema. We use `infer_signature` to automatically generate this signature based on our `input_example` and the generated `test_output`.
   - The inferred signature will be used by MLflow to validate the input and output formats when the model is deployed and used for prediction.

### Importance of These Steps

- **Model Readiness**: Loading and saving the model ensure that it is ready for logging and deployment via MLflow.
- **Input-Output Contract**: The signature acts as a contract that specifies what the model expects as input and what it produces as output, crucial for ensuring consistency and reliability in model deployment.

By completing these steps, we have effectively prepared our Sentence Transformer model and defined its operational schema, setting the stage for its integration and management within the MLflow ecosystem.

Let's move forward with preparing our model and its signature.


In [2]:
# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = pd.DataFrame([{"sentence_1": "I like apples", "sentence_2": "I like oranges"}])

# Save the model in the /tmp directory
model_directory = "/tmp/sbert_model"
model.save(model_directory)

# Define artifacts with the absolute path
artifacts = {"model_path": model_directory}

# Generate test output for signature
test_output = np.array(util.cos_sim(model.encode(input_example["sentence_1"][0]), 
                                    model.encode(input_example["sentence_2"][0])).tolist())

# Define the signature associated with the model
signature = infer_signature(input_example, test_output)

### Setting the tracking server and creating an experiment

In order to view the results in our tracking server (for the purposes of this tutorial, we’ve started a local tracking server at this url)

We can start an instance of the MLflow server locally by running the following from a terminal to start the tracking server:

``` bash
mlflow server --host 127.0.0.1 --port 8080
```

With the server started, the following code will ensure that all experiments, runs, models, parameters, and metrics that we log are being tracked within that server instance (which also provides us with the MLflow UI when navigating to that url address in a browser).

After setting the tracking url, we create a new MLflow Experiment to store the run we’re about to create in.

In [3]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Semantic Similarity")


2023/11/17 20:54:17 INFO mlflow.tracking.fluent: Experiment with name 'Semantic Similarity' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/413386080563320984', creation_time=1700272457800, experiment_id='413386080563320984', last_update_time=1700272457800, lifecycle_stage='active', name='Semantic Similarity', tags={}>

## Logging the Custom Model with MLflow

In this step, we log our custom `SimilarityModel` with MLflow. This process involves encapsulating the model within MLflow's logging mechanism, which allows for tracking, versioning, and later deployment.

### Creating a Path for the PyFunc Model

- **PyFunc Path**: We define a temporary path, `pyfunc_path`, where the model will be stored. This path is used by MLflow to save the serialized version of our Python model.

### Logging the Model in MLflow

- **Initiating MLflow Run**: We start an MLflow run using `with mlflow.start_run() as run:`. This run acts as a container for all the operations related to the model logging process.

- **Model Logging Details**:
  - **Model Name**: We specify `"similarity"` as the name for our logged model. This name can be used to reference the model in the MLflow tracking server.
  - **Python Model**: The `python_model` parameter is provided with an instance of our `SimilarityModel`. This custom model class handles the loading of the Sentence Transformer model and the prediction logic for calculating cosine similarity.
  - **Input Example**: We pass `input_example`, a DataFrame containing example sentences. This example helps users understand the format and type of data that the model expects.
  - **Signature**: The `signature` that we previously inferred is included. It provides a schema for the model's input and output, ensuring that the model is used correctly.
  - **Artifacts**: We include `artifacts`, which is a dictionary specifying the path to the saved Sentence Transformer model.
  - **Python Dependencies**: The `pip_requirements` argument lists the necessary Python packages (`sentence_transformers` and `numpy`) for the model to function correctly when loaded in a different environment.

### Significance of Model Logging

- **Model Tracking and Versioning**: By logging the model in MLflow, we ensure that it is tracked and versioned, facilitating better model lifecycle management.
- **Reproducibility and Deployment**: Logging the model with its input example, signature, and requirements ensures that it can be easily reproduced and deployed in different environments, maintaining consistency and reliability.

Once the model is logged with MLflow, it is ready for further actions like model comparison, version tracking, and deployment for inference.

Let's proceed to log our custom `SimilarityModel` with MLflow.


In [4]:
pyfunc_path = "/tmp/sbert_pyfunc"

with mlflow.start_run() as run:
    model_info = mlflow.pyfunc.log_model(
        "similarity",
        python_model=SimilarityModel(),
        input_example=input_example,
        signature=signature,
        artifacts=artifacts,
        pip_requirements=["sentence_transformers", "numpy"],
    )


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2023/11/17 20:54:18 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


## Model Inference and Testing Similarity Prediction

After logging our custom `SimilarityModel` with MLflow, we now proceed to load this model for inference. This step demonstrates how to use the model to compute the semantic similarity between two sentences.

### Loading the Model for Inference

- **Loading with MLflow**: We use `mlflow.pyfunc.load_model` to load our model. This function requires the model's URI, which we obtain from `model_info.model_uri`. The model URI is a unique identifier that MLflow uses to locate and load the model.
- **Model Readiness**: The loaded model, `loaded_dynamic`, is now ready for inference. It encapsulates all the logic we defined in the `SimilarityModel`, including the Sentence Transformer model's loading and the cosine similarity calculation.

### Preparing Data for Similarity Prediction

- **Creating Input Data**: We create a DataFrame, `similarity_data`, containing a pair of sentences for which we want to compute the similarity. In this example, we use `"I like apples"` and `"I like oranges"` as our input sentences.
- **Flexibility in Input Format**: This step demonstrates the flexibility of our custom model in accepting input data in a DataFrame format, which is intuitive and user-friendly.

### Computing and Displaying Similarity Score

- **Predicting Similarity**: We call the `predict` method on `loaded_dynamic` with our `similarity_data`. This method computes the cosine similarity between the embeddings of the two input sentences.
- **Interpreting the Result**: The result, `similarity_score`, is a numerical representation of how similar the two sentences are in terms of their semantic content. A higher score indicates greater similarity.
- **Output Display**: We print out the similarity score to provide a clear and immediate understanding of the model's output. For example, `The similarity between these sentences is: [similarity_score]`.

### Importance of This Testing

- **Model Validation**: This step is crucial for validating that our model behaves as expected when making predictions on new data.
- **Practical Application**: Demonstrating the model's ability to compute sentence similarities showcases its practical application in real-world scenarios.

By completing this inference test, we have successfully demonstrated the application of our custom `SimilarityModel` for semantic similarity analysis, highlighting the model's utility and effectiveness.

Let's perform the inference test and observe the similarity score computed by our model.


In [5]:
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

similarity_data = pd.DataFrame([{"sentence_1": "I like apples", "sentence_2": "I like oranges"}])

similarity_score = loaded_dynamic.predict(similarity_data)

print(f"The similarity between these sentences is: {similarity_score}")


Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]

2023/11/17 20:54:18 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


The similarity between these sentences is: [[0.63414472]]


## Evaluating Semantic Similarity with Distinct Text Pairs

In this section, we use our loaded MLflow model to evaluate semantic similarity for two pairs of sentences, specifically chosen to demonstrate the model's capability to discern varying degrees of similarity.

### Selection of Text Pairs

1. **Low Similarity Pair (`low_similarity`)**:
   - **Text Choice**: The first sentence describes an explorer at the edge of a rainforest, while the second details the process of installing software. These sentences were chosen for their starkly different themes and contexts – one is an adventurous narrative, and the other is a technical instruction.
   - **Expected Outcome**: Given their contrasting subject matters, we anticipate a low similarity score, reflecting the model's ability to recognize and differentiate disparate semantic contents.

2. **High Similarity Pair (`high_similarity`)**:
   - **Text Choice**: Both sentences in this pair describe personal experiences of visiting the Great Pyramids of Giza. While the sentence structures and specific details differ, the overarching theme, emotional tone, and subject matter are closely aligned.
   - **Expected Outcome**: These sentences are expected to yield a high similarity score, demonstrating the model's capacity to detect semantic parallels in texts with similar underlying themes, despite surface-level variations.

### sBERT Model's Role in Similarity Calculation

- **Semantic Understanding**: The Sentence-BERT (sBERT) model implementation in our `SimilarityModel` encodes each sentence into a vector that captures its semantic essence. 
- **Cosine Similarity**: The model then computes the cosine similarity between these vectors. This similarity score quantifies how close the vectors (and hence the sentences) are in the multi-dimensional space, with a higher score indicating greater semantic similarity.

### Computing and Displaying Similarity Scores

- **Predicting for Low Similarity Pair**:
  - We input the `low_similarity` pair to our model and obtain a similarity score. This score quantifies the semantic distance between the two vastly different sentences.
  - The result is printed to give us a clear indication of how the model perceives the semantic relationship between these sentences.

- **Predicting for High Similarity Pair**:
  - Similarly, we input the `high_similarity` pair and obtain its similarity score. This score reflects the semantic closeness of the sentences, both centered around the awe-inspiring experience at the Pyramids of Giza.
  - The output is printed, providing insight into the model's ability to recognize semantic similarities in contextually related sentences.

### Why This Matters

- **Model Validation**: These tests are critical for validating the effectiveness of our custom model in real-world scenarios. They demonstrate the model's nuanced understanding of language and its ability to quantify semantic relationships.
- **Practical Implications**: Understanding how the model processes and evaluates semantic content is vital for applications such as content recommendation, information retrieval, and automated text comparison.

By analyzing these similarity scores, we gain valuable insights into our model's semantic analysis capabilities, confirming its practical utility in distinguishing and quantifying semantic relationships between texts.

Let's proceed to compute and observe the similarity scores for these carefully chosen text pairs.


In [6]:
low_similarity = {
    "sentence_1": "The explorer stood at the edge of the dense rainforest, "
                  "contemplating the journey ahead. The untamed wilderness was "
                  "a labyrinth of exotic plants and unknown dangers, a challenge "
                  "for even the most seasoned adventurer, brimming with the "
                  "prospect of new discoveries and uncharted territories.",
    "sentence_2": "To install the software, begin by downloading the latest "
                  "version from the official website. Once downloaded, run the "
                  "installer and follow the on-screen instructions. Ensure that "
                  "your system meets the minimum requirements and agree to the "
                  "license terms to complete the installation process successfully."
}

high_similarity = {
    "sentence_1": "Standing in the shadow of the Great Pyramids of Giza, I felt a "
                  "profound sense of awe. The towering structures, a testament to "
                  "ancient ingenuity, rose majestically against the clear blue sky. "
                  "As I walked around the base of the pyramids, the intricate "
                  "stonework and sheer scale of these wonders of the ancient world "
                  "left me speechless, enveloped in a deep sense of history.",
    "sentence_2": "My visit to the Great Pyramids of Giza was an unforgettable "
                  "experience. Gazing upon these monumental structures, I was "
                  "captivated by their grandeur and historical significance. Each "
                  "step around these ancient marvels filled me with a deep "
                  "appreciation for the architectural prowess of a civilization long "
                  "gone, yet still speaking through these timeless monuments."
}

low_similarity_score = loaded_dynamic.predict(low_similarity)

print(f"The similarity score for the 'low_similarity' pair is: {low_similarity_score}")


high_similarity_score = loaded_dynamic.predict(high_similarity)

print(f"The similarity score for the 'high_similarity' pair is: {high_similarity_score}")

The similarity score for the 'low_similarity' pair is: [[-0.00052751]]
The similarity score for the 'high_similarity' pair is: [[0.83703309]]


## Conclusion: Harnessing the Power of Custom MLflow Python Functions in NLP

As we conclude this tutorial, let's recap the significant strides we've made in understanding and applying advanced NLP techniques using Sentence Transformers and MLflow.

### Key Takeaways from the Tutorial

- **Versatile NLP Modeling**: We explored how to harness the advanced capabilities of Sentence Transformers for semantic similarity analysis, a critical task in many NLP applications.
- **Custom MLflow Python Function**: The implementation of the custom `SimilarityModel` in MLflow demonstrated the power and flexibility of using Python functions to extend and adapt the functionality of pre-trained models to suit specific project needs.
- **Model Management and Deployment**: We delved into the process of logging, managing, and deploying these models with MLflow, showcasing how MLflow streamlines these aspects of the machine learning lifecycle.
- **Practical Semantic Analysis**: Through hands-on examples, we demonstrated the model's ability to discern varying degrees of semantic similarity between sentence pairs, validating its effectiveness in real-world semantic analysis tasks.

### The Power and Flexibility of MLflow's Python Functions

- **Customization for Specific Needs**: One of the tutorial's highlights is the demonstration of how MLflow's `PythonModel` can be customized. This customization is not only powerful but also necessary for tailoring models to specific NLP tasks that go beyond standard model functionalities.
- **Adaptability and Extension**: The `PythonModel` framework in MLflow provides a solid foundation for implementing a wide range of NLP models. Its adaptability allows for the extension of base model functionalities, such as transforming a sentence embedding model into a semantic similarity comparison tool.

### Empowering Advanced NLP Applications

- **Ease of Modification**: The tutorial showcased that modifying the provided `PythonModel` implementation for different flavors in MLflow can be done with relative ease, empowering you to create models that align precisely with your project's requirements.
- **Wide Applicability**: Whether it's semantic search, content recommendation, or automated text comparison, the approach outlined in this tutorial can be adapted to a broad spectrum of NLP tasks, opening doors to innovative applications in the field.

### Moving Forward

Armed with the knowledge and skills acquired in this tutorial, you are now well-equipped to apply these advanced NLP techniques in your projects. The seamless integration of Sentence Transformers with MLflow's robust model management and deployment capabilities paves the way for developing sophisticated, efficient, and effective NLP solutions.

Thank you for joining us on this journey through advanced NLP modeling with Sentence Transformers and MLflow. We hope this tutorial has inspired you to explore further and innovate in your NLP endeavors!

Happy Modeling!