## Advanced Paraphrase Mining with Sentence Transformers and MLflow

Welcome to our hands-on tutorial focused on **Advanced Paraphrase Mining using Sentence Transformers** and **MLflow**. This guide is crafted for those eager to delve into the fascinating world of Natural Language Processing (NLP), particularly in applying state-of-the-art NLP models within the robust framework of MLflow for effective model management and deployment.

### Exploring Paraphrase Mining

Paraphrase mining is the process of identifying textually distinct yet semantically similar sentences or phrases. Unlike basic text matching techniques, paraphrase mining delves deeper into the nuances of language, capturing the essence of meaning in different textual expressions. This capability is crucial in applications like document summarization, chatbot development, and information retrieval, where understanding the contextual similarity between sentences is essential.

### The Role of Sentence Transformers in Paraphrase Mining

**Sentence Transformers** are specialized adaptations of transformer models, fine-tuned to produce contextually rich sentence embeddings. These models, derived from the renowned Transformers library by 🤗 Hugging Face, are excellent at understanding and comparing the semantic content of texts. In this tutorial, we will leverage `sentence-transformers` to transform sentences into embeddings and identify paraphrases effectively.

### MLflow: Simplifying Model Management and Deployment

MLflow plays a pivotal role in our paraphrase mining project by providing:

- **Efficient Experiment Tracking**: Log and track your NLP experiments effortlessly, capturing intricate details like model parameters and sentence embeddings.
- **Custom `PythonModel` Implementation**: We will explore a custom `PythonModel` implementation within MLflow, tailored for paraphrase mining, demonstrating the platform's flexibility.
- **Streamlined Lifecycle Management**: Utilize MLflow's versioning and configuration control to manage the iterative nature of NLP model development.
- **Consistent Deployment and Reproducibility**: With MLflow, ensure that your models are not only ready for deployment but also maintain consistency and reliability across different environments.

### Learning Objectives

Throughout this tutorial, you will:

- Utilize the `sentence-transformers` library to perform advanced paraphrase mining.
- Create and customize a `PythonModel` in MLflow for paraphrase mining.
- Manage, log, and track models and configurations within the MLflow ecosystem.
- Deploy your paraphrase mining model for practical applications, taking full advantage of MLflow's deployment capabilities.

By the end of this tutorial, you will have developed a deep understanding of paraphrase mining using Sentence Transformers and gained valuable insights into leveraging MLflow for managing and deploying NLP models. Prepare to embark on an enriching journey through the realms of NLP, enhancing your skills in both language understanding and model management.

## Introduction to the Paraphrase Mining Model

In this first code cell, we establish the foundation for our **Paraphrase Mining Model** using Sentence Transformers and MLflow. This model exemplifies the integration of sophisticated NLP techniques with the flexibility of MLflow for model management and deployment.

### Overview of the Model Structure

Our `ParaphraseMiningModel` class is a custom Python model that integrates advanced NLP capabilities with MLflow's deployment and tracking features.

#### Loading Model and Corpus: `load_context` Method

- The method loads a pre-trained Sentence Transformer model, optimized for semantic embeddings.
- It also reads a corpus of text from a file, providing data for paraphrase identification.

#### Paraphrase Mining Logic: `predict` Method

- This method includes input validation, checking and extracting a query sentence from various input formats.
- It allows users to customize the behavior of the model using parameters like `similarity_threshold`.
- The core functionality of the method is to identify semantically similar sentences to the input query within the corpus.

#### Sorting and Filtering Matches: `_sort_and_filter_matches` Helper Method

- This helper method organizes the paraphrases based on their similarity scores.
- It ensures that only unique and relevant paraphrases above the set similarity threshold are returned, filtering out duplicates.

### Key Features

- Utilizes advanced NLP techniques by leveraging Sentence Transformers.
- Demonstrates the seamless integration of custom logic in the `predict` method.
- Offers flexibility to end users to modify match criteria, enhancing usability.
- Ensures efficient processing by pre-encoding the corpus in the `load_context`.
- Includes robust error handling and validations for increased reliability.

### Practical Implications

This model provides a versatile framework for paraphrase mining, adaptable to various domains where textual similarity is key. It highlights the power of custom `PythonModel` in MLflow, tailored for the nuanced requirements of NLP applications.


In [1]:
import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel
from sentence_transformers import SentenceTransformer, util
import numpy as np
import pandas as pd
import warnings
from typing import List


class ParaphraseMiningModel(PythonModel):
    def load_context(self, context):
        """Load the model context for inference, including the customer feedback corpus."""
        try:
            # Load the pre-trained sentence transformer model
            self.model = SentenceTransformer.load(context.artifacts["model_path"])
            
            # Load the customer feedback corpus from the specified file
            corpus_file = context.artifacts["corpus_file"]
            with open(corpus_file, 'r') as file:
                self.corpus = file.read().splitlines()

        except Exception as e:
            raise ValueError(f"Error loading model and corpus: {e}")

    def _sort_and_filter_matches(self, query: str, paraphrase_pairs: List[tuple], similarity_threshold: float):
        """Sort and filter the matches by similarity score."""
        
        # Convert to list of tuples and sort by score
        sorted_matches = sorted(paraphrase_pairs, key=lambda x: x[1], reverse=True)

        # Filter and collect paraphrases for the query, avoiding duplicates
        query_paraphrases = {}
        for score, i, j in sorted_matches:
            if score < similarity_threshold:
                continue
            
            paraphrase = self.corpus[j] if self.corpus[i] == query else self.corpus[i]
            if paraphrase == query:
                continue
            
            if paraphrase not in query_paraphrases or score > query_paraphrases[paraphrase]:
                query_paraphrases[paraphrase] = score

        return sorted(query_paraphrases.items(), key=lambda x: x[1], reverse=True)

    def predict(self, context, model_input, params=None):
        """Predict method to perform paraphrase mining over the corpus."""
        
        # Validate and extract the query input
        if isinstance(model_input, pd.DataFrame):
            if model_input.shape[1] != 1:
                raise ValueError("DataFrame input must have exactly one column.")
            query = model_input.iloc[0, 0]
        elif isinstance(model_input, dict):
            query = model_input.get("query")
            if query is None:
                raise ValueError("The input dictionary must have a key named 'query'.")
        else:
            raise TypeError(f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame.")

        # Determine the minimum similarity threshold
        similarity_threshold = params.get("similarity_threshold", 0.5) if params else 0.5

        # Add the query to the corpus for paraphrase mining
        extended_corpus = self.corpus + [query]

        # Perform paraphrase mining
        paraphrase_pairs = util.paraphrase_mining(self.model, extended_corpus, show_progress_bar=False)

        # Convert to list of tuples and sort by score
        sorted_paraphrases = self._sort_and_filter_matches(query, paraphrase_pairs, similarity_threshold)

        # Warning if no paraphrases found
        if not sorted_paraphrases:
            warnings.warn(
                "No paraphrases found above the similarity threshold.",
                UserWarning
            )

        return {sentence[0]: str(sentence[1]) for sentence in sorted_paraphrases}


* 'schema_extra' has been renamed to 'json_schema_extra'


## Preparing the Corpus for Paraphrase Mining

In this section of our tutorial, we focus on creating and preparing the corpus, which is a crucial component for effective paraphrase mining.

### Corpus Creation

- We define a `corpus` as a collection of sentences covering a wide range of topics. This diversity is essential to demonstrate the model's ability to identify paraphrases across various subjects.
- The topics include everything from space exploration and AI technologies to hobbies like gardening and yoga. Such a varied corpus ensures that our paraphrase mining model can handle a wide array of input queries.

### Writing the Corpus to a File

- The corpus is written to a file named `feedback.txt`. This step simulates a real-world scenario where large datasets are often stored in files or databases.
- Writing the corpus to a file also prepares it for loading into the Paraphrase Mining Model. This process will allow the model to access and process the corpus efficiently during the paraphrase mining task.

### Significance of the Corpus

- The corpus forms the backbone of our paraphrase mining application. It acts as the dataset against which the model will compare input queries to find semantically similar sentences.
- By covering a broad spectrum of topics, we ensure that the model is versatile and robust, capable of handling a variety of real-world use cases.

With the corpus prepared and saved, we are now set to move forward with loading it into our Paraphrase Mining Model and demonstrating the power of NLP in finding related sentences across different topics.

In [2]:
corpus = [
    "Exploring ancient cities in Europe offers a glimpse into history.",
    "Modern AI technologies are revolutionizing industries.",
    "Healthy eating contributes significantly to overall well-being.",
    "Advancements in renewable energy are combating climate change.",
    "Learning a new language opens doors to different cultures.",
    "Gardening is a relaxing hobby that connects you with nature.",
    "Blockchain technology could redefine digital transactions.",
    "Homemade Italian pasta is a delight to cook and eat.",
    "Practicing yoga daily improves both physical and mental health.",
    "The art of photography captures moments in time.",
    "Baking bread at home has become a popular quarantine activity.",
    "Virtual reality is creating new experiences in gaming.",
    "Sustainable travel is becoming a priority for eco-conscious tourists.",
    "Reading books is a great way to unwind and learn.",
    "Jazz music provides a rich tapestry of sound and rhythm.",
    "Marathon training requires discipline and perseverance.",
    "Studying the stars helps us understand our universe.",
    "The rise of electric cars is an important environmental development.",
    "Documentary films offer deep insights into real-world issues.",
    "Crafting DIY projects can be both fun and rewarding.",
    "The history of ancient civilizations is fascinating to explore.",
    "Exploring the depths of the ocean reveals a world of marine wonders.",
    "Learning to play a musical instrument can be a rewarding challenge.",
    "Artificial intelligence is shaping the future of personalized medicine.",
    "Cycling is not only a great workout but also eco-friendly transportation.",
    "Home automation with IoT devices is enhancing living experiences.",
    "Understanding quantum computing requires a grasp of complex physics.",
    "A well-brewed cup of coffee is the perfect start to the day.",
    "Urban farming is gaining popularity as a sustainable food source.",
    "Meditation and mindfulness can lead to a more balanced life.",
    "The popularity of podcasts has revolutionized audio storytelling.",
    "Space exploration continues to push the boundaries of human knowledge.",
    "Wildlife conservation is essential for maintaining biodiversity.",
    "The fusion of technology and fashion is creating new trends.",
    "E-learning platforms have transformed the educational landscape.",
    "Dark chocolate has surprising health benefits when enjoyed in moderation.",
    "Robotics in manufacturing is leading to more efficient production.",
    "Creating a personal budget is key to financial well-being.",
    "Hiking in nature is a great way to connect with the outdoors.",
    "3D printing is innovating the way we create and manufacture objects.",
    "Sommeliers can identify a wine's characteristics with just a taste.",
    "Mind-bending puzzles and riddles are great for cognitive exercise.",
    "Social media has a profound impact on communication and culture.",
    "Urban sketching captures the essence of city life on paper.",
    "The ethics of AI is a growing field in tech philosophy.",
    "Homemade skincare remedies are becoming more popular.",
    "Virtual travel experiences can provide a sense of adventure at home.",
    "Ancient mythology still influences modern storytelling and literature.",
    "Building model kits is a hobby that requires patience and precision.",
    "The study of languages opens windows into different worldviews.",
    "Professional esports has become a major global phenomenon.",
    "The mysteries of the universe are unveiled through space missions.",
    "Astronauts' experiences in space stations offer unique insights into life beyond Earth.",
    "Telescopic observations bring distant galaxies within our view.",
    "The study of celestial bodies helps us understand the cosmos.",
    "Space travel advancements could lead to interplanetary exploration.",
    "Observing celestial events provides valuable data for astronomers.",
    "The development of powerful rockets is key to deep space exploration.",
    "Mars rover missions are crucial in searching for extraterrestrial life.",
    "Satellites play a vital role in our understanding of Earth's atmosphere.",
    "Astrophysics is central to unraveling the secrets of space.",
    "Zero gravity environments in space pose unique challenges and opportunities.",
    "Space tourism might soon become a reality for many.",
    "Lunar missions have contributed significantly to our knowledge of the moon.",
    "The International Space Station is a hub for groundbreaking space research.",
    "Studying comets and asteroids reveals information about the early solar system.",
    "Advancements in space technology have implications for many scientific fields.",
    "The possibility of life on other planets continues to intrigue scientists.",
    "Black holes are among the most mysterious phenomena in space.",
    "The history of space exploration is filled with remarkable achievements.",
    "Future space missions could unlock the mysteries of dark matter."
]

# Write out the corpus to a file
corpus_file = '/tmp/feedback.txt'
with open(corpus_file, 'w') as file:
    for sentence in corpus:
        file.write(sentence + '\n')

## Setting Up the Paraphrase Mining Model

This part of the tutorial involves setting up the Sentence Transformer model and preparing it for integration with MLflow. This step is crucial for leveraging the model's capabilities in our paraphrase mining application.

### Loading the Sentence Transformer Model

- We start by loading a pre-trained Sentence Transformer model, specifically `all-MiniLM-L6-v2`. This model is known for its efficiency in generating high-quality sentence embeddings and is well-suited for paraphrase mining tasks.

### Preparing the Input Example

- An input example is created using a DataFrame. This example represents a typical query that our model is expected to process. It helps in understanding the structure and format of the input data the model will receive.

### Saving the Model

- The loaded Sentence Transformer model is then saved to a directory (`/tmp/paraphrase_search_model`). This step is essential for creating a portable version of the model that can be loaded by MLflow for deployment and further use.

### Defining Artifacts and Corpus Path

- The paths to the saved model and the corpus file are defined as artifacts. Artifacts in MLflow are used to log additional files, like data or models, which are needed to understand or reproduce the work.

### Generating Test Output for Signature

- A sample output for paraphrase mining is generated. This output is a list of tuples, each containing a paraphrase and its corresponding similarity score. This sample helps in defining the expected format of the model's output.

### Creating the Model Signature

- A model signature is created using MLflow's `infer_signature` function. The signature captures the expected input and output formats of the model, ensuring compatibility and clarity in how the model is to be used.
- Additionally, the parameter `similarity_threshold` is included in the signature, which is absolutely required if we want to expose this parameter for override during inference. If this is not declared when assigning the signature, the parameter will be ignored during inference.

With these steps, we have successfully set up and saved our Sentence Transformer model, along with defining the expected input and output structures. This setup lays the groundwork for integrating the model with MLflow, ensuring that it is ready for deployment and use in our paraphrase mining application.


In [3]:
# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = pd.DataFrame({"query": ["This product works well. I'm satisfied."]})

# Save the model in the /tmp directory
model_directory = "/tmp/paraphrase_search_model"
model.save(model_directory)

# Define the path for the corpus file
corpus_file = "/tmp/feedback.txt"

# Define the artifacts (paths to the model and corpus file)
artifacts = {
    "model_path": model_directory,
    "corpus_file": corpus_file
}

# Generate test output for signature
# Sample output for paraphrase mining could be a list of tuples (paraphrase, score)
test_output = [{"This product is satisfactory and functions as expected.": "0.8"}]

# Define the signature associated with the model
# The signature includes the structure of the input and the expected output
signature = infer_signature(model_input=input_example, model_output=test_output, params={"similarity_threshold": 0.5})

signature

inputs: 
  ['query': string]
outputs: 
  ['This product is satisfactory and functions as expected.': string]
params: 
  ['similarity_threshold': double (default: 0.5)]

### Setting the tracking server and creating an experiment

In order to view the results in our tracking server (for the purposes of this tutorial, we’ve started a local tracking server at this url)

We can start an instance of the MLflow server locally by running the following from a terminal to start the tracking server:

``` bash
mlflow server --host 127.0.0.1 --port 8080
```

With the server started, the following code will ensure that all experiments, runs, models, parameters, and metrics that we log are being tracked within that server instance (which also provides us with the MLflow UI when navigating to that url address in a browser).

After setting the tracking url, we create a new MLflow Experiment to store the run we’re about to create in.

In [4]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Semantic Similarity")

<Experiment: artifact_location='mlflow-artifacts:/413386080563320984', creation_time=1700272457800, experiment_id='413386080563320984', last_update_time=1700272457800, lifecycle_stage='active', name='Semantic Similarity', tags={}>

## Logging the Paraphrase Mining Model with MLflow

In this crucial step, we demonstrate the process of logging our custom Paraphrase Mining Model with MLflow, a pivotal phase for model management and deployment.

### Initiating an MLflow Run

- We begin by starting an MLflow run, which serves as a record of our operations related to the model. This run encapsulates all actions of logging, tracking, and managing the model within the MLflow framework.

### Logging the Model in MLflow

- The model is logged using MLflow's function designed for registering Python models. This step is central to integrating our model into the MLflow ecosystem for effective management.
- We assign a unique name to our model, making it easily identifiable within MLflow.
- The custom Paraphrase Mining Model, instantiated from our defined class, is specified here for logging.
- An input example is provided to illustrate the expected format of data the model will process, enhancing documentation and understanding of the model's usage.
- A model signature is included, describing the model's input and output schema. This signature is crucial for ensuring that the model is used correctly and consistently in various environments.
- Artifacts, including the paths to the model and corpus file, are specified. These artifacts are essential components required for the model's operation.
- Python package dependencies are listed, ensuring that all necessary libraries are available in the environment where the model is deployed.

### Outcomes and Benefits of Model Logging

- By logging the model in MLflow, we effectively register it for management and deployment. The model becomes a part of the MLflow ecosystem, accessible for various operations.
- This step enhances the model's trackability, facilitating effective version control and ensuring reproducibility across different deployment environments.

This process of logging the model in MLflow is a testament to the platform's capabilities in handling complex models, such as our Paraphrase Mining Model. It showcases MLflow's role in simplifying model management and deployment, aligning with best practices in machine learning workflows.


In [5]:
with mlflow.start_run() as run:
    model_info = mlflow.pyfunc.log_model(
        "paraphrase_model",
        python_model=ParaphraseMiningModel(),
        input_example=input_example,
        signature=signature,
        artifacts=artifacts,
        pip_requirements=["sentence_transformers"],
    )

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2023/11/20 21:11:58 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]



## Model Loading and Paraphrase Mining Prediction

In this section of the tutorial, we demonstrate the practical application of our Paraphrase Mining Model. We load the model using MLflow and perform a paraphrase mining prediction, illustrating how the model operates in a real-world scenario.

### Loading the Model for Inference

- The model is loaded using MLflow's `load_model` function. This step is crucial as it retrieves the model from the MLflow registry, making it ready for inference.
- We use the model's URI, which is a unique identifier within MLflow, to locate and load the specific model we have trained and logged.

### Executing a Paraphrase Mining Prediction

- A prediction is made using the `predict` method of the loaded model. This method is where the paraphrase mining logic, as defined in our model class, is executed.
- We pass a query, "Space exploration is fascinating.", to the model. This query represents a typical sentence that we aim to find paraphrases for in our corpus.
- Additionally, we set a parameter `similarity_threshold` to 0.65, specifying the minimum similarity score for considering sentences as paraphrases. This threshold is adjustable, allowing users to control the strictness of match criteria.

### Interpreting the Model Output

- The output is displayed in the notebook, showing a list of sentences from the corpus that are semantically similar to the input query, along with their similarity scores.
- Sentences like "Studying the stars helps us understand our universe." and "The history of space exploration is filled with remarkable achievements." have high similarity scores, indicating strong semantic relatedness to the query.
- This result demonstrates the model's ability to identify paraphrases - sentences that are different in wording but similar in meaning.

### Conclusion

The successful execution of this prediction showcases the effectiveness of our Paraphrase Mining Model in identifying semantically similar sentences within a given corpus. It highlights the model's potential in various applications, such as content recommendation, information retrieval, and enhancing understanding of user queries in conversational AI systems. The output also reflects the model's nuanced understanding of language, crucial for accurate paraphrase mining.


In [6]:
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

loaded_dynamic.predict({"query": "Space exploration is fascinating."}, params={"similarity_threshold": 0.65})

Downloading artifacts:   0%|          | 0/18 [00:00<?, ?it/s]

2023/11/20 21:11:59 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


{'Studying the stars helps us understand our universe.': '0.8207424879074097',
 'The history of space exploration is filled with remarkable achievements.': '0.7770636677742004',
 'Exploring ancient cities in Europe offers a glimpse into history.': '0.7461957335472107',
 'Space travel advancements could lead to interplanetary exploration.': '0.7090306282043457',
 'Space exploration continues to push the boundaries of human knowledge.': '0.6893945932388306',
 'The mysteries of the universe are unveiled through space missions.': '0.6830739974975586',
 'The study of celestial bodies helps us understand the cosmos.': '0.671358048915863'}

## Conclusion: Insights and Potential Enhancements

As we wrap up this tutorial, let's reflect on our journey through the implementation of a Paraphrase Mining Model using Sentence Transformers and MLflow. We've successfully built and deployed a model capable of identifying semantically similar sentences, showcasing the flexibility and power of MLflow's `PythonModel` implementation.

### Key Takeaways

- We learned how to integrate advanced NLP techniques, specifically paraphrase mining, with MLflow. This integration not only enhances model management but also simplifies deployment and scalability.
- The flexibility of the `PythonModel` implementation in MLflow was a central theme. We saw firsthand how it allows for the incorporation of custom logic into the model's predict function, catering to specific NLP tasks like paraphrase mining.
- Through our custom model, we explored the dynamics of sentence embeddings, semantic similarity, and the nuances of language understanding. This understanding is crucial in a wide range of applications, from content recommendation to conversational AI.

### Ideas for Enhancing the Paraphrase Mining Model

While our model serves as a robust starting point, there are several enhancements that could be made within the `predict` function to make it more powerful and feature-rich:

1. **Contextual Filters**: Introduce filters based on contextual clues or specific keywords to refine the search results further. This feature would allow users to narrow down paraphrases to those most relevant to their particular context or subject matter.

2. **Sentiment Analysis Integration**: Incorporate sentiment analysis to group paraphrases by their emotional tone. This would be especially useful in applications like customer feedback analysis, where understanding sentiment is as important as content.

3. **Multi-Lingual Support**: Expand the model to support paraphrase mining in multiple languages. This enhancement would significantly broaden the model's applicability in global or multi-lingual contexts.

### Scalability with Vector Databases

- Moving beyond a static text file as a corpus, a more scalable and real-world approach would involve connecting the model to an external vector database or in-memory store. 
- Pre-calculated embeddings could be stored and updated in such databases, accommodating real-time content generation without requiring model redeployment. This approach would dramatically improve the model’s scalability and responsiveness in real-world applications.

### Final Thoughts

The journey through building and deploying the Paraphrase Mining Model has been both enlightening and practical. We've seen how MLflow's `PythonModel` offers a flexible canvas for crafting custom NLP solutions, and how sentence transformers can be leveraged to delve deep into the semantics of language.

This tutorial is just the beginning. There’s a vast potential for further exploration and innovation in paraphrase mining and NLP as a whole. We encourage you to build upon this foundation, experiment with enhancements, and continue pushing the boundaries of what's possible with MLflow and advanced NLP techniques.
