# Walkthrough 2 - Detecting Similarity

Checking for the similarity between two (or more) pieces of textual input is useful, for example:
* When we explored AI for Defect Management, one use case could be to automate the process of checking for duplicates.
* When we looked at Self-Healing Tests, one of the ways that UI Locators are healed is by looking for locators that are similar to the one that is no longer working.

We could achieve this by computing how much the two inputs overlap (e.g. the number of words that overlap between the two) but this ignores words that are different but have semantically similar or closely related in terms of context. We can't account for these by comparing individual words.

In modern Natural Language Processing (the part of AI that deals with natural language) an idea called semantic embeddings can be used to solve this (to some extent). The idea is that we represent words as a set of numbers that are like co-ordinates on a graph or map, but we place them such that words that are closely related are represented by points that are close together on the graph/map. Words that are not related are further apart.

Given this representation of a word (or sentence) we can measure the distance between the points to give a measure of similarity and relatedness. While not perfect, it can certainly provide a recommendation of possible duplicates or ranking for similar inputs.

In this walkthrough we will show how we can use HuggingFace models to perform similarity scoring using modern Transformers (as used in Large Language Models).

> For this walkthrough, you do not need to write any code. The code presented here is complete and designed to show we could integrate AI models. If you want to learn more about the coding aspect of integrating AI models then I would suggest you start with the HuggingFace NLP Course.



## Resources

| Link                                                                                                                              | Description                      |
|-----------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db | Description of Cosine Similarity |
|https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223| A reasonably accessible introduction to Word Embeddings |
| https://huggingface.co/models| Link to the HuggingFace Pre-trained Models Hub|
| https://huggingface.co/learn/nlp-course            | A free course provided by HuggingFace that gets you up to speed with Natural Language Processing using the HuggingFace library (requires Python knowledge |


First we need to install some dependencies

In [ ]:
!pip install sentence-transformers

Here we are going to use the HuggingFace Sentence Transformer model to compare the similarity between two pieces of text.


In [9]:
from sentence_transformers import SentenceTransformer, util

class SimilarityChecker:
    """
    Class that encapsulates a Sentence-Transformer model to perform semantic similarity on two sentences.
    We are using a HuggingFace Sentence Transformer for this task
    """
    def __init__(self, model_name = "sentence-transformers/all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
    
    def check_similarity(self, input_1, input_2, similarity_threshold=0.75):
        embedding_1= self.model.encode(input_1, convert_to_tensor=True)
        embedding_2 = self.model.encode(input_2, convert_to_tensor=True)
        # Compute the similarity between them (we are using Cosine similarity 
        similarity = util.pytorch_cos_sim(embedding_1, embedding_2).item()
        if similarity > similarity_threshold:
            return {"similar": True, "score": similarity}
        else:
            return {"similar": False, "score": similarity}
        
# Create an instance of the checker
sim_checker = SimilarityChecker()

Now let's try some comparisons

Feel free to change the values for *text_1* and *text_2* to understand how good the checker is at detecting similar text.
You can also change the *match_threshold* to a value between 0.0 and 1.0 to see how that improves or degrades the performance.

In [14]:


text_1 = "When I try to log in using an admin account the Admin menu does not exist"
text_2 = "I can access the admin pages using a direct URL"
match_threshold = 0.75

print(sim_checker.check_similarity(text_1, text_2, match_threshold))

{'similar': True, 'score': 0.4181157648563385}
